I recently spent some time working with Venkata, an Oracle programmer and keen cricket fan :
This post is about the fun and games involved in the bulk loading of data into Oracle – especially when it includes floating point values.
In Venkata’s honour, I’m going to explore this topic through the medium of the career of Sachin Tendulkar…
At this point it’s probably worth wandering off the point to explore some of the highlights of The Little Master’s career.
India’s tour of England in 1996 has been largely forgotten. However, it proved to be a watershed in our hero’s test career.
The first test of the series at Edgebaston was not untypical of the time – Indian batsmen skittled in conditions totally alien to them. The could only muster a paltry 219 in their second innings and none of them got beyond 18…apart from Tendulkar, ninth out for 122, made with a serenity that contrasted starkly with the chaos surrounding him.
It was in the very next match, at Lords, that India granted test debuts to two batsmen who themselves would prove quite useful over the years. Surav Ganguly marked the occasion with a century. Rahul Dravid fell an agonising five runs short of doing the same.
The fourth member of India’s vaunted middle-order announced himself in the next series we look at, against Australia.
Having been soundly beaten in the first test of the series, India came back the hard way in the second test.
VVS Laxmans 281 not out, with the not inconsiderable assistance of Dravid (180) and Harbhajan (13-196) became only the third team in the history of Test Cricket to win after following on. This against arguably the best team to have ever played the game who had their record winning streak of 16 consecutive tests ended somewhat emphatically.
The 2004/05 series against Bangladesh is included because Tendulkar’s highest test score (248 not out) was made in this series.
The 2009/10 series against South Africa is indicative of the journey that India made during the course of Tendulkar’s career.
At this time they were vying with South Africa for the status of the world’s leading test nation.
Back to the techie stuff. What I’m going to cover is :
- What happens when you insert a floating point value into an INTEGER column
- How you can overcome this using SQL*Loader
- An alternative approach using external tables
The target table
The target table looks like this :
CREATE TABLE sachin_tests( season VARCHAR2(7), opponents VARCHAR2(50), venue VARCHAR2(4), matches INTEGER, runs INTEGER, high_score VARCHAR2(4), average FLOAT) /
The data we’re inserting into the table is a selection of Tendulkar’s performance in the test series I’ve already mentioned.
As this table is populated via a data load, this data is in the form of a comma delimited file :
season, opponents, venue, matches, runs, high_score, average, 1996,ENGLAND,AWAY,3,428,177,85.60, 2000/01,AUSTRALIA,HOME,3,304,126,50.66, 2004/05,BANGLADESH,AWAY,2,284,248*,284.00, 2009/10,SOUTH AFRICA,HOME,2,213,106,71.00,
The limitations of datatype validation
As you’d expect, the structure of this table does serve to provide some elementary validation.
If we issue a statement like this …
INSERT INTO sachin_tests( season, opponents, venue, matches, runs, high_score, average) VALUES( '1996 or thereabouts', 'ENGLAND', 'EDGBASTON', 3, 428, '177', 85.60);
We’ll get …
'1996 or thereabouts', * ERROR at line 10: ORA-12899: value too large for column "MIKE"."SACHIN_TESTS"."SEASON" (actual: 19, maximum: 7)
However, what happens when we try inserting a float into an integer column ?
INSERT INTO sachin_tests( season, opponents, venue, matches, runs, high_score, average) VALUES( '1996', 'ENGLAND', 'AWAY', 3, 85.60, '177', 428);
Run this and well get…
1 row created.
So, the record has been inserted without error. But have a look at the value in the table…
SQL> SELECT runs 2 FROM sachin_tests 3 WHERE season = '1996' 4 AND opponents = 'ENGLAND' 5 AND venue = 'AWAY' 6 / RUNS ---------- 86 SQL>
The float value we’ve inserted has been rounded. Is this some insidious bug in Oracle ?
Well, no. The ANSI standard for SQL stipulates that a float value inserted into an integer column will either be truncated or rounded. The action is implementation specific. In Oracle’s case, the value is rounded.
Therefore, if we want to protect against this, we need to write our own validation.
The trouble with triggers
At this point, you may be thinking that the obvious solution is to simply write a trigger to check that
the INTEGER values being inserted are not in fact FLOATS. Let’s knock up a quick test :
CREATE OR REPLACE TRIGGER st_bi_trg BEFORE INSERT ON sachin_tests FOR EACH ROW BEGIN -- -- Output the value to be inserted into the runs column -- DBMS_OUTPUT.PUT_LINE(:new.runs); -- -- Check to make sure that the runs value being inserted is a whole number -- IF MOD( :new.runs, 1) > 0 THEN RAISE_APPLICATION_ERROR(-20000, 'Cannot insert a FLOAT into an INTEGER column.'); END IF; END; /
Now let’s run the offending insert statement again. This time, we’ll set the serveroutput so that we can see the value of RUNS as the trigger fires :
set serveroutput on 1 INSERT INTO sachin_tests( season, opponents, venue, 2 matches, runs, high_score, average) 3 VALUES( 4 '1996', 'ENGLAND', 'AWAY', 5* 3, 85.60, '177', 428) SQL> / 86 1 row created. SQL>
So, the value has been rounded before we even get to the trigger. An alternative approach is required here.
The SQL*LOADER option
The data we have to load is in a flat-file. So, let’s see what SQL*LOADER can do for us.
OPTIONS(SKIP=1) LOAD DATA INFILE 'tests.dat' INTO TABLE sachin_tests APPEND FIELDS TERMINATED BY ',' TRAILING NULLCOLS( season CHAR(7), opponents CHAR(50), venue CHAR(4), matches INTEGER EXTERNAL "DECODE(TRUNC(:matches),:matches,:matches,'Error Matches must be an Integer')", runs INTEGER EXTERNAL "DECODE(TRUNC(:runs), :runs, :runs, 'Error Runs must be an Integer')", high_score CHAR(4), average CHAR(7))
If we run this for our correctly formatted file :
$ sqlldr userid=mike control=tests.ctl Password: SQL*Loader: Release 220.127.116.11.0 - Production on Sun Jun 23 14:15:47 2013 Copyright (c) 1982, 2009, Oracle and/or its affiliates. All rights reserved. Commit point reached - logical record count 4 $
Now, let’s play around with the .dat file and try the insert we’ve been having problems with up to now…
season, opponents, venue, matches, runs, high_score, average, 1996,ENGLAND,AWAY,3,85.60,177,428, 2000/01,AUSTRALIA,HOME,3,304,126,50.66, 2004/05,BANGLADESH,AWAY,2,284,248*,284.00, 2009/10,SOUTH AFRICA,HOME,2,213,106,71.00,
Save the file and run it again …
sqlldr userid=mike control=tests.ctl
This time, if we check in the SQL*Loader log file (tests.log), we’ll see :
Record 1: Rejected - Error on table SACHIN_TESTS, column RUNS. ORA-01722: invalid number Table SACHIN_TESTS: 3 Rows successfully loaded. 1 Row not loaded due to data errors. 0 Rows not loaded because all WHEN clauses were failed. 0 Rows not loaded because all fields were null.
SQL*Loader has successfully filtered out our rogue record.
The External Table option
But what if we want to encapsulate the entire process within the database, rather than having to look at the log file on the OS.
Well, we can load the data into an External Table which we use as a staging area.
We can then insert the records either into the target table, or an error table using a multi-table insert statement.
A couple of housekeeping chores are required when setting up an external table.
First, we need to specify a directory to hold the tests.dat file. This needs to be created on the OS and be a directory to which the Oracle owner (usually oracle on *nix), has access…
cd /u01/app/oracle/admin sudo su oracle mkdir ext_tabs
Now copy the tests.data file to the new directory.
Next, we need to create a directory object in the database :
CREATE DIRECTORY ext_tabs AS '/u01/app/oracle/admin/ext_tabs'
The owner of the directory object in the database should automatically have READ and WRITE privileges.
Finally, the external table. The table will be structured to allow all numeric values to be loaded without rounding them…
CREATE TABLE tests_staging_ext( season VARCHAR2(7), opponents VARCHAR2(50), venue VARCHAR2(4), matches NUMBER, runs NUMBER, high_score VARCHAR2(4), average NUMBER) ORGANIZATION EXTERNAL ( type oracle_loader default directory ext_tabs access parameters ( records delimited by newline fields terminated by ',' missing field values are NULL reject rows with all null fields( season, opponents, venue, matches, runs, high_score, average) ) location( 'tests.dat') ) REJECT LIMIT UNLIMITED /
Yep, we’ll take any numeric value into a number column.
If we do a quick check, we should see that all of the records have been loaded into our external table…
SQL> SELECT COUNT(*) 2 FROM tests_staging_ext 3 / COUNT(*) ---------- 4 SQL>
We’ll then write some code to sort out the load into our proper target table as well as an error table. The error table looks like this :
CREATE TABLE sachin_test_errors( season VARCHAR2(7), opponents VARCHAR2(50), venue VARCHAR2(4), matches NUMBER, runs NUMBER, high_score VARCHAR2(4), average NUMBER, rejection_reason VARCHAR2(100)) /
At this point, you may well be thinking that something like this could now do the job…
INSERT FIRST WHEN TRUNC(matches) != TRUNC(matches) OR TRUNC(runs) != TRUNC(runs) THEN INTO sachin_test_errors( season, opponents, venue, matches, runs, high_score, average, rejection_reason) VALUES( season, opponents, venue, matches, runs, high_score, average, 'FLOAT value found where INTEGER expected.') ELSE INTO sachin_tests SELECT season, opponents, venue, matches, runs, high_score, average FROM tests_staging_ext /
…unfortunately, this will cause us to trip over the same issue as we had with the trigger.
In order for this to work properly, we need to push the logic to check for FLOAT values into the SELECT statement itself…
INSERT FIRST WHEN rejection_reason IS NOT NULL THEN INTO sachin_test_errors( season, opponents, venue, matches, runs, high_score, average, rejection_reason) VALUES( season, opponents, venue, matches, runs, high_score, average, rejection_reason) ELSE INTO sachin_tests( season, opponents, venue, matches, runs, high_score, average) VALUES( season, opponents, venue, matches, runs, high_score, average) SELECT season, opponents, venue, matches, runs, high_score, average, CASE WHEN matches != TRUNC(matches) THEN 'MATCHES has a FLOAT value' WHEN runs != TRUNC(runs) THEN 'RUNS has a FLOAT value' ELSE NULL END as rejection_reason FROM tests_staging_ext /
Now, if we look at our target table, we should have the three valid records loaded
SELECT * FROM sachin_tests / SEASON OPPONENTS VENU MATCHES RUNS HIGH AVERAGE ------- -------------------- ---- ------- ---- ---- --------------------------------------- 2000/01 AUSTRALIA HOME 3 304 126 50.66 2004/05 BANGLADESH AWAY 2 284 248* 284 2009/10 SOUTH AFRICA HOME 2 213 106 71
The rejected record should be in the errors table …
SELECT * FROM sachin_test_errors / SEASON OPPONENTS VENU MATCHES RUNS HIGH AVERAGE REJECTION_REASON ------- -------------------- ---- ------- ---- ---- ------- ------------------------------ 1996 ENGLAND AWAY 3 86 177 428 RUNS has a FLOAT value
There, I’ve managed to get through the entire post without mentioning the Champions Trophy final which, England have managed to let India win. At least Venkata will still be smiling.