(tested on Oracle 11G , Solaris 11 express, Sun v240 server platform.)
Imagine you have millions of records in a file that needs to be inserted into a table (as typically the scenario in data warehousing applications). How are you going to load data in ? There are three ways to go about this,
1. Using insert statements separately for every record; you can either create a insert command batch file and run on sqlplus/ or using a programming language by opening a connection to database and inserting data. (this is good for several hundred records ? but since we are talking about millions of records this method is highly inefficient in terms of resources and time taken)
2. Using bulk inserts; some programming languages provide a batch loading API in their database connector. For instance in JAVA JDBC one can use statements.addBatch() or even better preparedStatements in java.sql package to load multiple records. Moderately good performance interms of loading speed can be expected but once again not good enough for loading millions.
3 Using SQLLDR; end of story ;). All you need is to have a input file(which I'm sure you are having :)) then tell sqlldr which parts to in the file to load, which parts to ignore and is any transformation is needed for columns if any. Cant believe your ears ? It's that easy. In next section I'm going to discuss SQLLDR with more detail along with an example.
The input file structure,
There is a requirement that input file (a.k.a a flat file) structure should be either delimited (variable width data) or having fixed width data (i.e. tabular). a delimited file contains data with each new record starting at a new line and data corresponding to each column in the destination table separated/delimited by some character.
e.g.:-
birthday,name,age,job
1957/11/12 ,Harry, 54, officer
1966/11/12,Ben,45,mechanic
1976/11/12,Nimal,34,techer
above is an example of a "comma" delimited (csv) file.
birthday name age job
1957/11/12 Harry 54 officer
1966/11/12 Ben 45 mechanic
1976/11/12 Nimal 34 teacher
above is an example of fixed with data file.
Did you notice that an any given column of data is having a fixed width. (i.e. data fields are padded with spaces so that each column has a constant width). It has become a regular practice to output statistics files/ performance logs /measurements/ dump files in tabular formats .In case you have an input file that isn't tabular, you'll simply have to pre-process it programmatically into tabular form before loading.
Did you notice that an any given column of data is having a fixed width. (i.e. data fields are padded with spaces so that each column has a constant width). It has become a regular practice to output statistics files/ performance logs /measurements/ dump files in tabular formats .In case you have an input file that isn't tabular, you'll simply have to pre-process it programmatically into tabular form before loading.
The Control file
Earlier on i said we have to mention to SQLLDR which parts to load, which parts to neglect, do some pre-formatting, remove unwanted chars in certain columns etc. That is done through the control file.
Below is a sample structure control file (one I typically use to load data),
As can be seen "options" block contains some odd values. READSIZE is max size of memory in bytes allocated to hold data read from input file (read buffer) at a time. BINDSIZE and ROWS are correlated in that, former specifies the size in bytes of the record set to be inserted before a commit and latter the number of records to be inserted per batch between commits. So upper limit in reality is whichever the limiting condition that hits first, either the allocated size in bytes or number of rows. ROWS specifies number of records in batch of loaded before committing. (I'll be discussing about these and more options in future.)
UNRECOVERABLE option allows us to specify not to perform 'logging' (this option valid only in direct path loading mode, discussed in below section) , thus only little redo are generated. This makes data loading faster but we are losing recoverability in case of a failure.
On 4th line we have APPEND INTO followed by table name "citizens", which means existing records in the table are kept as is and new data only appended. One can use INSERT INTO clause when inserting into empty tables. REPLACE INTO/TRUNCATE INTO clauses allows us to delete old records and replace them with new records in one sweep !. latter has the added benefits of resetting the high water mark of the table and faster execution speeds. I use TRUNCATE INTO option the most because I usually use SQLLDR to load data into a staging table (a temporary table which holds data until data merged into final tables) Cool right ?
In some situations, data will not be there for last one or more fields in a record.
eg:-
1984/09/10,ravi,?,?
notice, age and job field values are missing (highlighted in red).
In these types of situations TRAILING NULLCOLS allows us to tell SQLLDR to discard those missing records.
Next we come to the actual definition of column to data mapping in control file. This section is called the field specification of the control file. Within curly braces we have identified four columns with their data type and delimiting character. First record says, yes you guessed correctly !, the first filed terminated by ',' should be loaded into date_time column of "citizens" as a "date" data type after being interpreted using "YYYY/MM/DD" format string.
If we move our attention to final row, I have used "terminated by whitespace". This is because if we look at our sample data above, the last field doesn't end with a comma rather with a newline character. WHITESPACE denotes a whitespace, newline or tab characters.
For fixed width data loading, control file changes slightly,
By the way you can use "--" to add a comment in the control file.
We have specified starting and ending positions for each column using POSITION clause.
Transformations/ alterations and using SQL functions in SQLLDR
Let's say we want to make first letter capital in name column when loading. By making below adjustment in control file we can achieve it,
name CHAR TERMINATED BY ',' "upper(:name)" ,
Say we have another column in citizens table called salary and we need to fill it based on job type,
salary "decode(:job, 'teacher', '4000$', 'Student', '200$', 'Engineer', '5000$') ",
Listed below are some of the SQL functions that can be used with SQLLDR,
TRIM
REPLACE
REGEXP_REPLACE
LOWER
UPPER
SUBSTR
LTRIM
RTRIM
before moving to next section we need to save above control file in our working directory.
Calling SQLLDR from command line.
In windows open a cmd, in unix a shell. Lets assume that we have the control file and input file in the same directory .Then we have navigate to that directory.
Then run ,
Provided that you have set your Windows environment variables correctly / Unix PATH variable correctly, SQLLDR will run.
Rejected records ?
Once in a while you might find that input file contains irregular records. If so those rejecte records are stored in a seperate file called badFile.bad. Also the operation log is created in logFile.log. It has the following look,
It took SQLLDR only 7 mins to load 41 million records ! Impressive isn't it !!! In comparison to what one might ask. Well i didnt go thourgh the painful process of loading using first two methods discussed because i felt its not worth it. But trust me when i say using SQLLDR you can save a lot of time. Also another thing to note that CPU was utilized right throughout. This means that server may utilize most of the CPU during direct path loading and may affect other services its hosting.
Data loading options in SQLLDR
Now I'm going to discuss about two parameters that were used above, DIRECT=YES and PARALLEL=TRUE.
There are two loading options using SQLLDR, conventional and direct path loading. If DIRECT parameter not provided as true, by default SQLLDR performs a conventional path loading. Direct path loading is much faster than conventional path loading though more restrictive. The speed is achieved by writing prepared data blocks directly into the data files, bypassing most of the RDBMS processing that's present in conventional path loading. PARALLEL = TRUE (defaults to FALSE) also increases the loading speed by pcreating multiple directpath loadings simultaneously. The downside is higher utilization of server resources and degrading other service provided by server.
(you can see that return code output is selected then piped to mailx program so that admin will receive a mail every time script runs. you need to have mailx configured properly for above to work)
For UNIX, the exit codes are as follows:
EX_SUCC 0
EX_FAIL 1
EX_WARN 2
EX_FTL 3
For Windows NT, the exit codes are as follows:
EX_SUCC 0
EX_WARN 2
EX_FAIL 3
EX_FTL 4
If you get a result other than success, issue can be traced from SQLLDR log file.
Summary
SQLLDR is a powerful tool installed along with database installation. This is a beginner level discussion on how to use SQLLDR to load data into a table. We discussed about input file structure requirements, how to prepare SQLLDR control file with some important parameters, how to transform data as they are loaded. Finally we discussed briefly about loading mode options in SQLLDR.
No comments:
Post a Comment