生成百万级的数据文件
连接到本地数据库(自己安装的):
找一个合适大小的表:
SELECT COUNT(0) FROM DBA_OBJECTS;
COUNT(0)
----------
65389
然后找一个有20行记录的表,对其进行笛卡尔运算,即可生成130万条记录了.
为了更好地体现通用性,我们在输出时对created日期做一下格式转换:
select a.owner||',"'||a.object_name||'",'||a.object_id||','||
to_char(a.created,'yyyy-mm-dd hh24:mi:ss')||','||a.status
from dba_objects a,(select rownum rn from dual content by rownum<=20) b ;
示例代码保存为SQL文件:getobject.sql
set echo off
set term off
set line 1000 pages 0
set feedback off
set heading off
spool [当前路径]/ldr_object.csv
@[当前路径]/getobject.sql
spool off
set heading on
set feedback on
set term on
set echo on
示例代码保存为:SQL文件:call.sql
然后在SQL*Plus命令行环境中,执行下列命令即可:
SQL> @[当前路径]call.sql
然后就耐心等待把,132万的记录量,输出也是需要一定时间的.
查看:
exit
du -m ldr_object.csv
//输出的数据以M为单位.
初始化环境
创建表:
CREATE TABLE OBJECTS(
OWNER VARCHAR2(30),
OBJECT_NAME VARCHAR2(50),
OBJECT_ID NUMBER,
STATUS VARCHAR2(2000),
CREATED DATE);
创建Index.
create index idx_obj_owner_name on objects(owner,object_name);
执行导入
创建控制文件如下:
LOAD DATA
INFILE ldr_object.csv
INSERT INTO TABLE OBJECTS
FIELDS TERMINATED BY "," OPTIONALLY ENCLOSED BY '"'
(owner,
object_name ,
object_id,
created date 'yyyy-mm-dd hh24:mi:ss',
status char(2000)
)
BEGINDATA
控制文件保存为:ldr_object.ctl
这里注意,对于CREATED列,我们指定了日期格式,并进行了转换,这个格式一定要与数据文件中日期格式相符,不然日期格式转换时会报错并导致数据加载失败.
按照默认参数执行SQLLDR,看看需要多长时间,同时指定ERRORS参数值为10,明确指定出现10次错误即中止加载:
sqlldr SCOTT/TIGER CONTROL=ldr_object.ctl ERRORS=10
5.4.4 能不能再快一点呢
SQLLDR常规路径导入时默认一次加载64行,现在要加载的总行数已经达到百万级,十位数显然太小,我们首先尝试修改该值,先直接在后面加个0好了,看看能对效率起到多大的提升:
执行如下命令:sqlldr SCOTT/TIGER CONTROL=ldr_object.ctl ERRORS=10 ROWS=640
日志节选信息:
value used for ROWS parameter changed from 640 to 84
Table OBJECTS:
1307820 Rows successfully loaded.
0 Rows not loaded due to data errors.
0 Rows not loaded because all WHEN clauses were failed.
0 Rows not loaded because all fields were null.
Space allocated for bind array: 254856 bytes(84 rows)
Read buffer bytes: 1048576
Total logical records skipped: 0
Total logical records read: 1307820
Total logical records rejected:0
Total logical records discarded: 0
Run began on Wed Jun 05 14:18:43 2013
Run ended on Wed Jun 05 14:28:56 2013
Elapsed time was: 00:10:12.39
CPU time was:00:00:16.40
注意节选信息的第一行,该信息是提示由于640行所占用的空间已经超过了参数BINDSIZE的默认值,因此自动修改到最大可承受的84行,这说明BINDSIZE参数默认值偏小.速度只比刚才小了2秒(变化几乎可以忽略)
再进一步调整BINDSIZE参数值,默认为256k,我们将其修改为10M(1024kb * 1024 * 10 = 10485760),同时将一次加载的行数提高到5000.
sqlldr SCOTT/TIGER CONTROL=ldr_object.ctl ERRORS=10 ROWS=5000 BINDSIZE=10485760
时间变得更长,正在研究:
大约时间是:13:48
能不能再快一点呢
所有参数默认,只打开直接路径加载:
sqlldr SCOTT/TIGER CONTROL=ldr_object.ctl DIRECT=TRUE
靠,这个速度才是王道啊!
Table OBJECTS:
1307820 Rows successfully loaded.
0 Rows not loaded due to data errors.
0 Rows not loaded because all WHEN clauses were failed.
0 Rows not loaded because all fields were null.
Date conversion cache disabled due to overflow (default size: 1000)
Bind array size not used in direct path.
Column array rows : 5000
Stream buffer bytes: 256000
Read buffer bytes: 1048576
Total logical records skipped: 0
Total logical records read: 1307820
Total logical records rejected:0
Total logical records discarded: 0
Total stream buffers loaded by SQL*Loader main thread: 1250
Total stream buffers loaded by SQL*Loader load thread: 4996
Run began on Wed Jun 05 14:55:12 2013
Run ended on Wed Jun 05 14:56:28 2013
Elapsed time was: 00:01:15.74
CPU time was:00:00:11.80
有没有可能更快呢
这究竟是希望还是欲望,已经说不清楚了,反正没个尽头。
直接导入路径导入可用的参数也有不少,不过我们这里总数据量不大,因此实际能够起到效率提升的不多,我准备主要从以下两个参数入手:
- STREAMSIZE : 直接路径加载默认读取全部记录,因此不需要设置ROWS参数,读取到的数据处理后存入流缓存区,即STREAMSIZE参数,该参数默认值为256kb,这里加大到10MB.
- DATE_CACHE : 该参数指定一个转换后日期格式的缓存区,以条为单位,默认值1000条,即保存1000条转换后的日期格式,由于我们要导入的数据中有日期列,因此加载该参数值到5000,以降低日期转换带来的开销.
修改参数后执行命令最终形式如下所示:
sqlldr SCOTT/TIGER CONTROL=ldr_object.ctl DIRECT=TRUE STREAMSIZE=10485760 DATE_CACHE=5000
可能已经达到性能瓶颈:
Date cache:
Max Size: 5000
Entries : 1207
Hits : 1306613
Misses :0
Bind array size not used in direct path.
Column array rows : 5000
Stream buffer bytes:16777216
Read buffer bytes: 1048576
Total logical records skipped: 0
Total logical records read: 1307820
Total logical records rejected:0
Total logical records discarded: 0
Total stream buffers loaded by SQL*Loader main thread: 1250
Total stream buffers loaded by SQL*Loader load thread: 0
Run began on Wed Jun 05 15:15:39 2013
Run ended on Wed Jun 05 15:17:07 2013
Elapsed time was: 00:01:28.14
CPU time was:00:00:09.05
SQL*Loader加载综述
事实上想在Oracle存储过程中调用SQLLDR非常麻烦,因为SQLLDR是一个执行程序而不是一个接口,在9i之前版本要在存储过程中实现类似功能也很复杂,虽然可以通过UTL_FILE之类的包间接实现,但需要编写大量脚本,考虑字符截取,过滤,判断等诸多事宜。
一个不慎就可能造成执行报错,或者更不慎,执行到一般的时候报错(可能比没执行还要麻烦),幸运的是,9i及之后的版本,Oracle提供了一个新的功能---外部表(External Tables),顾名思义就是数据存储在数据库之外的表,这是一个号称"SQL*Loader替代者"的新特性.
2013/06/05 15:25:57
--EOF--