oracle SQL*Loader:处理数据中的分隔符
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/8940163/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
SQL*Loader: Dealing with delimiter characters in data
提问by bonsvr
I am loading some data to Oracle via SQLLDR. The source file is "pipe delimited".
我正在通过 SQLLDR 将一些数据加载到 Oracle。源文件是“管道分隔的”。
FIELDS TERMINATED BY '|'
But some records contain pipe character in data, and not as separator. So it breaks correct loading of records as it understands indatapipe characters as field terminator.
但是有些记录在数据中包含管道字符,而不是作为分隔符。因此,它破坏了记录的正确加载,因为它将indata管道字符理解为字段终止符。
Can you point me a direction to solve this issue?
你能指出我解决这个问题的方向吗?
Data file is about 9 GB, so it is hard to edit manually.
数据文件大约 9 GB,因此很难手动编辑。
For example,
例如,
Loaded row:
加载行:
ABC|1234567|STR 9 R 25|98734959,32|28.12.2011
ABC|1234567|STR 9 R 25|98734959,32|28.12.2011
Rejected Row:
拒绝行:
DE4|2346543|WE| 454|956584,84|28.11.2011
DE4|2346543|我们| 454|956584,84|28.11.2011
Error:
错误:
Rejected - Error on table HSX, column DATE_N.
ORA-01847: day of month must be between 1 and last day of month
DATE_N column is the last one.
DATE_N 列是最后一列。
采纳答案by Benoit
You could not use any separator, and do something like:
您不能使用任何分隔符,并执行以下操作:
field FILLER,
col1 EXPRESSION "REGEXP_REPLACE(:field,'^([^|]*)\|([^|]*)\|(.*)\|([^|]*)\|([^|]*)\|([^|]*)$', '\1')",
col2 EXPRESSION "REGEXP_REPLACE(:field,'^([^|]*)\|([^|]*)\|(.*)\|([^|]*)\|([^|]*)\|([^|]*)$', '\2')",
col3 EXPRESSION "REGEXP_REPLACE(:field,'^([^|]*)\|([^|]*)\|(.*)\|([^|]*)\|([^|]*)\|([^|]*)$', '\3')",
col4 EXPRESSION "REGEXP_REPLACE(:field,'^([^|]*)\|([^|]*)\|(.*)\|([^|]*)\|([^|]*)\|([^|]*)$', '\4')",
col5 EXPRESSION "REGEXP_REPLACE(:field,'^([^|]*)\|([^|]*)\|(.*)\|([^|]*)\|([^|]*)\|([^|]*)$', '\5')",
col6 EXPRESSION "REGEXP_REPLACE(:field,'^([^|]*)\|([^|]*)\|(.*)\|([^|]*)\|([^|]*)\|([^|]*)$', '\6')"
This regexp takes six capture groups (inside parentheses) separated by a vertical bar (I had to escape it because otherwise it means OR in regexp). All groups except the third cannot contain a vertical bar ([^|]*
), the third group may contain anything (.*
), and the regexp must span from beginning to end of the line (^
and $
).
这个正则表达式需要六个捕获组(在括号内),由一个竖线分隔(我不得不转义它,否则它意味着正则表达式中的 OR)。除了第三组之外的所有组都不能包含竖线 ( [^|]*
),第三组可以包含任何内容 ( .*
),并且正则表达式必须跨越行的开头和结尾 ( ^
and $
)。
This way we are sure that the third group will eat all superfluous separators. This only works because you've only one field that may contain separators. If you want to proofcheck you can for example specify that the fourth group starts with a digit (include \d
at the beginning of the fourth parenthesized block).
这样我们就可以确定第三组会吃掉所有多余的分隔符。这只有效,因为您只有一个可能包含分隔符的字段。例如,如果您想验证检查,您可以指定第四组以数字开头(包括\d
在第四个带括号的块的开头)。
I have doubled all backslashes because we are inside a double-quoted expression, but I am not really sure that I ought to.
我已将所有反斜杠加倍,因为我们在双引号表达式中,但我不确定是否应该这样做。
回答by John Doyle
It looks to me that it's not really possible for SQL*Loader to handle your file because of the third field which: can contain the delimiter, is not surrounded by quotes and is of a variable length. Instead, if the data you provide is an accurate example then I can provide a sample workaround. First, create a table with one column of VARCHAR2 with length the same as the maximum length of any one line in your file. Then just load the entire file into this table. From there you can extract each column with a query such as:
在我看来,SQL*Loader 不太可能处理您的文件,因为第三个字段:可以包含分隔符,没有被引号包围并且长度可变。相反,如果您提供的数据是准确示例,那么我可以提供示例解决方法。首先,创建一个表,其中包含一列 VARCHAR2,其长度与文件中任何一行的最大长度相同。然后只需将整个文件加载到该表中。从那里您可以使用查询提取每一列,例如:
with CTE as
(select 'ABC|1234567|STR 9 R 25|98734959,32|28.12.2011' as CTETXT
from dual
union all
select 'DE4|2346543|WE| 454|956584,84|28.11.2011' from dual)
select substr(CTETXT, 1, instr(CTETXT, '|') - 1) as COL1
,substr(CTETXT
,instr(CTETXT, '|', 1, 1) + 1
,instr(CTETXT, '|', 1, 2) - instr(CTETXT, '|', 1, 1) - 1)
as COL2
,substr(CTETXT
,instr(CTETXT, '|', 1, 2) + 1
,instr(CTETXT, '|', -1, 1) - instr(CTETXT, '|', 1, 2) - 1)
as COL3
,substr(CTETXT, instr(CTETXT, '|', -1, 1) + 1) as COL4
from CTE
It's not perfect (though it may be adaptable to SQL*Loader) but would need a bit of work if you have more columns or if your third field is not what I think it is. But, it's a start.
它并不完美(尽管它可能适用于 SQL*Loader)但如果您有更多列或者您的第三个字段不是我认为的那样,则需要做一些工作。但是,这是一个开始。
回答by Florin Ghita
OK, I recomend you to parse the file and replace the delimiter. In command line in Unix/linux you should do:
好的,我建议您解析文件并替换分隔符。在 Unix/linux 的命令行中,您应该执行以下操作:
cat current_file | awk -F'|' '{printf( "%s,%s,", , ); for(k=3;k<NF-2;k++) printf("%s|", $k); printf("%s,%s,%s", $(NF-2),$(NF-1),$NF);print "";}' > new_file
This command will not change your current file. Will create a new file, comma delimited, with five fields. It splits the input file on "|" and take first, second, anything to antelast, antelast, and last chunk.
此命令不会更改您当前的文件。将创建一个新文件,以逗号分隔,有五个字段。它在“|”上拆分输入文件 并取第一个,第二个,任何到 antelast、antelast 和最后一块。
You can try to sqlldr the new_file with "," delimiter.
您可以尝试使用“,”分隔符对 new_file 进行 sqlldr。
UPDATE:The command can be put in a script like (and named parse.awk)
更新:该命令可以放在一个脚本中(并命名为 parse.awk)
#!/usr/bin/awk
# parse.awk
BEGIN {FS="|"}
{
printf("%s,%s,", , );
for(k=3;k<NF-2;k++)
printf("%s|", $k);
printf("%s,%s,%s\n", $(NF-2),$(NF-1),$NF);
}
and you can run in this way:
你可以这样运行:
cat current_file | awk -f parse.awk > new_file