bash 将 CSV 文件拆分为较小的文件但保留标题?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/51420966/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Split CSV files into smaller files but keeping the headers?
提问by neisantos
I have a huge CSV file, 1m lines. I was wondering if there is a way to split this file into smaller ones but keeping the first line (CSV header) on all the files.
我有一个巨大的 CSV 文件,100 万行。我想知道是否有办法将此文件拆分为较小的文件,但在所有文件上保留第一行(CSV 标头)。
It seems split
is very fast but is also very limited. You cannot add a suffix to the filenames like .csv
.
它似乎split
非常快,但也非常有限。您不能为文件名添加后缀,例如.csv
.
split -l11000 products.csv file_
Is there an effective way to do this task in just bash
? A one-line command would be great.
有没有一种有效的方法来完成这项任务bash
?单行命令会很棒。
回答by kvantour
The answer to this question is yes, this is possible with AWK.
这个问题的答案是肯定的,这可以通过 AWK 实现。
The idea is to keep the header in mind and print all the rest in filenames of the form filename.00001.csv
:
这个想法是记住标题并以以下形式的文件名打印所有其余部分filename.00001.csv
:
awk -v l=11000 '(NR==1){header=awk -v m=100 '
(NR==1){h=export inputPrefix='file' parts=16 && split --verbose -d -n l/${parts} --additional-suffix=.csv --filter='([ "$FILE" != "${inputPrefix}.00.csv" ] && head -1 "${inputPrefix}.csv" ; cat) > "$FILE"' "${inputPrefix}.csv" "${inputPrefix}."
;next}
(NR%m==2) { close(f); f=sprintf("%s.%0.5d",FILENAME,++c); print h > f }
{print > f}' file.csv
;next}
(NR%l==2) {
close(file);
file=sprintf("%s.%0.5d.csv",FILENAME,++c)
sub(/csv[.]/,"",file)
print header > file
}
{print > file}' file.csv
This works in the following way:
这以下列方式工作:
(NR==1){header=$0;next}
:If the record/line is the first line, save that line as the header.(NR%l==2){...}
:Every time we wrotel=11000
records/lines, we need to start writing to a new file. This happens every time the modulo of the record/line number hits 2. This is on the lines 2, 2+l, 2+2l, 2+3l,.... When such a line is found we do:close(file)
:close the file you just wrote too.file=sprintf("%s.%0.5d.csv",FILENAME,++c); sub(/csv[.]/,"",file)
:define the new filename asFILENAME.00XXX.csv
print header > file
:open the file and write the header to that file.
{print > file}
: write the entries to the file.
(NR==1){header=$0;next}
:如果记录/行是第一行,则将该行保存为标题。(NR%l==2){...}
:每次我们写l=11000
记录/行时,我们都需要开始写一个新文件。每次记录/行号的模数达到 2 时都会发生这种情况。这是在行2, 2+l, 2+2l, 2+3l,.... 当找到这样的行时,我们执行以下操作:close(file)
:关闭你刚刚写的文件。file=sprintf("%s.%0.5d.csv",FILENAME,++c); sub(/csv[.]/,"",file)
:将新文件名定义为FILENAME.00XXX.csv
print header > file
:打开文件并将标题写入该文件。
{print > file}
: 将条目写入文件。
note:If you don't care about the filename, you can use the following shorter version:
注意:如果您不关心文件名,可以使用以下较短的版本:
##代码##回答by nzkeith
Using GNU split
to split file.csv
:
使用 GNUsplit
拆分file.csv
: