bash 将 CSV 文件拆分为较小的文件但保留标题？

Question

提问by neisantos

I have a huge CSV file, 1m lines. I was wondering if there is a way to split this file into smaller ones but keeping the first line (CSV header) on all the files.

我有一个巨大的 CSV 文件，100 万行。我想知道是否有办法将此文件拆分为较小的文件，但在所有文件上保留第一行（CSV 标头）。

It seems splitis very fast but is also very limited. You cannot add a suffix to the filenames like .csv.

它似乎split非常快，但也非常有限。您不能为文件名添加后缀，例如.csv.

split -l11000 products.csv file_

Is there an effective way to do this task in just bash? A one-line command would be great.

有没有一种有效的方法来完成这项任务bash？单行命令会很棒。

Answer 1

回答by kvantour

The answer to this question is yes, this is possible with AWK.

这个问题的答案是肯定的，这可以通过 AWK 实现。

The idea is to keep the header in mind and print all the rest in filenames of the form filename.00001.csv:

这个想法是记住标题并以以下形式的文件名打印所有其余部分filename.00001.csv：

awk -v l=11000 '(NR==1){header=awk -v m=100 '
    (NR==1){h=export inputPrefix='file' parts=16 && split --verbose -d -n l/${parts} --additional-suffix=.csv --filter='([ "$FILE" != "${inputPrefix}.00.csv" ] && head -1 "${inputPrefix}.csv" ; cat) > "$FILE"' "${inputPrefix}.csv" "${inputPrefix}."
;next}
    (NR%m==2) { close(f); f=sprintf("%s.%0.5d",FILENAME,++c); print h > f }
    {print > f}' file.csv
;next}
                (NR%l==2) {
                   close(file); 
                   file=sprintf("%s.%0.5d.csv",FILENAME,++c)
                   sub(/csv[.]/,"",file)
                   print header > file
                }
                {print > file}' file.csv

This works in the following way:

这以下列方式工作：

(NR==1){header=$0;next}:If the record/line is the first line, save that line as the header.
(NR%l==2){...}:Every time we wrote l=11000records/lines, we need to start writing to a new file. This happens every time the modulo of the record/line number hits 2. This is on the lines 2, 2+l, 2+2l, 2+3l,.... When such a line is found we do:
- close(file):close the file you just wrote too.
- file=sprintf("%s.%0.5d.csv",FILENAME,++c); sub(/csv[.]/,"",file):define the new filename as FILENAME.00XXX.csv
- print header > file:open the file and write the header to that file.
{print > file}: write the entries to the file.

(NR==1){header=$0;next}:如果记录/行是第一行，则将该行保存为标题。
(NR%l==2){...}:每次我们写l=11000记录/行时，我们都需要开始写一个新文件。每次记录/行号的模数达到 2 时都会发生这种情况。这是在行2, 2+l, 2+2l, 2+3l,.... 当找到这样的行时，我们执行以下操作：
- close(file):关闭你刚刚写的文件。
- file=sprintf("%s.%0.5d.csv",FILENAME,++c); sub(/csv[.]/,"",file):将新文件名定义为FILENAME.00XXX.csv
- print header > file:打开文件并将标题写入该文件。
{print > file}: 将条目写入文件。

note:If you don't care about the filename, you can use the following shorter version:

注意：如果您不关心文件名，可以使用以下较短的版本：

##代码##

Answer 2

回答by nzkeith

Using GNU splitto split file.csv:

使用 GNUsplit拆分file.csv：

##代码##

bash 将 CSV 文件拆分为较小的文件但保留标题？

提问by neisantos

回答by kvantour

回答by nzkeith

相关推荐

最近更新

标签

bash 将 CSV 文件拆分为较小的文件但保留标题？

提问by neisantos

回答by kvantour

回答by nzkeith

相关推荐

bash 如何临时切换 AWS CLI 的配置文件？

bash ./deploy.sh 不适用于 gitlab ci

bash 在 systemd 服务中使用用户的 .bashrc

bash 如何在 Linux 中创建带有特殊字符的文件？

相关推荐

最近更新

标签