bash 如何拆分文件并在每个部分中保留第一行？

Question

提问by Arkady

Given:One big text-data file (e.g. CSV format) with a 'special' first line (e.g., field names).

给定：一个带有“特殊”第一行（例如字段名称）的大文本数据文件（例如 CSV 格式）。

Wanted:An equivalent of the coreutils split -lcommand, but with the additional requirement that the header line from the original file appear at the beginning of each of the resulting pieces.

通缉：相当于 coreutilssplit -l命令，但附加要求原始文件的标题行出现在每个结果片段的开头。

I am guessing some concoction of splitand headwill do the trick?

我在猜测一些混合物split并且head会起作用吗？

Answer 1

采纳答案by Paused until further notice.

This is robhruska'sscript cleaned up a bit:

这是robhruska 的脚本清理了一下：

tail -n +2 file.txt | split -l 4 - split_
for file in split_*
do
    head -n 1 file.txt > tmp_file
    cat "$file" >> tmp_file
    mv -f tmp_file "$file"
done

I removed wc, cut, lsand echoin the places where they're unnecessary. I changed some of the filenames to make them a little more meaningful. I broke it out onto multiple lines only to make it easier to read.

我删除了wc, cut,ls和echo在不需要它们的地方。我更改了一些文件名，使它们更有意义。我把它分成多行只是为了更容易阅读。

If you want to get fancy, you could use mktempor tempfileto create a temporary filename instead of using a hard coded one.

如果您想花哨，您可以使用mktemp或tempfile创建一个临时文件名，而不是使用硬编码的文件名。

Edit

编辑

Using GNU splitit's possible to do this:

使用 GNUsplit可以做到这一点：

split_filter () { { head -n 1 file.txt; cat; } > "$FILE"; }; export -f split_filter; tail -n +2 file.txt | split --lines=4 --filter=split_filter - split_

Broken out for readability:

为可读性而拆分：

split_filter () { { head -n 1 file.txt; cat; } > "$FILE"; }
export -f split_filter
tail -n +2 file.txt | split --lines=4 --filter=split_filter - split_

When --filteris specified, splitruns the command (a function in this case, which must be exported) for each output file and sets the variable FILE, in the command's environment, to the filename.

当--filter指定时，split运行用于每个输出文件的命令（在此情况下的函数，其必须导出）并设置变量FILE，在命令的环境，到文件名。

A filter script or function could do any manipulation it wanted to the output contents or even the filename. An example of the latter might be to output to a fixed filename in a variable directory: > "$FILE/data.dat"for example.

过滤器脚本或函数可以对输出内容甚至文件名进行任何操作。后者的一个例子可能是输出到变量目录中的固定文件名：> "$FILE/data.dat"例如。

Answer 2

回答by pixelbeat

You could use the new --filter functionality in GNU coreutils split >= 8.13 (2011):

您可以在 GNU coreutils split >= 8.13 (2011) 中使用新的 --filter 功能：

tail -n +2 FILE.in | split -l 50 - --filter='sh -c "{ head -n1 FILE.in; cat; } > $FILE"'

Answer 3

回答by marco

You can use [mg]awk:

您可以使用 [mg]awk：

awk 'NR==1{
        header=cat bigFile.csv | parallel --header : --pipe -N999 'cat >file_{#}.csv'
; 
        count=1; 
        print header > "x_" count; 
        next 
     } 

     !( (NR-1) % 100){
        count++; 
        print header > "x_" count;
     } 
     {
        print $> tail -n +2 file.txt | split -l 4
$> for file in `ls xa*`; do echo "`head -1 file.txt`" > tmp; cat $file >> tmp; mv -f tmp $file; done
 > "x_" count
     }' file

100 is the number of lines of each slice. It doesn't require temp files and can be put on a single line.

100 是每个切片的行数。它不需要临时文件，可以放在一行上。

Answer 4

回答by Tim Richardson

This one-liner will split the big csv into pieces of 999 records, with the header at the top of each one (so 999 records + 1 header = 1000 rows)

这个单行将大 csv 拆分为 999 条记录，标题位于每个记录的顶部（因此 999 条记录 + 1 个标题 = 1000 行）

trap 'rm split_* tmp_file ; exit 13' SIGINT SIGTERM SIGQUIT 
tail -n +2 file.txt | split -l 4 - split_
for file in split_*
do
    head -n 1 file.txt > tmp_file
    cat $file >> tmp_file
    mv -f tmp_file $file
done

Based on Ole Tange's answer. (re Ole's answer: You can't use line count with pipepart)

基于 Ole Tange 的回答。（重新 Ole 的回答：您不能将行数与 pipepart 一起使用）

Answer 5

回答by Rob Hruska

I'm a novice when it comes to Bash-fu, but I was able to concoct this two-command monstrosity. I'm sure there are more elegant solutions.

我是 Bash-fu 的新手，但我能够炮制出这种双命令怪物。我相信有更优雅的解决方案。

awk 'NR==1{print in_file=
awk '{if (NR!=1) {print}}' $in_file | split -d -a 5 -l 100000 - $in_file"_" # Get all lines except the first, split into 100,000 line chunks
for file in $in_file"_"*
do
    tmp_file=$(mktemp $in_file.XXXXXX) # Create a safer temp file
    head -n 1 $in_file | cat - $file > $tmp_file # Get header from main file, cat that header with split file contents to temp file
    mv -f $tmp_file $file # Overwrite non-header containing file with header-containing file
done
 > FILENAME ".split1";  print parallel -a bigfile.csv --header : --pipepart 'cat > {#}'
 > FILENAME ".split2";} NR>1{if (NR % 10 > 5) print parallel -a bigfile.csv --header : --pipepart my_program_reading_from_stdin
parallel -a bigfile.csv --header : --pipepart --fifo my_program_reading_from_fifo {}
parallel -a bigfile.csv --header : --pipepart --cat my_program_reading_from_a_file {}
 >> FILENAME ".split1"; else print parallel --block -2 -a bigfile.csv --header : --pipepart my_program_reading_from_stdin
 >> FILENAME ".split2"}' file

This is assuming your input file is file.txt, you're not using the prefixargument to split, and you're working in a directory that doesn't have any other files that start with split's default xa*output format. Also, replace the '4' with your desired split line size.

这是假设您的输入文件是file.txt，您没有使用prefix参数split，并且您正在一个目录中工作，该目录没有任何其他以split的默认xa*输出格式开头的文件。此外，将“4”替换为您所需的分割线大小。

Answer 6

回答by Sam Bisbee

This is a more robust version of Denis Williamson's script. The script creates a lot of temporary files, and it would be a shame if they were left lying around if the run was incomplete. So, let's add signal trapping (see http://tldp.org/LDP/Bash-Beginners-Guide/html/sect_12_02.htmland then http://tldp.org/LDP/abs/html/debugging.html) and remove our temporary files; this is a best practice anyways.

这是Denis Williamson脚本的更强大版本。该脚本创建了许多临时文件，如果运行不完整，如果将它们留在原地，那将是一种耻辱。所以，让我们添加信号捕获（参见http://tldp.org/LDP/Bash-Beginners-Guide/html/sect_12_02.html然后是http://tldp.org/LDP/abs/html/debugging.html）和删除我们的临时文件；无论如何，这是最佳实践。

parallel --block 10M -a bigfile.csv --header : --pipepart my_program_reading_from_stdin

Replace '13' with whatever return code you want. Oh, and you should probably be using mktemp anyways (as some have already suggested), so go ahead and remove 'tmp_file" from the rm in the trap line. See the signal man page for more signals to catch.

用你想要的任何返回码替换“13”。哦，您可能无论如何都应该使用 mktemp（正如有些人已经建议的那样），所以继续并从陷阱行中的 rm 中删除 'tmp_file'。请参阅信号手册页以获取更多要捕获的信号。

Answer 7

回答by DreamFlasher

I liked the awk version of marco, adopted from this a simplified one-liner where you can easily specify the split fraction as granular as you want:

我喜欢 marco 的 awk 版本，它采用了一种简化的单行代码，您可以根据需要轻松地将拆分分数指定为粒度：

csvheader=`head -1 bigfile.csv`
split -d -l10000 bigfile.csv smallfile_
find .|grep smallfile_ | xargs sed -i "1s/^/$csvheader\n/"
sed -i '1d' smallfile_00

Answer 8

回答by Garren S

I really liked Rob and Dennis' versions, so much so that I wanted to improve them.

我真的很喜欢 Rob 和 Dennis 的版本，以至于我想改进它们。

Here's my version:

这是我的版本：

##代码##

Differences:

区别：

in_file is the file argument you want to split maintaining headers
Use awkinstead of taildue to awkhaving better performance
split into 100,000 line files instead of 4
Split file name will be input file name appended with an underscore and numbers (up to 99999 - from the "-d -a 5" split argument)
Use mktemp to safely handle temporary files
Use single head | catline instead of two lines

in_file 是您要拆分维护标头的文件参数
使用awk而不是tail因为awk性能更好
拆分为 100,000 个行文件而不是 4 个
拆分文件名将是输入文件名，并附加下划线和数字（最多 99999 - 来自“-d -a 5”拆分参数）
使用 mktemp 安全处理临时文件
使用单行head | cat而不是两行

Answer 9

回答by Ole Tange

Use GNU Parallel:

使用 GNU 并行：

##代码##

If you need to run a command on each of the parts, then GNU Parallel can help do that, too:

如果您需要在每个部分上运行命令，那么 GNU Parallel 也可以帮助做到这一点：

##代码##

If you want to split into 2 parts per CPU core (e.g. 24 cores = 48 equal sized parts):

如果您想将每个 CPU 内核分成 2 个部分（例如 24 个内核 = 48 个相同大小的部分）：

##代码##

If you want to split into 10 MB blocks:

如果要拆分为 10 MB 的块：

##代码##

Answer 10

回答by Thyag

Below is a 4 liner that can be used to split a bigfile.csv into multiple smaller files, and preserve the csv header. Uses only built-in Bash commands (head, split, find, grep, xargs, and sed) which should work on most *nix systems. Should also work on Windows if you install mingw-64 / git-bash.

下面是一个 4 行，可用于将 bigfile.csv 拆分为多个较小的文件，并保留 csv 标头。只使用内置的 Bash 命令（head、split、find、grep、xargs 和 sed），这些命令应该适用于大多数 *nix 系统。如果您安装 mingw-64 / git-bash，也应该在 Windows 上工作。

##代码##

Line by line explanation:

逐行解释：

Capture the header to a variable named csvheader
Split the bigfile.csvinto a number of smaller files with prefix smallfile_
Findall smallfiles and insert the csvheader into the FIRST line using xargsand sed -i. Note that you need to use sed within "double quotes" in order to use variables.
The first file named smallfile_00 will now have redundant headers on lines 1 and 2 (from the original data as well as from the sed header insert in step 3). We can remove the redundant header with sed -i '1d' command.

将标头捕获到名为csvheader的变量
将bigfile.csv拆分为多个带有前缀smallfile_的较小文件
查找所有小文件并使用xargs和sed -i将 csvheader 插入到第一行。请注意，您需要在“双引号”内使用 sed 才能使用变量。
第一个名为 smallfile_00 的文件现在将在第 1 行和第 2 行（来自原始数据以及来自步骤 3 中的 sed 标头插入）具有冗余标头。我们可以使用 sed -i '1d' 命令删除多余的标头。

bash 如何拆分文件并在每个部分中保留第一行？

提问by Arkady

采纳答案by Paused until further notice.

回答by pixelbeat

回答by marco

回答by Tim Richardson

回答by Rob Hruska

回答by Sam Bisbee

回答by DreamFlasher

回答by Garren S

回答by Ole Tange

回答by Thyag

相关推荐

最近更新

标签

bash 如何拆分文件并在每个部分中保留第一行？

提问by Arkady

采纳答案by Paused until further notice.

回答by pixelbeat

回答by marco

回答by Tim Richardson

回答by Rob Hruska

回答by Sam Bisbee

回答by DreamFlasher

回答by Garren S

回答by Ole Tange

回答by Thyag

相关推荐

单行 Bash 无限 while 循环的语法

bash 检查是否安装了 apt-get 软件包，如果它不在 Linux 上，则安装它

bash 查找目录中不是目录本身的所有文件

bash 从所有目录中删除 .svn 文件

相关推荐

最近更新

标签