bash 如何拆分文件并在每个部分中保留第一行?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/1411713/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-09 18:31:10  来源:igfitidea点击:

How to split a file and keep the first line in each of the pieces?

linuxbashfileshelltext

提问by Arkady

Given:One big text-data file (e.g. CSV format) with a 'special' first line (e.g., field names).

给定:一个带有“特殊”第一行(例如字段名称)的大文本数据文件(例如 CSV 格式)。

Wanted:An equivalent of the coreutils split -lcommand, but with the additional requirement that the header line from the original file appear at the beginning of each of the resulting pieces.

通缉:相当于 coreutilssplit -l命令,但附加要求原始文件的标题行出现在每个结果片段的开头。

I am guessing some concoction of splitand headwill do the trick?

我在猜测一些混合物split并且head会起作用吗?

采纳答案by Paused until further notice.

This is robhruska'sscript cleaned up a bit:

这是robhruska 的脚本清理了一下:

tail -n +2 file.txt | split -l 4 - split_
for file in split_*
do
    head -n 1 file.txt > tmp_file
    cat "$file" >> tmp_file
    mv -f tmp_file "$file"
done

I removed wc, cut, lsand echoin the places where they're unnecessary. I changed some of the filenames to make them a little more meaningful. I broke it out onto multiple lines only to make it easier to read.

我删除了wc, cut,lsecho在不需要它们的地方。我更改了一些文件名,使它们更有意义。我把它分成多行只是为了更容易阅读。

If you want to get fancy, you could use mktempor tempfileto create a temporary filename instead of using a hard coded one.

如果您想花哨,您可以使用mktemptempfile创建一个临时文件名,而不是使用硬编码的文件名。

Edit

编辑

Using GNU splitit's possible to do this:

使用 GNUsplit可以做到这一点:

split_filter () { { head -n 1 file.txt; cat; } > "$FILE"; }; export -f split_filter; tail -n +2 file.txt | split --lines=4 --filter=split_filter - split_

Broken out for readability:

为可读性而拆分:

split_filter () { { head -n 1 file.txt; cat; } > "$FILE"; }
export -f split_filter
tail -n +2 file.txt | split --lines=4 --filter=split_filter - split_

When --filteris specified, splitruns the command (a function in this case, which must be exported) for each output file and sets the variable FILE, in the command's environment, to the filename.

--filter指定时,split运行用于每个输出文件的命令(在此情况下的函数,其必须导出)并设置变量FILE,在命令的环境,到文件名。

A filter script or function could do any manipulation it wanted to the output contents or even the filename. An example of the latter might be to output to a fixed filename in a variable directory: > "$FILE/data.dat"for example.

过滤器脚本或函数可以对输出内容甚至文件名进行任何操作。后者的一个例子可能是输出到变量目录中的固定文件名:> "$FILE/data.dat"例如。

回答by pixelbeat

You could use the new --filter functionality in GNU coreutils split >= 8.13 (2011):

您可以在 GNU coreutils split >= 8.13 (2011) 中使用新的 --filter 功能:

tail -n +2 FILE.in | split -l 50 - --filter='sh -c "{ head -n1 FILE.in; cat; } > $FILE"'

回答by marco

You can use [mg]awk:

您可以使用 [mg]awk:

awk 'NR==1{
        header=
cat bigFile.csv | parallel --header : --pipe -N999 'cat >file_{#}.csv'
; count=1; print header > "x_" count; next } !( (NR-1) % 100){ count++; print header > "x_" count; } { print
$> tail -n +2 file.txt | split -l 4
$> for file in `ls xa*`; do echo "`head -1 file.txt`" > tmp; cat $file >> tmp; mv -f tmp $file; done
> "x_" count }' file

100 is the number of lines of each slice. It doesn't require temp files and can be put on a single line.

100 是每个切片的行数。它不需要临时文件,可以放在一行上。

回答by Tim Richardson

This one-liner will split the big csv into pieces of 999 records, with the header at the top of each one (so 999 records + 1 header = 1000 rows)

这个单行将大 csv 拆分为 999 条记录,标题位于每个记录的顶部(因此 999 条记录 + 1 个标题 = 1000 行)

trap 'rm split_* tmp_file ; exit 13' SIGINT SIGTERM SIGQUIT 
tail -n +2 file.txt | split -l 4 - split_
for file in split_*
do
    head -n 1 file.txt > tmp_file
    cat $file >> tmp_file
    mv -f tmp_file $file
done

Based on Ole Tange's answer. (re Ole's answer: You can't use line count with pipepart)

基于 Ole Tange 的回答。(重新 Ole 的回答:您不能将行数与 pipepart 一起使用)

回答by Rob Hruska

I'm a novice when it comes to Bash-fu, but I was able to concoct this two-command monstrosity. I'm sure there are more elegant solutions.

我是 Bash-fu 的新手,但我能够炮制出这种双命令怪物。我相信有更优雅的解决方案。

awk 'NR==1{print 
in_file=
awk '{if (NR!=1) {print}}' $in_file | split -d -a 5 -l 100000 - $in_file"_" # Get all lines except the first, split into 100,000 line chunks
for file in $in_file"_"*
do
    tmp_file=$(mktemp $in_file.XXXXXX) # Create a safer temp file
    head -n 1 $in_file | cat - $file > $tmp_file # Get header from main file, cat that header with split file contents to temp file
    mv -f $tmp_file $file # Overwrite non-header containing file with header-containing file
done
> FILENAME ".split1"; print
parallel -a bigfile.csv --header : --pipepart 'cat > {#}'
> FILENAME ".split2";} NR>1{if (NR % 10 > 5) print
parallel -a bigfile.csv --header : --pipepart my_program_reading_from_stdin
parallel -a bigfile.csv --header : --pipepart --fifo my_program_reading_from_fifo {}
parallel -a bigfile.csv --header : --pipepart --cat my_program_reading_from_a_file {}
>> FILENAME ".split1"; else print
parallel --block -2 -a bigfile.csv --header : --pipepart my_program_reading_from_stdin
>> FILENAME ".split2"}' file

This is assuming your input file is file.txt, you're not using the prefixargument to split, and you're working in a directory that doesn't have any other files that start with split's default xa*output format. Also, replace the '4' with your desired split line size.

这是假设您的输入文件是file.txt,您没有使用prefix参数split,并且您正在一个目录中工作,该目录没有任何其他以split的默认xa*输出格式开头的文件。此外,将“4”替换为您所需的分割线大小。

回答by Sam Bisbee

This is a more robust version of Denis Williamson's script. The script creates a lot of temporary files, and it would be a shame if they were left lying around if the run was incomplete. So, let's add signal trapping (see http://tldp.org/LDP/Bash-Beginners-Guide/html/sect_12_02.htmland then http://tldp.org/LDP/abs/html/debugging.html) and remove our temporary files; this is a best practice anyways.

这是Denis Williamson脚本的更强大版本。该脚本创建了许多临时文件,如果运行不完整,如果将它们留在原地,那将是一种耻辱。所以,让我们添加信号捕获(参见http://tldp.org/LDP/Bash-Beginners-Guide/html/sect_12_02.html然后是http://tldp.org/LDP/abs/html/debugging.html)和删除我们的临时文件;无论如何,这是最佳实践。

parallel --block 10M -a bigfile.csv --header : --pipepart my_program_reading_from_stdin

Replace '13' with whatever return code you want. Oh, and you should probably be using mktemp anyways (as some have already suggested), so go ahead and remove 'tmp_file" from the rm in the trap line. See the signal man page for more signals to catch.

用你想要的任何返回码替换“13”。哦,您可能无论如何都应该使用 mktemp(正如有些人已经建议的那样),所以继续并从陷阱行中的 rm 中删除 'tmp_file'。请参阅信号手册页以获取更多要捕获的信号。

回答by DreamFlasher

I liked the awk version of marco, adopted from this a simplified one-liner where you can easily specify the split fraction as granular as you want:

我喜欢 marco 的 awk 版本,它采用了一种简化的单行代码,您可以根据需要轻松地将拆分分数指定为粒度:

csvheader=`head -1 bigfile.csv`
split -d -l10000 bigfile.csv smallfile_
find .|grep smallfile_ | xargs sed -i "1s/^/$csvheader\n/"
sed -i '1d' smallfile_00

回答by Garren S

I really liked Rob and Dennis' versions, so much so that I wanted to improve them.

我真的很喜欢 Rob 和 Dennis 的版​​本,以至于我想改进它们。

Here's my version:

这是我的版本:

##代码##

Differences:

区别:

  1. in_file is the file argument you want to split maintaining headers
  2. Use awkinstead of taildue to awkhaving better performance
  3. split into 100,000 line files instead of 4
  4. Split file name will be input file name appended with an underscore and numbers (up to 99999 - from the "-d -a 5" split argument)
  5. Use mktemp to safely handle temporary files
  6. Use single head | catline instead of two lines
  1. in_file 是您要拆分维护标头的文件参数
  2. 使用awk而不是tail因为awk性能更好
  3. 拆分为 100,000 个行文件而不是 4 个
  4. 拆分文件名将是输入文件名,并附加下划线和数字(最多 99999 - 来自“-d -a 5”拆分参数)
  5. 使用 mktemp 安全处理临时文件
  6. 使用单行head | cat而不是两行

回答by Ole Tange

Use GNU Parallel:

使用 GNU 并行:

##代码##

If you need to run a command on each of the parts, then GNU Parallel can help do that, too:

如果您需要在每个部分上运行命令,那么 GNU Parallel 也可以帮助做到这一点:

##代码##

If you want to split into 2 parts per CPU core (e.g. 24 cores = 48 equal sized parts):

如果您想将每个 CPU 内核分成 2 个部分(例如 24 个内核 = 48 个相同大小的部分):

##代码##

If you want to split into 10 MB blocks:

如果要拆分为 10 MB 的块:

##代码##

回答by Thyag

Below is a 4 liner that can be used to split a bigfile.csv into multiple smaller files, and preserve the csv header. Uses only built-in Bash commands (head, split, find, grep, xargs, and sed) which should work on most *nix systems. Should also work on Windows if you install mingw-64 / git-bash.

下面是一个 4 行,可用于将 bigfile.csv 拆分为多个较小的文件,并保留 csv 标头。只使用内置的 Bash 命令(head、split、find、grep、xargs 和 sed),这些命令应该适用于大多数 *nix 系统。如果您安装 mingw-64 / git-bash,也应该在 Windows 上工作。

##代码##

Line by line explanation:

逐行解释:

  1. Capture the header to a variable named csvheader
  2. Split the bigfile.csvinto a number of smaller files with prefix smallfile_
  3. Findall smallfiles and insert the csvheader into the FIRST line using xargsand sed -i. Note that you need to use sed within "double quotes" in order to use variables.
  4. The first file named smallfile_00 will now have redundant headers on lines 1 and 2 (from the original data as well as from the sed header insert in step 3). We can remove the redundant header with sed -i '1d' command.
  1. 将标头捕获到名为csvheader的变量
  2. bigfile.csv拆分为多个带有前缀smallfile_的较小文件
  3. 查找所有小文件并使用xargssed -i将 csvheader 插入到第一行。请注意,您需要在“双引号”内使用 sed 才能使用变量。
  4. 第一个名为 smallfile_00 的文件现在将在第 1 行和第 2 行(来自原始数据以及来自步骤 3 中的 sed 标头插入)具有冗余标头。我们可以使用 sed -i '1d' 命令删除多余的标头。