bash 如何拆分文件并在每个部分中保留第一行?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/1411713/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How to split a file and keep the first line in each of the pieces?
提问by Arkady
Given:One big text-data file (e.g. CSV format) with a 'special' first line (e.g., field names).
给定:一个带有“特殊”第一行(例如字段名称)的大文本数据文件(例如 CSV 格式)。
Wanted:An equivalent of the coreutils split -l
command, but with the additional requirement that the header line from the original file appear at the beginning of each of the resulting pieces.
通缉:相当于 coreutilssplit -l
命令,但附加要求原始文件的标题行出现在每个结果片段的开头。
I am guessing some concoction of split
and head
will do the trick?
我在猜测一些混合物split
并且head
会起作用吗?
采纳答案by Paused until further notice.
This is robhruska'sscript cleaned up a bit:
这是robhruska 的脚本清理了一下:
tail -n +2 file.txt | split -l 4 - split_
for file in split_*
do
head -n 1 file.txt > tmp_file
cat "$file" >> tmp_file
mv -f tmp_file "$file"
done
I removed wc
, cut
, ls
and echo
in the places where they're unnecessary. I changed some of the filenames to make them a little more meaningful. I broke it out onto multiple lines only to make it easier to read.
我删除了wc
, cut
,ls
和echo
在不需要它们的地方。我更改了一些文件名,使它们更有意义。我把它分成多行只是为了更容易阅读。
If you want to get fancy, you could use mktemp
or tempfile
to create a temporary filename instead of using a hard coded one.
如果您想花哨,您可以使用mktemp
或tempfile
创建一个临时文件名,而不是使用硬编码的文件名。
Edit
编辑
Using GNU split
it's possible to do this:
使用 GNUsplit
可以做到这一点:
split_filter () { { head -n 1 file.txt; cat; } > "$FILE"; }; export -f split_filter; tail -n +2 file.txt | split --lines=4 --filter=split_filter - split_
Broken out for readability:
为可读性而拆分:
split_filter () { { head -n 1 file.txt; cat; } > "$FILE"; }
export -f split_filter
tail -n +2 file.txt | split --lines=4 --filter=split_filter - split_
When --filter
is specified, split
runs the command (a function in this case, which must be exported) for each output file and sets the variable FILE
, in the command's environment, to the filename.
当--filter
指定时,split
运行用于每个输出文件的命令(在此情况下的函数,其必须导出)并设置变量FILE
,在命令的环境,到文件名。
A filter script or function could do any manipulation it wanted to the output contents or even the filename. An example of the latter might be to output to a fixed filename in a variable directory: > "$FILE/data.dat"
for example.
过滤器脚本或函数可以对输出内容甚至文件名进行任何操作。后者的一个例子可能是输出到变量目录中的固定文件名:> "$FILE/data.dat"
例如。
回答by pixelbeat
You could use the new --filter functionality in GNU coreutils split >= 8.13 (2011):
您可以在 GNU coreutils split >= 8.13 (2011) 中使用新的 --filter 功能:
tail -n +2 FILE.in | split -l 50 - --filter='sh -c "{ head -n1 FILE.in; cat; } > $FILE"'
回答by marco
You can use [mg]awk:
您可以使用 [mg]awk:
awk 'NR==1{
header=cat bigFile.csv | parallel --header : --pipe -N999 'cat >file_{#}.csv'
;
count=1;
print header > "x_" count;
next
}
!( (NR-1) % 100){
count++;
print header > "x_" count;
}
{
print $> tail -n +2 file.txt | split -l 4
$> for file in `ls xa*`; do echo "`head -1 file.txt`" > tmp; cat $file >> tmp; mv -f tmp $file; done
> "x_" count
}' file
100 is the number of lines of each slice. It doesn't require temp files and can be put on a single line.
100 是每个切片的行数。它不需要临时文件,可以放在一行上。
回答by Tim Richardson
This one-liner will split the big csv into pieces of 999 records, with the header at the top of each one (so 999 records + 1 header = 1000 rows)
这个单行将大 csv 拆分为 999 条记录,标题位于每个记录的顶部(因此 999 条记录 + 1 个标题 = 1000 行)
trap 'rm split_* tmp_file ; exit 13' SIGINT SIGTERM SIGQUIT
tail -n +2 file.txt | split -l 4 - split_
for file in split_*
do
head -n 1 file.txt > tmp_file
cat $file >> tmp_file
mv -f tmp_file $file
done
Based on Ole Tange's answer. (re Ole's answer: You can't use line count with pipepart)
基于 Ole Tange 的回答。(重新 Ole 的回答:您不能将行数与 pipepart 一起使用)
回答by Rob Hruska
I'm a novice when it comes to Bash-fu, but I was able to concoct this two-command monstrosity. I'm sure there are more elegant solutions.
我是 Bash-fu 的新手,但我能够炮制出这种双命令怪物。我相信有更优雅的解决方案。
awk 'NR==1{print in_file=
awk '{if (NR!=1) {print}}' $in_file | split -d -a 5 -l 100000 - $in_file"_" # Get all lines except the first, split into 100,000 line chunks
for file in $in_file"_"*
do
tmp_file=$(mktemp $in_file.XXXXXX) # Create a safer temp file
head -n 1 $in_file | cat - $file > $tmp_file # Get header from main file, cat that header with split file contents to temp file
mv -f $tmp_file $file # Overwrite non-header containing file with header-containing file
done
> FILENAME ".split1"; print parallel -a bigfile.csv --header : --pipepart 'cat > {#}'
> FILENAME ".split2";} NR>1{if (NR % 10 > 5) print parallel -a bigfile.csv --header : --pipepart my_program_reading_from_stdin
parallel -a bigfile.csv --header : --pipepart --fifo my_program_reading_from_fifo {}
parallel -a bigfile.csv --header : --pipepart --cat my_program_reading_from_a_file {}
>> FILENAME ".split1"; else print parallel --block -2 -a bigfile.csv --header : --pipepart my_program_reading_from_stdin
>> FILENAME ".split2"}' file
This is assuming your input file is file.txt
, you're not using the prefix
argument to split
, and you're working in a directory that doesn't have any other files that start with split
's default xa*
output format. Also, replace the '4' with your desired split line size.
这是假设您的输入文件是file.txt
,您没有使用prefix
参数split
,并且您正在一个目录中工作,该目录没有任何其他以split
的默认xa*
输出格式开头的文件。此外,将“4”替换为您所需的分割线大小。
回答by Sam Bisbee
This is a more robust version of Denis Williamson's script. The script creates a lot of temporary files, and it would be a shame if they were left lying around if the run was incomplete. So, let's add signal trapping (see http://tldp.org/LDP/Bash-Beginners-Guide/html/sect_12_02.htmland then http://tldp.org/LDP/abs/html/debugging.html) and remove our temporary files; this is a best practice anyways.
这是Denis Williamson脚本的更强大版本。该脚本创建了许多临时文件,如果运行不完整,如果将它们留在原地,那将是一种耻辱。所以,让我们添加信号捕获(参见http://tldp.org/LDP/Bash-Beginners-Guide/html/sect_12_02.html然后是http://tldp.org/LDP/abs/html/debugging.html)和删除我们的临时文件;无论如何,这是最佳实践。
parallel --block 10M -a bigfile.csv --header : --pipepart my_program_reading_from_stdin
Replace '13' with whatever return code you want. Oh, and you should probably be using mktemp anyways (as some have already suggested), so go ahead and remove 'tmp_file" from the rm in the trap line. See the signal man page for more signals to catch.
用你想要的任何返回码替换“13”。哦,您可能无论如何都应该使用 mktemp(正如有些人已经建议的那样),所以继续并从陷阱行中的 rm 中删除 'tmp_file'。请参阅信号手册页以获取更多要捕获的信号。
回答by DreamFlasher
I liked the awk version of marco, adopted from this a simplified one-liner where you can easily specify the split fraction as granular as you want:
我喜欢 marco 的 awk 版本,它采用了一种简化的单行代码,您可以根据需要轻松地将拆分分数指定为粒度:
csvheader=`head -1 bigfile.csv` split -d -l10000 bigfile.csv smallfile_ find .|grep smallfile_ | xargs sed -i "1s/^/$csvheader\n/" sed -i '1d' smallfile_00
回答by Garren S
I really liked Rob and Dennis' versions, so much so that I wanted to improve them.
我真的很喜欢 Rob 和 Dennis 的版本,以至于我想改进它们。
Here's my version:
这是我的版本:
##代码##Differences:
区别:
- in_file is the file argument you want to split maintaining headers
- Use
awk
instead oftail
due toawk
having better performance - split into 100,000 line files instead of 4
- Split file name will be input file name appended with an underscore and numbers (up to 99999 - from the "-d -a 5" split argument)
- Use mktemp to safely handle temporary files
- Use single
head | cat
line instead of two lines
- in_file 是您要拆分维护标头的文件参数
- 使用
awk
而不是tail
因为awk
性能更好 - 拆分为 100,000 个行文件而不是 4 个
- 拆分文件名将是输入文件名,并附加下划线和数字(最多 99999 - 来自“-d -a 5”拆分参数)
- 使用 mktemp 安全处理临时文件
- 使用单行
head | cat
而不是两行
回答by Ole Tange
Use GNU Parallel:
使用 GNU 并行:
##代码##If you need to run a command on each of the parts, then GNU Parallel can help do that, too:
如果您需要在每个部分上运行命令,那么 GNU Parallel 也可以帮助做到这一点:
##代码##If you want to split into 2 parts per CPU core (e.g. 24 cores = 48 equal sized parts):
如果您想将每个 CPU 内核分成 2 个部分(例如 24 个内核 = 48 个相同大小的部分):
##代码##If you want to split into 10 MB blocks:
如果要拆分为 10 MB 的块:
##代码##回答by Thyag
Below is a 4 liner that can be used to split a bigfile.csv into multiple smaller files, and preserve the csv header. Uses only built-in Bash commands (head, split, find, grep, xargs, and sed) which should work on most *nix systems. Should also work on Windows if you install mingw-64 / git-bash.
下面是一个 4 行,可用于将 bigfile.csv 拆分为多个较小的文件,并保留 csv 标头。只使用内置的 Bash 命令(head、split、find、grep、xargs 和 sed),这些命令应该适用于大多数 *nix 系统。如果您安装 mingw-64 / git-bash,也应该在 Windows 上工作。
##代码##Line by line explanation:
逐行解释:
- Capture the header to a variable named csvheader
- Split the bigfile.csvinto a number of smaller files with prefix smallfile_
- Findall smallfiles and insert the csvheader into the FIRST line using xargsand sed -i. Note that you need to use sed within "double quotes" in order to use variables.
- The first file named smallfile_00 will now have redundant headers on lines 1 and 2 (from the original data as well as from the sed header insert in step 3). We can remove the redundant header with sed -i '1d' command.
- 将标头捕获到名为csvheader的变量
- 将bigfile.csv拆分为多个带有前缀smallfile_的较小文件
- 查找所有小文件并使用xargs和sed -i将 csvheader 插入到第一行。请注意,您需要在“双引号”内使用 sed 才能使用变量。
- 第一个名为 smallfile_00 的文件现在将在第 1 行和第 2 行(来自原始数据以及来自步骤 3 中的 sed 标头插入)具有冗余标头。我们可以使用 sed -i '1d' 命令删除多余的标头。