bash 如何使用bash将一个大文件拆分成许多小文件?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/14616630/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How to split a large file into many small files using bash?
提问by Yishu Fang
I have a file, say all
, with 2000 lines, and I hope it can be split into 4 small files with line number 1~500, 501~1000, 1001~1500, 1501~2000.
我有一个文件,比如说all
2000 行,我希望它可以分成 4 个小文件,行号分别为 1~500、501~1000、1001~1500、1501~2000。
Perhaps, I can do this using:
也许,我可以使用:
cat all | head -500 >small1
cat all | tail -1500 | head -500 >small2
cat all | tail -1000 | head -500 >small3
cat all | tail -500 >small4
But this way involves the calculation of line number, which may cause error when the number of lines is not a good number, or when we want to split the file to too many small files (e.g.: file all
with 3241 lines, and we want to split it into 7 files, each with 463 lines).
但是这种方式涉及到行数的计算,当行数不是一个好的数字时,或者当我们想把文件分割成太多小文件(例如:all
3241行的文件,而我们想把文件分割成太多的时候,可能会出错)将其分成 7 个文件,每个文件 463 行)。
Is there a better way to do this?
有一个更好的方法吗?
回答by William Pursell
When you want to split a file, use split
:
当您要拆分文件时,请使用split
:
split -l 500 all all
will split the file into several files that each have 500 lines. If you want to split the file into 4 files of roughly the same size, use something like:
将文件拆分为多个文件,每个文件有 500 行。如果要将文件拆分为大小大致相同的 4 个文件,请使用以下内容:
split -l $(( $( wc -l < all ) / 4 + 1 )) all all
回答by John Brodie
Look into the split
command, it should do what you want (and more):
查看split
命令,它应该执行您想要的操作(以及更多操作):
$ split --help
Usage: split [OPTION]... [INPUT [PREFIX]]
Output fixed-size pieces of INPUT to PREFIXaa, PREFIXab, ...; default
size is 1000 lines, and default PREFIX is 'x'. With no INPUT, or when INPUT
is -, read standard input.
Mandatory arguments to long options are mandatory for short options too.
-a, --suffix-length=N generate suffixes of length N (default 2)
--additional-suffix=SUFFIX append an additional SUFFIX to file names.
-b, --bytes=SIZE put SIZE bytes per output file
-C, --line-bytes=SIZE put at most SIZE bytes of lines per output file
-d, --numeric-suffixes[=FROM] use numeric suffixes instead of alphabetic.
FROM changes the start value (default 0).
-e, --elide-empty-files do not generate empty output files with '-n'
--filter=COMMAND write to shell COMMAND; file name is $FILE
-l, --lines=NUMBER put NUMBER lines per output file
-n, --number=CHUNKS generate CHUNKS output files. See below
-u, --unbuffered immediately copy input to output with '-n r/...'
--verbose print a diagnostic just before each
output file is opened
--help display this help and exit
--version output version information and exit
SIZE is an integer and optional unit (example: 10M is 10*1024*1024). Units
are K, M, G, T, P, E, Z, Y (powers of 1024) or KB, MB, ... (powers of 1000).
CHUNKS may be:
N split into N files based on size of input
K/N output Kth of N to stdout
l/N split into N files without splitting lines
l/K/N output Kth of N to stdout without splitting lines
r/N like 'l' but use round robin distribution
r/K/N likewise but only output Kth of N to stdout
回答by Jeroen Janssens
Like the others have already mentioned, you could use split
. The complicated command substitution that the accepted answer mentions is not necessary. For reference I'm adding the following commands, which accomplish almost what has been request. Note that when using -n
command-line argument to specify the number of chucks, the small*
files do not contain exactly 500 lines when using split
.
就像其他人已经提到的那样,您可以使用split
. 接受的答案提到的复杂命令替换是不必要的。作为参考,我添加了以下命令,它们几乎完成了请求。请注意,当使用-n
命令行参数指定卡盘数量small*
时,使用split
.
$ seq 2000 > all
$ split -n l/4 --numeric-suffixes=1 --suffix-length=1 all small
$ wc -l small*
583 small1
528 small2
445 small3
444 small4
2000 total
Alternatively, you could use GNU parallel:
或者,您可以使用GNU parallel:
$ < all parallel -N500 --pipe --cat cp {} small{#}
$ wc -l small*
500 small1
500 small2
500 small3
500 small4
2000 total
As you can see, this incantation is quite complex. GNU Parallel is actually most-often used for parallelizing pipelines. IMHO a tool worth looking into.
如您所见,这个咒语非常复杂。GNU Parallel 实际上最常用于并行化管道。恕我直言,一个值得研究的工具。