bash 如何使用bash将一个大文件拆分成许多小文件?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/14616630/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-09 23:17:03  来源:igfitidea点击:

How to split a large file into many small files using bash?

bash

提问by Yishu Fang

I have a file, say all, with 2000 lines, and I hope it can be split into 4 small files with line number 1~500, 501~1000, 1001~1500, 1501~2000.

我有一个文件,比如说all2000 行,我希望它可以分成 4 个小文件,行号分别为 1~500、501~1000、1001~1500、1501~2000。

Perhaps, I can do this using:

也许,我可以使用:

cat all | head -500 >small1
cat all | tail -1500 | head -500 >small2
cat all | tail -1000 | head -500 >small3
cat all | tail -500 >small4

But this way involves the calculation of line number, which may cause error when the number of lines is not a good number, or when we want to split the file to too many small files (e.g.: file allwith 3241 lines, and we want to split it into 7 files, each with 463 lines).

但是这种方式涉及到行数的计算,当行数不是一个好的数字时,或者当我们想把文件分割成太多小文件(例如:all3241行的文件,而我们想把文件分割成太多的时候,可能会出错)将其分成 7 个文件,每个文件 463 行)。

Is there a better way to do this?

有一个更好的方法吗?

回答by William Pursell

When you want to split a file, use split:

当您要拆分文件时,请使用split

split -l 500 all all

will split the file into several files that each have 500 lines. If you want to split the file into 4 files of roughly the same size, use something like:

将文件拆分为多个文件,每个文件有 500 行。如果要将文件拆分为大小大致相同的 4 个文件,请使用以下内容:

split -l $(( $( wc -l < all ) / 4 + 1 )) all all

回答by John Brodie

Look into the splitcommand, it should do what you want (and more):

查看split命令,它应该执行您想要的操作(以及更多操作):

$ split --help
Usage: split [OPTION]... [INPUT [PREFIX]]
Output fixed-size pieces of INPUT to PREFIXaa, PREFIXab, ...; default
size is 1000 lines, and default PREFIX is 'x'.  With no INPUT, or when INPUT
is -, read standard input.

Mandatory arguments to long options are mandatory for short options too.
  -a, --suffix-length=N   generate suffixes of length N (default 2)
      --additional-suffix=SUFFIX  append an additional SUFFIX to file names.
  -b, --bytes=SIZE        put SIZE bytes per output file
  -C, --line-bytes=SIZE   put at most SIZE bytes of lines per output file
  -d, --numeric-suffixes[=FROM]  use numeric suffixes instead of alphabetic.
                                   FROM changes the start value (default 0).
  -e, --elide-empty-files  do not generate empty output files with '-n'
      --filter=COMMAND    write to shell COMMAND; file name is $FILE
  -l, --lines=NUMBER      put NUMBER lines per output file
  -n, --number=CHUNKS     generate CHUNKS output files.  See below
  -u, --unbuffered        immediately copy input to output with '-n r/...'
      --verbose           print a diagnostic just before each
                            output file is opened
      --help     display this help and exit
      --version  output version information and exit

SIZE is an integer and optional unit (example: 10M is 10*1024*1024).  Units
are K, M, G, T, P, E, Z, Y (powers of 1024) or KB, MB, ... (powers of 1000).

CHUNKS may be:
N       split into N files based on size of input
K/N     output Kth of N to stdout
l/N     split into N files without splitting lines
l/K/N   output Kth of N to stdout without splitting lines
r/N     like 'l' but use round robin distribution
r/K/N   likewise but only output Kth of N to stdout

回答by Jeroen Janssens

Like the others have already mentioned, you could use split. The complicated command substitution that the accepted answer mentions is not necessary. For reference I'm adding the following commands, which accomplish almost what has been request. Note that when using -ncommand-line argument to specify the number of chucks, the small*files do not contain exactly 500 lines when using split.

就像其他人已经提到的那样,您可以使用split. 接受的答案提到的复杂命令替换是不必要的。作为参考,我添加了以下命令,它们几乎完成了请求。请注意,当使用-n命令行参数指定卡盘数量small*时,使用split.

$ seq 2000 > all
$ split -n l/4 --numeric-suffixes=1 --suffix-length=1 all small
$ wc -l small*
 583 small1
 528 small2
 445 small3
 444 small4
2000 total

Alternatively, you could use GNU parallel:

或者,您可以使用GNU parallel

$ < all parallel -N500 --pipe --cat cp {} small{#}
$ wc -l small*
 500 small1
 500 small2
 500 small3
 500 small4
2000 total

As you can see, this incantation is quite complex. GNU Parallel is actually most-often used for parallelizing pipelines. IMHO a tool worth looking into.

如您所见,这个咒语非常复杂。GNU Parallel 实际上最常用于并行化管道。恕我直言,一个值得研究的工具。