bash 如何将大文本文件拆分为行数相同的小文件?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/2016894/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-09 18:50:48  来源:igfitidea点击:

How to split a large text file into smaller files with equal number of lines?

bashfileunix

提问by danben

I've got a large (by number of lines) plain text file that I'd like to split into smaller files, also by number of lines. So if my file has around 2M lines, I'd like to split it up into 10 files that contain 200k lines, or 100 files that contain 20k lines (plus one file with the remainder; being evenly divisible doesn't matter).

我有一个大的(按行数)纯文本文件,我想将其拆分为更小的文件,也按行数。因此,如果我的文件有大约 200 万行,我想将其拆分为 10 个包含 200k 行的文件,或 100 个包含 20k 行的文件(加上一个文件与其余部分;均匀分割无关紧要)。

I could do this fairly easily in Python but I'm wondering if there's any kind of ninja way to do this using bash and unix utils (as opposed to manually looping and counting / partitioning lines).

我可以在 Python 中很容易地做到这一点,但我想知道是否有任何忍者方法可以使用 bash 和 unix utils(而不是手动循环和计数/分区行)来做到这一点。

回答by Mark Byers

Have you looked at the split command?

你看过 split 命令吗?

$ split --help
Usage: split [OPTION] [INPUT [PREFIX]]
Output fixed-size pieces of INPUT to PREFIXaa, PREFIXab, ...; default
size is 1000 lines, and default PREFIX is `x'.  With no INPUT, or when INPUT
is -, read standard input.

Mandatory arguments to long options are mandatory for short options too.
  -a, --suffix-length=N   use suffixes of length N (default 2)
  -b, --bytes=SIZE        put SIZE bytes per output file
  -C, --line-bytes=SIZE   put at most SIZE bytes of lines per output file
  -d, --numeric-suffixes  use numeric suffixes instead of alphabetic
  -l, --lines=NUMBER      put NUMBER lines per output file
      --verbose           print a diagnostic to standard error just
                            before each output file is opened
      --help     display this help and exit
      --version  output version information and exit

You could do something like this:

你可以这样做:

split -l 200000 filename

which will create files each with 200000 lines named xaa xab xac...

这将创建每个文件有 200000 行,命名为xaa xab xac...

Another option, split by size of output file (still splits on line breaks):

另一种选择,按输出文件的大小拆分(仍然在换行符处拆分):

 split -C 20m --numeric-suffixes input_filename output_prefix

creates files like output_prefix01 output_prefix02 output_prefix03 ...each of max size 20 megabytes.

创建output_prefix01 output_prefix02 output_prefix03 ...每个最大大小为 20 兆字节的文件。

回答by Robert Christie

How about the splitcommand?

又如何分割命令?

split -l 200000 mybigfile.txt

回答by Dave Kirby

Yes, there is a splitcommand. It will split a file by lines or bytes.

是的,有split命令。它将按行或字节拆分文件。

$ split --help
Usage: split [OPTION]... [INPUT [PREFIX]]
Output fixed-size pieces of INPUT to PREFIXaa, PREFIXab, ...; default
size is 1000 lines, and default PREFIX is `x'.  With no INPUT, or when INPUT
is -, read standard input.

Mandatory arguments to long options are mandatory for short options too.
  -a, --suffix-length=N   use suffixes of length N (default 2)
  -b, --bytes=SIZE        put SIZE bytes per output file
  -C, --line-bytes=SIZE   put at most SIZE bytes of lines per output file
  -d, --numeric-suffixes  use numeric suffixes instead of alphabetic
  -l, --lines=NUMBER      put NUMBER lines per output file
      --verbose           print a diagnostic just before each
                            output file is opened
      --help     display this help and exit
      --version  output version information and exit

SIZE may have a multiplier suffix:
b 512, kB 1000, K 1024, MB 1000*1000, M 1024*1024,
GB 1000*1000*1000, G 1024*1024*1024, and so on for T, P, E, Z, Y.

回答by zmbush

use split

split

Split a file into fixed-size pieces, creates output files containing consecutive sections of INPUT (standard input if none is given or INPUT is `-')

将文件拆分为固定大小的部分,创建包含 INPUT 连续部分的输出文件(如果没有给出或 INPUT 为“-”,则为标准输入)

Syntax split [options] [INPUT [PREFIX]]

Syntax split [options] [INPUT [PREFIX]]

http://ss64.com/bash/split.html

http://ss64.com/bash/split.html

回答by Harshwardhan

Use:

用:

sed -n '1,100p' filename > output.txt

Here, 1 and 100 are the line numbers which you will capture in output.txt.

在这里,1 和 100 是您将在output.txt.

回答by ialqwaiz

split the file "file.txt" into 10000 lines files:

将文件“file.txt”拆分为 10000 行文件:

split -l 10000 file.txt

回答by Denilson Sá Maia

split(from GNU coreutils, since version 8.8 from 2010-12-22) includes the following parameter:

split(来自 GNU coreutils,自2010-12-22 的 8.8 版起)包括以下参数:

-n, --number=CHUNKS     generate CHUNKS output files; see explanation below

CHUNKS may be:
  N       split into N files based on size of input
  K/N     output Kth of N to stdout
  l/N     split into N files without splitting lines/records
  l/K/N   output Kth of N to stdout without splitting lines/records
  r/N     like 'l' but use round robin distribution
  r/K/N   likewise but only output Kth of N to stdout

Thus, split -n 4 input output.will generate four files (output.a{a,b,c,d}) with the same amount of bytes, but lines might be broken in the middle.

因此,split -n 4 input output.将生成output.a{a,b,c,d}具有相同字节数的四个文件 ( ),但可能会在中间断行。

If we want to preserve full lines (i.e. split by lines), then this should work:

如果我们想保留整行(即按行拆分),那么这应该有效:

split -n l/4 input output.

Related answer: https://stackoverflow.com/a/19031247

相关答案:https: //stackoverflow.com/a/19031247

回答by m3nda

In case you just want to split by x number of lines each file, the given answers about splitare OK. But, i am curious about no one paid attention to requirements:

如果您只想按每个文件的 x 行数拆分,那么给出的答案split是可以的。但是,我很好奇没有人关注要求:

  • "without having to count them" -> using wc + cut
  • "having the remainder in extra file" -> split does by default
  • “无需计算它们”-> 使用 wc + cut
  • “将剩余部分放在额外文件中”-> 默认情况下拆分

I can't do that without "wc + cut", but I'm using that:

如果没有“wc + cut”,我就无法做到这一点,但我正在使用它:

split -l  $(expr `wc $filename | cut -d ' ' -f3` / $chunks) $filename

This can be easily added to your bashrc functions so you can just invoke it passing filename and chunks:

这可以很容易地添加到你的 bashrc 函数中,这样你就可以调用它传递文件名和块:

 split -l  $(expr `wc  | cut -d ' ' -f3` / ) 

In case you want just x chunks without remainder in extra file, just adapt the formula to sum it (chunks - 1) on each file. I do use this approach because usually i just want x number of files rather than x lines per file:

如果您只想要 x 块而没有额外文件中的余数,只需调整公式以在每个文件上对其求和 (chunks - 1)。我确实使用这种方法,因为通常我只想要 x 个文件而不是每个文件 x 行:

split -l  $(expr `wc  | cut -d ' ' -f3` /  + `expr  - 1`) 

You can add that to a script and call it your "ninja way", because if nothing suites your needs, you can build it :-)

您可以将其添加到脚本中并将其称为“忍者方式”,因为如果没有什么适合您的需求,您可以构建它:-)

回答by ghostdog74

you can also use awk

你也可以使用 awk

awk -vc=1 'NR%200000==0{++c}{print 
split -b 125m compact.file -d -a 3 compact_prefix
> c".txt"}' largefile

回答by Matiji66

HDFS getmerge small file and spilt into property size.

HDFS getmerge 小文件并溢出到属性大小。

This method will cause line break

这种方法会导致换行

# split into 128m ,judge sizeunit is M or G ,please test before use.

begainsize=`hdfs dfs -du -s -h /externaldata/$table_name/$date/ | awk '{ print }' `
sizeunit=`hdfs dfs -du -s -h /externaldata/$table_name/$date/ | awk '{ print }' `
if [ $sizeunit = "G" ];then
    res=$(printf "%.f" `echo "scale=5;$begainsize*8 "|bc`)
else
    res=$(printf "%.f" `echo "scale=5;$begainsize/128 "|bc`)  # celling ref http://blog.csdn.net/naiveloafer/article/details/8783518
fi
echo $res
# split into $res files with number suffix.  ref  http://blog.csdn.net/microzone/article/details/52839598
compact_file_name=$compact_file"_"
echo "compact_file_name :"$compact_file_name
split -n l/$res $basedir/$compact_file -d -a 3 $basedir/${compact_file_name}

I try to getmerge and split into about 128MB every file.

我尝试 getmerge 并将每个文件拆分为大约 128MB。

##代码##