bash 在每个空行上拆分大文本文件

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/33294986/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-18 13:47:34  来源:igfitidea点击:

Splitting large text file on every blank line

bashperlawk

提问by tropical e

I'm having a bit trouble of splitting a large text file into multiple smaller ones. Syntax of my text file is the following:

我在将大型文本文件拆分为多个较小的文件时遇到了一些麻烦。我的文本文件的语法如下:

dasdas #42319 blaablaa 50 50
content content
more content
content conclusion

asdasd #92012 blaablaa 30 70
content again
more of it
content conclusion

asdasd #299 yadayada 60 40
content
content
contend done
...and so on

A typical information table in my file has anywhere between 10-40 rows.

我的文件中的典型信息表有 10-40 行。

I would like this file to be split in n smaller files, where n is the amount of content tables.

我希望将此文件拆分为 n 个较小的文件,其中 n 是内容表的数量。

That is

那是

dasdas #42319 blaablaa 50 50
content content
more content
content conclusion

would be its own separate file, (whateverN.txt)

将是它自己的单独文件,( whateverN.txt)

and

asdasd #92012 blaablaa 30 70
content again
more of it
content conclusion

again a separate file whateverN+1.txtand so forth.

再一个单独的文件whateverN+1.txt等等。

It seems like awkor Perlare nifty tools for this, but having never used them before the syntax is kinda baffling.

看起来像是awk或者Perl是用于此的漂亮工具,但是在语法之前从未使用过它们有点令人困惑。

I found these two questions that are almost correspondent to my problem, but failed to modify the syntax to fit my needs:

我发现这两个问题几乎与我的问题相对应,但未能修改语法以满足我的需要:

Split text file into multiple files& How can I split a text file into multiple text files?(on Unix & Linux)

将文本文件拆分为多个文件以及如何将文本文件拆分为多个文本文件?(在 Unix 和 Linux 上)

How should one modify the command line inputs, so that it solves my problem?

应该如何修改命令行输入,以解决我的问题?

回答by jas

Setting RSto null tells awk to use one or more blank lines as the record separator. Then you can simply use NRto set the name of the file corresponding to each new record:

设置RS为 null 告诉 awk 使用一个或多个空行作为记录分隔符。然后你可以简单地使用NR来设置每个新记录对应的文件名:

 awk -v RS= '{print > ("whatever-" NR ".txt")}' file.txt

RS: This is awk's input record separator. Its default value is a string containing a single newline character, which means that an input record consists of a single line of text. It can also be the null string, in which case records are separated by runs of blank lines, or a regexp, in which case records are separated by matches of the regexp in the input text.

RS:这是 awk 的输入记录分隔符。它的默认值是一个包含单个换行符的字符串,这意味着输入记录由单行文本组成。它也可以是空字符串,在这种情况下,记录由空行或正则表达式分隔,在这种情况下,记录由输入文本中的正则表达式的匹配项分隔。

$ cat file.txt
dasdas #42319 blaablaa 50 50
content content
more content
content conclusion

asdasd #92012 blaablaa 30 70
content again
more of it
content conclusion

asdasd #299 yadayada 60 40
content
content
contend done

$ awk -v RS= '{print > ("whatever-" NR ".txt")}' file.txt

$ ls whatever-*.txt
whatever-1.txt  whatever-2.txt  whatever-3.txt

$ cat whatever-1.txt 
dasdas #42319 blaablaa 50 50
content content
more content
content conclusion

$ cat whatever-2.txt 
asdasd #92012 blaablaa 30 70
content again
more of it
content conclusion

$ cat whatever-3.txt 
asdasd #299 yadayada 60 40
content
content
contend done
$ 

回答by Sobrique

Perl has a useful feature called the input record separator. $/.

Perl 有一个有用的特性,称为输入记录分隔符。$/.

This is the 'marker' for separating records when reading a file.

这是读取文件时分隔记录的“标记”。

So:

所以:

#!/usr/bin/env perl
use strict;
use warnings;

local $/ = "\n\n"; 
my $count = 0; 

while ( my $chunk = <> ) {
    open ( my $output, '>', "filename_".$count++ ) or die $!;
    print {$output} $chunk;
    close ( $output ); 
}

Just like that. The <>is the 'magic' filehandle, in that it reads piped data or from files specified on command line (opens them and reads them). This is similar to how sedor grepwork.

就这样。这<>是“魔术”文件句柄,因为它读取管道数据或从命令行指定的文件(打开它们并读取它们)。这类似于 how sedor grepwork。

This can be reduced to a one liner:

这可以简化为单行:

perl -00 -pe 'open ( $out, '>', "filename_".++$n ); select $out;'  yourfilename_here

回答by sat

You can use this awk,

你可以用这个awk

awk 'BEGIN{file="content"++i".txt"} !NF{file="content"++i".txt";next} {print > file}' yourfile

(OR)

(或者)

awk 'BEGIN{i++} !NF{++i;next} {print > "filename"i".txt"}' yourfile

More readable format:

更易读的格式:

BEGIN {
        file="content"++i".txt"
}
!NF {
        file="content"++i".txt";
        next
}
{
        print > file
}

回答by KuldeepSinh

In case you get "too many open files" error as follows...

如果您收到“打开的文件太多”错误,如下所示...

awk: whatever-18.txt makes too many open files
 input record number 18, file file.txt
 source line number 1

You may need to close newly created file, before creating a new one, as follows.

在创建新文件之前,您可能需要关闭新创建的文件,如下所示。

awk -v RS= '{close("whatever-" i ".txt"); i++}{print > ("whatever-" i ".txt")}' file.txt

回答by Benjamin W.

You could use the csplitcommand:

您可以使用以下csplit命令:

csplit \
    --quiet \
    --prefix=whatever \
    --suffix-format=%02d.txt \
    --suppress-matched \
    infile.txt /^$/ {*}

POSIX csplitonly uses short options and doesn't know --suffixand --suppress-matched, so this requires GNU csplit.

POSIXcsplit只使用短选项并且不知道--suffixand --suppress-matched,所以这需要 GNU csplit

This is what the options do:

这就是选项的作用:

  • --quiet– suppress output of file sizes
  • --prefix=whatever– use whateverinstead fo the default xxfilename prefix
  • --suffix-format=%02d.txt– append .txtto the default two digit suffix
  • --suppress-matched– don't include the lines matching the pattern on which the input is split
  • /^$/ {*}– split on pattern "empty line" (/^$/) as often as possible ({*})
  • --quiet– 抑制文件大小的输出
  • --prefix=whatever– 使用whatever默认xx文件名前缀代替
  • --suffix-format=%02d.txt– 附加.txt到默认的两位数后缀
  • --suppress-matched– 不包括与分割输入的模式匹配的行
  • /^$/ {*}/^$/尽可能多地拆分模式“空行” ( ) ( {*})

回答by user2138595

awk -v RS="\n\n" '{for (i=1;i<=NR;i++); print > i-1}' file.txt

Sets record separator as blank line, prints each record as a separate file numbered 1, 2, 3, etc. Last file (only) ends in blank line.

将记录分隔符设置为空行,将每条记录打印为编号为 1、2、3 等的单独文件。最后一个文件(仅)以空行结束。

回答by Kalanidhi

Try this bash script also

也试试这个 bash 脚本

#!/bin/bash
i=1
fileName="OutputFile_$i"
while read line ; do 
if [ "$line"  == ""  ] ; then
 ((++i))
 fileName="OutputFile_$i"
else
 echo $line >> "$fileName"
fi
done < InputFile.txt

回答by Nick P

Since it's Friday and I'm feeling a bit helpful... :)

因为是星期五,所以我觉得有点帮助... :)

Try this. If the file is as small as you imply it's simplest to just read it all at once and work in memory.

尝试这个。如果文件和您暗示的一样小,最简单的方法是一次读取所有文件并在内存中工作。

use strict;
use warnings;

# slurp file
local $/ = undef;
open my $fh, '<', 'test.txt' or die $!;
my $text = <$fh>;
close $fh;

# split on double new line
my @chunks = split(/\n\n/, $text);

# make new files from chunks
my $count = 1;
for my $chunk (@chunks) {
    open my $ofh, '>', "whatever$count.txt" or die $!;
    print $ofh $chunk, "\n";
    close $ofh;
    $count++;
}

The perldocs can explain any individual commands you don't understand but at this point you should probably look into a tutorial as well.

perl文档可以解释你不明白任何单独的命令,但在这一点上你应该考虑的教程以及。