bash 在每个空行上拆分大文本文件
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/33294986/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Splitting large text file on every blank line
提问by tropical e
I'm having a bit trouble of splitting a large text file into multiple smaller ones. Syntax of my text file is the following:
我在将大型文本文件拆分为多个较小的文件时遇到了一些麻烦。我的文本文件的语法如下:
dasdas #42319 blaablaa 50 50
content content
more content
content conclusion
asdasd #92012 blaablaa 30 70
content again
more of it
content conclusion
asdasd #299 yadayada 60 40
content
content
contend done
...and so on
A typical information table in my file has anywhere between 10-40 rows.
我的文件中的典型信息表有 10-40 行。
I would like this file to be split in n smaller files, where n is the amount of content tables.
我希望将此文件拆分为 n 个较小的文件,其中 n 是内容表的数量。
That is
那是
dasdas #42319 blaablaa 50 50
content content
more content
content conclusion
would be its own separate file, (whateverN.txt
)
将是它自己的单独文件,( whateverN.txt
)
and
和
asdasd #92012 blaablaa 30 70
content again
more of it
content conclusion
again a separate file whateverN+1.txt
and so forth.
再一个单独的文件whateverN+1.txt
等等。
It seems like awk
or Perl
are nifty tools for this, but having never used them before the syntax is kinda baffling.
看起来像是awk
或者Perl
是用于此的漂亮工具,但是在语法之前从未使用过它们有点令人困惑。
I found these two questions that are almost correspondent to my problem, but failed to modify the syntax to fit my needs:
我发现这两个问题几乎与我的问题相对应,但未能修改语法以满足我的需要:
Split text file into multiple files& How can I split a text file into multiple text files?(on Unix & Linux)
将文本文件拆分为多个文件以及如何将文本文件拆分为多个文本文件?(在 Unix 和 Linux 上)
How should one modify the command line inputs, so that it solves my problem?
应该如何修改命令行输入,以解决我的问题?
回答by jas
Setting RS
to null tells awk to use one or more blank lines as the record separator. Then you can simply use NR
to set the name of the file corresponding to each new record:
设置RS
为 null 告诉 awk 使用一个或多个空行作为记录分隔符。然后你可以简单地使用NR
来设置每个新记录对应的文件名:
awk -v RS= '{print > ("whatever-" NR ".txt")}' file.txt
RS: This is awk's input record separator. Its default value is a string containing a single newline character, which means that an input record consists of a single line of text. It can also be the null string, in which case records are separated by runs of blank lines, or a regexp, in which case records are separated by matches of the regexp in the input text.
RS:这是 awk 的输入记录分隔符。它的默认值是一个包含单个换行符的字符串,这意味着输入记录由单行文本组成。它也可以是空字符串,在这种情况下,记录由空行或正则表达式分隔,在这种情况下,记录由输入文本中的正则表达式的匹配项分隔。
$ cat file.txt
dasdas #42319 blaablaa 50 50
content content
more content
content conclusion
asdasd #92012 blaablaa 30 70
content again
more of it
content conclusion
asdasd #299 yadayada 60 40
content
content
contend done
$ awk -v RS= '{print > ("whatever-" NR ".txt")}' file.txt
$ ls whatever-*.txt
whatever-1.txt whatever-2.txt whatever-3.txt
$ cat whatever-1.txt
dasdas #42319 blaablaa 50 50
content content
more content
content conclusion
$ cat whatever-2.txt
asdasd #92012 blaablaa 30 70
content again
more of it
content conclusion
$ cat whatever-3.txt
asdasd #299 yadayada 60 40
content
content
contend done
$
回答by Sobrique
Perl has a useful feature called the input record separator. $/
.
Perl 有一个有用的特性,称为输入记录分隔符。$/
.
This is the 'marker' for separating records when reading a file.
这是读取文件时分隔记录的“标记”。
So:
所以:
#!/usr/bin/env perl
use strict;
use warnings;
local $/ = "\n\n";
my $count = 0;
while ( my $chunk = <> ) {
open ( my $output, '>', "filename_".$count++ ) or die $!;
print {$output} $chunk;
close ( $output );
}
Just like that. The <>
is the 'magic' filehandle, in that it reads piped data or from files specified on command line (opens them and reads them). This is similar to how sed
or grep
work.
就这样。这<>
是“魔术”文件句柄,因为它读取管道数据或从命令行指定的文件(打开它们并读取它们)。这类似于 how sed
or grep
work。
This can be reduced to a one liner:
这可以简化为单行:
perl -00 -pe 'open ( $out, '>', "filename_".++$n ); select $out;' yourfilename_here
回答by sat
You can use this awk
,
你可以用这个awk
,
awk 'BEGIN{file="content"++i".txt"} !NF{file="content"++i".txt";next} {print > file}' yourfile
(OR)
(或者)
awk 'BEGIN{i++} !NF{++i;next} {print > "filename"i".txt"}' yourfile
More readable format:
更易读的格式:
BEGIN {
file="content"++i".txt"
}
!NF {
file="content"++i".txt";
next
}
{
print > file
}
回答by KuldeepSinh
In case you get "too many open files" error as follows...
如果您收到“打开的文件太多”错误,如下所示...
awk: whatever-18.txt makes too many open files
input record number 18, file file.txt
source line number 1
You may need to close newly created file, before creating a new one, as follows.
在创建新文件之前,您可能需要关闭新创建的文件,如下所示。
awk -v RS= '{close("whatever-" i ".txt"); i++}{print > ("whatever-" i ".txt")}' file.txt
回答by Benjamin W.
You could use the csplit
command:
您可以使用以下csplit
命令:
csplit \
--quiet \
--prefix=whatever \
--suffix-format=%02d.txt \
--suppress-matched \
infile.txt /^$/ {*}
POSIX csplit
only uses short options and doesn't know --suffix
and --suppress-matched
, so this requires GNU csplit
.
POSIXcsplit
只使用短选项并且不知道--suffix
and --suppress-matched
,所以这需要 GNU csplit
。
This is what the options do:
这就是选项的作用:
--quiet
– suppress output of file sizes--prefix=whatever
– usewhatever
instead fo the defaultxx
filename prefix--suffix-format=%02d.txt
– append.txt
to the default two digit suffix--suppress-matched
– don't include the lines matching the pattern on which the input is split/^$/ {*}
– split on pattern "empty line" (/^$/
) as often as possible ({*}
)
--quiet
– 抑制文件大小的输出--prefix=whatever
– 使用whatever
默认xx
文件名前缀代替--suffix-format=%02d.txt
– 附加.txt
到默认的两位数后缀--suppress-matched
– 不包括与分割输入的模式匹配的行/^$/ {*}
–/^$/
尽可能多地拆分模式“空行” ( ) ({*}
)
回答by user2138595
awk -v RS="\n\n" '{for (i=1;i<=NR;i++); print > i-1}' file.txt
Sets record separator as blank line, prints each record as a separate file numbered 1, 2, 3, etc. Last file (only) ends in blank line.
将记录分隔符设置为空行,将每条记录打印为编号为 1、2、3 等的单独文件。最后一个文件(仅)以空行结束。
回答by Kalanidhi
Try this bash script also
也试试这个 bash 脚本
#!/bin/bash
i=1
fileName="OutputFile_$i"
while read line ; do
if [ "$line" == "" ] ; then
((++i))
fileName="OutputFile_$i"
else
echo $line >> "$fileName"
fi
done < InputFile.txt
回答by Nick P
Since it's Friday and I'm feeling a bit helpful... :)
因为是星期五,所以我觉得有点帮助... :)
Try this. If the file is as small as you imply it's simplest to just read it all at once and work in memory.
尝试这个。如果文件和您暗示的一样小,最简单的方法是一次读取所有文件并在内存中工作。
use strict;
use warnings;
# slurp file
local $/ = undef;
open my $fh, '<', 'test.txt' or die $!;
my $text = <$fh>;
close $fh;
# split on double new line
my @chunks = split(/\n\n/, $text);
# make new files from chunks
my $count = 1;
for my $chunk (@chunks) {
open my $ofh, '>', "whatever$count.txt" or die $!;
print $ofh $chunk, "\n";
close $ofh;
$count++;
}
The perl
docs can explain any individual commands you don't understand but at this point you should probably look into a tutorial as well.
该perl
文档可以解释你不明白任何单独的命令,但在这一点上你应该考虑的教程以及。