使用 BASH 中的 shell 脚本在正则表达式上将一个大的 txt 文件拆分为 200 个较小的 txt 文件
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/4952021/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Splitting a large txt file into 200 smaller txt files on a regex using shell script in BASH
提问by rosser
Hi guys I hope the subject is clear enough, I haven't found anything specifically about this in the previously asked bin. I've tried implementing this in Perl or Python, but I think I may be trying too hard.
嗨,伙计们,我希望这个主题足够清楚,我在之前询问的 bin 中没有找到任何关于此的具体内容。我试过用 Perl 或 Python 实现它,但我想我可能太努力了。
Is there a simple shell command / pipeline that will split my 4mb .txt file into seperate .txt files, based on a beginning and ending regex?
是否有一个简单的 shell 命令/管道可以根据开始和结束的正则表达式将我的 4mb .txt 文件拆分为单独的 .txt 文件?
I provide a short sample of the file below.. so you can see that every "story" starts with the phrase "X of XXX DOCUMENTS", which could be used to split the file.
我在下面提供了文件的简短示例.. 这样您就可以看到每个“故事”都以短语“X of XXX DOCUMENTS”开头,可用于拆分文件。
I think this should be easy and I'd be surprised if bash can't do it - faster than Perl/Py.
我认为这应该很容易,如果 bash 不能做到,我会感到惊讶 - 比 Perl/Py 快。
Here it is:
这里是:
1 of 999 DOCUMENTS
Copyright 2011 Virginian-Pilot Companies LLC
All Rights Reserved
The Virginian-Pilot(Norfolk, VA.)
...
3 of 999 DOCUMENTS
Copyright 2011 Canwest News Service
All Rights Reserved
Canwest News Service
...
Thanks in advance for all your help.
在此先感谢您的帮助。
Ross
罗斯
回答by kurumi
awk '/[0-9]+ of [0-9]+ DOCUMENTS/{g++} { print #!/usr/bin/env ruby
g=1
f=File.open(g.to_s + ".txt","w")
open("file").each do |line|
if line[/\d+ of \d+ DOCUMENTS/]
f.close
g+=1
f=File.open(g.to_s + ".txt","w")
end
f.print line
end
> g".txt"}' file
OSX users will need
gawk, as the builtinawkwill produce an error likeawk: illegal statement at source line 1
OSX 用户将需要
gawk,因为内置程序awk会产生类似的错误awk: illegal statement at source line 1
Ruby(1.9+)
红宝石(1.9+)
csplit csplit.test '/^\.\.\./' '{*}' && sed -i '/^\.\.\./d' xx*
回答by ?aphink
As suggested in other solutions, you could use csplitfor that:
正如其他解决方案中所建议的,您可以使用csplit:
use strict;
use warnings;
my $count = 1;
open (my $file, '<', 'source.txt') or die "Can't open source.txt: $!";
for (split /(?=^.*\d+[^\S\n]*of[^\S\n]*\d+[^\S\n]*DOCUMENTS)/m, join('',<$file>))
{
if ( s/^.*(\d+)\s*of\s*\d+\s*DOCUMENTS.*(\n|$)//m )
{
open (my $part, '>', "Part_$count.txt")
or die "Can't open Part_$count for output: $!";
print $part $_;
close ($part);
$count++;
}
}
close ($file);
I haven't found a better way to get rid of the reminiscent separator in the split files.
我还没有找到更好的方法来摆脱拆分文件中的回忆分隔符。
回答by ?aphink
How hard did you try in Perl?
你在 Perl 上有多努力?
EditHere is a faster method. It splits the file then prints the part files.
编辑这是一种更快的方法。它拆分文件然后打印零件文件。
use strict;
use warnings;
open (my $masterfile, '<', 'yourfilename.txt') or die "Can't open yourfilename.txt: $!";
my $count = 1;
my $fh;
while (<$masterfile>) {
if ( /(?<!\d)(\d+)\s*of\s*\d+\s*DOCUMENTS/ ) {
defined $fh and close ($fh);
open ($fh, '>', "Part_$count.txt") or die "Can't open Part_$count for output: $!";
$count++;
next;
}
defined $fh and print $fh $_;
}
defined $fh and close ($fh);
close ($masterfile);
This is the line by line method:
这是逐行方法:
base=outputfile
start=1
pattern='^[[:blank:]]*[[:digit:]]+ OF [[:digit:]]+ DOCUMENTS[[:blank:]]*$
while read -r line
do
if [[ $line =~ $pattern ]]
then
((start++))
printf -v filecount '%4d' $start
>"$base$filecount" # create an empty file named like foo0001
fi
echo "$line" >> "$base$filecount"
done
回答by bw_üezi
regex to match "X of XXX DOCUMENTS" is
\d{1,3} of \d{1,3) DOCUMENTS
匹配“X of XXX DOCUMENTS”的正则表达式是
\d{1,3} of \d{1,3) DOCUMENTS
reading line by line and starting to write new file upon regex match should be fine.
逐行读取并在正则表达式匹配时开始写入新文件应该没问题。
回答by Paused until further notice.
Untested:
未经测试:
##代码##
