使用 BASH 中的 shell 脚本在正则表达式上将一个大的 txt 文件拆分为 200 个较小的 txt 文件

Question

提问by rosser

Hi guys I hope the subject is clear enough, I haven't found anything specifically about this in the previously asked bin. I've tried implementing this in Perl or Python, but I think I may be trying too hard.

嗨，伙计们，我希望这个主题足够清楚，我在之前询问的 bin 中没有找到任何关于此的具体内容。我试过用 Perl 或 Python 实现它，但我想我可能太努力了。

Is there a simple shell command / pipeline that will split my 4mb .txt file into seperate .txt files, based on a beginning and ending regex?

是否有一个简单的 shell 命令/管道可以根据开始和结束的正则表达式将我的 4mb .txt 文件拆分为单独的 .txt 文件？

I provide a short sample of the file below.. so you can see that every "story" starts with the phrase "X of XXX DOCUMENTS", which could be used to split the file.

我在下面提供了文件的简短示例.. 这样您就可以看到每个“故事”都以短语“X of XXX DOCUMENTS”开头，可用于拆分文件。

I think this should be easy and I'd be surprised if bash can't do it - faster than Perl/Py.

我认为这应该很容易，如果 bash 不能做到，我会感到惊讶 - 比 Perl/Py 快。

Here it is:

这里是：

                           1 of 999 DOCUMENTS


              Copyright 2011 Virginian-Pilot Companies LLC
                          All Rights Reserved
                   The Virginian-Pilot(Norfolk, VA.)

...



                           3 of 999 DOCUMENTS


                  Copyright 2011 Canwest News Service
                          All Rights Reserved
                          Canwest News Service

...

Thanks in advance for all your help.

在此先感谢您的帮助。

Ross

罗斯

Answer 1

回答by kurumi

awk '/[0-9]+ of [0-9]+ DOCUMENTS/{g++} { print #!/usr/bin/env ruby
g=1
f=File.open(g.to_s + ".txt","w")
open("file").each do |line|
  if line[/\d+ of \d+ DOCUMENTS/]
    f.close
    g+=1
    f=File.open(g.to_s + ".txt","w")
  end
  f.print line
end
 > g".txt"}' file

OSX users will need gawk, as the builtin awkwill produce an error like awk: illegal statement at source line 1

OSX 用户将需要gawk，因为内置程序awk会产生类似的错误awk: illegal statement at source line 1

Ruby(1.9+)

红宝石（1.9+）

csplit csplit.test '/^\.\.\./' '{*}' && sed -i '/^\.\.\./d' xx*

Answer 2

回答by ?aphink

As suggested in other solutions, you could use csplitfor that:

正如其他解决方案中所建议的，您可以使用csplit：

use strict;
use warnings;

my $count = 1;

open (my $file, '<', 'source.txt') or die "Can't open source.txt: $!";

for (split /(?=^.*\d+[^\S\n]*of[^\S\n]*\d+[^\S\n]*DOCUMENTS)/m, join('',<$file>))
{
    if ( s/^.*(\d+)\s*of\s*\d+\s*DOCUMENTS.*(\n|$)//m )
    {
        open (my $part, '>', "Part_$count.txt") 
            or die "Can't open Part_$count for output: $!";
        print $part $_;
        close ($part);
        $count++;
    }
}
close ($file);

I haven't found a better way to get rid of the reminiscent separator in the split files.

我还没有找到更好的方法来摆脱拆分文件中的回忆分隔符。

Answer 3

回答by ?aphink

How hard did you try in Perl?

你在 Perl 上有多努力？

EditHere is a faster method. It splits the file then prints the part files.

编辑这是一种更快的方法。它拆分文件然后打印零件文件。

use strict;
use warnings;

open (my $masterfile, '<', 'yourfilename.txt') or die "Can't open yourfilename.txt: $!";

my $count = 1;
my $fh;

while (<$masterfile>) {
    if ( /(?<!\d)(\d+)\s*of\s*\d+\s*DOCUMENTS/ ) {
        defined $fh and close ($fh);
        open ($fh, '>', "Part_$count.txt") or die "Can't open Part_$count for  output: $!";
        $count++;
        next;
    }
    defined $fh and print $fh $_;
}
defined $fh and close ($fh);
close ($masterfile);

This is the line by line method:

这是逐行方法：

base=outputfile
start=1
pattern='^[[:blank:]]*[[:digit:]]+ OF [[:digit:]]+ DOCUMENTS[[:blank:]]*$

while read -r line
do
    if [[ $line =~ $pattern ]]
    then
        ((start++))
        printf -v filecount '%4d' $start
        >"$base$filecount"    # create an empty file named like foo0001
    fi
    echo "$line" >> "$base$filecount"
done

Answer 4

回答by bw_üezi

regex to match "X of XXX DOCUMENTS" is
\d{1,3} of \d{1,3) DOCUMENTS

匹配“X of XXX DOCUMENTS”的正则表达式是
\d{1,3} of \d{1,3) DOCUMENTS

reading line by line and starting to write new file upon regex match should be fine.

逐行读取并在正则表达式匹配时开始写入新文件应该没问题。

Answer 5

回答by Paused until further notice.

Untested:

未经测试：

##代码##

使用 BASH 中的 shell 脚本在正则表达式上将一个大的 txt 文件拆分为 200 个较小的 txt 文件

提问by rosser

回答by kurumi

回答by ?aphink

回答by ?aphink

回答by bw_üezi

回答by Paused until further notice.

相关推荐

最近更新

标签

使用 BASH 中的 shell 脚本在正则表达式上将一个大的 txt 文件拆分为 200 个较小的 txt 文件

提问by rosser

回答by kurumi

回答by ?aphink

回答by ?aphink

回答by bw_üezi

回答by Paused until further notice.

相关推荐

bash python子进程调用bash脚本 - 也需要打印引号

bash Shell脚本：检查文件是文件而不是目录

bash Mac OS X 终端 [使用选项作为元键] 覆盖西班牙语键盘中的反斜杠

bash 如果参数超过 9 个，如何访问函数的参数？

相关推荐

最近更新

标签