根据内容在linux中拆分文件

Question

提问by Greenhorn

I have an email dump of around 400mb. I want to split this into .txt files, consisting of one mail in each file. Every e-mail starts with the standard HTML header specifying the doctype.

我有一个大约 400mb 的电子邮件转储。我想将其拆分为 .txt 文件，每个文件中包含一封邮件。每封电子邮件都以指定文档类型的标准 HTML 标题开头。

This means I will have to split my files based on the above said header. How do I go about it in linux?

这意味着我将不得不根据上述标题拆分我的文件。我如何在 linux 中处理它？

Answer 1

采纳答案by kev

If you have a mail.txt

如果你有一个 mail.txt

$ cat mail.txt
<html>
    mail A
</html>

<html>
    mail B
</html>

<html>
    mail C
</html>

run csplitto split by <html>

跑到csplit分裂<html>

$ csplit mail.txt '/^<html>$/' '{*}'

 - mail.txt    => input file
 - /^<html>$/  => pattern match every `<html>` line
 - {*}         => repeat the previous pattern as many times as possible

check output

检查输出

$ ls
mail.txt  xx00  xx01  xx02  xx03

If you want do it in awk

如果你想在 awk

$ awk '/<html>/{filename=NR".txt"}; {print >filename}' mail.txt
$ ls
1.txt  5.txt  9.txt  mail.txt

Answer 2

回答by fge

It is doable with some perl "magic"... Many people would call this ugly but here goes.

用一些 perl 的“魔法”是可行的……很多人会说这很丑，但这里是。

The trick is to replace $/with what you want and read your input, as such:

诀窍是替换$/为您想要的内容并阅读您的输入，如下所示：

#!/usr/bin/perl -W
use strict;
my $i = 1;

$/ = <<EOF;
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"><html><head> <xmeta content="text/html;charset=ISO-8859-1" http-equiv="Content-Type">
EOF

open INPUT, "/path/to/inputfile" or die;

while (my $mail = <INPUT>) {
    $mail = substr($mail, 0, index($mail, $/));
    open OUTPUT, ">/path/to/emailfile." . $i . ".txt" or die;
    $i++;
    print OUTPUT $mail;
    close OUTPUT;
}

edit: fixed, I always forget that $/is included in the input. Also, the first file will always be empty, but then it can be easily handled.

编辑：固定，我总是忘记$/包含在输入中。此外，第一个文件将始终为空，但随后可以轻松处理。

Answer 3

回答by jaypal singh

I agree with fge. With perlit would be a lot simpler. You can try something like this -

我同意 fge。有了perl它就会简单很多。你可以试试这样的——

#!/usr/bin/perl

undef $/;
$_ = <>;
$n = 0;

for $match (split(/(?=HEADER_FORMAT)/)) {
      open(O, '>mail' . ++$n);
      print O $match;
      close(O);
}

Replace HEADER_FORMATwith your header type.

替换HEADER_FORMAT为您的标题类型。

Answer 4

回答by thiton

The csplitprogram solves your problem elegantly:

该csplit程序优雅地解决了您的问题：

csplit '/<!DOCTYPE.*/' $FILE

Answer 5

回答by Fredrik Pihl

csplitis the best solution to this problem. Just thought I'd post a bash-solution to show that there is no need to go perl on this task:

csplit是这个问题的最佳解决方案。只是想我会发布一个 bash 解决方案来表明没有必要在这个任务上使用 perl：

#!/usr/bin/bash

MAIL='mail'        # path to huge mail-file

#get linenumbers for all headers
line_no=$(grep -n html $MAIL | cut -d: -f1)

read -a LINES<<< $line_no

file=0
for i in $(seq 0 2 ${#LINES[@]}); do
    start=${LINES[i]}
    end=$((${LINES[i+1]}-1))
    echo $start, $end
    sed -n "${start},${end}p" $MAIL > ${MAIL}${file}.txt
    file=$((file+1))
done

根据内容在linux中拆分文件

提问by Greenhorn

采纳答案by kev

回答by fge

回答by jaypal singh

回答by thiton

回答by Fredrik Pihl

相关推荐

最近更新

标签

根据内容在linux中拆分文件

提问by Greenhorn

采纳答案by kev

回答by fge

回答by jaypal singh

回答by thiton

回答by Fredrik Pihl

相关推荐

C# 如何使用触发器或任何其他事件更改 wpf 中 onmoveover、onmouseleave 的按钮颜色

Linux 如何删除根目录中没有可用空间的“死信”文件

C# 如何调整“是一种类型但像变量一样使用”？

Linux 如何检查iptables状态并在ubuntu中允许ip

相关推荐

最近更新

标签