bash 如何使用终端将 mbox 文件拆分为 n-MB 大块?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/28110536/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-18 12:14:26  来源:igfitidea点击:

How to split an mbox file into n-MB big chunks using the terminal?

bashterminalmbox

提问by Alex

So I've read through this questionon SO but it does not quite help me any. I want to import a Gmail generated mbox file into another webmail service, but the problem is it only allows 40 MB huge files per import.

所以我已经通读了关于 SO 的这个问题,但它对我没有任何帮助。我想将 Gmail 生成的 mbox 文件导入另一个网络邮件服务,但问题是每次导入只允许 40 MB 大文件。

So I somehow have to split the mbox file into max. 40 MB big files and import them one after another. How would you do this?

所以我必须以某种方式将 mbox 文件拆分为最大。40 MB 大文件并一个接一个地导入。你会怎么做?

My initial thought was to use the other script (formail) to save each mail as a single file and afterwards run a script to combine them to 40 MB huge files, but still I wouldnt know how to do this using the terminal.

我最初的想法是使用另一个脚本 ( formail) 将每封邮件保存为单个文件,然后运行一个脚本将它们组合成 40 MB 的大文件,但我仍然不知道如何使用终端来做到这一点。

I also looked at the splitcommand, but Im afraid it would cutoff mails. Thanks for any help!

我也看了看split命令,但我怕它会截断邮件。谢谢你的帮助!

回答by Mark Setchell

If your mboxis in standard format, each message will begin with Fromand a space:

如果您mbox是标准格式,则每条消息都将以From和 一个空格开头:

From [email protected]

So, you could COPY YOUR MBOX TO A TEMPORARY DIRECTORYand try using awkto process it, on a message-by-message basis, only splitting at the start of any message. Let's say we went for 1,000 messages per output file:

因此,您可以COPY YOUR MBOX TO A TEMPORARY DIRECTORY尝试使用awk以逐条消息的方式处理它,仅在任何消息的开头进行拆分。假设我们为每个输出文件处理 1,000 条消息:

awk 'BEGIN{chunk=0} /^From /{msgs++;if(msgs==1000){msgs=0;chunk++}}{print > "chunk_" chunk ".txt"}' mbox

then you will get output files called chunk_1.txtto chunk_n.txteach containing up to 1,000 messages.

那么你将得到所谓的输出文件chunk_1.txtchunk_n.txt每个包含多达1,000条消息。

If you are unfortunate enough to be on Windows (which is incapable of understanding single quotes), you will need to save the following in a file called awk.txt

如果您不幸使用 Windows(无法理解单引号),则需要将以下内容保存在名为的文件中 awk.txt

BEGIN{chunk=0} /^From /{msgs++;if(msgs==1000){msgs=0;chunk++}}{print > "chunk_" chunk ".txt"}

and then type

然后输入

awk -f awk.txt mbox

回答by Oki Erie Rinaldi

I just improved a script from Mark Sechell's answer. As We can see, that script can parse the mbox file based on the amount of email per chunk. This improved script can parse the mbox file based on the defined-maximum-size for each chunk.
So, if you have size limitation in uploading or importing the mbox file, you can try the script below to split the mbox file into chunks with specified size*.
Save the script below to a text file, e.g. mboxsplit.txt, in the directory that contains the mbox file (e.g. named mbox):

我刚刚从Mark Sechell's answer改进了一个脚本。正如我们所见,该脚本可以根据每个块的电子邮件数量解析 mbox 文件。这个改进的脚本可以根据每个块的定义的最大大小来解析 mbox 文件。
因此,如果您在上传或导入 mbox 文件时有大小限制,您可以尝试使用下面的脚本将 mbox 文件拆分为指定大小的*
将下面的脚本保存到一个文本文件中,例如mboxsplit.txt,在包含 mbox 文件(例如 named mbox)的目录中:

BEGIN{chunk=0;filesize=0;}
    /^From /{
    if(filesize>=40000000){#file size per chunk in byte
        close("chunk_" chunk ".txt");
        filesize=0;
        chunk++;
    }
  }
  {filesize+=length()}
  {print > ("chunk_" chunk ".txt")}

And then run/type this line in that directory (contains the mboxsplit.txtand the mboxfile):

然后在该目录中运行/键入这一行(包含mboxsplit.txtmbox文件):

  awk -f mboxsplit.txt mbox

Please note:

请注意

  • The size of the result may be larger than the defined size. It depends on the last email size inserted into the buffer/chunk before checking the chunk size.
  • It will not split the email body
  • One chunk may contain only one email if the email size is larger than the specified chunk size
  • 结果的大小可能大于定义的大小。这取决于在检查块大小之前插入缓冲区/块的最后一封电子邮件大小。
  • 它不会拆分电子邮件正文
  • 如果电子邮件大小大于指定的块大小,则一个块可能只包含一封电子邮件

I suggest you to specify the chunk size less or lower than the maximum upload/import size.

我建议您指定小于或小于最大上传/导入大小的块大小。

回答by Olaf Dietsche

formailis perfectly suited for this task. You may look at formail's +skipand -totaloptions

formail非常适合这项任务。您可以查看 formail+skip-total选项

Options
...
+skip
Skip the first skipmessages while splitting.
-total
Output at most totalmessages while splitting.

选项
...
+skip拆分时
跳过第一个跳过消息。
-total拆分
时最多输出消息数。

Depending on the size of your mailbox and mails, you may try

根据您的邮箱和邮件的大小,您可以尝试

formail -100 -s <google.mbox >import-01.mbox
formail +100 -100 -s <google.mbox >import-02.mbox
formail +200 -100 -s <google.mbox >import-03.mbox

etc.

等等。

The parts need not be of equal size, of course. If there's one large e-mail, you may have only formail +100 -60 -s <google.mbox >import-02.mbox, or if there are many small messages, maybe formail +100 -500 -s <google.mbox >import-02.mbox.

当然,这些部件不必具有相同的尺寸。如果有一封大电子邮件,您可能只有formail +100 -60 -s <google.mbox >import-02.mbox,或者如果有许多小邮件,则可能formail +100 -500 -s <google.mbox >import-02.mbox

To look for an initial number of mails per chunk, try

要查找每个块的初始邮件数,请尝试

formail -100 -s <google.mbox | wc
formail -500 -s <google.mbox | wc
formail -1000 -s <google.mbox | wc

You may need to experiment a bit, in order to accommodate to your mailbox size. On the other hand, since this seems to be a one time task, you may not want to spend too much time on this.

您可能需要进行一些试验,以适应您的邮箱大小。另一方面,由于这似乎是一项一次性任务,您可能不想在此上花费太多时间。

回答by David W.

My initial thought was to use the other script (formail) to save each mail as a single file and afterwards run a script to combine them to 40 MB huge files, but still I wouldnt know how to do this using the terminal.

我最初的想法是使用其他脚本(formail)将每封邮件保存为单个文件,然后运行一个脚本将它们组合成 40 MB 的大文件,但我仍然不知道如何使用终端来做到这一点。

If I understand you correctly, you want to split the files up, then combine them into a big file before importing them. That sounds like what splitand catwere meant to do. Split splits the files based upon your size specification whether based upon line or bytes. It then adds a suffix to these files to keep them in order, You then use catto put the files back together:

如果我理解正确,您想将文件拆分,然后在导入之前将它们组合成一个大文件。像什么声音split,并cat注定要去做。Split 根据您的大小规范拆分文件,无论是基于行还是字节。然后它为这些文件添加一个后缀以保持它们的顺序,然后您cat将文件重新组合在一起:

$ split -b40m -a5 mbox  # this makes mbox.aaaaa, mbox.aaab, etc.

Once you get the files on the other system:

在其他系统上获取文件后:

$ cat mbox.* > mbox

You wouldn't do this if you want to break the files so messages aren't split between files because you are going to import each file into the new mail system one at a time.

如果您想破坏文件以便消息不会在文件之间拆分,则不会这样做,因为您将一次一个地将每个文件导入新的邮件系统。