bash 如何使用终端将 mbox 文件拆分为 n-MB 大块?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/28110536/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How to split an mbox file into n-MB big chunks using the terminal?
提问by Alex
So I've read through this questionon SO but it does not quite help me any. I want to import a Gmail generated mbox file into another webmail service, but the problem is it only allows 40 MB huge files per import.
所以我已经通读了关于 SO 的这个问题,但它对我没有任何帮助。我想将 Gmail 生成的 mbox 文件导入另一个网络邮件服务,但问题是每次导入只允许 40 MB 大文件。
So I somehow have to split the mbox file into max. 40 MB big files and import them one after another. How would you do this?
所以我必须以某种方式将 mbox 文件拆分为最大。40 MB 大文件并一个接一个地导入。你会怎么做?
My initial thought was to use the other script (formail
) to save each mail as a single file and afterwards run a script to combine them to 40 MB huge files, but still I wouldnt know how to do this using the terminal.
我最初的想法是使用另一个脚本 ( formail
) 将每封邮件保存为单个文件,然后运行一个脚本将它们组合成 40 MB 的大文件,但我仍然不知道如何使用终端来做到这一点。
I also looked at the split
command, but Im afraid it would cutoff mails.
Thanks for any help!
我也看了看split
命令,但我怕它会截断邮件。谢谢你的帮助!
回答by Mark Setchell
If your mbox
is in standard format, each message will begin with From
and a space:
如果您mbox
是标准格式,则每条消息都将以From
和 一个空格开头:
From [email protected]
So, you could COPY YOUR MBOX TO A TEMPORARY DIRECTORY
and try using awk
to process it, on a message-by-message basis, only splitting at the start of any message. Let's say we went for 1,000 messages per output file:
因此,您可以COPY YOUR MBOX TO A TEMPORARY DIRECTORY
尝试使用awk
以逐条消息的方式处理它,仅在任何消息的开头进行拆分。假设我们为每个输出文件处理 1,000 条消息:
awk 'BEGIN{chunk=0} /^From /{msgs++;if(msgs==1000){msgs=0;chunk++}}{print > "chunk_" chunk ".txt"}' mbox
then you will get output files called chunk_1.txt
to chunk_n.txt
each containing up to 1,000 messages.
那么你将得到所谓的输出文件chunk_1.txt
到chunk_n.txt
每个包含多达1,000条消息。
If you are unfortunate enough to be on Windows (which is incapable of understanding single quotes), you will need to save the following in a file called awk.txt
如果您不幸使用 Windows(无法理解单引号),则需要将以下内容保存在名为的文件中 awk.txt
BEGIN{chunk=0} /^From /{msgs++;if(msgs==1000){msgs=0;chunk++}}{print > "chunk_" chunk ".txt"}
and then type
然后输入
awk -f awk.txt mbox
回答by Oki Erie Rinaldi
I just improved a script from Mark Sechell's answer. As We can see, that script can parse the mbox file based on the amount of email per chunk. This improved script can parse the mbox file based on the defined-maximum-size for each chunk.
So, if you have size limitation in uploading or importing the mbox file, you can try the script below to split the mbox file into chunks with specified size*.
Save the script below to a text file, e.g. mboxsplit.txt
, in the directory that contains the mbox file (e.g. named mbox
):
我刚刚从Mark Sechell's answer改进了一个脚本。正如我们所见,该脚本可以根据每个块的电子邮件数量解析 mbox 文件。这个改进的脚本可以根据每个块的定义的最大大小来解析 mbox 文件。
因此,如果您在上传或导入 mbox 文件时有大小限制,您可以尝试使用下面的脚本将 mbox 文件拆分为指定大小的块*。
将下面的脚本保存到一个文本文件中,例如mboxsplit.txt
,在包含 mbox 文件(例如 named mbox
)的目录中:
BEGIN{chunk=0;filesize=0;}
/^From /{
if(filesize>=40000000){#file size per chunk in byte
close("chunk_" chunk ".txt");
filesize=0;
chunk++;
}
}
{filesize+=length()}
{print > ("chunk_" chunk ".txt")}
And then run/type this line in that directory (contains the mboxsplit.txt
and the mbox
file):
然后在该目录中运行/键入这一行(包含mboxsplit.txt
和mbox
文件):
awk -f mboxsplit.txt mbox
Please note:
请注意:
- The size of the result may be larger than the defined size. It depends on the last email size inserted into the buffer/chunk before checking the chunk size.
- It will not split the email body
- One chunk may contain only one email if the email size is larger than the specified chunk size
- 结果的大小可能大于定义的大小。这取决于在检查块大小之前插入缓冲区/块的最后一封电子邮件大小。
- 它不会拆分电子邮件正文
- 如果电子邮件大小大于指定的块大小,则一个块可能只包含一封电子邮件
I suggest you to specify the chunk size less or lower than the maximum upload/import size.
我建议您指定小于或小于最大上传/导入大小的块大小。
回答by Olaf Dietsche
formail
is perfectly suited for this task. You may look at formail's +skip
and -total
options
formail
非常适合这项任务。您可以查看 formail+skip
和-total
选项
Options
...
+skip
Skip the first skipmessages while splitting.
-total
Output at most totalmessages while splitting.
选项
...
+skip拆分时
跳过第一个跳过消息。
-total拆分
时最多输出总消息数。
Depending on the size of your mailbox and mails, you may try
根据您的邮箱和邮件的大小,您可以尝试
formail -100 -s <google.mbox >import-01.mbox
formail +100 -100 -s <google.mbox >import-02.mbox
formail +200 -100 -s <google.mbox >import-03.mbox
etc.
等等。
The parts need not be of equal size, of course. If there's one large e-mail, you may have only formail +100 -60 -s <google.mbox >import-02.mbox
, or if there are many small messages, maybe formail +100 -500 -s <google.mbox >import-02.mbox
.
当然,这些部件不必具有相同的尺寸。如果有一封大电子邮件,您可能只有formail +100 -60 -s <google.mbox >import-02.mbox
,或者如果有许多小邮件,则可能formail +100 -500 -s <google.mbox >import-02.mbox
。
To look for an initial number of mails per chunk, try
要查找每个块的初始邮件数,请尝试
formail -100 -s <google.mbox | wc
formail -500 -s <google.mbox | wc
formail -1000 -s <google.mbox | wc
You may need to experiment a bit, in order to accommodate to your mailbox size. On the other hand, since this seems to be a one time task, you may not want to spend too much time on this.
您可能需要进行一些试验,以适应您的邮箱大小。另一方面,由于这似乎是一项一次性任务,您可能不想在此上花费太多时间。
回答by David W.
My initial thought was to use the other script (formail) to save each mail as a single file and afterwards run a script to combine them to 40 MB huge files, but still I wouldnt know how to do this using the terminal.
我最初的想法是使用其他脚本(formail)将每封邮件保存为单个文件,然后运行一个脚本将它们组合成 40 MB 的大文件,但我仍然不知道如何使用终端来做到这一点。
If I understand you correctly, you want to split the files up, then combine them into a big file before importing them. That sounds like what split
and cat
were meant to do. Split splits the files based upon your size specification whether based upon line or bytes. It then adds a suffix to these files to keep them in order, You then use cat
to put the files back together:
如果我理解正确,您想将文件拆分,然后在导入之前将它们组合成一个大文件。像什么声音split
,并cat
注定要去做。Split 根据您的大小规范拆分文件,无论是基于行还是字节。然后它为这些文件添加一个后缀以保持它们的顺序,然后您cat
将文件重新组合在一起:
$ split -b40m -a5 mbox # this makes mbox.aaaaa, mbox.aaab, etc.
Once you get the files on the other system:
在其他系统上获取文件后:
$ cat mbox.* > mbox
You wouldn't do this if you want to break the files so messages aren't split between files because you are going to import each file into the new mail system one at a time.
如果您想破坏文件以便消息不会在文件之间拆分,则不会这样做,因为您将一次一个地将每个文件导入新的邮件系统。