如何在python中读取eml文件?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/31392361/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How to read eml file in python?
提问by B?o Nguy?n
I do not known how to load a eml file in python 3.4.
I want to list all and read all of them in python.
我不知道如何在 python 3.4 中加载 eml 文件。
我想列出所有并在 python 中阅读所有这些。
回答by Dalen
This is how you get content of an e-mail i.e. *.eml file. This works perfectly on Python2.5 - 2.7. Try it on 3. It should work as well.
这是您获取电子邮件内容的方式,即 *.eml 文件。这在 Python2.5 - 2.7 上完美运行。在 3 上试试。它应该也能工作。
from email import message_from_file
import os
# Path to directory where attachments will be stored:
path = "./msgfiles"
# To have attachments extracted into memory, change behaviour of 2 following functions:
def file_exists (f):
"""Checks whether extracted file was extracted before."""
return os.path.exists(os.path.join(path, f))
def save_file (fn, cont):
"""Saves cont to a file fn"""
file = open(os.path.join(path, fn), "wb")
file.write(cont)
file.close()
def construct_name (id, fn):
"""Constructs a file name out of messages ID and packed file name"""
id = id.split(".")
id = id[0]+id[1]
return id+"."+fn
def disqo (s):
"""Removes double or single quotations."""
s = s.strip()
if s.startswith("'") and s.endswith("'"): return s[1:-1]
if s.startswith('"') and s.endswith('"'): return s[1:-1]
return s
def disgra (s):
"""Removes < and > from HTML-like tag or e-mail address or e-mail ID."""
s = s.strip()
if s.startswith("<") and s.endswith(">"): return s[1:-1]
return s
def pullout (m, key):
"""Extracts content from an e-mail message.
This works for multipart and nested multipart messages too.
m -- email.Message() or mailbox.Message()
key -- Initial message ID (some string)
Returns tuple(Text, Html, Files, Parts)
Text -- All text from all parts.
Html -- All HTMLs from all parts
Files -- Dictionary mapping extracted file to message ID it belongs to.
Parts -- Number of parts in original message.
"""
Html = ""
Text = ""
Files = {}
Parts = 0
if not m.is_multipart():
if m.get_filename(): # It's an attachment
fn = m.get_filename()
cfn = construct_name(key, fn)
Files[fn] = (cfn, None)
if file_exists(cfn): return Text, Html, Files, 1
save_file(cfn, m.get_payload(decode=True))
return Text, Html, Files, 1
# Not an attachment!
# See where this belongs. Text, Html or some other data:
cp = m.get_content_type()
if cp=="text/plain": Text += m.get_payload(decode=True)
elif cp=="text/html": Html += m.get_payload(decode=True)
else:
# Something else!
# Extract a message ID and a file name if there is one:
# This is some packed file and name is contained in content-type header
# instead of content-disposition header explicitly
cp = m.get("content-type")
try: id = disgra(m.get("content-id"))
except: id = None
# Find file name:
o = cp.find("name=")
if o==-1: return Text, Html, Files, 1
ox = cp.find(";", o)
if ox==-1: ox = None
o += 5; fn = cp[o:ox]
fn = disqo(fn)
cfn = construct_name(key, fn)
Files[fn] = (cfn, id)
if file_exists(cfn): return Text, Html, Files, 1
save_file(cfn, m.get_payload(decode=True))
return Text, Html, Files, 1
# This IS a multipart message.
# So, we iterate over it and call pullout() recursively for each part.
y = 0
while 1:
# If we cannot get the payload, it means we hit the end:
try:
pl = m.get_payload(y)
except: break
# pl is a new Message object which goes back to pullout
t, h, f, p = pullout(pl, key)
Text += t; Html += h; Files.update(f); Parts += p
y += 1
return Text, Html, Files, Parts
def extract (msgfile, key):
"""Extracts all data from e-mail, including From, To, etc., and returns it as a dictionary.
msgfile -- A file-like readable object
key -- Some ID string for that particular Message. Can be a file name or anything.
Returns dict()
Keys: from, to, subject, date, text, html, parts[, files]
Key files will be present only when message contained binary files.
For more see __doc__ for pullout() and caption() functions.
"""
m = message_from_file(msgfile)
From, To, Subject, Date = caption(m)
Text, Html, Files, Parts = pullout(m, key)
Text = Text.strip(); Html = Html.strip()
msg = {"subject": Subject, "from": From, "to": To, "date": Date,
"text": Text, "html": Html, "parts": Parts}
if Files: msg["files"] = Files
return msg
def caption (origin):
"""Extracts: To, From, Subject and Date from email.Message() or mailbox.Message()
origin -- Message() object
Returns tuple(From, To, Subject, Date)
If message doesn't contain one/more of them, the empty strings will be returned.
"""
Date = ""
if origin.has_key("date"): Date = origin["date"].strip()
From = ""
if origin.has_key("from"): From = origin["from"].strip()
To = ""
if origin.has_key("to"): To = origin["to"].strip()
Subject = ""
if origin.has_key("subject"): Subject = origin["subject"].strip()
return From, To, Subject, Date
# Usage:
f = open("message.eml", "rb")
print extract(f, f.name)
f.close()
I programmed this for my mailgroup using mailbox, that is why it is so convoluted. It never failed me. Never any junk. If message is multipart, output dictionary will contain a key "files" (a sub dict) with all filenames of extracted other files that were not text or html. That was a way of extracting attachments and other binary data. You may change it in pullout(), or just change the behaviour of file_exists() and save_file().
我使用邮箱为我的邮件组编程了这个,这就是它如此复杂的原因。它从来没有让我失望。从来没有垃圾。如果消息是多部分的,则输出字典将包含一个键“文件”(一个子字典),其中包含提取的其他非文本或 html 文件的所有文件名。这是一种提取附件和其他二进制数据的方法。您可以在 pullout() 中更改它,或者只是更改 file_exists() 和 save_file() 的行为。
construct_name() constructs a filename out of message id and multipart message filename, if there is one.
construct_name() 从消息 id 和多部分消息文件名中构造一个文件名,如果有的话。
In pullout() the Text and Html variables are strings. For online mailgroup it was OK to get any text or HTML packed into multipart that wasn't an attachment at once.
在 pullout() 中,Text 和 Html 变量是字符串。对于在线邮件组,可以立即将任何文本或 HTML 打包到不是附件的多部分中。
If you need something more sophisticated change Text and Html to lists and append to them and add them as needed. Nothing problematic.
如果您需要更复杂的内容,请将 Text 和 Html 更改为列表并附加到它们并根据需要添加它们。没什么问题。
Maybe there are some errors here, because it is intended to work with mailbox.Message(), not with email.Message(). I tried it on email.Message() and it worked fine.
也许这里有一些错误,因为它旨在与邮箱.Message() 一起使用,而不是与 email.Message() 一起使用。我在 email.Message() 上试过了,效果很好。
You said, you "wish to list them all". From where? If you refer to the POP3 mailbox or a mailbox of some nice open-source mailer, then you do it using mailbox module. If you want to list them from others, then you have a problem. For example, to get mails from MS Outlook, you have to know how to read OLE2 compound files. Other mailers rarely refer to them as *.eml files, so I think this is exactly what you would like to do. Then search on PyPI for olefile or compoundfiles module and Google around for how to extract an e-mail from MS Outlook inbox file. Or save yourself a mess and just export them from there to some directory. When you have them as eml files, then apply this code.
你说,你“想把它们都列出来”。来自哪里?如果您引用 POP3 邮箱或一些不错的开源邮件程序的邮箱,那么您可以使用邮箱模块。如果您想从其他人那里列出它们,那么您就有问题了。例如,要从 MS Outlook 获取邮件,您必须知道如何阅读 OLE2 复合文件。其他邮件程序很少将它们称为 *.eml 文件,所以我认为这正是您想要做的。然后在 PyPI 上搜索 olefile 或 Compoundfiles 模块,并在 Google 上搜索如何从 MS Outlook 收件箱文件中提取电子邮件。或者让自己一团糟,然后将它们从那里导出到某个目录。当您将它们作为 eml 文件使用时,请应用此代码。
回答by Mike
I found this codemuch simpler
我发现这段代码更简单
import email
import os
path = './'
listing = os.listdir(path)
for fle in listing:
if str.lower(fle[-3:])=="eml":
msg = email.message_from_file(open(fle))
attachments=msg.get_payload()
for attachment in attachments:
try:
fnam=attachment.get_filename()
f=open(fnam, 'wb').write(attachment.get_payload(decode=True,))
f.close()
except Exception as detail:
#print detail
pass
回答by IvanTheFirst
Try this:
尝试这个:
#!python3
# -*- coding: utf-8 -*-
import email
import os
SOURCE_DIR = 'email'
DEST_DIR = 'temp'
def extractattachements(fle,suffix=None):
message = email.message_from_file(open(fle))
filenames = []
if message.get_content_maintype() == 'multipart':
for part in message.walk():
if part.get_content_maintype() == 'multipart': continue
#if part.get('Content-Disposition') is None: continue
if part.get('Content-Type').find('application/octet-stream') == -1: continue
filename = part.get_filename()
if suffix:
filename = ''.join( [filename.split('.')[0], '_', suffix, '.', filename.split('.')[1]])
filename = os.path.join(DEST_DIR, filename)
fb = open(filename,'wb')
fb.write(part.get_payload(decode=True))
fb.close()
filenames.append(filename)
return filenames
def main():
onlyfiles = [f for f in os.listdir(SOURCE_DIR) if os.path.isfile(os.path.join(SOURCE_DIR, f))]
for file in onlyfiles:
#print path.join(SOURCE_DIR,file)
extractattachements(os.path.join(SOURCE_DIR,file))
return True
if __name__ == "__main__":
main()