Python:如何从原始电子邮件解析正文,因为原始电子邮件没有“正文”标签或任何东西

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/17874360/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 09:23:23  来源:igfitidea点击:

Python : How to parse the Body from a raw email , given that raw email does not have a "Body" tag or anything

pythonemailpython-2.7mod-wsgiwsgi

提问by

It seems easy to get the

似乎很容易得到

From
To
Subject

etc via

等通过

import email
b = email.message_from_string(a)
bbb = b['from']
ccc = b['to']

assuming that "a"is the raw-email string which looks something like this.

假设这"a"是原始电子邮件字符串,看起来像这样。

a = """From [email protected] Thu Jul 25 19:28:59 2013
Received: from a1.local.tld (localhost [127.0.0.1])
    by a1.local.tld (8.14.4/8.14.4) with ESMTP id r6Q2SxeQ003866
    for <[email protected]>; Thu, 25 Jul 2013 19:28:59 -0700
Received: (from root@localhost)
    by a1.local.tld (8.14.4/8.14.4/Submit) id r6Q2Sxbh003865;
    Thu, 25 Jul 2013 19:28:59 -0700
From: [email protected]
Subject: oooooooooooooooo
To: [email protected]
Cc: 
X-Originating-IP: 192.168.15.127
X-Mailer: Webmin 1.420
Message-Id: <1374805739.3861@a1>
Date: Thu, 25 Jul 2013 19:28:59 -0700 (PDT)
MIME-Version: 1.0
Content-Type: multipart/mixed; boundary="bound1374805739"

This is a multi-part message in MIME format.

--bound1374805739
Content-Type: text/plain
Content-Transfer-Encoding: 7bit

ooooooooooooooooooooooooooooooooooooooooooooooo
ooooooooooooooooooooooooooooooooooooooooooooooo
ooooooooooooooooooooooooooooooooooooooooooooooo

--bound1374805739--"""

THE QUESTION

问题

how do you get the Bodyof this email via python ?

你如何Body通过 python获得这封电子邮件?

So far this is the only code i am aware of but i have yet to test it.

到目前为止,这是我知道的唯一代码,但我还没有对其进行测试。

if email.is_multipart():
    for part in email.get_payload():
        print part.get_payload()
else:
    print email.get_payload()

is this the correct way ?

这是正确的方法吗?

or maybe there is something simpler such as...

或者也许有一些更简单的东西,比如......

import email
b = email.message_from_string(a)
bbb = b['body']

?

?

采纳答案by falsetru

Use Message.get_payload

使用Message.get_payload

b = email.message_from_string(a)
if b.is_multipart():
    for payload in b.get_payload():
        # if payload.is_multipart(): ...
        print payload.get_payload()
else:
    print b.get_payload()

回答by Jimmy Lin

There is no b['body']in python. You have to use get_payload.

b['body']python中没有。您必须使用 get_payload。

if isinstance(mailEntity.get_payload(), list):
    for eachPayload in mailEntity.get_payload():
        ...do things you want...
        ...real mail body is in eachPayload.get_payload()...
else:
    ...means there is only text/plain part....
    ...use mailEntity.get_payload() to get the body...

Good Luck.

祝你好运。

回答by Todor Minakov

To be highly positive you work with the actual email body (yet, still with the possibility you're not parsing the right part), you have to skip attachments, and focus on the plain or html part (depending on your needs) for further processing.

为了非常积极地处理实际的电子邮件正文(但是,仍然有可能您没有解析正确的部分),您必须跳过附件,并专注于普通或 html 部分(取决于您的需要)以进一步了解加工。

As the before-mentioned attachments can and very often are of text/plain or text/html part, this non-bullet-proof sample skips those by checking the content-disposition header:

由于前面提到的附件可以并且经常是 text/plain 或 text/html 部分,因此这个非防弹示例通过检查 content-disposition 标头来跳过这些:

b = email.message_from_string(a)
body = ""

if b.is_multipart():
    for part in b.walk():
        ctype = part.get_content_type()
        cdispo = str(part.get('Content-Disposition'))

        # skip any text/plain (txt) attachments
        if ctype == 'text/plain' and 'attachment' not in cdispo:
            body = part.get_payload(decode=True)  # decode
            break
# not multipart - i.e. plain text, no attachments, keeping fingers crossed
else:
    body = b.get_payload(decode=True)

BTW, walk()iterates marvelously on mime parts, and get_payload(decode=True)does the dirty work on decoding base64 etc. for you.

顺便说一句,walk()在 mime 部分上进行了奇妙的迭代,并get_payload(decode=True)为您完成了解码 base64 等的肮脏工作。

Some background - as I implied, the wonderful world of MIME emails presents a lot of pitfalls of "wrongly" finding the message body. In the simplest case it's in the sole "text/plain" part and get_payload() is very tempting, but we don't live in a simple world - it's often surrounded in multipart/alternative, related, mixed etc. content. Wikipedia describes it tightly - MIME, but considering all these cases below are valid - and common - one has to consider safety nets all around:

一些背景 - 正如我所暗示的,MIME 电子邮件的美妙世界呈现出许多“错误地”找到邮件正文的陷阱。在最简单的情况下,它位于唯一的“文本/纯文本”部分,并且 get_payload() 非常诱人,但我们并不生活在一个简单的世界中——它通常被多部分/替代、相关、混合等内容包围。维基百科对其进行了严格的描述 - MIME,但考虑到以下所有这些情况都是有效的 - 并且很常见 - 必须考虑周围的安全网:

Very common - pretty much what you get in normal editor (Gmail,Outlook) sending formatted text with an attachment:

非常常见 - 几乎你在普通编辑器(Gmail,Outlook)中得到的发送带有附件的格式化文本:

multipart/mixed
 |
 +- multipart/related
 |   |
 |   +- multipart/alternative
 |   |   |
 |   |   +- text/plain
 |   |   +- text/html
 |   |      
 |   +- image/png
 |
 +-- application/msexcel

Relatively simple - just alternative representation:

相对简单 - 只是替代表示:

multipart/alternative
 |
 +- text/plain
 +- text/html

For good or bad, this structure is also valid:

无论好坏,这个结构也是有效的:

multipart/alternative
 |
 +- text/plain
 +- multipart/related
      |
      +- text/html
      +- image/jpeg

Hope this helps a bit.

希望这个对你有帮助。

P.S. My point is don't approach email lightly - it bites when you least expect it :)

PS 我的观点是不要轻易处理电子邮件 - 它会在你最不期望的时候咬人:)

回答by Amit Sharma

There is very good packageavailable to parse the email contents with proper documentation.

有非常好的软件包可用于解析带有适当文档的电子邮件内容。

import mailparser

mail = mailparser.parse_from_file(f)
mail = mailparser.parse_from_file_obj(fp)
mail = mailparser.parse_from_string(raw_mail)
mail = mailparser.parse_from_bytes(byte_mail)

How to Use:

如何使用:

mail.attachments: list of all attachments
mail.body
mail.to

回答by Ajay Ohri

If emails is the pandas dataframe and emails.message the column for email text

如果 emails 是 pandas 数据框,emails.message 是电子邮件文本的列

## Helper functions
def get_text_from_email(msg):
    '''To get the content from email objects'''
    parts = []
    for part in msg.walk():
        if part.get_content_type() == 'text/plain':
            parts.append( part.get_payload() )
    return ''.join(parts)

def split_email_addresses(line):
    '''To separate multiple email addresses'''
    if line:
        addrs = line.split(',')
        addrs = frozenset(map(lambda x: x.strip(), addrs))
    else:
        addrs = None
    return addrs 

import email
# Parse the emails into a list email objects
messages = list(map(email.message_from_string, emails['message']))
emails.drop('message', axis=1, inplace=True)
# Get fields from parsed email objects
keys = messages[0].keys()
for key in keys:
    emails[key] = [doc[key] for doc in messages]
# Parse content from emails
emails['content'] = list(map(get_text_from_email, messages))
# Split multiple email addresses
emails['From'] = emails['From'].map(split_email_addresses)
emails['To'] = emails['To'].map(split_email_addresses)

# Extract the root of 'file' as 'user'
emails['user'] = emails['file'].map(lambda x:x.split('/')[0])
del messages

emails.head()

回答by Deepesh Verma

Here's the code that works for me everytime (for Outlook emails):

这是每次都适用于我的代码(对于 Outlook 电子邮件):

#to read Subjects and Body of email in a folder (or subfolder)

import win32com.client  
#import package

outlook = win32com.client.Dispatch("Outlook.Application").GetNamespace("MAPI")  
#create object

#get to the desired folder ([email protected] is my root folder)

root_folder = 
outlook.Folders['[email protected]'].Folders['Inbox'].Folders['SubFolderName']

#('Inbox' and 'SubFolderName' are the subfolders)

messages = root_folder.Items

for message in messages:
if message.Unread == True:    # gets only 'Unread' emails
    subject_content = message.subject
# to store subject lines of mails

    body_content = message.body
# to store Body of mails

    print(subject_content)
    print(body_content)

    message.Unread = True         # mark the mail as 'Read'
    message = messages.GetNext()  #iterate over mails

回答by Doctor J

Python 3.6+ provides built-in convenience methods to find and decode the plain text body as in @Todor Minakov's answer. You can use the EMailMessage.get_body()and get_content()methods:

Python 3.6+ 提供了内置的便捷方法来查找和解码纯文本正文,如@Todor Minakov的答案。您可以使用EMailMessage.get_body()get_content()方法:

msg = email.message_from_string(s, policy=email.policy.default)
body = msg.get_body(('plain',))
if body:
    body = body.get_content()
print(body)

Note this will give Noneif there is no (obvious) plain text body part.

请注意,None如果没有(明显的)纯文本正文部分,这将给出。

If you are reading from e.g. an mbox file, you can give the mailbox constructor an EmailMessagefactory:

如果您正在读取例如 mbox 文件,您可以为邮箱构造函数提供一个EmailMessage工厂:

mbox = mailbox.mbox(mboxfile, factory=lambda f: email.message_from_binary_file(f, policy=email.policy.default), create=False)
for msg in mbox:
    ...

Note you must pass email.policy.defaultas the policy, since it's notthe default...

请注意,您必须email.policy.default作为策略传递,因为它不是默认设置...