C# 从引用的回复中解析电子邮件内容
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/278788/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Parse email content from quoted reply
提问by VanOrman
I'm trying to figure out how to parse out the text of an email from any quoted reply text that it might include. I've noticed that usually email clients will put an "On such and such date so and so wrote" or prefix the lines with an angle bracket. Unfortunately, not everyone does this. Does anyone have any idea on how to programmatically detect reply text? I am using C# to write this parser.
我试图弄清楚如何从它可能包含的任何引用的回复文本中解析出电子邮件的文本。我注意到,通常电子邮件客户端会加上“在某某日期某某某写”或在行前加上尖括号。不幸的是,并不是每个人都这样做。有没有人知道如何以编程方式检测回复文本?我正在使用 C# 来编写这个解析器。
回答by 3Doubloons
There is no universal indicator of a reply in an e-mail. The best you can do is try to catch the most common and parse new patterns as you come across them.
电子邮件中没有通用的回复指示符。您能做的最好的事情就是尝试捕捉最常见的模式,并在遇到新模式时对其进行解析。
Keep in mind that some people insert replies inside the quoted text (My boss for example answers questions on the same line as I asked them) so whatever you do, you might lose some information you would have liked to keep.
请记住,有些人在引用的文本中插入回复(例如,我的老板在我问他们的同一行回答问题),因此无论您做什么,您都可能会丢失一些您希望保留的信息。
回答by VanOrman
I did a lot more searching on this and here's what I've found. There are basically two situations under which you are doing this: when you have the entire thread and when you don't. I'll break it up into those two categories:
我对此进行了更多搜索,这就是我发现的内容。基本上有两种情况你会这样做:当你拥有整个线程时,当你没有时。我将把它分为两类:
When you have the thread:
当你有线程时:
If you have the entire series of emails, you can achieve a very high level of assurance that what you are removing is actually quoted text. There are two ways to do this. One, you could use the message's Message-ID, In-Reply-To ID, and Thread-Index to determine the individual message, it's parent, and the thread it belongs to. For more information on this, see RFC822, RFC2822, this interesting article on threading, or this article on threading. Once you have re-assembled the thread, you can then remove the external text (such as To, From, CC, etc... lines) and you're done.
如果您拥有整个系列的电子邮件,您就可以非常确信您要删除的内容实际上是引用文本。有两种方法可以做到这一点。一,您可以使用消息的 Message-ID、In-Reply-To ID 和 Thread-Index 来确定单个消息、它的父级以及它所属的线程。有关这方面的更多信息,请参阅RFC822、RFC2822、这篇关于线程的有趣文章或这篇关于线程的文章。重新组装线程后,您可以删除外部文本(例如 To、From、CC 等...行),然后就完成了。
If the messages you are working with do not have the headers, you can also use similarity matching to determine what parts of an email are the reply text. In this case you're stuck with doing similarity matching to determine the text that is repeated. In this case you might want to look into a Levenshtein Distance algorithmsuch as this one on Code Projector this one.
如果您正在处理的邮件没有标题,您还可以使用相似性匹配来确定电子邮件的哪些部分是回复文本。在这种情况下,您必须通过相似性匹配来确定重复的文本。在这种情况下,您可能需要研究Levenshtein 距离算法,例如Code Project 上的this或this one。
No matter what, if you're interested in the threading process, check out this great PDF on reassembling email threads.
无论如何,如果您对线程处理过程感兴趣,请查看有关重新组合电子邮件线程的精彩 PDF。
When you don't have the thread:
当你没有线程时:
If you are stuck with only one message from the thread, you're doing to have to try to guess what the quote is. In that case, here are the different quotation methods I have seen:
如果您只看到来自线程的一条消息,那么您将不得不尝试猜测引文是什么。在这种情况下,以下是我见过的不同引用方法:
- a line (as seen in outlook).
- Angle Brackets
- "---Original Message---"
- "On such-and-such day, so-and-so wrote:"
- 一条线(如 Outlook 中所示)。
- 尖括号
- “ - -原始信息 - -”
- “某某某日,某某写道:”
Remove the text from there down and you're done. The downside to any of these is that they all assume that the sender put their reply on top of the quoted text and did not interleave it (as was the old style on the internet). If that happens, good luck. I hope this helps some of you out there!
从那里删除文本,你就完成了。其中任何一个的缺点是他们都假设发件人将他们的回复放在引用的文本之上并且没有交错它(就像互联网上的旧样式一样)。如果发生这种情况,祝你好运。我希望这可以帮助你们中的一些人!
回答by Oleg Yaroshevych
First of all, this is a tricky task.
首先,这是一项棘手的任务。
You should collect typical responses from different e-mail clients and prepare correct regular expressions (or whatever) to parse them. I've collected responses from outlook, thunderbird, gmail, apple mail and mail.ru.
您应该从不同的电子邮件客户端收集典型的响应,并准备正确的正则表达式(或其他)来解析它们。我收集了 Outlook、thunderbird、gmail、apple mail 和 mail.ru 的回复。
I am using regular expressions to parse response in following manner: if expression did not matched, I try to use the next one.
我使用正则表达式以下列方式解析响应:如果表达式不匹配,我尝试使用下一个。
new Regex("From:\s*" + Regex.Escape(_mail), RegexOptions.IgnoreCase);
new Regex("<" + Regex.Escape(_mail) + ">", RegexOptions.IgnoreCase);
new Regex(Regex.Escape(_mail) + "\s+wrote:", RegexOptions.IgnoreCase);
new Regex("\n.*On.*(\r\n)?wrote:\r\n", RegexOptions.IgnoreCase | RegexOptions.Multiline);
new Regex("-+original\s+message-+\s*$", RegexOptions.IgnoreCase);
new Regex("from:\s*$", RegexOptions.IgnoreCase);
To remove quotation in the end:
最后删除引号:
new Regex("^>.*$", RegexOptions.IgnoreCase | RegexOptions.Multiline);
Here is my small collection of test responses (samples divided by ---):
这是我的一小部分测试响应(样本除以---):
From: [email protected] [mailto:[email protected]]
Sent: Tuesday, January 13, 2009 1:27 PM
----
2008/12/26 <[email protected]>
> text
----
[email protected] wrote:
> text
----
[email protected] wrote: text
text
----
2009/1/13 <[email protected]>
> text
----
[email protected] wrote: text
text
----
2009/1/13 <[email protected]>
> text
> text
----
2009/1/13 <[email protected]>
> text
> text
----
[email protected] wrote:
> text
> text
<response here>
----
--- On Fri, 23/1/09, [email protected] <[email protected]> wrote:
> text
> text
Best regards, Oleg Yaroshevych
最好的问候,奥列格·雅罗舍维奇
回答by Eric R. Rath
If you control the original message (e.g. notifications from a web application), you can put a distinct, identifiable header in place, and use that as the delimiter for the original post.
如果您控制原始消息(例如来自 Web 应用程序的通知),您可以放置一个独特的、可识别的标题,并将其用作原始帖子的分隔符。
回答by hurshagrawal
Thank you, Goleg, for the regexes! Really helped. This isn't C#, but for the googlers out there, here's my Ruby parsing script:
感谢 Goleg 提供正则表达式!真的很有帮助。这不是 C#,但对于谷歌人来说,这是我的 Ruby 解析脚本:
def extract_reply(text, address)
regex_arr = [
Regexp.new("From:\s*" + Regexp.escape(address), Regexp::IGNORECASE),
Regexp.new("<" + Regexp.escape(address) + ">", Regexp::IGNORECASE),
Regexp.new(Regexp.escape(address) + "\s+wrote:", Regexp::IGNORECASE),
Regexp.new("^.*On.*(\n)?wrote:$", Regexp::IGNORECASE),
Regexp.new("-+original\s+message-+\s*$", Regexp::IGNORECASE),
Regexp.new("from:\s*$", Regexp::IGNORECASE)
]
text_length = text.length
#calculates the matching regex closest to top of page
index = regex_arr.inject(text_length) do |min, regex|
[(text.index(regex) || text_length), min].min
end
text[0, index].strip
end
It's worked pretty well so far.
到目前为止,它工作得很好。
回答by superluminary
By far the easiest way to do this is by placing a marker in your content, such as:
到目前为止,最简单的方法是在您的内容中放置一个标记,例如:
--- Please reply above this line ---
--- 请在此行上方回复 ---
As you have no doubt noticed, parsing out quoted text is not a trivial task as different email clients quote text in different ways. To solve this problem properly you need to account for and test in every email client.
毫无疑问,您已经注意到,解析引用的文本并不是一项微不足道的任务,因为不同的电子邮件客户端以不同的方式引用文本。要正确解决此问题,您需要在每个电子邮件客户端中进行说明和测试。
Facebook can do this, but unless your project has a big budget, you probably can't.
Facebook 可以做到这一点,但除非你的项目预算很大,否则你可能做不到。
Oleg has solved the problem using regexes to find the "On 13 Jul 2012, at 13:09, xxx wrote:" text. However, if the user deletes this text, or replies at the bottom of the email, as many people do, this solution will not work.
Oleg 已使用正则表达式解决了该问题,以查找“2012 年 7 月 13 日,13:09,xxx 写道:”文本。但是,如果用户删除此文本,或在电子邮件底部回复,就像许多人一样,此解决方案将不起作用。
Likewise if the email client uses a different date string, or doesn't include a date string the regex will fail.
同样,如果电子邮件客户端使用不同的日期字符串,或者不包含日期字符串,则正则表达式将失败。
回答by Austin
Here is my C# version of @hurshagrawal's Ruby code. I don't know Ruby really well so it could be off, but I think I got it right.
这是@hurshagrawal 的 Ruby 代码的 C# 版本。我不太了解 Ruby,所以它可能会关闭,但我认为我做对了。
public string ExtractReply(string text, string address)
{
var regexes = new List<Regex>() { new Regex("From:\s*" + Regex.Escape(address), RegexOptions.IgnoreCase),
new Regex("<" + Regex.Escape(address) + ">", RegexOptions.IgnoreCase),
new Regex(Regex.Escape(address) + "\s+wrote:", RegexOptions.IgnoreCase),
new Regex("\n.*On.*(\r\n)?wrote:\r\n", RegexOptions.IgnoreCase | RegexOptions.Multiline),
new Regex("-+original\s+message-+\s*$", RegexOptions.IgnoreCase),
new Regex("from:\s*$", RegexOptions.IgnoreCase),
new Regex("^>.*$", RegexOptions.IgnoreCase | RegexOptions.Multiline)
};
var index = text.Length;
foreach(var regex in regexes){
var match = regex.Match(text);
if(match.Success && match.Index < index)
index = match.Index;
}
return text.Substring(0, index).Trim();
}
回答by Amit M
This is a good solution. Found it after searching for so long.
这是一个很好的解决方案。找了好久才找到。
One addition, as mentioned above, this is case wise, so the above expressions did not correctly parse my gmail and outlook (2010) responses, for which I added the following two Regex(s). Let me know for any issues.
另外,如上所述,这是区分大小写的,因此上述表达式没有正确解析我的 gmail 和 Outlook (2010) 响应,为此我添加了以下两个正则表达式。让我知道任何问题。
//Works for Gmail
new Regex("\n.*On.*<(\r\n)?" + Regex.Escape(address) + "(\r\n)?>", RegexOptions.IgnoreCase),
//Works for Outlook 2010
new Regex("From:.*" + Regex.Escape(address), RegexOptions.IgnoreCase),
Cheers
干杯
回答by Eric Huang
It is old post, however, not sure if you are aware github has a Ruby libextracting the reply. If you use .NET, I have a .NET one at https://github.com/EricJWHuang/EmailReplyParser
这是旧帖子,但是,不确定您是否知道 github 有一个 Ruby 库来提取回复。如果您使用 .NET,我在https://github.com/EricJWHuang/EmailReplyParser 上有一个 .NET
回答by Paul Mendoza
If you use SigParser.com's API, it will give you an array of all the broken out emails in a reply chain from a single email text string. So if there are 10 emails, you'll get the text for all 10 of the emails.
如果您使用SigParser.com的 API,它将从单个电子邮件文本字符串中为您提供回复链中所有断开的电子邮件的数组。因此,如果有 10 封电子邮件,您将获得所有 10 封电子邮件的文本。
You can view the detailed API spec here.
您可以在此处查看详细的 API 规范。