在 php 中解析原始电子邮件

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/12896/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-24 21:10:54  来源:igfitidea点击:

parsing raw email in php

phpemail

提问by Uberfuzzy

I'm looking for good/working/simple to use php code for parsing raw email into parts.

我正在寻找好的/工作/简单的方法来使用 php 代码将原始电子邮件解析为部分。

I've written a couple of brute force solutions, but every time, one small change/header/space/something comes along and my whole parser fails and the project falls apart.

我写了几个蛮力解决方案,但每次都有一个小的更改/标题/空间/东西出现,我的整个解析器失败并且项目分崩离析。

And before I get pointed at PEAR/PECL, I need actual code. My host has some screwy config or something, I can never seem to get the .so's to build right. If I do get the .so made, some difference in path/environment/php.ini doesn't always make it available (apache vs cron vs cli).

在我指出 PEAR/PECL 之前,我需要实际的代码。我的主机有一些奇怪的配置或其他东西,我似乎永远无法正确构建 .so。如果我确实得到了 .so,路径/环境/php.ini 中的一些差异并不总是使其可用(apache vs cron vs cli)。

Oh, and one last thing, I'm parsing the raw email text, NOT POP3, and NOT IMAP. It's being piped into the php script via a .qmail email redirect.

哦,还有最后一件事,我正在解析原始电子邮件文本,而不是 POP3 和 IMAP。它通过 .qmail 电子邮件重定向通过管道传输到 php 脚本中。

I'm not expecting SOF to write it for me, I'm looking for some tips/starting points on doing it "right". This is one of those "wheel" problems that I know has already been solved.

我不希望 SOF 为我写它,我正在寻找一些关于“正确”做这件事的技巧/起点。这是我知道已经解决的那些“轮子”问题之一。

采纳答案by jj33

What are you hoping to end up with at the end? The body, the subject, the sender, an attachment? You should spend some time with RFC2822to understand the format of the mail, but here's the simplest rules for well formed email:

你希望最后得到什么?正文、主题、发件人、附件?您应该花一些时间使用RFC2822来了解邮件的格式,但这里是格式正确的电子邮件的最简单规则:

HEADERS\n
\n
BODY

That is, the first blank line (double newline) is the separator between the HEADERS and the BODY. A HEADER looks like this:

也就是说,第一个空行(双换行符)是 HEADERS 和 BODY 之间的分隔符。一个 HEADER 看起来像这样:

HSTRING:HTEXT

HSTRING always starts at the beginning of a line and doesn't contain any white space or colons. HTEXT can contain a wide variety of text, including newlines as long as the newline char is followed by whitespace.

HSTRING 总是从一行的开头开始,并且不包含任何空格或冒号。HTEXT 可以包含多种文本,包括换行符,只要换行符后跟空格即可。

The "BODY" is really just any data that follows the first double newline. (There are different rules if you are transmitting mail via SMTP, but processing it over a pipe you don't have to worry about that).

“BODY”实际上只是第一个双换行符之后的任何数据。(如果您通过 SMTP 传输邮件,则有不同的规则,但您不必担心通过管道处理邮件)。

So, in really simple, circa-1982 RFC822terms, an email looks like this:

因此,用非常简单的大约 1982 年RFC822术语来说,一封电子邮件如下所示:

HEADER: HEADER TEXT
HEADER: MORE HEADER TEXT
  INCLUDING A LINE CONTINUATION
HEADER: LAST HEADER

THIS IS ANY
ARBITRARY DATA
(FOR THE MOST PART)

Most modern email is more complex than that though. Headers can be encoded for charsets or RFC2047mime words, or a ton of other stuff I'm not thinking of right now. The bodies are really hard to roll your own code for these days to if you want them to be meaningful. Almost all email that's generated by an MUA will be MIMEencoded. That might be uuencoded text, it might be html, it might be a uuencoded excel spreadsheet.

不过,大多数现代电子邮件都比这更复杂。标题可以编码为字符集或RFC2047mime 词,或者我现在没有想到的大量其他东西。如果您希望它们有意义,这些天体真的很难将您自己的代码推出来。几乎所有由 MUA 生成的电子邮件都将进行MIME编码。那可能是 uuencoded 文本,也可能是 html,也可能是 uuencoded excel 电子表格。

I hope this helps provide a framework for understanding some of the very elemental buckets of email. If you provide more background on what you are trying to do with the data I (or someone else) might be able to provide better direction.

我希望这有助于提供一个框架来理解一些非常基本的电子邮件桶。如果你提供更多关于你试图用数据做什么的背景,我(或其他人)可能会提供更好的指导。

回答by dan

Try the Plancake PHP Email parser: https://github.com/plancake/official-library-php-email-parser

试试 Plancake PHP 电子邮件解析器:https: //github.com/plancake/official-library-php-email-parser

I have used it for my projects. It works great, it is just one class and it is open source.

我已经将它用于我的项目。它工作得很好,它只是一个类并且它是开源的。

回答by Carter Cole

I cobbled this together, some code isn't mine but I don't know where it came from... I later adopted the more robust "MimeMailParser" but this works fine, I pipe my default email to it using cPanel and it works great.

我把它拼凑在一起,有些代码不是我的,但我不知道它来自哪里......我后来采用了更强大的“MimeMailParser”,但这工作正常,我使用 cPanel 将我的默认电子邮件发送给它,它工作正常伟大的。

#!/usr/bin/php -q
<?php
// Config
$dbuser = 'emlusr';
$dbpass = 'pass';
$dbname = 'email';
$dbhost = 'localhost';
$notify= '[email protected]'; // an email address required in case of errors
function mailRead($iKlimit = "") 
    { 
        // Purpose: 
        //   Reads piped mail from STDIN 
        // 
        // Arguements: 
        //   $iKlimit (integer, optional): specifies after how many kilobytes reading of mail should stop 
        //   Defaults to 1024k if no value is specified 
        //     A value of -1 will cause reading to continue until the entire message has been read 
        // 
        // Return value: 
        //   A string containing the entire email, headers, body and all. 

        // Variable perparation         
            // Set default limit of 1024k if no limit has been specified 
            if ($iKlimit == "") { 
                $iKlimit = 1024; 
            } 

            // Error strings 
            $sErrorSTDINFail = "Error - failed to read mail from STDIN!"; 

        // Attempt to connect to STDIN 
        $fp = fopen("php://stdin", "r"); 

        // Failed to connect to STDIN? (shouldn't really happen) 
        if (!$fp) { 
            echo $sErrorSTDINFail; 
            exit(); 
        } 

        // Create empty string for storing message 
        $sEmail = ""; 

        // Read message up until limit (if any) 
        if ($iKlimit == -1) { 
            while (!feof($fp)) { 
                $sEmail .= fread($fp, 1024); 
            }                     
        } else { 
            while (!feof($fp) && $i_limit < $iKlimit) { 
                $sEmail .= fread($fp, 1024); 
                $i_limit++; 
            }         
        } 

        // Close connection to STDIN 
        fclose($fp); 

        // Return message 
        return $sEmail; 
    }  
$email = mailRead();

// handle email
$lines = explode("\n", $email);

// empty vars
$from = "";
$subject = "";
$headers = "";
$message = "";
$splittingheaders = true;
for ($i=0; $i < count($lines); $i++) {
    if ($splittingheaders) {
        // this is a header
        $headers .= $lines[$i]."\n";

        // look out for special headers
        if (preg_match("/^Subject: (.*)/", $lines[$i], $matches)) {
            $subject = $matches[1];
        }
        if (preg_match("/^From: (.*)/", $lines[$i], $matches)) {
            $from = $matches[1];
        }
        if (preg_match("/^To: (.*)/", $lines[$i], $matches)) {
            $to = $matches[1];
        }
    } else {
        // not a header, but message
        $message .= $lines[$i]."\n";
    }

    if (trim($lines[$i])=="") {
        // empty line, header section has ended
        $splittingheaders = false;
    }
}

if ($conn = @mysql_connect($dbhost,$dbuser,$dbpass)) {
  if(!@mysql_select_db($dbname,$conn))
    mail($email,'Email Logger Error',"There was an error selecting the email logger database.\n\n".mysql_error());
  $from    = mysql_real_escape_string($from);
  $to    = mysql_real_escape_string($to);
  $subject = mysql_real_escape_string($subject);
  $headers = mysql_real_escape_string($headers);
  $message = mysql_real_escape_string($message);
  $email   = mysql_real_escape_string($email);
  $result = @mysql_query("INSERT INTO email_log (`to`,`from`,`subject`,`headers`,`message`,`source`) VALUES('$to','$from','$subject','$headers','$message','$email')");
  if (mysql_affected_rows() == 0)
    mail($notify,'Email Logger Error',"There was an error inserting into the email logger database.\n\n".mysql_error());
} else {
  mail($notify,'Email Logger Error',"There was an error connecting the email logger database.\n\n".mysql_error());
}
?>

回答by Yaroslav

There is a library for parsing raw email message into php array - http://flourishlib.com/api/fMailbox#parseMessage.

有一个用于将原始电子邮件消息解析为 php 数组的库 - http://flourishlib.com/api/fMailbox#parseMessage

The static method parseMessage() can be used to parse a full MIME email message into the same format that fetchMessage() returns, minus the uid key.

$parsed_message = fMailbox::parseMessage(file_get_contents('/path/to/email'));

Here is an example of a parsed message:

静态方法 parseMessage() 可用于将完整的 MIME 电子邮件消息解析为与 fetchMessage() 返回的格式相同的格式,减去 uid 键。

$parsed_message = fMailbox::parseMessage(file_get_contents('/path/to/email'));

以下是解析消息的示例:

array(
    'received' => '28 Apr 2010 22:00:38 -0400',
    'headers'  => array(
        'received' => array(
            0 => '(qmail 25838 invoked from network); 28 Apr 2010 22:00:38 -0400',
            1 => 'from example.com (HELO ?192.168.10.2?) (example) by example.com with (DHE-RSA-AES256-SHA encrypted) SMTP; 28 Apr 2010 22:00:38 -0400'
        ),
        'message-id' => '<[email protected]>',
        'date' => 'Wed, 28 Apr 2010 21:59:49 -0400',
        'from' => array(
            'personal' => 'Will Bond',
            'mailbox'  => 'tests',
            'host'     => 'flourishlib.com'
        ),
        'user-agent'   => 'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.1.9) Gecko/20100317 Thunderbird/3.0.4',
        'mime-version' => '1.0',
        'to' => array(
            0 => array(
                'mailbox' => 'tests',
                'host'    => 'flourishlib.com'
            )
        ),
        'subject' => 'This message is encrypted'
    ),
    'text'      => 'This message is encrypted',
    'decrypted' => TRUE,
    'uid'       => 15
);

回答by TomaszKane

This https://github.com/zbateson/MailMimeParserworks for me, and don't need mailparse extension.

这个https://github.com/zbateson/MailMimeParser对我有用,不需要 mailparse 扩展。

<?php
echo $message->getHeaderValue('from');          // [email protected]
echo $message
    ->getHeader('from')
    ->getPersonName();                          // Person Name
echo $message->getHeaderValue('subject');       // The email's subject

echo $message->getTextContent();                // or getHtmlContent

回答by postfuturist

The Pear lib Mail_mimeDecode is written in plain PHP that you can see here: Mail_mimeDecode source

Pear lib Mail_mimeDecode 是用纯 PHP 编写的,您可以在这里看到:Mail_mimeDecode 源

回答by zuups

There are Mailparse Functions you could try: http://php.net/manual/en/book.mailparse.php, not in default php conf, however.

您可以尝试使用 Mailparse 函数:http://php.net/manual/en/book.mailparse.php ,但不在默认的 php conf 中。

回答by astateful

Parsing email in PHP isn't an impossible task. What I mean is, you don't need a team of engineers to do it; it is attainable as an individual. Really the hardest part I found was creating the FSM for parsing an IMAP BODYSTRUCTURE result. Nowhere on the Internet had I seen this so I wrote my own. My routine basically creates an array of nested arrays from the command output, and the depth one is at in the array roughly corresponds to the part number(s) needed to perform the lookups. So it handles the nested MIME structures quite gracefully.

用 PHP 解析电子邮件并不是一项不可能完成的任务。我的意思是,你不需要一个工程师团队来做这件事;作为个人,这是可以实现的。我发现最难的部分是创建 FSM 来解析 IMAP BODYSTRUCTURE 结果。我在互联网上没有看到过这个,所以我自己写了一个。我的例程基本上从命令输出创建一个嵌套数组的数组,数组中的深度大致对应于执行查找所需的部件号。所以它非常优雅地处理嵌套的 MIME 结构。

The problem is that PHP's default imap_* functions don't provide much granularity...so I had to open a socket to the IMAP port and write the functions to send and retrieve the necessary information (IMAP FETCH 1 BODY.PEEK[1.2] for example), and that involves looking at the RFC documentation.

问题是 PHP 的默认 imap_* 函数没有提供太多的粒度......所以我不得不打开一个到 IMAP 端口的套接字并编写函数来发送和检索必要的信息(IMAP FETCH 1 BODY.PEEK[1.2]例如),这涉及查看 RFC 文档。

The encoding of the data (quoted-printable, base64, 7bit, 8bit, etc.), length of the message, content-type, etc. is all provided to you; for attachments, text, html, etc. You may have to figure out the nuances of your mail server as well since not all fields are always implemented 100%.

数据的编码(quoted-printable、base64、7bit、8bit等)、消息长度、内容类型等都提供给您;用于附件、文本、html 等。您可能还需要弄清楚邮件服务器的细微差别,因为并非所有字段都始终 100% 实现。

The gem is the FSM...if you have a background in Comp Sci it can be really really fun to make this (they key is that brackets are not a regular grammar ;)); otherwise it will be a struggle and/or result in ugly code, using traditional methods. Also you need some time!

宝石是 FSM ......如果你有 Comp Sci 的背景,那么做这个真的很有趣(关键是括号不是常规语法;));否则,使用传统方法将是一场斗争和/或导致丑陋的代码。你也需要一些时间!

Hope this helps!

希望这可以帮助!

回答by jj33

You're probably not going to have much fun writing your own MIME parser. The reason you are finding "overdeveloped mail handling packages" is because MIME is a really complex set of rules/formats/encodings. MIME parts can be recursive, which is part of the fun. I think your best bet is to write the best MIME handler you can, parse a message, throw away everything that's not text/plain or text/html, and then force the command in the incoming string to be prefixed with COMMAND: or something similar so that you can find it in the muck. If you start with rules like that you have a decent chance of handling new providers, but you should be ready to tweak if a new provider comes along (or heck, if your current provider chooses to change their messaging architecture).

您可能不会从编写自己的 MIME 解析器中获得多少乐趣。您发现“过度开发的邮件处理包”的原因是因为 MIME 是一组非常复杂的规则/格式/编码。MIME 部分可以递归,这是乐趣的一部分。我认为最好的办法是编写最好的 MIME 处理程序,解析消息,丢弃所有不是 text/plain 或 text/html 的内容,然后强制传入字符串中的命令以 COMMAND: 或类似的前缀为前缀这样你就可以在泥泞中找到它。如果你从这样的规则开始,你就有很大的机会处理新的提供者,但你应该准备好在新的提供者出现时进行调整(或者,如果你当前的提供者选择改变他们的消息传递架构)。

回答by Polsonby

I'm not sure if this will be of help to you - hope so - but it will surely help others interested in finding out more about email. Marcus Bointondid one of the best presentations entitled "Mail() and life after Mail()" at the PHP London conference in March this year and the slidesand MP3are online. He speaks with some authority, having worked extensively with email and PHP at a deep level.

我不确定这是否会对您有所帮助 - 希望如此 - 但它肯定会帮助其他有兴趣了解有关电子邮件的更多信息。Marcus Bointon在今年 3 月的 PHP 伦敦会议上做了最好的演讲之一,题为“Mail() 和Lifeafter Mail()”,幻灯片MP3在线。他说话颇有权威,曾在电子邮件和 PHP 的深层次上进行过广泛的工作。

My perception is that you are in for a world of pain trying to write a truly generic parser.

我的看法是,您正处于尝试编写真正通用的解析器的痛苦世界中。

EDIT - The files seem to have been removed on the PHP London site; found the slides on Marcus' own site: Part 1Part 2Couldn't see the MP3 anywhere though

编辑 - 这些文件似乎已在 PHP London 站点上删除;在 Marcus自己的网站上找到了幻灯片:第 1部分第 2 部分虽然在任何地方都看不到 MP3