如何在 Java 中读取或解析 MHTML (.mht) 文件

Question

提问by Favonius

I need to mine the contentof most of known document files like:

我需要挖掘大多数已知文档文件的内容，例如：

pdf
html
doc/docx etc.

pdf
html
doc/docx 等

For most of these file formats I am planning to use:

对于我计划使用的大多数文件格式：

http://tika.apache.org/

But as of now Tikadoes not support MHTML (*.mht) files.. ( http://en.wikipedia.org/wiki/MHTML) There are few examples in C# ( http://www.codeproject.com/KB/files/MhtBuilder.aspx) but I found none in Java.

但截至目前Tika不支持 MHTML (*.mht) 文件.. ( http://en.wikipedia.org/wiki/MHTML) C# 中的例子很少 ( http://www.codeproject.com/KB/ files/MhtBuilder.aspx) 但我在 Java 中没有发现。

I tried opening the *.mht file in 7Zip and it failed...Although the WinZip was able to decompress the file into images and text (CSS, HTML, Script) as text and binary files...

我尝试在 7Zip 中打开 *.mht 文件，但它失败了......尽管 WinZip 能够将文件解压缩为图像和文本（CSS、HTML、脚本）作为文本和二进制文件......

As per MSDN page ( http://msdn.microsoft.com/en-us/library/aa767785%28VS.85%29.aspx#compress_content) and the code projectpage i mentioned earlier ... mht files use GZip compression ....

根据 MSDN 页面（http://msdn.microsoft.com/en-us/library/aa767785%28VS.85%29.aspx#compress_content）和code project我之前提到的页面...... mht 文件使用 GZip 压缩...... .

Attempting to decompress in java results in following exceptions: With java.uti.zip.GZIPInputStream

尝试在 java 中解压会导致以下异常： java.uti.zip.GZIPInputStream

java.io.IOException: Not in GZIP format
at java.util.zip.GZIPInputStream.readHeader(Unknown Source)
at java.util.zip.GZIPInputStream.<init>(Unknown Source)
at java.util.zip.GZIPInputStream.<init>(Unknown Source)
at GZipTest.main(GZipTest.java:16)

And with java.util.zip.ZipFile

与 java.util.zip.ZipFile

 java.util.zip.ZipException: error in opening zip file
at java.util.zip.ZipFile.open(Native Method)
at java.util.zip.ZipFile.<init>(Unknown Source)
at java.util.zip.ZipFile.<init>(Unknown Source)
at GZipTest.main(GZipTest.java:21)

Kindly suggest how to decompress it....

请建议如何解压它....

Thanks....

谢谢....

Answer 1

采纳答案by Favonius

Frankly, I wasn't expecting a solution in near future and was about to give up, but some how I stumbled on this page:

坦率地说，我没想到在不久的将来会有解决方案，并且即将放弃，但是我在此页面上偶然发现了一些方法：

http://en.wikipedia.org/wiki/MIME#Multipart_messages

http://msdn.microsoft.com/en-us/library/ms527355%28EXCHG.10%29.aspx

Although, not a very catchy in first look. But if you look carefully you will get clue. After reading this I fired up my IE and at random started saving pages as *.mhtfile. Let me go line by line...

虽然，乍一看并不是很吸引人。但如果仔细观察，你会得到线索。读完这篇后，我启动了我的 IE 并随机开始将页面保存为*.mht文件。让我一行一行...

But let me explain beforehand that my ultimate goal was to separate/extract out the htmlcontent and parse it... the solution is not complete in itself as it depends on the character setor encodingI choose while saving. But even though it will extract the individual files with minor hitches...

但是让我事先解释一下，我的最终目标是分离/提取html内容并对其进行解析......解决方案本身并不完整，因为它取决于保存时的character set或encoding选择。但即使它会以轻微的故障提取单个文件......

I hope this will be useful for anyone who is trying to parse/decompress *.mht/MHTMLfiles :)

我希望这对任何试图解析/解压缩*.mht/MHTML文件的人有用:)

======= Explanation ======== ** Taken from a mht file **

======== 说明 ======== ** 取自 mht 文件 **

From: "Saved by Windows Internet Explorer 7"

It is the software used for saving the file

它是用于保存文件的软件

Subject: Google
Date: Tue, 13 Jul 2010 21:23:03 +0530
MIME-Version: 1.0

Subject, date and mime-version … much like the mail format

主题、日期和 MIME 版本……很像邮件格式

  Content-Type: multipart/related;
type="text/html";

This is the part which tells us that it is a multipartdocument. A multipart document has one or more different sets of data combined in a single body, a multipartContent-Type field must appear in the entity's header. Here, we can also see the type as "text/html".

这是告诉我们它是一个multipart文档的部分。多部分文档将一组或多组不同的数据组合在一个正文中，multipartContent-Type 字段必须出现在实体的标题中。在这里，我们也可以看到类型为"text/html"。

boundary="----=_NextPart_000_0007_01CB22D1.93BBD1A0"

Out of all this is the most important part. This is the unique delimiter which divides two different parts (html,images,css,script etc). Onceyou get hold of this, everything gets easy... Now, I just have to iterate through the document and finding out different sections and saving them as per their Content-Transfer-Encoding(base64, quoted-printable etc) ... . . .

其中最重要的部分。这是将两个不同部分（html、图像、css、脚本等）分开的唯一分隔符。一旦你掌握了这一点，一切就变得简单了......现在，我只需要遍历文档并找出不同的部分并根据它们的Content-Transfer-Encoding（base64，quoted-printable等）保存它们......。. .

SAMPLE

样本

 ------=_NextPart_000_0007_01CB22D1.93BBD1A0
 Content-Type: text/html;
 charset="utf-8"
 Content-Transfer-Encoding: quoted-printable
 Content-Location: http://www.google.com/webhp?sourceid=navclient&ie=UTF-8

 <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" =
.
.
.

** JAVA CODE **

** 爪哇代码 **

An interface for defining constants.

用于定义常量的接口。

public interface IConstants 
{
    public String BOUNDARY = "boundary";
    public String CHAR_SET = "charset";
    public String CONTENT_TYPE = "Content-Type";
    public String CONTENT_TRANSFER_ENCODING = "Content-Transfer-Encoding";
    public String CONTENT_LOCATION = "Content-Location";

    public String UTF8_BOM = "=EF=BB=BF";

    public String UTF16_BOM1 = "=FF=FE";
    public String UTF16_BOM2 = "=FE=FF";
}

The main parser class...

主要的解析器类...

/**
 * This program and the accompanying materials are made available under the terms of the Eclipse Public License v1.0
 * which accompanies this distribution, and is available at
 * http://www.eclipse.org/legal/epl-v10.html
 */
package com.test.mht.core;

import java.io.BufferedOutputStream;
import java.io.BufferedReader;
import java.io.BufferedWriter;
import java.io.File;
import java.io.FileOutputStream;
import java.io.FileReader;
import java.io.OutputStreamWriter;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

import sun.misc.BASE64Decoder;

/**
 * File to parse and decompose *.mts file in its constituting parts.
 * @author Manish Shukla 
 */

public class MHTParser implements IConstants
{
    private File mhtFile;
    private File outputFolder;

    public MHTParser(File mhtFile, File outputFolder) {
        this.mhtFile = mhtFile;
        this.outputFolder = outputFolder;
    }

    /**
     * @throws Exception
     */
    public void decompress() throws Exception
    {
        BufferedReader reader = null;

        String type = "";
        String encoding = "";
        String location = "";
        String filename = "";
        String charset = "utf-8";
        StringBuilder buffer = null;

        try
        {
            reader = new BufferedReader(new FileReader(mhtFile));

            final String boundary = getBoundary(reader);
            if(boundary == null)
                throw new Exception("Failed to find document 'boundary'... Aborting");

            String line = null;
            int i = 1;
            while((line = reader.readLine()) != null)
            {
                String temp = line.trim();
                if(temp.contains(boundary)) 
                {
                    if(buffer != null) {
                        writeBufferContentToFile(buffer,encoding,filename,charset);
                        buffer = null;
                    }

                    buffer = new StringBuilder();
                }else if(temp.startsWith(CONTENT_TYPE)) {
                    type = getType(temp);
                }else if(temp.startsWith(CHAR_SET)) {
                    charset = getCharSet(temp);
                }else if(temp.startsWith(CONTENT_TRANSFER_ENCODING)) {
                    encoding = getEncoding(temp);
                }else if(temp.startsWith(CONTENT_LOCATION)) {
                    location = temp.substring(temp.indexOf(":")+1).trim();
                    i++;
                    filename = getFileName(location,type);
                }else {
                    if(buffer != null) {
                        buffer.append(line + "\n");
                    }
                }
            }

        }finally 
        {
            if(null != reader)
                reader.close();
        }

    }

    private String getCharSet(String temp) 
    {
        String t = temp.split("=")[1].trim();
        return t.substring(1, t.length()-1);
    }

    /**
     * Save the file as per character set and encoding 
     */
    private void writeBufferContentToFile(StringBuilder buffer,String encoding, String filename, String charset) 
    throws Exception
    {

        if(!outputFolder.exists())
            outputFolder.mkdirs();

        byte[] content = null; 

        boolean text = true;

        if(encoding.equalsIgnoreCase("base64")){
            content = getBase64EncodedString(buffer);
            text = false;
        }else if(encoding.equalsIgnoreCase("quoted-printable")) {
            content = getQuotedPrintableString(buffer);         
        }
        else
            content = buffer.toString().getBytes();

        if(!text)
        {
            BufferedOutputStream bos = null;
            try
            {
                bos = new BufferedOutputStream(new FileOutputStream(filename));
                bos.write(content);
                bos.flush();
            }finally {
                bos.close();
            }
        }else 
        {
            BufferedWriter bw = null;
            try
            {
                bw = new BufferedWriter(new OutputStreamWriter(new FileOutputStream(filename), charset));
                bw.write(new String(content));
                bw.flush();
            }finally {
                bw.close();
            }
        }
    }

    /**
     * When the save the *.mts file with 'utf-8' encoding then it appends '=EF=BB=BF'</br>
     * @see http://en.wikipedia.org/wiki/Byte_order_mark
     */
    private byte[] getQuotedPrintableString(StringBuilder buffer) 
    {
        //Set<String> uniqueHex = new HashSet<String>();
        //final Pattern p = Pattern.compile("(=\p{XDigit}{2})*");

        String temp = buffer.toString().replaceAll(UTF8_BOM, "").replaceAll("=\n", "");

        //Matcher m = p.matcher(temp);
        //while(m.find()) {
        //  uniqueHex.add(m.group());
        //}

        //System.out.println(uniqueHex);

        //for (String hex : uniqueHex) {
            //temp = temp.replaceAll(hex, getASCIIValue(hex.substring(1)));
        //}     

        return temp.getBytes();
    }

    /*private String getASCIIValue(String hex) {
        return ""+(char)Integer.parseInt(hex, 16);
    }*/
    /**
     * Although system dependent..it works well
     */
    private byte[] getBase64EncodedString(StringBuilder buffer) throws Exception {
        return new BASE64Decoder().decodeBuffer(buffer.toString());
    }

    /**
     * Tries to get a qualified file name. If the name is not apparent it tries to guess it from the URL.
     * Otherwise it returns 'unknown.<type>'
     */
    private String getFileName(String location, String type) 
    {
        final Pattern p = Pattern.compile("(\w|_|-)+\.\w+");
        String ext = "";
        String name = "";
        if(type.toLowerCase().endsWith("jpeg"))
            ext = "jpg";
        else
            ext = type.split("/")[1];

        if(location.endsWith("/")) {
            name = "main";
        }else
        {
            name = location.substring(location.lastIndexOf("/") + 1);

            Matcher m = p.matcher(name);
            String fname = "";
            while(m.find()) {
                fname = m.group();
            }

            if(fname.trim().length() == 0)
                name = "unknown";
            else
                return getUniqueName(fname.substring(0,fname.indexOf(".")), fname.substring(fname.indexOf(".") + 1, fname.length()));
        }
        return getUniqueName(name,ext);
    }

    /**
     * Returns a qualified unique output file path for the parsed path.</br>
     * In case the file already exist it appends a numarical value a continues
     */
    private String getUniqueName(String name,String ext)
    {
        int i = 1;
        File file = new File(outputFolder,name + "." + ext);
        if(file.exists())
        {
            while(true)
            {
                file = new File(outputFolder, name + i + "." + ext);
                if(!file.exists())
                    return file.getAbsolutePath();
                i++;
            }
        }

        return file.getAbsolutePath();
    }

    private String getType(String line) {
        return splitUsingColonSpace(line);
    }

    private String getEncoding(String line){
        return splitUsingColonSpace(line);
    }

    private String splitUsingColonSpace(String line) {
        return line.split(":\s*")[1].replaceAll(";", "");
    }

    /**
     * Gives you the boundary string
     */
    private String getBoundary(BufferedReader reader) throws Exception 
    {
        String line = null;

        while((line = reader.readLine()) != null)
        {
            line = line.trim();
            if(line.startsWith(BOUNDARY)) {
                return line.substring(line.indexOf("\"") + 1, line.lastIndexOf("\""));
            }
        }

        return null;
    }
}

Regards,

问候，

Answer 2

回答by Roki

U can try http://www.chilkatsoft.com/mht-features.asp, it can pack/unpack and you can handle it after as normal files. The download link is: http://www.chilkatsoft.com/java.asp

你可以试试http://www.chilkatsoft.com/mht-features.asp，它可以打包/解包，你可以像普通文件一样处理它。下载链接为：http: //www.chilkatsoft.com/java.asp

Answer 3

回答by Wajdy Essam

i was used http://jtidy.sourceforge.netto parse/read/index mht files (but as normal files, not compressed files)

我使用http://jtidy.sourceforge.net来解析/读取/索引 mht 文件（但作为普通文件，而不是压缩文件）

Answer 4

回答by wener

You don't have to do it on you own.

你不必自己做。

With dependency

有依赖性

<dependency>
    <groupId>org.apache.james</groupId>
    <artifactId>apache-mime4j</artifactId>
    <version>0.7.2</version>
</dependency>

Roll you mht file

滚动你的 mht 文件

public static void main(String[] args)
{
    MessageTree.main(new String[]{"YOU MHT FILE PATH"});
}

MessageTreewill

MessageTree将要

/**
 * Displays a parsed Message in a window. The window will be divided into
 * two panels. The left panel displays the Message tree. Clicking on a
 * node in the tree shows information on that node in the right panel.
 *
 * Some of this code have been copied from the Java tutorial's JTree section.
 */

Then you can look into it.

然后你可以看看它。

;-)

Answer 5

回答by David Turner

Late to the party, but expanding on @wener's answer for anyone else stumbling across this.

聚会迟到了，但扩展了@wener的回答，让其他人遇到了这个问题。

The Apache Mime4Jlibrary seems to have the most readily accessible solution for EML or MHTMLprocessing, much easier than rolling-your-own!

该阿帕奇Mime4J库似乎有最易访问的解决方案EML或MHTML处理，更容易比滚动你自己！

My prototype 'parseMhtToFile' function below rips html files and other artifacts out of a Cognos active report 'mht' file, but could be tailored to other purposes.

我的原型 ' parseMhtToFile' 下面的函数从 Cognos 活动报告 'mht' 文件中提取 html 文件和其他工件，但可以针对其他目的进行定制。

This is written in Groovy and requires Apache Mime4J 'core' and 'dom' jars(currently 0.7.2).

这是用 Groovy 编写的，需要Apache Mime4J 'core' 和 'dom' jars（当前是 0.7.2）。

import org.apache.james.mime4j.dom.Message
import org.apache.james.mime4j.dom.Multipart
import org.apache.james.mime4j.dom.field.ContentTypeField
import org.apache.james.mime4j.message.DefaultMessageBuilder
import org.apache.james.mime4j.stream.MimeConfig

/**
 * Use Mime4J MessageBuilder to parse an mhtml file (assumes multipart) into
 * separate html files.
 * Files will be written to outDir (or parent) as baseName + partIdx + ext.
 */
void parseMhtToFile(File mhtFile, File outDir = null) {
    if (!outDir) {outDir = mhtFile.parentFile }
    // File baseName will be used in generating new filenames
    def mhtBaseName = mhtFile.name.replaceFirst(~/\.[^\.]+$/, '')

    // -- Set up Mime parser, using Default Message Builder
    MimeConfig parserConfig  = new MimeConfig();
    parserConfig.setMaxHeaderLen(-1); // The default is a mere 10k
    parserConfig.setMaxLineLen(-1); // The default is only 1000 characters.
    parserConfig.setMaxHeaderCount(-1); // Disable the check for header count.
    DefaultMessageBuilder builder = new DefaultMessageBuilder();
    builder.setMimeEntityConfig(parserConfig);

    // -- Parse the MHT stream data into a Message object
    println "Parsing ${mhtFile}...";
    InputStream mhtStream = mhtFile.newInputStream()
    Message message = builder.parseMessage(mhtStream);

    // -- Process the resulting body parts, writing to file
    assert message.getBody() instanceof Multipart
    Multipart multipart = (Multipart) message.getBody();
    def parts = multipart.getBodyParts();
    parts.eachWithIndex { p, i ->
        ContentTypeField cType = p.header.getField('content-type')
        println "${p.class.simpleName}\t${i}\t${cType.mimeType}"

        // Assume mime sub-type is a "good enough" file-name extension 
        // e.g. text/html = html, image/png = png, application/json = json
        String partFileName = "${mhtBaseName}_${i}.${cType.subType}"
        File partFile = new File(outDir, partFileName)

        // Write part body stream to file
        println "Writing ${partFile}...";
        if (partFile.exists()) partFile.delete();
        InputStream partStream = p.body.inputStream;
        partFile.append(partStream);
    }
}

Usage is simply:

用法很简单：

File mhtFile = new File('<path>', 'Report-en-au.mht')
parseMhtToFile(mhtFile)
println 'Done.'

Output is:

输出是：

Parsing <path>\Report-en-au.mht...
BodyPart    0   text/html
Writing <path>\Report-en-au_0.html...
BodyPart    1   image/png
Writing <path>\Report-en-au_1.png...
Done.

Thoughts on other improvements:

关于其他改进的想法：

For 'text' mime parts, you can access a Readerinstead of a Streamwhich might be more appropriate for text mining as the OP requested.
For generated filename extensions, I'd use another library to lookup appropriate extension, not assume the mime sub-type is adequate.
Handle Single-body (non-Multipart) and Recursive Multipart mhtml files and other complexities. These may require a MimeStreamParserwith custom Content Handlerimplementation.

对于“文本” mime 部分，您可以访问 aReader而不是 a Stream，根据 OP 的要求，这可能更适合文本挖掘。
对于生成的文件扩展名，我会使用另一个库来查找适当的扩展名，而不是假设 mime 子类型就足够了。
处理单体（非多部分）和递归多部分 mhtml 文件和其他复杂性。这些可能需要具有自定义内容处理程序实现的MimeStreamParser。

Answer 6

回答by rakesh

A more compact code using Java Mail APIs

使用 Java Mail API 的更紧凑的代码

import java.io.File;
import java.io.FileInputStream;
import java.io.FileOutputStream;
import java.net.URL;
import java.util.Properties;

import javax.mail.BodyPart;
import javax.mail.Session;
import javax.mail.internet.MimeMessage;
import javax.mail.internet.MimeMultipart;

import org.apache.commons.io.IOUtils;

public class MhtParser {

    private File mhtFile;
    private File outputFolder;

    public MhtParser(File mhtFile, File outputFolder) {
        this.mhtFile = mhtFile;
        this.outputFolder = outputFolder;
    }

    public void decompress() throws Exception {
        MimeMessage message = 
            new MimeMessage(
                    Session.getDefaultInstance(new Properties(), null),
                    new FileInputStream(mhtFile));

        if (message.getContent() instanceof MimeMultipart) {
            outputFolder.mkdir();
            MimeMultipart mimeMultipart = (MimeMultipart) message.getContent();

            for (int i = 0; i < mimeMultipart.getCount(); i++) {
                BodyPart bodyPart = mimeMultipart.getBodyPart(i);
                String fileName = bodyPart.getFileName();

                if (fileName == null) {
                    String[] locationHeader = bodyPart.getHeader("Content-Location");
                    if (locationHeader != null && locationHeader.length > 0) {
                        fileName = 
                            new File(new URL(locationHeader[0]).getFile()).getName();
                    }
                }

                if (fileName != null) {
                    FileOutputStream out = 
                        new FileOutputStream(new File(outputFolder, fileName));

                    IOUtils.copy(bodyPart.getInputStream(), out);
                    out.flush();
                    out.close();
                }
            }
        }
    }
}

如何在 Java 中读取或解析 MHTML (.mht) 文件

提问by Favonius

采纳答案by Favonius

回答by Roki

回答by Wajdy Essam

回答by wener

回答by David Turner

回答by rakesh

相关推荐

最近更新

标签

如何在 Java 中读取或解析 MHTML (.mht) 文件

提问by Favonius

采纳答案by Favonius

回答by Roki

回答by Wajdy Essam

回答by wener

回答by David Turner

回答by rakesh

相关推荐

Java 将 JSON 对象映射到 Hibernate 实体

Java Intellij Idea错误上的Jboss端口配置

Java 线程安全哈希映射？

java静态变量存储在内存中的什么地方？

相关推荐

最近更新

标签