使用java将html转换为xml

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/19489882/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-12 17:42:32  来源:igfitidea点击:

Convert html to xml using java

javahtmlxmljtidy

提问by suresh

Can any one suggest me a best approach for converting html to xml using java Is there any API available for that? The html also might contain javascript code

任何人都可以建议我使用 java 将 html 转换为 xml 的最佳方法有没有可用的 API?html 也可能包含 javascript 代码

I have tried below code:

我试过下面的代码:

import java.io.BufferedInputStream;
import java.io.BufferedReader;
import java.io.BufferedWriter;
import java.io.DataInputStream;
import java.io.FileOutputStream;
import java.io.FileReader;
import java.io.FileWriter;
import java.io.IOException;
import java.io.InputStream;
import java.io.InputStreamReader;
import java.net.URL;
import java.net.URLConnection;
import org.jdom.JDOMException;
import org.jdom.input.SAXBuilder;
import org.jdom.output.XMLOutputter;
import java.io.IOException;

class HTML2XML {
    public static void main(String args[]) throws JDOMException {
    InputStream isInHtml = null;
    URL url = null;
    URLConnection connection = null;
    DataInputStream disInHtml = null;
    FileOutputStream fosOutHtml = null;
    FileWriter fwOutXml = null;
    FileReader frInHtml = null;
    BufferedWriter bwOutXml = null;
    BufferedReader brInHtml = null;
    try {
        // url = new URL("www.climb.co.jp");
        // connection = url.openConnection();
        // isInHtml = connection.getInputStream();

        frInHtml = new FileReader("D:\Second.html");
        brInHtml = new BufferedReader(frInHtml);
        SAXBuilder saxBuilder = new SAXBuilder(
                "org.ccil.cowan.tagsoup.Parser", false);
        org.jdom.Document jdomDocument = saxBuilder.build(brInHtml);

        XMLOutputter outputter = new XMLOutputter();
        org.jdom.output.Format newFormat = outputter.getFormat();
        String encoding = "iso-8859-2";
        newFormat.setEncoding(encoding);
        outputter.setFormat(newFormat);

        try {
            outputter.output(jdomDocument, System.out);
            fwOutXml = new FileWriter("D:\Second.xml");
            bwOutXml = new BufferedWriter(fwOutXml);
            outputter.output(jdomDocument, bwOutXml);
            System.out.flush();
        } catch (IOException e) {
        }

    } catch (IOException e) {
    } finally {
        System.out.flush();
        try {
            isInHtml.close();
            disInHtml.close();
            fosOutHtml.flush();
            fosOutHtml.getFD().sync();
            fosOutHtml.close();
            fwOutXml.flush();
            fwOutXml.close();
            bwOutXml.close();
        } catch (Exception w) {

        }
    }
}
}

But its not working as expected

但它没有按预期工作

采纳答案by Clyde Lobo

Try jTidy

试试jTidy

JTidy can be used as a tool for cleaning up malformed and faulty HTML

JTidy 可用作清理格式错误和有缺陷的 HTML 的工具

回答by Ahsan Shah

HTML is not the same as XML unless it is conforming XHTML or HTML5 in XML mode.

HTML 与 XML 不同,除非它在 ​​XML 模式下符合 XHTML 或 HTML5。

suggesting to use a HTML parser to read the HTML and transform it to XML – or process it directly.

建议使用 HTML 解析器读取 HTML 并将其转换为 XML - 或直接处理它。

回答by Rajj

If you want to parse html than rather than converting html to xml you can use html parser. http://www.mkyong.com/java/jsoup-html-parser-hello-world-examples/http://htmlparser.sourceforge.net/javadoc/doc-files/using.htmlI hope it helps you.

如果您想解析 html 而不是将 html 转换为 xml,您可以使用 html 解析器。 http://www.mkyong.com/java/jsoup-html-parser-hello-world-examples/ http://htmlparser.sourceforge.net/javadoc/doc-files/using.html希望对你有帮助。