Java 如何将 Jsoup 文档保存到 HTML 文件?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/24696766/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-14 14:14:36  来源:igfitidea点击:

How to save a Jsoup Document to an HTML file?

javajsoupdocument

提问by Ali Khezeli

I have used this method to retrieve a webpage into an org.jsoup.nodes.Documentobject:

我使用此方法将网页检索到org.jsoup.nodes.Document对象中:

myDoc = Jsoup.connect(myURL).ignoreContentType(true).get();

myDoc = Jsoup.connect(myURL).ignoreContentType(true).get();

How should I write this object to a HTML file? The methods myDoc.html(), myDoc.text()and myDoc.toString()don't output all elements of the document.

我应该如何将此对象写入 HTML 文件?的方法myDoc.html()myDoc.text()并且myDoc.toString()不输出文档中的所有元素。

Some information in a javascript element can be lost in parsing it. For example, "timestamp" in the source of an Instagram media page.

javascript 元素中的某些信息可能会在解析时丢失。例如,Instagram 媒体页面来源中的“时间戳”。

采纳答案by Alkis Kalogeris

The fact that there are elements that are ignored, must be due to the attempt of normalization by Jsoup.

有元素被忽略的事实,一定是由于Jsoup的规范化尝试。

In order to get the server's exact output without any form of normalization use this.

为了在没有任何形式的规范化的情况下获得服务器的确切输出,请使用它。

Connection.Response html = Jsoup.connect("PUT_URL_HERE").execute();
System.out.println(html.body());

回答by Gondy

Use doc.outerHtml().

使用doc.outerHtml().

import org.apache.commons.io.FileUtils;

public void downloadPage() throws Exception {
        final Response response = Jsoup.connect("http://www.example.net").execute();
        final Document doc = response.parse();

        final File f = new File("filename.html");
        FileUtils.writeStringToFile(f, doc.outerHtml(), "UTF-8");
    }

Don't forget to catch Exceptions. Add dependency or download Apache commons-io library for easy and quick way to saving files in UTF-8 format.

不要忘记捕获异常。添加依赖项或下载 Apache commons-io 库,以方便快捷地以 UTF-8 格式保存文件。