java HtmlUnit - 将 HtmlPage 转换为 HTML 字符串?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/6497167/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-10-30 16:06:46  来源:igfitidea点击:

HtmlUnit - Convert an HtmlPage into HTML string?

javahtmlxmlhtmlunit

提问by Peter

I'm using HtmlUnit to generate the HTML for various pages, but right now, the best I can do to get the page into the raw HTML that the server returns is to convert the HtmlPage into an XML string.

我正在使用 HtmlUnit 为各种页面生成 HTML,但现在,将页面转换为服务器返回的原始 HTML 的最佳方法是将 HtmlPage 转换为 XML 字符串。

This is somewhat annoying because the XML output is rendered by web browsers differently than the raw HTML would. Is there a way to convert an HtmlPage into raw HTML instead of XML?

这有点烦人,因为 Web 浏览器呈现 XML 输出的方式与原始 HTML 不同。有没有办法将 HtmlPage 转换为原始 HTML 而不是 XML?

Thanks!

谢谢!

回答by Rodney Gitzel

page.asXml()will return the HTML. page.asText()returns it rendered down to just text.

page.asXml()将返回 HTML。 page.asText()返回它呈现的只是文本。

回答by Sergey O.

I'm not 100% certain I understood the question correctly, but maybe this will address your issue:

我不是 100% 确定我正确理解了这个问题,但也许这可以解决您的问题:

page.getWebResponse().getContentAsString()

page.getWebResponse().getContentAsString()

回答by snorbi

I think there is no direct way to get the final page as HTML. asXml() returns the result as XML, asText() returns the extracted text content.

我认为没有直接的方法可以将最终页面作为 HTML 获取。asXml() 以 XML 形式返回结果,asText() 返回提取的文本内容。

The best you can do is to use asXml() and "transform" it to HTML:

您能做的最好的事情是使用 asXml() 并将其“转换”为 HTML:

htmlPage.asXml().replaceFirst("<\?xml version=\"1.0\" encoding=\"(.+)\"\?>", "<!DOCTYPE html>")

(Of course you can apply more transformations like converting <br/> to <br> - it depends on your requirements.)

(当然,您可以应用更多转换,例如将 <br/> 转换为 <br> - 这取决于您的要求。)

Even the related Google documentationrecommends this approach (although they don't apply any transformations):

甚至相关的 Google 文档也推荐这种方法(尽管它们不应用任何转换):

// return the snapshot
out.println(page.asXml());

回答by Pavlo

Here is my solution that works for me:

这是我的解决方案,适用于我:

ScriptResult scriptResult = htmlPage.executeJavaScript("document.documentElement.outerHTML;");
System.out.println(scriptResult.getJavaScriptResult().toString());

回答by mP.

I dont know the answer short of a switch on Page type and for XmlPage and SgmlPage one must do an innerHTML on the HTML element and manually write out the attributes. Not elegant and exact (its missing the doctype) but it works.

我不知道缺少页面类型开关的答案,对于 XmlPage 和 SgmlPage,必须在 HTML 元素上执行一个 innerHTML 并手动写出属性。不优雅和精确(它缺少文档类型)但它有效。

Page.getWebResponse().getContentAsString()

Page.getWebResponse().getContentAsString()

This is incorrect as it returns the text form of the original unrendered, no js bytes. If javascript executes and changes stuff, then this method will not see the changes.

这是不正确的,因为它返回原始未渲染的文本形式,没有 js 字节。如果 javascript 执行并更改内容,则此方法将看不到更改。

page.asXml() will return the HTML. page.asText() returns it rendered down to just text.

page.asXml() 将返回 HTML。page.asText() 将其返回呈现为仅文本。

Just want to confirm this only returns text within text nodes and does not include the tags and their attributes. If you wish to take the complete HTML this is not the good enuff.

只是想确认这仅返回文本节点内的文本,不包括标签及其属性。如果您想获取完整的 HTML,这不是一个好方法。

回答by PooBucket

Maybe you want to go with something like this, instead of using the HtmlUnit framework's methods:

也许您想使用这样的方法,而不是使用 HtmlUnit 框架的方法:

try (InputStreamReader isr = new InputStreamReader(url.openConnection().getInputStream());
                 BufferedReader br = new BufferedReader(isr);){

        String line ="";
        String htmlSource ="";

        while((line = br.readLine()) != null)
        {
            htmlSource += line + "\n";
        }


        return htmlSource;

        } catch (IOException e) {
         // TODO Auto-generated catch block
            e.printStackTrace();
        }