如何使用 Nokogiri 漂亮地打印 HTML?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/1898829/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-29 01:34:05  来源:igfitidea点击:

How do I pretty-print HTML with Nokogiri?

htmlrubynokogiripretty-print

提问by Jarsen

I wrote a web crawler in Ruby and I'm using Nokogiri::HTMLto parse the page. I need to print the page out and while messing around in IRB I noticed a pretty_printmethod. However it takes a parameter and I can't figure out what it wants.

我用 Ruby 编写了一个网络爬虫,我正在使用它Nokogiri::HTML来解析页面。我需要将页面打印出来,在 IRB 中乱搞时,我注意到了一种pretty_print方法。但是它需要一个参数,我无法弄清楚它想要什么。

My crawler is caching the HTML of the webpages and writing it to files on my local machine. I would like to "pretty print" the HTML so that it looks nice and properly formatted when I do so.

我的爬虫正在缓存网页的 HTML 并将其写入本地机器上的文件。我想“漂亮地打印”HTML,以便在我这样做时它看起来不错并且格式正确。

采纳答案by mislav

By "pretty printing" of HTML page I presume you meant that you want to reformat the HTML structure with proper indentation. Nokogiri doesn't support this; the pretty_printmethod is for the "pp" library and the output is useful for debugging only.

通过 HTML 页面的“漂亮打印”,我认为您的意思是您想使用适当的缩进重新格式化 HTML 结构。Nokogiri 不支持这个;该pretty_print方法适用于“pp”库,输出仅用于调试。

There are several projects that understand HTML well enough to be able to reformat it without destroying whitespace that is actually significant (the famous one is HTML Tidy), but by Googling I've found this post titled "Pretty printing XHTML with Nokogiri and XSLT".

有几个项目可以很好地理解 HTML 以能够在不破坏实际重要的空白的情况下重新格式化它(着名的一个是HTML Tidy),但是通过谷歌搜索,我发现这篇文章标题为“使用 Nokogiri 和 XSLT 打印 XHTML”.

It comes down to this:

归结为:

xsl = Nokogiri::XSLT(File.open("pretty_print.xsl"))
html = Nokogiri(File.open("source.html"))
puts xsl.apply_to(html).to_s

It requires you, of course, to download the linked XSL file to your filesystem. I've tried it very quickly on my machine and it works like a charm.

当然,它要求您将链接的 XSL 文件下载到您的文件系统。我已经在我的机器上快速试用了它,它就像一个魅力。

回答by Phrogz

The answer by @mislav is somewhat wrong. Nokogiri does support pretty-printing ifyou:

@mislav 的回答有点错误。如果您符合以下条件,Nokogiri 确实支持漂亮打印:

  • Parse the document as XML
  • Instruct Nokogiri to ignore whitespace-only nodes ("blanks") during parsing
  • Use to_xhtmlor to_xmlto specify pretty-printing parameters
  • 将文档解析为 XML
  • 指示 Nokogiri 在解析过程中忽略空白节点(“空白”)
  • 使用to_xhtmlto_xml指定漂亮打印参数

In action:

在行动:

html = '<section>
<h1>Main Section 1</h1><p>Intro</p>
<section>
<h2>Subhead 1.1</h2><p>Meat</p><p>MOAR MEAT</p>
</section><section>
<h2>Subhead 1.2</h2><p>Meat</p>
</section></section>'

require 'nokogiri'
doc = Nokogiri::XML(html,&:noblanks)
puts doc
#=> <section>
#=>   <h1>Main Section 1</h1>
#=>   <p>Intro</p>
#=>   <section>
#=>     <h2>Subhead 1.1</h2>
#=>     <p>Meat</p>
#=>     <p>MOAR MEAT</p>
#=>   </section>
#=>   <section>
#=>     <h2>Subhead 1.2</h2>
#=>     <p>Meat</p>
#=>   </section>
#=> </section>

puts doc.to_xhtml( indent:3, indent_text:"." )
#=> <section>
#=> ...<h1>Main Section 1</h1>
#=> ...<p>Intro</p>
#=> ...<section>
#=> ......<h2>Subhead 1.1</h2>
#=> ......<p>Meat</p>
#=> ......<p>MOAR MEAT</p>
#=> ...</section>
#=> ...<section>
#=> ......<h2>Subhead 1.2</h2>
#=> ......<p>Meat</p>
#=> ...</section>
#=> </section>

回答by bronson

This worked for me:

这对我有用:

 pretty_html = Nokogiri::HTML(html).to_xhtml(indent: 3) 

I tried the REXML version above, but it corrupted some of my documents. And I hate to bring xslt into a new project. Both feel antiquated. :)

我尝试了上面的 REXML 版本,但它损坏了我的一些文档。而且我讨厌将 xslt 带入一个新项目。两者都感觉过时了。:)

回答by Julien

You can try REXML:

你可以试试 REXML:

require "rexml/document"

doc = REXML::Document.new(xml)
doc.write($stdout, 2)

回答by pariser

My solution was to add a printmethod onto the actual Nokogiriobjects. After you run the code in the snippet below, you should just be able to write node.print, and it'll pretty print the contents. No xslt required :-)

我的解决方案是print在实际Nokogiri对象上添加一个方法。在您运行下面代码段中的代码后,您应该能够编写node.print,并且它会很好地打印内容。不需要 xslt :-)

Nokogiri::XML::Node.class_eval do
  # Print every Node by default (will be overridden by CharacterData)
  define_method :should_print? do
    true
  end

  # Duplicate this node, replace the contents of the duplicated node with a
  # newline. With this content substitution, the #to_s method conveniently
  # returns a string with the opening tag (e.g. `<a href="foo">`) on the first
  # line and the closing tag on the second (e.g. `</a>`, provided that the
  # current node is not a self-closing tag).
  #
  # Now, print the open tag preceded by the correct amount of indentation, then
  # recursively print this node's children (with extra indentation), and then
  # print the close tag (if there is a closing tag)
  define_method :print do |indent=0|
    duplicate = self.dup
    duplicate.content = "\n"
    open_tag, close_tag = duplicate.to_s.split("\n")

    puts (" " * indent) + open_tag
    self.children.select(&:should_print?).each { |child| child.print(indent + 2) }
    puts (" " * indent) + close_tag if close_tag
  end
end

Nokogiri::XML::CharacterData.class_eval do
  # Only print CharacterData if there's non-whitespace content
  define_method :should_print? do
    content =~ /\S+/
  end

  # Replace all consecutive whitespace characters by a single space; precede the
  # outut by a certain amount of indentation; print this text.
  define_method :print do |indent=0|
    puts (" " * indent) + to_s.strip.sub(/\s+/, ' ')
  end
end

回答by khelll

why don't you try the ppmethod?

你为什么不试试这个pp方法呢?

require 'pp'
pp some_var