java 如何让 JTIdy 使 HTML 文档格式良好？

Question

提问by Dave

I'm using JTidy v. r938. I'm using this code to attempt to clean up a page …

我正在使用 JTidy 诉 r938。我正在使用此代码尝试清理页面……

final Tidy tidy = new Tidy();
tidy.setQuiet(false);
tidy.setShowWarnings(true);
tidy.setShowErrors(0);
tidy.setMakeClean(true);
Document document = tidy.parseDOM(conn.getInputStream(), null);

But when I parse this URL -- http://www.chicagoreader.com/chicago/EventSearch?narrowByDate=This+Week&eventCategory=93922&keywords=&page=1, things aren't getting cleaned up. For example, the META tags on the page, like

但是当我解析这个 URL —— http://www.chicagoreader.com/chicago/EventSearch?narrowByDate=This+Week&eventCategory=93922&keywords=&page=1 时，事情并没有得到清理。例如，页面上的 META 标签，如

<META http-equiv="Content-Type" content="text/html; charset=UTF-8">

remain as

保持为

<META http-equiv="Content-Type" content="text/html; charset=UTF-8">

instead of having a "</META>" tag or appearing as "<META http-equiv="Content-Type" content="text/html; charset=UTF-8"/>". I confirm this by outputting the resulting JTidy org.w3c.dom.Document as a String.

而不是具有“</META>”标签或显示为“<META http-equiv="Content-Type" content="text/html; 字符集=UTF-8"/>"。我通过将结果 JTidy org.w3c.dom.Document 作为字符串输出来确认这一点。

What can I do to make JTidy truly clean up the page -- i.e. make it well-formed? I realize there are other tools out there, but this question specifically relates to using JTIdy.

我该怎么做才能让 JTidy 真正清理页面——即使其格式正确？我意识到还有其他工具，但这个问题特别与使用 JTIdy 有关。

Answer 1

回答by Paul Vargas

You need specify several flags to Tidy if you want XML format

如果需要 XML 格式，则需要为 Tidy 指定几个标志

private String cleanData(String data) throws UnsupportedEncodingException {
    Tidy tidy = new Tidy();
    tidy.setInputEncoding("UTF-8");
    tidy.setOutputEncoding("UTF-8");
    tidy.setWraplen(Integer.MAX_VALUE);
    tidy.setPrintBodyOnly(true);
    tidy.setXmlOut(true);
    tidy.setSmartIndent(true);
    ByteArrayInputStream inputStream = new ByteArrayInputStream(data.getBytes("UTF-8"));
    ByteArrayOutputStream outputStream = new ByteArrayOutputStream();
    tidy.parseDOM(inputStream, outputStream);
    return outputStream.toString("UTF-8");
}

Or simply if want XHTML form

或者只是如果想要 XHTML 表单

Tidy tidy = new Tidy();
tidy.setXHTML(true);

Answer 2

回答by Brian

use tidy.setXmlTags(true); to parse XML instead of HTML

使用 tidy.setXmlTags(true); 解析 XML 而不是 HTML

Answer 3

回答by Adam Mackler

Use Tidy.setForceOutput(true)(at your own risk) to generate the output even if errors are found.

Tidy.setForceOutput(true)即使发现错误，也使用（风险自负）生成输出。

Answer 4

回答by user3278204

I parse the HTML 2 times to get well formed xml

我解析 HTML 2 次以获得格式良好的 xml

  BufferedReader br = new BufferedReader(new StringReader(str));
  StringWriter sw = new StringWriter();

  Tidy t = new Tidy();
  t.setDropEmptyParas(true);
  t.setShowWarnings(false); //to hide errors
  t.setQuiet(true); //to hide warning
  t.setUpperCaseAttrs(false);
  t.setUpperCaseTags(false);
  t.parse(br,sw);
  StringBuffer sb = sw.getBuffer();
  String strClean = sb.toString();
  br.close();
  sw.close();

  //do another round of tidyness
  br = new BufferedReader(new StringReader(strClean));
  sw = new StringWriter();

  t = new Tidy();
  t.setXmlTags(true);
  t.parse(br,sw);
  sb = sw.getBuffer();
  String strClean2 = sb.toString();
  br.close();
  sw.close();

java 如何让 JTIdy 使 HTML 文档格式良好？

提问by Dave

回答by Paul Vargas

回答by Brian

回答by Adam Mackler

回答by user3278204

相关推荐

最近更新

标签

java 如何让 JTIdy 使 HTML 文档格式良好？

提问by Dave

回答by Paul Vargas

回答by Brian

回答by Adam Mackler

回答by user3278204

相关推荐

java Spring MVC - 如何在缺少参数时返回 404

java java方法同步与读写互斥

java Netty - 如何获取所有客户端频道？

从 Java 程序访问 OBIEE 中的报告

相关推荐

最近更新

标签