java 如何让 JTIdy 使 HTML 文档格式良好?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/10390922/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-10-31 00:50:55  来源:igfitidea点击:

How do I make JTIdy make HTML documents well-formed?

javahtmlxmlparsingjtidy

提问by Dave

I'm using JTidy v. r938. I'm using this code to attempt to clean up a page …

我正在使用 JTidy 诉 r938。我正在使用此代码尝试清理页面……

final Tidy tidy = new Tidy();
tidy.setQuiet(false);
tidy.setShowWarnings(true);
tidy.setShowErrors(0);
tidy.setMakeClean(true);
Document document = tidy.parseDOM(conn.getInputStream(), null);

But when I parse this URL -- http://www.chicagoreader.com/chicago/EventSearch?narrowByDate=This+Week&eventCategory=93922&keywords=&page=1, things aren't getting cleaned up. For example, the META tags on the page, like

但是当我解析这个 URL —— http://www.chicagoreader.com/chicago/EventSearch?narrowByDate=This+Week&eventCategory=93922&keywords=&page=1 时,事情并没有得到清理。例如,页面上的 META 标签,如

<META http-equiv="Content-Type" content="text/html; charset=UTF-8">

remain as

保持为

<META http-equiv="Content-Type" content="text/html; charset=UTF-8">

instead of having a "</META>" tag or appearing as "<META http-equiv="Content-Type" content="text/html; charset=UTF-8"/>". I confirm this by outputting the resulting JTidy org.w3c.dom.Document as a String.

而不是具有“</META>”标签或显示为“<META http-equiv="Content-Type" content="text/html; 字符集=UTF-8"/>"。我通过将结果 JTidy org.w3c.dom.Document 作为字符串输出来确认这一点。

What can I do to make JTidy truly clean up the page -- i.e. make it well-formed? I realize there are other tools out there, but this question specifically relates to using JTIdy.

我该怎么做才能让 JTidy 真正清理页面——即使其格式正确?我意识到还有其他工具,但这个问题特别与使用 JTIdy 有关。

回答by Paul Vargas

You need specify several flags to Tidy if you want XML format

如果需要 XML 格式,则需要为 Tidy 指定几个标志

private String cleanData(String data) throws UnsupportedEncodingException {
    Tidy tidy = new Tidy();
    tidy.setInputEncoding("UTF-8");
    tidy.setOutputEncoding("UTF-8");
    tidy.setWraplen(Integer.MAX_VALUE);
    tidy.setPrintBodyOnly(true);
    tidy.setXmlOut(true);
    tidy.setSmartIndent(true);
    ByteArrayInputStream inputStream = new ByteArrayInputStream(data.getBytes("UTF-8"));
    ByteArrayOutputStream outputStream = new ByteArrayOutputStream();
    tidy.parseDOM(inputStream, outputStream);
    return outputStream.toString("UTF-8");
}

Or simply if want XHTML form

或者只是如果想要 XHTML 表单

Tidy tidy = new Tidy();
tidy.setXHTML(true);

回答by Brian

use tidy.setXmlTags(true); to parse XML instead of HTML

使用 tidy.setXmlTags(true); 解析 XML 而不是 HTML

回答by Adam Mackler

Use Tidy.setForceOutput(true)(at your own risk) to generate the output even if errors are found.

Tidy.setForceOutput(true)即使发现错误,也使用(风险自负)生成输出。

回答by user3278204

I parse the HTML 2 times to get well formed xml

我解析 HTML 2 次以获得格式良好的 xml

  BufferedReader br = new BufferedReader(new StringReader(str));
  StringWriter sw = new StringWriter();

  Tidy t = new Tidy();
  t.setDropEmptyParas(true);
  t.setShowWarnings(false); //to hide errors
  t.setQuiet(true); //to hide warning
  t.setUpperCaseAttrs(false);
  t.setUpperCaseTags(false);
  t.parse(br,sw);
  StringBuffer sb = sw.getBuffer();
  String strClean = sb.toString();
  br.close();
  sw.close();

  //do another round of tidyness
  br = new BufferedReader(new StringReader(strClean));
  sw = new StringWriter();

  t = new Tidy();
  t.setXmlTags(true);
  t.parse(br,sw);
  sb = sw.getBuffer();
  String strClean2 = sb.toString();
  br.close();
  sw.close();