java 如何让 JTIdy 使 HTML 文档格式良好?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/10390922/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How do I make JTIdy make HTML documents well-formed?
提问by Dave
I'm using JTidy v. r938. I'm using this code to attempt to clean up a page …
我正在使用 JTidy 诉 r938。我正在使用此代码尝试清理页面……
final Tidy tidy = new Tidy();
tidy.setQuiet(false);
tidy.setShowWarnings(true);
tidy.setShowErrors(0);
tidy.setMakeClean(true);
Document document = tidy.parseDOM(conn.getInputStream(), null);
But when I parse this URL -- http://www.chicagoreader.com/chicago/EventSearch?narrowByDate=This+Week&eventCategory=93922&keywords=&page=1, things aren't getting cleaned up. For example, the META tags on the page, like
但是当我解析这个 URL —— http://www.chicagoreader.com/chicago/EventSearch?narrowByDate=This+Week&eventCategory=93922&keywords=&page=1 时,事情并没有得到清理。例如,页面上的 META 标签,如
<META http-equiv="Content-Type" content="text/html; charset=UTF-8">
remain as
保持为
<META http-equiv="Content-Type" content="text/html; charset=UTF-8">
instead of having a "</META>" tag or appearing as "<META http-equiv="Content-Type" content="text/html; charset=UTF-8"/>". I confirm this by outputting the resulting JTidy org.w3c.dom.Document as a String.
而不是具有“</META>”标签或显示为“<META http-equiv="Content-Type" content="text/html; 字符集=UTF-8"/>"。我通过将结果 JTidy org.w3c.dom.Document 作为字符串输出来确认这一点。
What can I do to make JTidy truly clean up the page -- i.e. make it well-formed? I realize there are other tools out there, but this question specifically relates to using JTIdy.
我该怎么做才能让 JTidy 真正清理页面——即使其格式正确?我意识到还有其他工具,但这个问题特别与使用 JTIdy 有关。
回答by Paul Vargas
You need specify several flags to Tidy if you want XML format
如果需要 XML 格式,则需要为 Tidy 指定几个标志
private String cleanData(String data) throws UnsupportedEncodingException {
Tidy tidy = new Tidy();
tidy.setInputEncoding("UTF-8");
tidy.setOutputEncoding("UTF-8");
tidy.setWraplen(Integer.MAX_VALUE);
tidy.setPrintBodyOnly(true);
tidy.setXmlOut(true);
tidy.setSmartIndent(true);
ByteArrayInputStream inputStream = new ByteArrayInputStream(data.getBytes("UTF-8"));
ByteArrayOutputStream outputStream = new ByteArrayOutputStream();
tidy.parseDOM(inputStream, outputStream);
return outputStream.toString("UTF-8");
}
Or simply if want XHTML form
或者只是如果想要 XHTML 表单
Tidy tidy = new Tidy();
tidy.setXHTML(true);
回答by Brian
use tidy.setXmlTags(true); to parse XML instead of HTML
使用 tidy.setXmlTags(true); 解析 XML 而不是 HTML
回答by Adam Mackler
Use Tidy.setForceOutput(true)
(at your own risk) to generate the output even if errors are found.
Tidy.setForceOutput(true)
即使发现错误,也使用(风险自负)生成输出。
回答by user3278204
I parse the HTML 2 times to get well formed xml
我解析 HTML 2 次以获得格式良好的 xml
BufferedReader br = new BufferedReader(new StringReader(str));
StringWriter sw = new StringWriter();
Tidy t = new Tidy();
t.setDropEmptyParas(true);
t.setShowWarnings(false); //to hide errors
t.setQuiet(true); //to hide warning
t.setUpperCaseAttrs(false);
t.setUpperCaseTags(false);
t.parse(br,sw);
StringBuffer sb = sw.getBuffer();
String strClean = sb.toString();
br.close();
sw.close();
//do another round of tidyness
br = new BufferedReader(new StringReader(strClean));
sw = new StringWriter();
t = new Tidy();
t.setXmlTags(true);
t.parse(br,sw);
sb = sw.getBuffer();
String strClean2 = sb.toString();
br.close();
sw.close();