java 正确使用JTidy净化HTML
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/2547000/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Proper usage of JTidy to purify HTML
提问by ragebiswas
I am trying to use JTidy (jtidy-r938.jar) to sanitize an input HTML string, but I seem to have problems getting the default settings right. Often strings such as "hello world" end up as "helloworld" after tidying. I wanted to show what I'm doing here, and any pointers would be really appreciated:
我正在尝试使用 JTidy (jtidy-r938.jar) 来清理输入的 HTML 字符串,但我似乎无法正确获取默认设置。通常,诸如“hello world”之类的字符串在整理后最终会变成“helloworld”。我想展示我在这里做的事情,任何指点都将不胜感激:
Assume that rawHtmlis the String containing the input (real world) HTML. This is what I'm doing:
假设这rawHtml是包含输入(真实世界)HTML 的字符串。这就是我正在做的:
Tidy tidy = new Tidy();
tidy.setPrintBodyOnly(true);
ByteArrayOutputStream baos = new ByteArrayOutputStream();
PrintStream ps = new PrintStream(baos);
tidy.parse(new StringReader(rawHtml), ps);
return baos.toString("UTF8");
First off, does anything look fundamentally wrong with the above code? I seem to be getting weird results with this.
首先,上面的代码看起来有什么根本错误吗?我似乎得到了奇怪的结果。
For example, consider the following input:
例如,考虑以下输入:
<p class="MsoNormal" style="text-autospace:none;"><font color="black"><span style="color:black;">???</span></font><b><font color="#7f0055"><span style="color:#7f0055;font-weight:bold;">private</span></font></b><font color="black"><span style="color:black;"> String parseDescription</span></font><font>
<p class="MsoNormal" style="text-autospace:none;"><font color="black"><span style="color:black;">???</span></font><b><font color="#7f0055"><span style="color:#7f0055;font-weight:bold;">private</span></font></b><font color="black"><span style="color:black;"> String parseDescription</span></font><font>
The output is:
输出是:
<p class="MsoNormal" style="text-autospace:none;"><font color=
"black"><span style="color:black;"> </span></font>
<b><font color="#7F0055"><span style=
"color:#7f0055;font-weight:bold;">private</span></font></b><font
color="black"><span style="color:black;">String
parseDescription</span></font></p>
<p class="MsoNormal" style="text-autospace:none;"><font color=
"black"><span style="color:black;"> </span></font>
<b><font color="#7F0055"><span style=
"color:#7f0055;font-weight:bold;">private</span></font></b><font
color="black"><span style="color:black;">String
parseDescription</span></font></p>
So,
所以,
"public String parseDescription" becomes "publicString parseDescription"
“public String parseDescription”变成“publicString parseDescription”
Thanks in advance!
提前致谢!
采纳答案by ragebiswas
Well, this seems to be a bug in Jtidy. For the exact file which causes problems, refer here:
好吧,这似乎是 Jtidy 中的一个错误。有关导致问题的确切文件,请参阅此处:
http://sourceforge.net/tracker/?func=detail&aid=2985849&group_id=13153&atid=113153
http://sourceforge.net/tracker/?func=detail&aid=2985849&group_id=13153&atid=113153
Thanks for all the help folks!
感谢所有帮助的人!
回答by Verhagen
Have a look at how JTidy is configured:
看看 JTidy 是如何配置的:
StringWriter writer = new StringWriter();
tidy.getConfiguration().printConfigOptions(writer, true);
System.out.println(writer.toString());
Maybe it then get clear what causes the problem.
也许它会弄清楚导致问题的原因。
What is weird? Little example, of actual output and expected... maybe ?
有什么奇怪的?实际输出和预期的小例子......也许?
回答by Slava Imeshev
Here is how we are calling JTidy from Ant. You may infer the API call from it:
下面是我们如何从 Ant 调用 JTidy。您可以从中推断出 API 调用:
<tidy destdir="${build.dir.result}">
<fileset dir="${src}" includes="**/*.htm"/>
<parameter name="tidy-mark" value="false"/>
<parameter name="output-xml" value="no"/>
<parameter name="numeric-entities" value="yes"/>
<parameter name="indent-spaces" value="2"/>
<parameter name="indent-attributes" value="no"/>
<parameter name="markup" value="yes"/>
<parameter name="wrap" value="2000"/>
<parameter name="uppercase-tags" value="no"/>
<parameter name="uppercase-attributes" value="no"/>
<parameter name="quiet" value="no"/>
<parameter name="clean" value="yes"/>
<parameter name="show-warnings" value="yes"/>
<parameter name="break-before-br" value="yes"/>
<parameter name="hide-comments" value="yes"/>
<parameter name="char-encoding" value="latin1"/>
<parameter name="output-html" value="yes"/>
</tidy>

