使用 Java 标准库将 HTML 字符转换回文本
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/599634/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Convert HTML Character Back to Text Using Java Standard Library
提问by Cheok Yan Cheng
I would like to convert some HTML characters back to text using Java Standard Library. I was wondering whether any library would achieve my purpose?
我想使用 Java 标准库将一些 HTML 字符转换回文本。我想知道是否有任何图书馆可以达到我的目的?
/**
* @param args the command line arguments
*/
public static void main(String[] args) {
// TODO code application logic here
// "Happy & Sad" in HTML form.
String s = "Happy & Sad";
System.out.println(s);
try {
// Change to "Happy & Sad". DOESN'T WORK!
s = java.net.URLDecoder.decode(s, "UTF-8");
System.out.println(s);
} catch (UnsupportedEncodingException ex) {
}
}
采纳答案by Bill.D
I think the Apache Commons Lang library's StringEscapeUtils.unescapeHtml3()
and unescapeHtml4()
methods are what you are looking for. See https://commons.apache.org/proper/commons-text/javadocs/api-release/org/apache/commons/text/StringEscapeUtils.html.
我认为 Apache Commons Lang 库StringEscapeUtils.unescapeHtml3()
和unescapeHtml4()
方法正是您要寻找的。请参阅https://commons.apache.org/proper/commons-text/javadocs/api-release/org/apache/commons/text/StringEscapeUtils.html。
回答by rogeriopvl
I'm not aware of any way to do it using the standard library. But I do know and use this class that deals with html entities.
我不知道有什么方法可以使用标准库来做到这一点。但我知道并使用这个处理 html 实体的类。
"HTMLEntities is an Open Source Java class that contains a collection of static methods (htmlentities, unhtmlentities, ...) to convert special and extended characters into HTML entitities and vice versa."
“HTMLEntities 是一个开源 Java 类,它包含一组静态方法(htmlentities、unhtmlentities 等),用于将特殊字符和扩展字符转换为 HTML 实体,反之亦然。”
http://www.tecnick.com/public/code/cp_dpage.php?aiocp_dp=htmlentities
http://www.tecnick.com/public/code/cp_dpage.php?aiocp_dp=htmlentities
回答by Zach Scrivena
java.net.URLDecoder
deals only with the application/x-www-form-urlencoded
MIME format (e.g. "%20" represents space), not with HTML character entities. I don't think there's anything on the Java platform for that. You could write your own utility class to do the conversion, like this one.
java.net.URLDecoder
只处理application/x-www-form-urlencoded
MIME 格式(例如“%20”代表空格),而不处理HTML 字符实体。我认为 Java 平台上没有任何内容。您可以编写自己的实用程序类来进行转换,就像这样。
回答by Rich
The URL decoder should only be used for decoding strings from the urls generated by html forms which are in the "application/x-www-form-urlencoded" mime type. This does not support html characters.
URL 解码器应该只用于从“application/x-www-form-urlencoded”mime 类型的 html 表单生成的 url 中解码字符串。这不支持 html 字符。
After a searchI found a Translateclass within the HTML Parserlibrary.
回答by jem
Here you have to just add jar file in lib jsoup in your application and then use this code.
在这里,您只需在应用程序的 lib jsoup 中添加 jar 文件,然后使用此代码。
import org.jsoup.Jsoup;
public class Encoder {
public static void main(String args[]) {
String s = Jsoup.parse("<Français>").text();
System.out.print(s);
}
}
Link to download jsoup: http://jsoup.org/download
jsoup下载链接:http: //jsoup.org/download
回答by Daniele
As @jem suggested, it is possible to use jsoup.
正如@jem 建议的那样,可以使用 jsoup。
With jSoup 1.8.3 it il possible to use the method Parser.unescapeEntitiesthat retain the original html.
使用 jSoup 1.8.3,可以使用保留原始 html 的Parser.unescapeEntities方法。
import org.jsoup.parser.Parser;
...
String html = Parser.unescapeEntities(original_html, false);
It seems that in some previous release this method is not present.
似乎在某些以前的版本中不存在此方法。
回答by Bruno Barros
You can use the class org.apache.commons.lang.StringEscapeUtils:
您可以使用类 org.apache.commons.lang.StringEscapeUtils:
String s = StringEscapeUtils.unescapeHtml("Happy & Sad")
It is working.
这是工作。
回答by Heriberto Gutiérrez Gutiérrez
Or you can use unescapeHtml4:
或者你可以使用 unescapeHtml4:
String miCadena="GUÍA TELEFÓNICA";
System.out.println(StringEscapeUtils.unescapeHtml4(miCadena));
This code print the line: GUíA TELEFóNICA
此代码打印以下行:GUíA TELEFóNICA