使用 Java 标准库将 HTML 字符转换回文本

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/599634/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-11 16:41:32  来源:igfitidea点击:

Convert HTML Character Back to Text Using Java Standard Library

javahtmlhtml-entities

提问by Cheok Yan Cheng

I would like to convert some HTML characters back to text using Java Standard Library. I was wondering whether any library would achieve my purpose?

我想使用 Java 标准库将一些 HTML 字符转换回文本。我想知道是否有任何图书馆可以达到我的目的?

/**
 * @param args the command line arguments
 */
public static void main(String[] args) {
    // TODO code application logic here

    // "Happy & Sad" in HTML form.
    String s = "Happy & Sad";
    System.out.println(s);

    try {
        // Change to "Happy & Sad". DOESN'T WORK!
        s = java.net.URLDecoder.decode(s, "UTF-8");
        System.out.println(s);
    } catch (UnsupportedEncodingException ex) {

    }
}

采纳答案by Bill.D

I think the Apache Commons Lang library's StringEscapeUtils.unescapeHtml3()and unescapeHtml4()methods are what you are looking for. See https://commons.apache.org/proper/commons-text/javadocs/api-release/org/apache/commons/text/StringEscapeUtils.html.

我认为 Apache Commons Lang 库StringEscapeUtils.unescapeHtml3()unescapeHtml4()方法正是您要寻找的。请参阅https://commons.apache.org/proper/commons-text/javadocs/api-release/org/apache/commons/text/StringEscapeUtils.html

回答by rogeriopvl

I'm not aware of any way to do it using the standard library. But I do know and use this class that deals with html entities.

我不知道有什么方法可以使用标准库来做到这一点。但我知道并使用这个处理 html 实体的类。

"HTMLEntities is an Open Source Java class that contains a collection of static methods (htmlentities, unhtmlentities, ...) to convert special and extended characters into HTML entitities and vice versa."

“HTMLEntities 是一个开源 Java 类,它包含一组静态方法(htmlentities、unhtmlentities 等),用于将特殊字符和扩展字符转换为 HTML 实体,反之亦然。”

http://www.tecnick.com/public/code/cp_dpage.php?aiocp_dp=htmlentities

http://www.tecnick.com/public/code/cp_dpage.php?aiocp_dp=htmlentities

回答by Zach Scrivena

java.net.URLDecoderdeals only with the application/x-www-form-urlencodedMIME format (e.g. "%20" represents space), not with HTML character entities. I don't think there's anything on the Java platform for that. You could write your own utility class to do the conversion, like this one.

java.net.URLDecoder只处理application/x-www-form-urlencodedMIME 格式(例如“%20”代表空格),而不处理HTML 字符实体。我认为 Java 平台上没有任何内容。您可以编写自己的实用程序类来进行转换,就像这样

回答by Rich

The URL decoder should only be used for decoding strings from the urls generated by html forms which are in the "application/x-www-form-urlencoded" mime type. This does not support html characters.

URL 解码器应该只用于从“application/x-www-form-urlencoded”mime 类型的 html 表单生成的 url 中解码字符串。这不支持 html 字符。

After a searchI found a Translateclass within the HTML Parserlibrary.

一个经过搜索,我发现一个翻译的类内的HTML解析器库。

回答by jem

Here you have to just add jar file in lib jsoup in your application and then use this code.

在这里,您只需在应用程序的 lib jsoup 中添加 jar 文件,然后使用此代码。

import org.jsoup.Jsoup;

public class Encoder {
    public static void main(String args[]) {
        String s = Jsoup.parse("<Français>").text();
        System.out.print(s);
    }
}

Link to download jsoup: http://jsoup.org/download

jsoup下载链接:http: //jsoup.org/download

回答by Daniele

As @jem suggested, it is possible to use jsoup.

正如@jem 建议的那样,可以使用 jsoup。

With jSoup 1.8.3 it il possible to use the method Parser.unescapeEntitiesthat retain the original html.

使用 jSoup 1.8.3,可以使用保留原始 html 的Parser.unescapeEntities方法。

import org.jsoup.parser.Parser;
...
String html = Parser.unescapeEntities(original_html, false);

It seems that in some previous release this method is not present.

似乎在某些以前的版本中不存在此方法。

回答by Bruno Barros

You can use the class org.apache.commons.lang.StringEscapeUtils:

您可以使用类 org.apache.commons.lang.StringEscapeUtils:

String s = StringEscapeUtils.unescapeHtml("Happy & Sad")

It is working.

这是工作。

回答by Heriberto Gutiérrez Gutiérrez

Or you can use unescapeHtml4:

或者你可以使用 unescapeHtml4:

    String miCadena="GUÍA TELEFÓNICA";
    System.out.println(StringEscapeUtils.unescapeHtml4(miCadena));

This code print the line: GUíA TELEFóNICA

此代码打印以下行:GUíA TELEFóNICA