如何解码 HTML 实体?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/576095/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-28 23:18:45  来源:igfitidea点击:

How can I decode HTML entities?

htmlperlasciispecial-characters

提问by Frank

Here's a quick Perl question:

这是一个快速的 Perl 问题:

How can I convert HTML special characters like üor 'to normal ASCII text?

如何将 HTML 特殊字符(如ü或 )转换'为普通 ASCII 文本?

I started with something like this:

我从这样的事情开始:

s/\&#(\d+);/chr()/eg;

and could write it for all HTML characters, but some function like this probably already exists?

并且可以为所有 HTML 字符编写它,但是像这样的某些功能可能已经存在?

Note that I don't need a full HTML->Text converter. I already parse the HTML with the HTML::Parser. I just need to convert the text with the special chars I'm getting.

请注意,我不需要完整的 HTML->Text 转换器。我已经用 .html 解析了 HTML HTML::Parser。我只需要使用我得到的特殊字符转换文本。

回答by Telemachus

Take a look at HTML::Entities:

看看HTML::Entities

use HTML::Entities;

my $html = "Snoopy & Charlie Brown";

print decode_entities($html), "\n";

You can guess the output.

你可以猜出输出。

回答by Mark Fowler

The above answers tell you how to decode the entities into Perl strings, but you also asked how to change those into ASCII.

上面的答案告诉您如何将实体解码为 Perl 字符串,但您还询问了如何将它们更改为ASCII

Assuming that this is really what you want and you don't want all the unicode characters you can look at the Text::Unidecodemodule from CPAN to Zap all those odd characters back into a roughly similar collection of ASCII characters:

假设这确实是您想要的并且您不想要所有 unicode 字符,您可以查看Text::Unidecode模块从 CPAN 到 Zap 所有这些奇怪的字符回到大致相似的 ASCII 字符集合:

use Text::Unidecode qw(unidecode);
use HTML::Entities qw(decode_entities);

my $source = '北亰';  
print unidecode(decode_entities($source));

# That prints: Bei Jing 

回答by ysth

Note that there are hex-specified characters too. They look like this: é (é).

请注意,也有十六进制指定的字符。它们看起来像这样: é (é)。

Use HTML::Entities' decode_entities to translate the entities into actual characters. To convert that to ASCII requires more work. I've used iconv (perl interface: Text::Iconv) with the transliterate option on with some success in the past. But if you are dealing with a limited set of entities, or you don't actually need it reduced to ASCII equivalents, you may be better off limiting what decode_entities produces or providing it with custom conversion maps. See the HTML::Entities doc.

使用 HTML::Entities 的 decode_entities 将实体转换为实际字符。将其转换为 ASCII 需要更多的工作。我曾经使用 iconv (perl interface: Text::Iconv) 和音译选项,并在过去取得了一些成功。但是,如果您正在处理一组有限的实体,或者您实际上并不需要将其简化为 ASCII 等效项,那么最好限制 decode_entities 生成的内容或为其提供自定义转换映射。请参阅 HTML::Entities 文档。

回答by Bevan

There are a handful of predefined HTML entities - &">and so on - that you could hard code.

有一些预定义的 HTML 实体 -&">等等 - 您可以对其进行硬编码。

However, the larger case of numberic entities - {- is going to be much harder, as those values are Unicode, and conversion to ASCII is going to range from difficultto impossible.

然而,数字实体的更大情况 -{将更加困难,因为这些值是Unicode,并且转换为 ASCII 的范围将从困难不可能