Html 对于主要包含阿拉伯文本的网页,我应该使用什么字符编码?utf-8 可以吗?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/2996475/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
What character encoding should I use for a web page containing mostly Arabic text? Is utf-8 okay?
提问by Paul D. Waite
What character encoding should I use for a web page containing mostly Arabic text?
对于主要包含阿拉伯文本的网页,我应该使用什么字符编码?
Is utf-8 okay?
utf-8 可以吗?
回答by JoeG
UTF-8 can store the full Unicode range, so it's fine to use for Arabic.
UTF-8 可以存储完整的 Unicode 范围,因此可以用于阿拉伯语。
However, if you were wondering what encoding would be most efficient:
但是,如果您想知道哪种编码最有效:
All Arabic characters can be encoded using a single UTF-16 code unit (2 bytes), but they may take either 2 or 3 UTF-8 code units (1 byte each), so if you were just encoding Arabic, UTF-16 would be a more space efficient option.
所有阿拉伯字符都可以使用单个 UTF-16 代码单元(2 个字节)进行编码,但它们可能需要 2 个或 3 个 UTF-8 代码单元(每个 1 个字节),因此如果您只是编码阿拉伯语,UTF-16 会成为更节省空间的选择。
However, you're not just encoding Arabic - you're encoding a significant number of characters that can be stored in a single byte in UTF-8, but take two bytes in UTF-16; all the html encoding characters <
,&
,>
,=
and all the html element names.
但是,您不仅仅是在编码阿拉伯语 - 您正在编码大量字符,这些字符可以存储在 UTF-8 中的单个字节中,但在 UTF-16 中需要两个字节;所有的HTML字符的编码<
,&
,>
,=
和所有的HTML元素名称。
It's a trade off and, unless you're dealing with huge documents, it doesn't matter.
这是一种权衡,除非您正在处理大量文档,否则这无关紧要。
回答by Maher4Ever
I develop mostly Arabic websites and these are the two encodings I use :
我主要开发阿拉伯语网站,这是我使用的两种编码:
1. Windows-1256
1. 视窗-1256
This is the most common encoding Arabic websites use. It works in most cases (90%) for Arabic users.
这是阿拉伯语网站最常用的编码方式。它在大多数情况下(90%)适用于阿拉伯语用户。
Here is one of the biggest Arabic web-development forums: http://traidnt.net/vb/. You can see that they are using this encoding.
这是最大的阿拉伯语网络开发论坛之一:http: //traidnt.net/vb/。您可以看到他们正在使用这种编码。
The problem with this encoding is that if you are developing a website for international use, this encoding won't work with every user and they will see gibberish instead of the content.
这种编码的问题在于,如果您正在开发一个供国际使用的网站,这种编码不适用于每个用户,他们会看到乱码而不是内容。
2. UTF-8
2. UTF-8
This encoding solves the previous problem and also works in urls. I mean if you want to have Arabic words in the your url, you need them to be in utf-8 or it won't work.
这种编码解决了前面的问题,也适用于 url。我的意思是如果你想在你的 url 中包含阿拉伯语单词,你需要它们在 utf-8 中,否则它将不起作用。
The downside of this encoding is that if you are going to save Arabic content to a database (e.g. MySql) using this encoding (so the database will also be encoded with utf-8) its size is going to be double what it would have been if it were encoded with windows-1256 (so the database will be encoded with latin-1).
这种编码的缺点是,如果您要使用这种编码(因此数据库也将使用 utf-8 编码)将阿拉伯语内容保存到数据库(例如 MySql),其大小将是原来的两倍如果它是用 windows-1256 编码的(所以数据库将用 latin-1 编码)。
I suggest going with utf-8 if you can afford the size increase.
如果您负担得起增加的大小,我建议使用 utf-8。
回答by JUST MY correct OPINION
UTF-8 is fine, yes. It can encode any code point in the Unicode standard.
UTF-8 没问题,是的。它可以对 Unicode 标准中的任何代码点进行编码。
Edited to add
编辑添加
To make the answer more complete, your realistic choices are:
为了使答案更完整,您的现实选择是:
- UTF-8
- UTF-16
- UTF-32
- UTF-8
- UTF-16
- UTF-32
Each comes with tradeoffs and advantages.
每个都有权衡和优势。
UTF-8
UTF-8
As Joe Gauterinpoints out, UTF-8 is very efficient for European texts but can get increasingly inefficient the "farther" from the Latin alphabet you get. If your text is all Arabic it will actually be larger than the equivalent text in UTF-16. This is rarely a problem, however, in practice in these days of cheap and plentiful RAM unless you have a lot of text to deal with. More of a problem is that the variable-length of the encoding makes some string operations difficult and slow. For example you can't easily get the fifth Arabic character in a string because some characters might be 1 byte long (punctuation, say), while others are two or three. This makes actual processingof strings slow and error-prone.
正如Joe Gauterin指出的那样,UTF-8 对欧洲文本非常有效,但随着您获得的拉丁字母“远离”,UTF-8 的效率会越来越低。如果您的文本都是阿拉伯语,它实际上会比 UTF-16 中的等效文本大。然而,除非您有大量文本需要处理,否则在这些廉价而充足的 RAM 的实践中,这很少成为问题。更多的问题是编码的可变长度使得一些字符串操作变得困难和缓慢。例如,您无法轻松获得字符串中的第五个阿拉伯字符,因为某些字符可能是 1 个字节长(例如标点符号),而其他字符则是两个或三个。这使得字符串的实际处理缓慢且容易出错。
On the other hand, UTF-8 is likely your best choice if you're doing a lot of mixed European/Arabic text. The more European text in your documents, the better the UTF-8 choice will be.
另一方面,如果您要处理大量欧洲/阿拉伯语混合文本,则 UTF-8 可能是您的最佳选择。文档中的欧洲文本越多,UTF-8 选择就越好。
UTF-16
UTF-16
UTF-16 will give you better space efficiency than UTF-8 if you're using predominantly Arabic text. I don't know about the Arabic code points, however, so I don't know if you risk having variable-length encodings here. (My guess is that this is not an issue, however.) If you do, in fact, have variable-length encodings, all the string processing problems of UTF-8 apply here as well. If not, no problems.
如果您主要使用阿拉伯语文本,则 UTF-16 将为您提供比 UTF-8 更好的空间效率。但是,我不知道阿拉伯语代码点,所以我不知道您是否有风险在此处使用可变长度编码。(不过,我的猜测是这不是问题。)如果您确实使用了可变长度编码,那么 UTF-8 的所有字符串处理问题也适用于此。如果没有,没有问题。
On the other hand, if you have mixed European and Arabic texts, UTF-16 will be less space-efficient. Also, if you find yourself expanding your text forms to other texts like, say, Chinese, you definitely go back to variable length forms and the associated problems.
另一方面,如果您混合了欧洲文本和阿拉伯文本,UTF-16 的空间效率会降低。此外,如果您发现自己将文本格式扩展到其他文本,例如中文,您肯定会回到可变长度格式和相关问题。
UTF-32
UTF-32
UTF-32 will basically double your space requirements. On the other hand it's constant sized for all known (and, likely, unknown;) script forms. For raw string processing it's your fastest, best option without the problems that variable-length encoding will cause you. (This presupposes you have a string library that knows about 32-bit characters, naturally.)
UTF-32 基本上会使您的空间需求增加一倍。另一方面,对于所有已知的(并且可能是未知的;)脚本形式,它的大小是恒定的。对于原始字符串处理,它是您最快、最好的选择,而不会出现可变长度编码会给您带来的问题。(这假设您有一个字符串库,它自然知道 32 位字符。)
Recommendation
推荐
My own recommendation is that you use UTF-8 as your external format (because everybody supports it) for storage, transmission, etc. unless you reallysee a benefit size-wise with UTF-16. So any time you read a string from the outside world it would be UTF-8 and any time you put one to the outside world it, too, would be UTF-8. Within your software, though, unless you're in the habit of manipulating massive strings (in which case I'd recommend different data structures anyway!) I'd recommend using UTF-16 or UTF-32 instead (depending on if there's any variable-length encoding issues in your UTF-16 data) for the speed efficiency and simplicity of code.
我自己的建议是,您使用 UTF-8 作为外部格式(因为每个人都支持它)用于存储、传输等,除非您真的看到 UTF-16 的大小优势。因此,任何时候您从外部世界读取一个字符串时,它都是 UTF-8,而任何时候您将一个字符串放到外部世界时,它也将是 UTF-8。但是,在您的软件中,除非您有操作大量字符串的习惯(在这种情况下,无论如何我都会推荐不同的数据结构!)我建议改用 UTF-16 或 UTF-32(取决于是否有任何UTF-16 数据中的可变长度编码问题)以提高代码的速度效率和简单性。
回答by marcgg
UTF-8 is the simplest way to go since it will work with almost everything:
UTF-8 是最简单的方法,因为它几乎适用于所有事物:
UTF-8 can encode any Unicode character. Files in different languages can be displayed correctly without having to choose the correct code page or font. For instance Chinese and Arabic can be in the same text without special codes inserted to switch the encoding. (via wikipedia)
UTF-8 可以编码任何 Unicode 字符。可以正确显示不同语言的文件,而无需选择正确的代码页或字体。例如中文和阿拉伯文可以在同一个文本中,无需插入特殊代码来切换编码。(通过维基百科)
Of course keep in mind that:
当然要记住:
UTF-8 often takes more space than an encoding made for one or a few languages. Latin letters with diacritics and characters from other alphabetic scripts typically take one byte per character in the appropriate multi-byte encoding but take two in UTF-8. East Asian scripts generally have two bytes per character in their multi-byte encodings yet take three bytes per character in UTF-8.
UTF-8 通常比为一种或几种语言进行的编码占用更多的空间。带有变音符号的拉丁字母和来自其他字母脚本的字符通常在适当的多字节编码中每个字符占用一个字节,但在 UTF-8 中占用两个字节。东亚文字在多字节编码中通常每个字符有两个字节,而在 UTF-8 中每个字符占三个字节。
... but in most cases it's not a big issues. It would become one if you start handling huge documents.
...但在大多数情况下,这不是什么大问题。如果您开始处理大量文档,它将成为其中之一。
回答by user2304302
UTF-8 often takes more space than an encoding made for one or a few languages. Latin letters with diacritics and characters from other alphabetic scripts typically take one byte per character in the appropriate multi-byte encoding but take two in UTF-8. East Asian scripts generally have two bytes per character in their multi-byte encodings yet take three bytes per character in UTF-8.
UTF-8 通常比为一种或几种语言进行的编码占用更多的空间。带有变音符号的拉丁字母和来自其他字母脚本的字符通常在适当的多字节编码中每个字符占用一个字节,但在 UTF-8 中占用两个字节。东亚文字在多字节编码中通常每个字符有两个字节,而在 UTF-8 中每个字符占三个字节。