XML 格式的 UTF-8 或 ISO-8859-1
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/1259409/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
UTF-8 or ISO-8859-1 in XML
提问by Rob Nicholson
We have an application this takes a text string entered by a user into a web form and packages it in XML. Just to confuse matters a little, the XML is send as the body of on Outlook email message.
我们有一个应用程序,它将用户输入的文本字符串输入到 Web 表单中,并将其打包为 XML。只是为了混淆一些事情,XML 作为 Outlook 电子邮件消息的正文发送。
Because the users can paste almost anything into the web form (typically from Word), the text string can contain non-ASCII (7 bit) characters such as those used for open and close double quotes.
由于用户几乎可以将任何内容粘贴到 Web 表单中(通常来自 Word),因此文本字符串可以包含非 ASCII(7 位)字符,例如用于打开和关闭双引号的字符。
The string is travelling intact via email but when we use the Microsoft XML parser, it complains (quite rightly) that the XML contains invalid characters.
该字符串通过电子邮件完好无损,但是当我们使用 Microsoft XML 解析器时,它抱怨(非常正确)XML 包含无效字符。
A quick fix is to put encoding="iso-8859-1" in the header. However, I wonder if it would be better to encode the XML file in true UTF-8 format at the start as I've read articles that state it would be better for a more harmonious world if every XML document was encoded in UTF-8?
快速修复是将 encoding="iso-8859-1" 放在标题中。但是,我想知道在开始时以真正的 UTF-8 格式对 XML 文件进行编码是否会更好,因为我读过的文章指出,如果每个 XML 文档都以 UTF-8 编码,那么对于更和谐的世界会更好?
But... are we going to have trouble as the XML document is actually being transferred via the body of an email message? I understand that UTF-8 is a variable byte length encoding system I assume using 7 bit ASCII and escapte characters to indicate "there is more data".
但是...我们是否会遇到问题,因为 XML 文档实际上是通过电子邮件正文传输的?我知道 UTF-8 是一种可变字节长度编码系统,我假设使用 7 位 ASCII 和转义字符来表示“还有更多数据”。
Another option is to set to UTF-8 but replace non-ASCII characters with the &#nnn; format.
另一种选择是设置为 UTF-8,但用 &#nnn; 替换非 ASCII 字符。格式。
Any advise on this rather complicated area appreciated.
对这个相当复杂的领域的任何建议表示赞赏。
Cheers, Rob.
干杯,罗伯。
回答by hlovdal
Here from outside english-only-land{1} I can confirm that UTF-8works fine everywhere and has done so for many, many years. I have trouble remembering since when any MTAcrippled emails by stripping of the 8th bit (leading to "inventions" like QP(which were basically fixing the symptom rather than solving the problem)). That happened most certainly during mid-90s, although UTF-8 quickly gained popularity and replaced iso-8859-1. I do not remember when I switched, but I guess it was at least before year 2000.
在 english-only-land 之外{1}我可以确认UTF-8在任何地方都可以正常工作,并且已经这样做了很多很多年。我很难记住,因为任何MTA通过剥离第 8 位来削弱电子邮件(导致像QP这样的“发明” (基本上是解决症状而不是解决问题))。尽管 UTF-8 很快流行起来并取代了 iso-8859-1,但这种情况肯定发生在 90 年代中期。我不记得我什么时候换的,但我想至少是在 2000 年之前。
Speaking of iso-8859-1, it will not be able to cover all possible input from your users. Depending on language, other iso-8859 variants might be needed (for instance for Finnish and Welsh), and even so the 8859 family does not support languages like Chinese. UTF-8 in the other hand should cover everything, so I stronglyrecommend that to iso-8859-1.
说到 iso-8859-1,它无法涵盖用户的所有可能输入。根据语言,可能需要其他 iso-8859 变体(例如芬兰语和威尔士语),即使如此,8859 系列也不支持中文等语言。另一方面,UTF-8 应该涵盖所有内容,因此我强烈建议使用 iso-8859-1。
{1} This might bias my experience since any program not fully supporting UTF-8 would be considered crap and tend not to be used here.
{1}这可能会影响我的经验,因为任何不完全支持 UTF-8 的程序都将被视为垃圾,因此不会在此处使用。
回答by marc_s
I would probably try to use UTF-8 whenever possible - it just covers more ground and is more flexible than ISO-8859-1 which will choke on e.g. Eastern European characters already (try to write Ji?i or something like that in ISO-8859-1 - it'll fail miserably).
我可能会尽可能尝试使用 UTF-8 - 它只是覆盖更多的领域,并且比 ISO-8859-1 更灵活,后者已经会阻塞东欧字符(尝试在 ISO 中编写 Ji?i 或类似的东西) 8859-1 - 它会失败得很惨)。
So if you really want to attempt to change (which I applaud!), then I'd go UTF-8 and only resort back to ISO-8859-1 if you really can't make UTF-8 work.
因此,如果您真的想尝试更改(我对此表示赞赏!),那么我会使用 UTF-8,并且只有在您真的无法使 UTF-8 工作时才使用 ISO-8859-1。
MArc
马克

