php 子弹“?” 以 XML 格式
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/16020488/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Bullet "?" in XML
提问by TecBrat
Similar to this questionI am consuming an XML product that has some illegal chars in it. I seriously doubt I can get them to fix the problem, but I will try. In the meantime I'd like a work-around.
与这个问题类似, 我正在使用一个包含一些非法字符的 XML 产品。我严重怀疑我能否让他们解决问题,但我会尝试。与此同时,我想要一个解决方法。
The problem is that it contains a bullet. It renders as "a¢" in my source. I've tried a few encoding conversions but have not found a combination that works. (I'm not accustomed to even thinking about my encoding type, so I'm out of my element here.) So, I tried the below and it seems that str_replace does not recognize the "?". (it renders as tall block in my text editor) You can see the commented lines where I tried a few different things.
问题是它包含一个子弹。它在我的来源中呈现为“a¢”。我尝试了一些编码转换,但没有找到有效的组合。(我什至不习惯考虑我的编码类型,所以我在这里没有我的元素。)所以,我尝试了下面的方法,似乎 str_replace 无法识别“?”。(它在我的文本编辑器中呈现为高块)您可以看到我尝试了一些不同内容的注释行。
I tried str replace on "a¢" first, then tweaked around and this is my latest:
我首先尝试在“a¢”上替换 str ,然后进行调整,这是我的最新版本:
// deal with bullets in XML.
$bullet="?"; //this was copied and pasted from transliterated text.
//$data=iconv( "UTF-8", "windows-1252//TRANSLIT", $data ); //transliterate the text:
//$data=str_replace($bullet,'•',$data); // replace the bullet char
$data=str_replace($bullet,' - ',$data); // replace the bullet char
//$data=iconv( "windows-1252", "UTF-8", $data ); // return the text to utf-8 encoding.
Any ideas how to strip or replace this char? If there's a function to pre-clean the XML, that'd be great, and I wouldn't have to worry about it.
任何想法如何剥离或替换这个字符?如果有一个预先清理 XML 的功能,那就太好了,我不必担心。
回答by M8R-1jmw5r
XML by definition has no illegal chars. If some string contains a character that is not part of XML, then that string is not XML by definition.
根据定义,XML 没有非法字符。如果某个字符串包含不属于 XML 的字符,则该字符串根据定义不是 XML 。
The character you're concerned about is part of Unicode. As XML is based on Unicode, this is good news. So let's name what you aim for:
您关心的字符是 Unicode 的一部分。由于 XML 基于 Unicode,因此这是个好消息。因此,让我们说出您的目标:
So you now say it renders as a¢
. Because U+2022 is encoded as 0xE2 0x80 0xA2 in UTF-8, it is a more or less safe assumption to say that you take an UTF-8 encoded string (that is the default encoding used in XML btw) but command the software that renders it to treat it as some single-byte encoding hence turning the single code-point into three different characters:
所以你现在说它呈现为a¢
. 因为 U+2022 在 UTF-8 中被编码为 0xE2 0x80 0xA2,所以说您采用 UTF-8 编码的字符串(这是 XML btw 中使用的默认编码)但命令软件渲染它以将其视为某种单字节编码,从而将单个代码点转换为三个不同的字符:
- Unicode Character 'LATIN SMALL LETTER A WITH CIRCUMFLEX' (U+00E2)
- Unicode Character 'EURO SIGN' (U+20AC)
- Unicode Character 'CENT SIGN' (U+00A2)
Instead you need to command the rendering application to use the UTF-8 encoding. That should immediately solve your issue. So find the place where you introduce the wrong encoding, you will likely not need to re-encode it, just to properly hint the encoding.
相反,您需要命令渲染应用程序使用 UTF-8 编码。那应该立即解决您的问题。所以找到你引入错误编码的地方,你可能不需要重新编码它,只是为了正确提示编码。
If you wonder which single-byte character-encodings have these three Unicode Characters at the corresponding bytes (0xE2 0x80 0xA2), here is a list. I have highlighted the most popular one of these:
如果您想知道哪些单字节字符编码在相应的字节 (0xE2 0x80 0xA2) 上有这三个 Unicode 字符,这里有一个列表。我强调了其中最受欢迎的一种:
- ISO-8859-15 (Latin 9)
- OEM 858 (Multilingual Latin I + Euro)
- Windows 1252 (Latin I)
- Windows 1254 (Turkish)
- Windows 1256 (Arabic)
- Windows 1258 (Vietnam)
- ISO-8859-15(拉丁文 9)
- OEM 858(多语言拉丁语 I + 欧元)
- Windows 1252(拉丁语 I)
- Windows 1254(土耳其语)
- Windows 1256(阿拉伯语)
- Windows 1258(越南)