php 子弹“？” 以 XML 格式

Question

提问by TecBrat

Similar to this questionI am consuming an XML product that has some illegal chars in it. I seriously doubt I can get them to fix the problem, but I will try. In the meantime I'd like a work-around.

与这个问题类似，我正在使用一个包含一些非法字符的 XML 产品。我严重怀疑我能否让他们解决问题，但我会尝试。与此同时，我想要一个解决方法。

The problem is that it contains a bullet. It renders as "a￠" in my source. I've tried a few encoding conversions but have not found a combination that works. (I'm not accustomed to even thinking about my encoding type, so I'm out of my element here.) So, I tried the below and it seems that str_replace does not recognize the "?". (it renders as tall block in my text editor) You can see the commented lines where I tried a few different things.

问题是它包含一个子弹。它在我的来源中呈现为“a￠”。我尝试了一些编码转换，但没有找到有效的组合。（我什至不习惯考虑我的编码类型，所以我在这里没有我的元素。）所以，我尝试了下面的方法，似乎 str_replace 无法识别“？”。（它在我的文本编辑器中呈现为高块）您可以看到我尝试了一些不同内容的注释行。

I tried str replace on "a￠" first, then tweaked around and this is my latest:

我首先尝试在“a￠”上替换 str ，然后进行调整，这是我的最新版本：

// deal with bullets in XML.
$bullet="?"; //this was copied and pasted from transliterated text.
//$data=iconv( "UTF-8", "windows-1252//TRANSLIT", $data ); //transliterate the text:
//$data=str_replace($bullet,'&#8226;',$data); // replace the bullet char
$data=str_replace($bullet,' - ',$data); // replace the bullet char
//$data=iconv( "windows-1252", "UTF-8", $data ); // return the text to utf-8 encoding.

Any ideas how to strip or replace this char? If there's a function to pre-clean the XML, that'd be great, and I wouldn't have to worry about it.

任何想法如何剥离或替换这个字符？如果有一个预先清理 XML 的功能，那就太好了，我不必担心。

Answer 1

回答by M8R-1jmw5r

XML by definition has no illegal chars. If some string contains a character that is not part of XML, then that string is not XML by definition.

根据定义，XML 没有非法字符。如果某个字符串包含不属于 XML 的字符，则该字符串根据定义不是 XML 。

The character you're concerned about is part of Unicode. As XML is based on Unicode, this is good news. So let's name what you aim for:

您关心的字符是 Unicode 的一部分。由于 XML 基于 Unicode，因此这是个好消息。因此，让我们说出您的目标：

Unicode Character 'BULLET' (U+2022)

Unicode 字符 'BULLET' (U+2022)

So you now say it renders as a￠. Because U+2022 is encoded as 0xE2 0x80 0xA2 in UTF-8, it is a more or less safe assumption to say that you take an UTF-8 encoded string (that is the default encoding used in XML btw) but command the software that renders it to treat it as some single-byte encoding hence turning the single code-point into three different characters:

所以你现在说它呈现为a￠. 因为 U+2022 在 UTF-8 中被编码为 0xE2 0x80 0xA2，所以说您采用 UTF-8 编码的字符串（这是 XML btw 中使用的默认编码）但命令软件渲染它以将其视为某种单字节编码，从而将单个代码点转换为三个不同的字符：

Instead you need to command the rendering application to use the UTF-8 encoding. That should immediately solve your issue. So find the place where you introduce the wrong encoding, you will likely not need to re-encode it, just to properly hint the encoding.

相反，您需要命令渲染应用程序使用 UTF-8 编码。那应该立即解决您的问题。所以找到你引入错误编码的地方，你可能不需要重新编码它，只是为了正确提示编码。

If you wonder which single-byte character-encodings have these three Unicode Characters at the corresponding bytes (0xE2 0x80 0xA2), here is a list. I have highlighted the most popular one of these:

如果您想知道哪些单字节字符编码在相应的字节 (0xE2 0x80 0xA2) 上有这三个 Unicode 字符，这里有一个列表。我强调了其中最受欢迎的一种：

ISO-8859-15 (Latin 9)
OEM 858 (Multilingual Latin I + Euro)
Windows 1252 (Latin I)
Windows 1254 (Turkish)
Windows 1256 (Arabic)
Windows 1258 (Vietnam)

ISO-8859-15（拉丁文 9）
OEM 858（多语言拉丁语 I + 欧元）
Windows 1252（拉丁语 I）
Windows 1254（土耳其语）
Windows 1256（阿拉伯语）
Windows 1258（越南）

php 子弹“？” 以 XML 格式

提问by TecBrat

回答by M8R-1jmw5r

相关推荐

最近更新

标签

php 子弹“？” 以 XML 格式

提问by TecBrat

回答by M8R-1jmw5r

相关推荐

php 以编程方式更新 Magento 属性

PHP 中的 register_globals 是什么？

PHP Mysql 查询（插入）问题

使用 PHP 在 textarea 中设置文本

相关推荐

最近更新

标签