php XML 解析器错误：未定义实体

Question

提问by NightHawk

I have searched stackoverflow on this problem and did find a few topics, but I feel like there isn't really a solid answer for me on this.

我在这个问题上搜索了 stackoverflow 并确实找到了一些主题，但我觉得这对我来说并没有真正可靠的答案。

I have a form that users submit and the field's value is stored in a XML file. The XML is set to be encoded with UTF-8.

我有一个用户提交的表单，该字段的值存储在一个 XML 文件中。XML 设置为使用 UTF-8 编码。

Every now and then a user will copy/paste text from somewhere and that's when I get the "entity not defined error".

用户时不时地从某处复制/粘贴文本，这就是我收到“实体未定义错误”的时候。

I realize XML only supports a select few entities and anything beyond that is not recognized - hence the parser error.

我意识到 XML 仅支持少数几个实体，并且无法识别超出的任何实体 - 因此出现解析器错误。

From what I gather, there's a few options I've seen:

从我收集的信息来看，我看到了一些选项：

I can find and replace all  and swap them out with  or an actual space.
I can place the code in question within a CDATA section.
I can include these entities within the XML file.

我可以找到并替换所有内容 并将它们替换为 或实际空间。
我可以将有问题的代码放在 CDATA 部分中。
我可以在 XML 文件中包含这些实体。

What I'm doing with the XML file is that the user can enter content into a form, it gets stored in a XML file, and that content then gets displayed as XHTML on a Web page (parsed with SimpleXML).

我对 XML 文件所做的是，用户可以将内容输入到一个表单中，它被存储在一个 XML 文件中，然后该内容在网页上显示为 XHTML（使用 SimpleXML 解析）。

Of the three options, or any other option(s) I'm not aware of, what's really the best way to deal with these entities?

在这三个选项或我不知道的任何其他选项中，处理这些实体的最佳方式是什么？

Thanks, Ryan

谢谢，瑞安

UPDATE

更新

I want to thank everyone for the great feedback. I actually determined what caused my entity errors. All the suggestions made me look into it more deeply!

我要感谢大家的精彩反馈。我实际上确定了导致我的实体错误的原因。所有的建议让我更深入地研究它！

Some textboxes where plain old textboxes, but my textareas were enhanced with TinyMCE. It turns out, while taking a closer look, that the PHP warnings always referenced data from the TinyMCE enhanced textareas. Later I noticed on a PC that all the characters were taken out (because it couldn't read them), but on a MAC you could see little square boxes referencing the unicode number of that character. The reason it showed up in squares on a MAC in the first place, is because I used utf8_encode to encode data that wasn't in UTF to prevent other parsing errors (which is somehow also related to TinyMCE).

一些文本框是普通的旧文本框，但我的文本区域是用 TinyMCE 增强的。事实证明，当仔细观察时，PHP 警告总是引用来自 TinyMCE 增强文本区域的数据。后来我在 PC 上注意到所有字符都被取出了（因为它无法读取它们），但是在 MAC 上，您可以看到引用该字符的 unicode 编号的小方框。它首先出现在 MAC 上的正方形中的原因是，我使用 utf8_encode 对非 UTF 格式的数据进行编码，以防止其他解析错误（这在某种程度上也与 TinyMCE 相关）。

The solution to all this was quite simple:

所有这一切的解决方案非常简单：

I added this line entity_encoding : "utf-8"in my tinyMCE.init. Now, all the characters show up the way they are supposed to.

我entity_encoding : "utf-8"在我的 tinyMCE.init 中添加了这一行。现在，所有角色都按照他们应该的方式出现。

I guess the only thing I don't understand is why the characters still show up when placed in textboxes, because nothing converts them to UTF, but with TinyMCE it was a problem.

我想我唯一不明白的是为什么字符在放置在文本框中时仍然显示出来，因为没有任何东西可以将它们转换为 UTF，但是对于 TinyMCE，这是一个问题。

Answer 1

回答by Gaurav Arya

I agree that it is purely an encoding issue. In PHP, this is how I solved this problem:

我同意这纯粹是一个编码问题。在 PHP 中，我是这样解决这个问题的：

Before passing the html-fragment to SimpleXMLElementconstructor I decoded it by using html_entity_decode.
Then further encoded it using utf8_encode().

在将 html-fragment 传递给SimpleXMLElement构造函数之前，我使用html_entity_decode.
然后使用utf8_encode().

$headerDoc = '<temp>' . utf8_encode(html_entity_decode($headerFragment)) . '</temp>'; 
$xmlHeader = new SimpleXMLElement($headerDoc);

Now the above code does not throw any undefined entityerrors.

现在上面的代码不会抛出任何未定义的实体错误。

Answer 2

回答by Tomalak

You could HTML-parse the text and have it re-escaped with the respective numeric entities only (like:  →  ). In any case — simply usingun-sanitized user input is a bad idea.

您可以对文本进行 HTML 解析，并仅使用相应的数字实体对其进行重新转义（例如： →  ）。无论如何——简单地使用未经消毒的用户输入是一个坏主意。

All of the numeric entities are allowed in XML, only the named ones known from HTML do not work (with the exception of &, ", <, >, ').

XML 中允许使用所有数字实体，只有 HTML 中已知的命名实体不起作用（除了&, ", <, >, '）。

Most of the time though, you can just write the actual character (ö→ ?) to the XML file so there is no need to use an entity reference at all. If you are using a DOM API to manipulate your XML (and you should!) this is your safest bet.

不过，大多数情况下，您只需将实际字符 ( ö→ ?)写入XML 文件，因此根本不需要使用实体引用。如果您正在使用 DOM API 来操作您的 XML（您应该这样做！）这是您最安全的选择。

Finally (this is the lazy developer solution) you could build a broken XML file (i.e. not well-formed, with entity errors) and just pass it through tidyfor the necessary fix-ups. This may work or may fail depending on just howbroken the whole thing is. In my experience, tidy is pretty smart, though, and lets you get away with a lot.

最后（这是懒惰的开发人员解决方案）您可以构建一个损坏的 XML 文件（即格式不正确，带有实体错误）并通过 tidy 传递它以进行必要的修复。这可能会工作，也可以根据刚刚失败如何整件事是破碎。不过，根据我的经验，tidy 非常聪明，可以让您摆脱很多。

Answer 3

回答by LarsH

1. I can find and replace all [ ?] and swap them out with [ ?] or an actual space.

1. 我可以找到并替换所有 [  ?] 并将它们替换为 [  ?] 或实际空间。

This is a robust method, but it requires you to have a table of all the HTML entities (I assume the pasted input is coming from HTML) and to parse the pasted text for entity references.

这是一个健壮的方法，但它要求您拥有一个包含所有 HTML 实体的表格（我假设粘贴的输入来自 HTML）并解析粘贴的文本以获取实体引用。

2. I can place the code in question within a CDATA section.

2. 我可以将有问题的代码放在 CDATA 部分中。

In other words disable parsing for the whole section? Then you would have to parse it some other way. Could work.

换句话说，禁用整个部分的解析？然后你将不得不以其他方式解析它。可以工作。

3. I can include these entities within the XML file.

3. 我可以在 XML 文件中包含这些实体。

You mean include the entity definitions? I think this is an easy and robust way, if you don't mind making the XML file quite a bit bigger. You could have an "included" file (find one on the web) which is an external entity, which you reference from the top of your main XML file.

你的意思是包括实体定义？我认为这是一种简单而强大的方法，如果您不介意将 XML 文件变大一点的话。您可以有一个“包含”文件（在网上找到），它是一个外部实体，您可以从主 XML 文件的顶部引用该文件。

One downside is that the XML parser you use has to be one that processes external entities (which not all parsers are required to do). And it must correctly resolve the (possibly relative) URL of the external entity to something accessible. This is not too bad but it may increase constraints on your processing tools.

一个缺点是您使用的 XML 解析器必须是一种处理外部实体的解析器（并非所有解析器都需要这样做）。并且它必须正确地将外部实体的（可能是相对的）URL 解析为可访问的内容。这还不错，但可能会增加对处理工具的限制。

4. You could forbid non-XML in the pasted content. Among other things, this would disallow entity references that are not predefined in XML (the 5 that Tomalak mentioned) or defined in the content itself. However this may violate the requirements of the application, if users need to be able to paste HTML in there.

4. 您可以禁止粘贴内容中的非 XML。除此之外，这将禁止未在 XML 中预定义（Tomalak 提到的 5 个）或在内容本身中定义的实体引用。但是，如果用户需要能够在其中粘贴 HTML，这可能会违反应用程序的要求。

5. You could parse the pasted content as HTML into a DOM tree by setting someDiv.innerHTML = thePastedContent; In other words, create a div somewhere (probably display=none, except for debugging). Say you then have a javascript variable myDivthat holds this div element, and another variable myFieldthat holds the element that is your input text field. Then in javascript you do

5. 您可以通过设置 someDiv.innerHTML = thePastedContent; 将粘贴的内容作为 HTML 解析到 DOM 树中。换句话说，在某处创建一个 div（可能是 display=none，除了调试）。假设您有一个 javascript 变量myDiv保存此 div 元素，另一个变量myField保存您的输入文本字段元素。然后在 javascript 你做

myDiv.innerHTML = myField.value;

which takes the unparsed text from myField, parses it into an HTML DOM tree, and sticks it into myDiv as HTML content.

它从 myField 获取未解析的文本，将其解析为 HTML DOM 树，并将其作为 HTML 内容粘贴到 myDiv 中。

Then you would use some browser-based method for serializing (= "de-parsing") the DOM tree back into XML. See for example this question. Then you send the result to the server as XML.

然后，您将使用一些基于浏览器的方法将 DOM 树序列化（=“解解析”）回 XML。例如，参见这个问题。然后将结果作为 XML 发送到服务器。

Whether you want to do this fix in the browser or on the server (as @Hannes suggested) will depend on the size of the data, how quick the response has to be, how beefy your server is, and whether you care about hackers sending not-well-formed XML on purpose.

您是想在浏览器中还是在服务器上进行此修复（如@Hannes 建议的那样）将取决于数据的大小、响应的速度、您的服务器的功能以及您是否关心黑客发送故意使用格式不正确的 XML。

Answer 4

回答by Hannes

If you want to convert all characters, this may help you (I wrote it a while back) :

如果您想转换所有字符，这可能对您有所帮助（我不久前写的）：

http://www.lautr.com/convert-all-applicable-characters-to-numeric-entities-for-use-in-xml

function _convertAlphaEntitysToNumericEntitys($entity) {
  return '&#'.ord(html_entity_decode($entity[0])).';';
}

$content = preg_replace_callback(
  '/&([\w\d]+);/i',
  '_convertAlphaEntitysToNumericEntitys',
  $content);

function _convertAsciOver127toNumericEntitys($entity) {
  if(($asciCode = ord($entity[0])) > 127)
    return '&#'.$asciCode.';';
  else
    return $entity[0];
}

$content = preg_replace_callback(
  '/[^\w\d ]/i',
  '_convertAsciOver127toNumericEntitys', $content);

Answer 5

回答by HoldOffHunger

This question is a general problem for any language that parses XML or JSON (so, basically, every language).

对于任何解析 XML 或 JSON 的语言（因此，基本上，每种语言），这个问题都是一个普遍问题。

The above answers are for PHP, but a Perl solution would be as easy as...

上面的答案是针对 PHP 的，但是 Perl 解决方案就像...

my $excluderegex =
    '^\n\x20-\x20' .   # Don't Encode Spaces
       '\x30-\x39' .   # Don't Encode Numbers
       '\x41-\x5a' .   # Don't Encode Capitalized Letters
       '\x61-\x7a' ;   # Don't Encode Lowercase Letters

    # in case anything is already encoded
$value = HTML::Entities::decode_entities($value);

    # encode properly to numeric
$value = HTML::Entities::encode_numeric($value, $excluderegex);

php XML 解析器错误：未定义实体

提问by NightHawk

回答by Gaurav Arya

回答by Tomalak

回答by LarsH

回答by Hannes

回答by HoldOffHunger

相关推荐

最近更新

标签

php XML 解析器错误：未定义实体

提问by NightHawk

回答by Gaurav Arya

回答by Tomalak

回答by LarsH

回答by Hannes

回答by HoldOffHunger

相关推荐

php 将表单数据存储为会话变量

php 如何在 CodeIgniter 中执行我的 SQL 查询

在 PHP 中将 UTC 日期转换为本地时间

php Codeigniter 生成目录（如果不存在）

相关推荐

最近更新

标签