将带有“&”的 XML 读入 C# XMLDocument 对象
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/121511/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Reading XML with an "&" into C# XMLDocument Object
提问by Ryan Skarin
I have inherited a poorly written web application that seems to have errors when it tries to read in an xml document stored in the database that has an "&" in it. For example there will be a tag with the contents: "Prepaid & Charge". Is there some secret simple thing to do to have it not get an error parsing that character, or am I missing something obvious?
我继承了一个编写不佳的 Web 应用程序,当它尝试读取存储在数据库中的 xml 文档时似乎有错误,其中包含“&”。例如,会有一个带有内容的标签:“Prepaid & Charge”。是否有一些秘密简单的事情可以让它在解析该字符时不会出错,或者我是否遗漏了一些明显的东西?
EDIT: Are there any other characters that will cause this same type of parser error for not being well formed?
编辑:是否还有其他字符会因格式不正确而导致相同类型的解析器错误?
采纳答案by Joel Coehoorn
The problem is that the xml is not well-formed. Properly generated xml would list that data like this:
问题是 xml 格式不正确。正确生成的 xml 会列出这样的数据:
Prepaid & Charge
Prepaid & Charge
I've had to fix the same problem before, and I did it with this regex:
我以前不得不解决同样的问题,我用这个正则表达式做到了:
Regex badAmpersand = new Regex("&(?![a-zA-Z]{2,6};|#[0-9]{2,4};)");
Combine that with a string constant defined like this:
将其与如下定义的字符串常量结合使用:
const string goodAmpersand = "&";
Now you can just say badAmpersand.Replace(<your input>, goodAmpersand);
现在你可以说 badAmpersand.Replace(<your input>, goodAmpersand);
Note that a simple String.Replace("&", "&")
isn't good enough, since you can't know in advance for a given document whether any & characters will be coded correctly, incorrectly, or even both in the same document.
请注意,简单的String.Replace("&", "&")
还不够好,因为您无法预先知道给定文档的任何 & 字符是否会被正确编码、错误编码,或者甚至在同一文档中两者都编码。
The catches here are that you have to do this to your xml document beforeloading it into your parser, which likely means an extra pass through it. Also, it does not account for ampersands inside of a CDATA section. Finally, it onlycatches ampersands, not other illegal characters like <. Update:based on the comment, I need to update the expression for hex-coded (&#x...;) entities as well.
这里的问题是您必须在将 xml 文档加载到解析器之前对其执行此操作,这可能意味着需要额外通过它。此外,它不考虑 CDATA 部分内的&符号。最后,它只捕获&符号,而不是其他非法字符,例如 <。 更新:根据评论,我还需要更新十六进制编码 (&#x...;) 实体的表达式。
Regarding which characters can cause problems, the actual rules are a little complex. For example, certain characters are allowed in data, but not as the first letter of an element name. And there's no simple list of illegal characters. Instead, a large (non-contiguous) swath of UNICODE is defined as legal, and anything outside of that is illegal.
关于哪些字符会引起问题,实际规则有点复杂。例如,数据中允许使用某些字符,但不能作为元素名称的第一个字母。并且没有简单的非法字符列表。相反,一大片(非连续)UNICODE 被定义为合法的,除此之外的任何东西都是非法的。
So when it comes down to it, you have to trust your document source to have at least a certain amount of compliance and consistency. For example, I've found that people are often smart enough to make sure the tags work properly and escape <, even if they don't know that & isn't allowed, hence your problem today. However, the best thing would be to get this fixed at the source.
因此,归根结底,您必须相信您的文档来源至少具有一定的合规性和一致性。例如,我发现人们通常足够聪明,可以确保标签正常工作并转义 <,即使他们不知道 & 是不允许的,因此您今天的问题。然而,最好的办法是从源头上解决这个问题。
Oh, and a note about the CDATA suggestion: I'd use that to make sure xml that I'm creatingis well-formed, but when dealing with existing xml from outside, I find the regex method easier.
哦,还有一个关于 CDATA 建议的说明:我会用它来确保我创建的xml格式正确,但是当从外部处理现有的 xml 时,我发现正则表达式方法更容易。
回答by Steve g
回答by Jim
The web application isn't at fault, the XML document is. Ampersands in XML should be encoded as &
. Failure to do so is a syntax error.
Web 应用程序没有问题,XML 文档有问题。XML 中的 & 符号应编码为&
. 不这样做是语法错误。
Edit:in answer to the followup question, yes there are all kinds of similar errors. For example, unbalanced tags, unencoded less-than signs, unquoted attribute values, octets outside of the character encoding and various Unicode oddities, unrecognised entity references, and so on. In order to get any decent XML parser to consume a document, that document must be well-formed. The XML specification requires that a parser encountering a malformed document throw a fatal error.
编辑:回答后续问题,是的,有各种类似的错误。例如,不平衡的标签、未编码的小于号、未引用的属性值、字符编码之外的八位字节和各种 Unicode 奇数、无法识别的实体引用等。为了让任何体面的 XML 解析器使用文档,该文档必须是格式良好的。XML 规范要求解析器遇到格式错误的文档时会抛出致命错误。
回答by ConroyP
There are several characters which will cause XML data to be reported as badly-formed.
有几个字符会导致 XML 数据被报告为格式错误。
From w3schools:
来自w3schools:
Characters like "<" and "&" are illegal in XML elements.
像“<”和“&”这样的字符在 XML 元素中是非法的。
The best solution for input you can't trust to be XML-compliant is to wrap it in CDATA tags, e.g.
您不能相信 XML 兼容的输入的最佳解决方案是将其包装在 CDATA 标签中,例如
<![CDATA[This is my wonderful & great user text]]>
Everything within the <![CDATA[
and ]]>
tags is ignored by the parser.
内的一切都<![CDATA[
和]]>
标签被分析器忽略。
回答by Chris Ingrassia
The other answers are all correct, and I concur with their advice, but let me just add one thing:
其他答案都是正确的,我同意他们的建议,但让我补充一点:
PLEASE do not make applications that work with non well-formed XML, it just makes the rest of our lives more difficult :).
请不要让应用程序使用格式不正确的 XML,它只会让我们的余生变得更加困难:)。
Granted, there are times when you really just don't have a choice if you have no control over the other end, but you should really have it throwing a fatal error and complaining very loudly and explicitly about what is broken when such an event occurs.
诚然,有时如果您无法控制另一端,您真的没有选择,但您真的应该让它抛出一个致命错误,并大声而明确地抱怨发生此类事件时发生了什么问题.
You could probably take it one step further and say "Ack! This XML is broken in these places and for these reasons, here's how I tried to fix it to make it well-formed: ...".
您可能会更进一步说“Ack!这个 XML 在这些地方被破坏,出于这些原因,这是我尝试修复它以使其格式正确的方法:......”。
I'm not overly familiar with the MSXML APIs, but most good XML parsers will allow you to install error handlers so that you can trap the exact line/column number where errors are appearing along with getting the error code and message.
我不太熟悉 MSXML API,但大多数优秀的 XML 解析器将允许您安装错误处理程序,以便您可以捕获出现错误的确切行/列编号,同时获取错误代码和消息。
回答by Robert Rossney
Your database doesn't contain XML documents. It contains some well-formed XML documents and some strings that look like XML to a human.
您的数据库不包含 XML 文档。它包含一些格式良好的 XML 文档和一些对人类来说看起来像 XML 的字符串。
If it's at all possible, you should fix this - in particular, you should fix whatever process is generating the malformed XML documents. Fixing the program that reads data out of this database is just putting wallpaper over a crack in the wall.
如果可能的话,您应该修复此问题 - 特别是,您应该修复生成格式错误的 XML 文档的任何过程。修复从这个数据库中读取数据的程序只是将墙纸放在墙上的裂缝上。