.net 为 XML 编码文本数据的最佳方法

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/157646/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-03 10:11:58  来源:igfitidea点击:

Best way to encode text data for XML

.netxmlencoding.net-2.0

提问by Joel Coehoorn

I was looking for a generic method in .Net to encode a string for use in an Xml element or attribute, and was surprised when I didn't immediately find one. So, before I go too much further, could I just be missing the built-in function?

我一直在 .Net 中寻找一种通用方法来编码一个字符串以用于 Xml 元素或属性,当我没有立即找到时感到很惊讶。所以,在我走得更远之前,我是否会错过内置功能?

Assuming for a moment that it really doesn't exist, I'm putting together my own generic EncodeForXml(string data)method, and I'm thinking about the best way to do this.

暂时假设它确实不存在,我正在整理自己的通用EncodeForXml(string data)方法,并且正在考虑执行此操作的最佳方法。

The data I'm using that prompted this whole thing could contain bad characters like &, <, ", etc. It could also contains on occasion the properly escaped entities: &amp;, &lt;, and &quot;, which means just using a CDATA section may not be the best idea. That seems kinda klunky anyay; I'd much rather end up with a nice string value that can be used directly in the xml.

我正在使用的数据提示整个事情可能包含坏字符,如 &、<、" 等。它有时也可能包含正确转义的实体:&、< 和 ",这意味着只使用一个CDATA 部分可能不是最好的主意。无论如何,这似乎有点笨拙;我宁愿最终得到一个可以直接在 xml 中使用的不错的字符串值。

I've used a regular expression in the past to just catch bad ampersands, and I'm thinking of using it to catch them in this case as well as the first step, and then doing a simple replace for other characters.

我过去曾使用正则表达式来捕获坏的&符号,我正在考虑在这种情况下以及第一步中使用它来捕获它们,然后对其他字符进行简单的替换。

So, could this be optimized further without making it too complex, and is there anything I'm missing? :

那么,这是否可以在不使其过于复杂的情况下进一步优化,有什么我遗漏的吗?:

Function EncodeForXml(ByVal data As String) As String
    Static badAmpersand As new Regex("&(?![a-zA-Z]{2,6};|#[0-9]{2,4};)")

    data = badAmpersand.Replace(data, "&amp;")

    return data.Replace("<", "&lt;").Replace("""", "&quot;").Replace(">", "gt;")
End Function

Sorry for all you C# -only folks-- I don't really care which language I use, but I wanted to make the Regex static and you can't do that in C# without declaring it outside the method, so this will be VB.Net

对不起你们所有的 C# 人——我真的不在乎我使用哪种语言,但我想让 Regex 成为静态的,你不能在 C# 中这样做而不在方法之外声明它,所以这将是 VB 。网

Finally, we're still on .Net 2.0 where I work, but if someone could take the final product and turn it into an extension method for the string class, that'd be pretty cool too.

最后,我们仍然在我工作的 .Net 2.0 上,但如果有人可以将最终产品转化为字符串类的扩展方法,那也太酷了。

UpdateThe first few responses indicate that .Net does indeed have built-in ways of doing this. But now that I've started, I kind of want to finish my EncodeForXml() method just for the fun of it, so I'm still looking for ideas for improvement. Notably: a more complete list of characters that should be encoded as entities (perhaps stored in a list/map), and something that gets better performance than doing a .Replace() on immutable strings in serial.

更新前几个响应表明 .Net 确实有这样做的内置方法。但是现在我已经开始了,我有点想完成我的 EncodeForXml() 方法只是为了它的乐趣,所以我仍在寻找改进的想法。值得注意的是:应该编码为实体的更完整的字符列表(可能存储在列表/映射中),以及比在不可变字符串上串行执行 .Replace() 获得更好性能的东西。

采纳答案by MusiGenesis

System.XML handles the encoding for you, so you don't need a method like this.

System.XML 为您处理编码,因此您不需要这样的方法。

回答by Michael Kropat

Depending on how much you know about the input, you may have to take into account that not all Unicode characters are valid XML characters.

根据您对输入的了解程度,您可能必须考虑到并非所有 Unicode 字符都是有效的 XML 字符

Both Server.HtmlEncodeand System.Security.SecurityElement.Escapeseem to ignore illegal XML characters, while System.XML.XmlWriter.WriteStringthrows an ArgumentExceptionwhen it encounters illegal characters (unless you disable that check in which case it ignores them). An overview of library functions is available here.

无论Server.HtmlEncodeSystem.Security.SecurityElement.Escape似乎忽视了非法XML字符,而System.XML.XmlWriter.WriteString引发的ArgumentException当它遇到非法字符(除非您禁用检查在这种情况下,忽略它们)。此处提供了库函数的概述。

Edit 2011/8/14:seeing that at least a few people have consulted this answer in the last couple years, I decided to completely rewrite the original code, which had numerous issues, including horribly mishandling UTF-16.

编辑 2011/8/14:看到至少有几个人在过去几年中咨询了这个答案,我决定完全重写原始代码,其中有很多问题,包括严重错误处理 UTF-16

using System;
using System.Collections.Generic;
using System.IO;
using System.Linq;

/// <summary>
/// Encodes data so that it can be safely embedded as text in XML documents.
/// </summary>
public class XmlTextEncoder : TextReader {
    public static string Encode(string s) {
        using (var stream = new StringReader(s))
        using (var encoder = new XmlTextEncoder(stream)) {
            return encoder.ReadToEnd();
        }
    }

    /// <param name="source">The data to be encoded in UTF-16 format.</param>
    /// <param name="filterIllegalChars">It is illegal to encode certain
    /// characters in XML. If true, silently omit these characters from the
    /// output; if false, throw an error when encountered.</param>
    public XmlTextEncoder(TextReader source, bool filterIllegalChars=true) {
        _source = source;
        _filterIllegalChars = filterIllegalChars;
    }

    readonly Queue<char> _buf = new Queue<char>();
    readonly bool _filterIllegalChars;
    readonly TextReader _source;

    public override int Peek() {
        PopulateBuffer();
        if (_buf.Count == 0) return -1;
        return _buf.Peek();
    }

    public override int Read() {
        PopulateBuffer();
        if (_buf.Count == 0) return -1;
        return _buf.Dequeue();
    }

    void PopulateBuffer() {
        const int endSentinel = -1;
        while (_buf.Count == 0 && _source.Peek() != endSentinel) {
            // Strings in .NET are assumed to be UTF-16 encoded [1].
            var c = (char) _source.Read();
            if (Entities.ContainsKey(c)) {
                // Encode all entities defined in the XML spec [2].
                foreach (var i in Entities[c]) _buf.Enqueue(i);
            } else if (!(0x0 <= c && c <= 0x8) &&
                       !new[] { 0xB, 0xC }.Contains(c) &&
                       !(0xE <= c && c <= 0x1F) &&
                       !(0x7F <= c && c <= 0x84) &&
                       !(0x86 <= c && c <= 0x9F) &&
                       !(0xD800 <= c && c <= 0xDFFF) &&
                       !new[] { 0xFFFE, 0xFFFF }.Contains(c)) {
                // Allow if the Unicode codepoint is legal in XML [3].
                _buf.Enqueue(c);
            } else if (char.IsHighSurrogate(c) &&
                       _source.Peek() != endSentinel &&
                       char.IsLowSurrogate((char) _source.Peek())) {
                // Allow well-formed surrogate pairs [1].
                _buf.Enqueue(c);
                _buf.Enqueue((char) _source.Read());
            } else if (!_filterIllegalChars) {
                // Note that we cannot encode illegal characters as entity
                // references due to the "Legal Character" constraint of
                // XML [4]. Nor are they allowed in CDATA sections [5].
                throw new ArgumentException(
                    String.Format("Illegal character: '{0:X}'", (int) c));
            }
        }
    }

    static readonly Dictionary<char,string> Entities =
        new Dictionary<char,string> {
            { '"', "&quot;" }, { '&', "&amp;"}, { '\'', "&apos;" },
            { '<', "&lt;" }, { '>', "&gt;" },
        };

    // References:
    // [1] http://en.wikipedia.org/wiki/UTF-16/UCS-2
    // [2] http://www.w3.org/TR/xml11/#sec-predefined-ent
    // [3] http://www.w3.org/TR/xml11/#charsets
    // [4] http://www.w3.org/TR/xml11/#sec-references
    // [5] http://www.w3.org/TR/xml11/#sec-cdata-sect
}

Unit tests and full code can be found here.

单元测试和完整代码可以在这里找到。

回答by workmad3

SecurityElement.Escape

SecurityElement.Escape

documented here

记录在这里

回答by Kilhoffer

In the past I have used HttpUtility.HtmlEncode to encode text for xml. It performs the same task, really. I havent ran into any issues with it yet, but that's not to say I won't in the future. As the name implies, it was made for HTML, not XML.

过去,我使用 HttpUtility.HtmlEncode 为 xml 编码文本。它执行相同的任务,真的。我还没有遇到任何问题,但这并不是说我将来不会。顾名思义,它是为 HTML 而不是 XML 设计的。

You've probably already read it, but here is an articleon xml encoding and decoding.

您可能已经阅读过它,但这里有一篇关于 xml 编码和解码的文章

EDIT: Of course, if you use an xmlwriter or one of the new XElement classes, this encoding is done for you. In fact, you could just take the text, place it in a new XElement instance, then return the string (.tostring) version of the element. I've heard that SecurityElement.Escapewill perform the same task as your utility method as well, but havent read much about it or used it.

编辑:当然,如果您使用 xmlwriter 或新的 XElement 类之一,则此编码已为您完成。事实上,您可以只获取文本,将其放置在一个新的 XElement 实例中,然后返回该元素的字符串 (.tostring) 版本。我听说SecurityElement.Escape也将执行与您的实用程序方法相同的任务,但尚未阅读或使用它。

EDIT2: Disregard my comment about XElement, since you're still on 2.0

EDIT2:请忽略我对 XElement 的评论,因为您仍在使用 2.0

回答by Luke Quinane

Microsoft's AntiXss libraryAntiXssEncoder Classin System.Web.dll has methods for this:

System.Web.dll 中的Microsoft AntiXss 库AntiXssEncoder Class具有用于此的方法:

AntiXss.XmlEncode(string s)
AntiXss.XmlAttributeEncode(string s)

it has HTML as well:

它也有 HTML:

AntiXss.HtmlEncode(string s)
AntiXss.HtmlAttributeEncode(string s)

回答by Ronnie Overby

In .net 3.5+

在.net 3.5+

new XText("I <want> to & encode this for XML").ToString();

Gives you:

new XText("I <want> to & encode this for XML").ToString();

给你:

I &lt;want&gt; to &amp; encode this for XML

I &lt;want&gt; to &amp; encode this for XML

Turns out that this method doesn't encode some things that it should (like quotes).

事实证明,这个方法没有对它应该编码的一些东西(比如引号)进行编码。

SecurityElement.Escape(workmad3's answer) seems to do a better job with this and it's included in earlier versions of .net.

SecurityElement.Escapeworkmad3 的回答)似乎在这方面做得更好,并且它包含在 .net 的早期版本中。

If you don't mind 3rd party code and want to ensure no illegal characters make it into your XML, I would recommend Michael Kropat's answer.

如果您不介意 3rd 方代码并希望确保没有非法字符进入您的 XML,我会推荐Michael Kropat 的回答

回答by GSerg

XmlTextWriter.WriteString()does the escaping.

XmlTextWriter.WriteString()做逃避。

回答by Dscoduc

This might be the case where you could benefit from using the WriteCData method.

在这种情况下,您可能会从使用 WriteCData 方法中受益。

public override void WriteCData(string text)
    Member of System.Xml.XmlTextWriter

Summary:
Writes out a <![CDATA[...]]> block containing the specified text.

Parameters:
text: Text to place inside the CDATA block.

A simple example would look like the following:

一个简单的示例如下所示:

writer.WriteStartElement("name");
writer.WriteCData("<unsafe characters>");
writer.WriteFullEndElement();

The result looks like:

结果如下:

<name><![CDATA[<unsafe characters>]]></name>

When reading the node values the XMLReader automatically strips out the CData part of the innertext so you don't have to worry about it. The only catch is that you have to store the data as an innerText value to an XML node. In other words, you can't insert CData content into an attribute value.

在读取节点值时,XMLReader 会自动去除内部文本的 CData 部分,因此您不必担心它。唯一的问题是您必须将数据作为innerText 值存储到XML 节点。换句话说,您不能将 CData 内容插入到属性值中。

回答by Kev

If this is an ASP.NET app why not use Server.HtmlEncode() ?

如果这是一个 ASP.NET 应用程序,为什么不使用 Server.HtmlEncode() ?

回答by Granger

If you're serious about handling allof the invalid characters (not just the few "html" ones), and you have access to System.Xml, here's the simplest way to do proper Xml encoding of value data:

如果您认真处理所有无效字符(不仅仅是少数“html”字符),并且您可以访问System.Xml,那么这里是对值数据进行正确 Xml 编码的最简单方法:

string theTextToEscape = "Something \x1d else \x1D <script>alert('123');</script>";
var x = new XmlDocument();
x.LoadXml("<r/>"); // simple, empty root element
x.DocumentElement.InnerText = theTextToEscape; // put in raw string
string escapedText = x.DocumentElement.InnerXml; // Returns:  Something &#x1D; else &#x1D; &lt;script&gt;alert('123');&lt;/script&gt;

// Repeat the last 2 lines to escape additional strings.

It's important to know that XmlConvert.EncodeName()is not appropriate, because that's for entity/tag names, not values. Using that would be like Url-encoding when you needed to Html-encode.

重要的是要知道这XmlConvert.EncodeName()是不合适的,因为那是用于实体/标签名称,而不是值。当您需要 Html 编码时,使用它就像 Url 编码一样。