C# 检测 XML 的更好方法?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/891223/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-05 08:39:32  来源:igfitidea点击:

Better way to detect XML?

c#xmloptimizationxml-parsing

提问by cyberconte

Currently, I have the following c# code to extract a value out of text. If its XML, I want the value within it - otherwise, if its not XML, it can just return the text itself.

目前,我有以下 c# 代码来从文本中提取一个值。如果它是 XML,我想要其中的值 - 否则,如果它不是 XML,它可以只返回文本本身。

String data = "..."
try
{
    return XElement.Parse(data).Value;
}
catch (System.Xml.XmlException)
{
    return data;
}

I know exceptions are expensive in C#, so I was wondering if there was a better way to determine if the text I'm dealing with is xml or not?

我知道 C# 中的异常很昂贵,所以我想知道是否有更好的方法来确定我正在处理的文本是否为 xml?

I thought of regex testing, but I dont' see that as a cheaper alternative. Note, I'm asking for a less expensivemethod of doing this.

我想到了正则表达式测试,但我不认为这是更便宜的替代方案。请注意,我要求一种更便宜的方法来做到这一点。

采纳答案by Colin Burnett

You could do a preliminary check for a < since all XML has to start with one and the bulk of all non-XML will not start with one.

您可以对 < 进行初步检查,因为所有 XML 都必须以 1 开头,而大部分非 XML 不会以 1 开头。

(Free-hand written.)

(徒手写的。)

// Has to have length to be XML
if (!string.IsNullOrEmpty(data))
{
    // If it starts with a < after trimming then it probably is XML
    // Need to do an empty check again in case the string is all white space.
    var trimmedData = data.TrimStart();
    if (string.IsNullOrEmpty(trimmedData))
    {
       return data;
    }

    if (trimmedData[0] == '<')
    {
        try
        {
            return XElement.Parse(data).Value;
        }
        catch (System.Xml.XmlException)
        {
            return data;
        }
    }
}
else
{
    return data;
}

I originally had the use of a regex but Trim()[0] is identical to what that regex would do.

我最初使用了正则表达式,但 Trim()[0] 与该正则表达式的作用相同。

回答by Mark Carpenter

I find that a perfectly acceptable way of handling your situation (it's probably the way I'd handle it as well). I couldn't find any sort of "XElement.TryParse(string)" in MSDN, so the way you have it would work just fine.

我发现这是处理您的情况的完全可以接受的方式(这可能也是我处理它的方式)。我在 MSDN 中找不到任何类型的“XElement.TryParse(string)”,所以你拥有它的方式会很好地工作。

回答by James Anderson

Clue -- all valid xml must start with "<?xml"

线索——所有有效的 xml 必须以"<?xml

You may have to deal with character set differences but checking plain ASCII, utf-8 and unicode will cover 99.5% of the xml out there.

您可能需要处理字符集差异,但检查纯 ASCII、utf-8 和 unicode 将覆盖 99.5% 的 xml。

回答by si618

Why is regex expensive? Doesn't it kill 2 birds with 1 stone (match and parse)?

为什么正则表达式很贵?它不是用 1 块石头杀死 2 只鸟(匹配和解析)吗?

Simple example parsing all elements, even easier if it's just one element!

解析所有元素的简单示例,如果它只是一个元素,那就更容易了!

Regex regex = new Regex(@"<(?<tag>\w*)>(?<text>.*)</\k<tag>>");
MatchCollection matches = regex.Matches(data);
foreach (Match match in matches)
{
    GroupCollection groups = match.Groups;
    string name = groups["tag"].Value;
    string value = groups["text"].Value;
    ...
}

回答by Arsen Mkrtchyan

The way you suggest will be expensive if you will use it in a loop where most of the xml s are not valied, In case of valied xml your code will work like there are not exception handling... so if in most of cases your xml is valied or you doesn't use it in loop, your code will work fine

如果您将在大多数 xml 未验证的循环中使用它,您建议的方式将是昂贵的,如果 xml 验证,您的代码将像没有异常处理一样工作......所以如果在大多数情况下您的xml 有效或您不在循环中使用它,您的代码将正常工作

回答by SO User

The code given below will match all the following xml formats:

下面给出的代码将匹配以下所有 xml 格式:

<text />                             
<text/>                              
<text   />                           
<text>xml data1</text>               
<text attr='2'>data2</text>");
<text attr='2' attr='4' >data3 </text>
<text attr>data4</text>              
<text attr1 attr2>data5</text>

And here's the code:

这是代码:

public class XmlExpresssion
{
    // EXPLANATION OF EXPRESSION
    // <        :   \<{1}
    // text     :   (?<xmlTag>\w+)  : xmlTag is a backreference so that the start and end tags match
    // >        :   >{1}
    // xml data :   (?<data>.*)     : data is a backreference used for the regex to return the element data      
    // </       :   <{1}/{1}
    // text     :   \k<xmlTag>
    // >        :   >{1}
    // (\w|\W)* :   Matches attributes if any

    // Sample match and pattern egs
    // Just to show how I incrementally made the patterns so that the final pattern is well-understood
    // <text>data</text>
    // @"^\<{1}(?<xmlTag>\w+)\>{1}.*\<{1}/{1}\k<xmlTag>\>{1}$";

    //<text />
    // @"^\<{1}(?<xmlTag>\w+)\s*/{1}\>{1}$";

    //<text>data</text> or <text />
    // @"^\<{1}(?<xmlTag>\w+)((\>{1}.*\<{1}/{1}\k<xmlTag>)|(\s*/{1}))\>{1}$";

    //<text>data</text> or <text /> or <text attr='2'>xml data</text> or <text attr='2' attr2 >data</text>
    // @"^\<{1}(?<xmlTag>\w+)(((\w|\W)*\>{1}(?<data>.*)\<{1}/{1}\k<xmlTag>)|(\s*/{1}))\>{1}$";

    private const string XML_PATTERN = @"^\<{1}(?<xmlTag>\w+)(((\w|\W)*\>{1}(?<data>.*)\<{1}/{1}\k<xmlTag>)|(\s*/{1}))\>{1}$";

    // Checks if the string is in xml format
    private static bool IsXml(string value)
    {
        return Regex.IsMatch(value, XML_PATTERN);
    }

    /// <summary>
    /// Assigns the element value to result if the string is xml
    /// </summary>
    /// <returns>true if success, false otherwise</returns>
    public static bool TryParse(string s, out string result)
    {
        if (XmlExpresssion.IsXml(s))
        {
            Regex r = new Regex(XML_PATTERN, RegexOptions.Compiled);
            result = r.Match(s).Result("${data}");
            return true;
        }
        else
        {
            result = null;
            return false;
        }
    }

}

Calling code:

调用代码:

if (!XmlExpresssion.TryParse(s, out result)) 
    result = s;
Console.WriteLine(result);

回答by V'rasana Oannes

If you want to know if it's valid, why not use the built-in .NetFX object rather than write one from scratch?

如果您想知道它是否有效,为什么不使用内置的 .NetFX 对象而不是从头开始编写?

Hope this helps,

希望这可以帮助,

Bill

账单

回答by Martin Peck

There is no way of validating that the text is XML other than doing something like XElement.Parse. If, for example, the very last close-angle-bracket is missing from the text field then it's not valid XML, and it's very unlikely that you'll spot this with RegEx or text parsing. There are number of illegal characters, illegal sequences etc that RegEx parsiing will most likely miss.

除了执行 XElement.Parse 之类的操作外,无法验证文本是否为 XML。例如,如果文本字段中缺少最后一个小括号,则它不是有效的 XML,并且您不太可能通过 RegEx 或文本解析发现它。RegEx 解析很可能会遗漏大量非法字符、非法序列等。

All you can hope to do is to short cut your failure cases.

你所能做的就是缩短你的失败案例。

So, if you expect to see lots of non-XML data and the less-expected case is XML then employing RegEx or substring searches to detect angle brackets might save you a little bit of time, but I'd suggest this is onlyuseful if you're batch processing lots of data in a tight loop.

因此,如果您希望看到大量非 XML 数据并且不太预期的情况是 XML,那么使用 RegEx 或子字符串搜索来检测尖括号可能会为您节省一点时间,但我建议这在以下情况有用您正在一个紧密的循环中批量处理大量数据。

If, instead, this is parsing user entered data from a web form or a winforms app then I think paying the cost of the Exception might be better than spending the dev and test effort ensuring that your short-cut code doesn't generate false positive/negative results.

相反,如果这是从 Web 表单或 winforms 应用程序解析用户输入的数据,那么我认为支付 Exception 的成本可能比花费开发和测试工作来确保您的快捷代码不会产生误报更好/负面结果。

It's not clear where you're getting your XML from (file, stream, textbox or somewhere else) but remember that whitespace, comments, byte order marks and other stuff can get in the way of simple rules like "it must start with a <".

不清楚从哪里获取 XML(文件、流、文本框或其他地方),但请记住,空格、注释、字节顺序标记和其他东西可能会妨碍简单规则,例如“它必须以 < 开头”。

回答by cyberconte

Update:(original post is below) Colin has the brilliant idea of moving the regex instantiation outside of the calls so that they're created only once. Heres the new program:

更新:(原始帖子在下面)Colin 有一个绝妙的主意,将正则表达式实例化移动到调用之外,以便它们只创建一次。这是新程序:

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Xml.Linq;
using System.Diagnostics;
using System.Text.RegularExpressions;

namespace ConsoleApplication3
{
    delegate String xmltestFunc(String data);

    class Program
    {
        static readonly int iterations = 1000000;

        private static void benchmark(xmltestFunc func, String data, String expectedResult)
        {
            if (!func(data).Equals(expectedResult))
            {
                Console.WriteLine(data + ": fail");
                return;
            }
            Stopwatch sw = Stopwatch.StartNew();
            for (int i = 0; i < iterations; ++i)
                func(data);
            sw.Stop();
            Console.WriteLine(data + ": " + (float)((float)sw.ElapsedMilliseconds / 1000));
        }

        static void Main(string[] args)
        {
            benchmark(xmltest1, "<tag>base</tag>", "base");
            benchmark(xmltest1, " <tag>base</tag> ", "base");
            benchmark(xmltest1, "base", "base");
            benchmark(xmltest2, "<tag>ColinBurnett</tag>", "ColinBurnett");
            benchmark(xmltest2, " <tag>ColinBurnett</tag> ", "ColinBurnett");
            benchmark(xmltest2, "ColinBurnett", "ColinBurnett");
            benchmark(xmltest3, "<tag>Si</tag>", "Si");
            benchmark(xmltest3, " <tag>Si</tag> ", "Si" );
            benchmark(xmltest3, "Si", "Si");
            benchmark(xmltest4, "<tag>RashmiPandit</tag>", "RashmiPandit");
            benchmark(xmltest4, " <tag>RashmiPandit</tag> ", "RashmiPandit");
            benchmark(xmltest4, "RashmiPandit", "RashmiPandit");
            benchmark(xmltest5, "<tag>Custom</tag>", "Custom");
            benchmark(xmltest5, " <tag>Custom</tag> ", "Custom");
            benchmark(xmltest5, "Custom", "Custom");

            // "press any key to continue"
            Console.WriteLine("Done.");
            Console.ReadLine();
        }

        public static String xmltest1(String data)
        {
            try
            {
                return XElement.Parse(data).Value;
            }
            catch (System.Xml.XmlException)
            {
                return data;
            }
        }

        static Regex xmltest2regex = new Regex("^[ \t\r\n]*<");
        public static String xmltest2(String data)
        {
            // Has to have length to be XML
            if (!string.IsNullOrEmpty(data))
            {
                // If it starts with a < then it probably is XML
                // But also cover the case where there is indeterminate whitespace before the <
                if (data[0] == '<' || xmltest2regex.Match(data).Success)
                {
                    try
                    {
                        return XElement.Parse(data).Value;
                    }
                    catch (System.Xml.XmlException)
                    {
                        return data;
                    }
                }
            }
           return data;
        }

        static Regex xmltest3regex = new Regex(@"<(?<tag>\w*)>(?<text>.*)</\k<tag>>");
        public static String xmltest3(String data)
        {
            Match m = xmltest3regex.Match(data);
            if (m.Success)
            {
                GroupCollection gc = m.Groups;
                if (gc.Count > 0)
                {
                    return gc["text"].Value;
                }
            }
            return data;
        }

        public static String xmltest4(String data)
        {
            String result;
            if (!XmlExpresssion.TryParse(data, out result))
                result = data;

            return result;
        }

        static Regex xmltest5regex = new Regex("^[ \t\r\n]*<");
        public static String xmltest5(String data)
        {
            // Has to have length to be XML
            if (!string.IsNullOrEmpty(data))
            {
                // If it starts with a < then it probably is XML
                // But also cover the case where there is indeterminate whitespace before the <
                if (data[0] == '<' || data.Trim()[0] == '<' || xmltest5regex.Match(data).Success)
                {
                    try
                    {
                        return XElement.Parse(data).Value;
                    }
                    catch (System.Xml.XmlException)
                    {
                        return data;
                    }
                }
            }
            return data;
        }
    }

    public class XmlExpresssion
    {
        // EXPLANATION OF EXPRESSION
        // <        :   \<{1}
        // text     :   (?<xmlTag>\w+)  : xmlTag is a backreference so that the start and end tags match
        // >        :   >{1}
        // xml data :   (?<data>.*)     : data is a backreference used for the regex to return the element data      
        // </       :   <{1}/{1}
        // text     :   \k<xmlTag>
        // >        :   >{1}
        // (\w|\W)* :   Matches attributes if any

        // Sample match and pattern egs
        // Just to show how I incrementally made the patterns so that the final pattern is well-understood
        // <text>data</text>
        // @"^\<{1}(?<xmlTag>\w+)\>{1}.*\<{1}/{1}\k<xmlTag>\>{1}$";

        //<text />
        // @"^\<{1}(?<xmlTag>\w+)\s*/{1}\>{1}$";

        //<text>data</text> or <text />
        // @"^\<{1}(?<xmlTag>\w+)((\>{1}.*\<{1}/{1}\k<xmlTag>)|(\s*/{1}))\>{1}$";

        //<text>data</text> or <text /> or <text attr='2'>xml data</text> or <text attr='2' attr2 >data</text>
        // @"^\<{1}(?<xmlTag>\w+)(((\w|\W)*\>{1}(?<data>.*)\<{1}/{1}\k<xmlTag>)|(\s*/{1}))\>{1}$";

        private static string XML_PATTERN = @"^\<{1}(?<xmlTag>\w+)(((\w|\W)*\>{1}(?<data>.*)\<{1}/{1}\k<xmlTag>)|(\s*/{1}))\>{1}$";
        private static Regex regex = new Regex(XML_PATTERN, RegexOptions.Compiled);

        // Checks if the string is in xml format
        private static bool IsXml(string value)
        {
            return regex.IsMatch(value);
        }

        /// <summary>
        /// Assigns the element value to result if the string is xml
        /// </summary>
        /// <returns>true if success, false otherwise</returns>
        public static bool TryParse(string s, out string result)
        {
            if (XmlExpresssion.IsXml(s))
            {
                result = regex.Match(s).Result("${data}");
                return true;
            }
            else
            {
                result = null;
                return false;
            }
        }

    }


}

And here are the new results:

以下是新结果:

<tag>base</tag>: 3.667
 <tag>base</tag> : 3.707
base: 40.737
<tag>ColinBurnett</tag>: 3.707
 <tag>ColinBurnett</tag> : 4.784
ColinBurnett: 0.413
<tag>Si</tag>: 2.016
 <tag>Si</tag> : 2.141
Si: 0.087
<tag>RashmiPandit</tag>: 12.305
 <tag>RashmiPandit</tag> : fail
RashmiPandit: 0.131
<tag>Custom</tag>: 3.761
 <tag>Custom</tag> : 3.866
Custom: 0.329
Done.

There you have it. Precompiled regex are the way to go, and pretty efficient to boot.

你有它。预编译的正则表达式是要走的路,而且启动效率很高。







(原帖)

I cobbled together the following program to benchmark the code samples that were provided for this answer, to demonstrate the reasoning for my post as well as evaluate the speed of the privded answers.

我拼凑了以下程序来对为此答案提供的代码示例进行基准测试,以证明我的帖子的推理以及评估私有答案的速度。

Without further ado, heres the program.

闲话少说,程序如下。

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Xml.Linq;
using System.Diagnostics;
using System.Text.RegularExpressions;

namespace ConsoleApplication3
{
    delegate String xmltestFunc(String data);

    class Program
    {
        static readonly int iterations = 1000000;

        private static void benchmark(xmltestFunc func, String data, String expectedResult)
        {
            if (!func(data).Equals(expectedResult))
            {
                Console.WriteLine(data + ": fail");
                return;
            }
            Stopwatch sw = Stopwatch.StartNew();
            for (int i = 0; i < iterations; ++i)
                func(data);
            sw.Stop();
            Console.WriteLine(data + ": " + (float)((float)sw.ElapsedMilliseconds / 1000));
        }

        static void Main(string[] args)
        {
            benchmark(xmltest1, "<tag>base</tag>", "base");
            benchmark(xmltest1, " <tag>base</tag> ", "base");
            benchmark(xmltest1, "base", "base");
            benchmark(xmltest2, "<tag>ColinBurnett</tag>", "ColinBurnett");
            benchmark(xmltest2, " <tag>ColinBurnett</tag> ", "ColinBurnett");
            benchmark(xmltest2, "ColinBurnett", "ColinBurnett");
            benchmark(xmltest3, "<tag>Si</tag>", "Si");
            benchmark(xmltest3, " <tag>Si</tag> ", "Si" );
            benchmark(xmltest3, "Si", "Si");
            benchmark(xmltest4, "<tag>RashmiPandit</tag>", "RashmiPandit");
            benchmark(xmltest4, " <tag>RashmiPandit</tag> ", "RashmiPandit");
            benchmark(xmltest4, "RashmiPandit", "RashmiPandit");

            // "press any key to continue"
            Console.WriteLine("Done.");
            Console.ReadLine();
        }

        public static String xmltest1(String data)
        {
            try
            {
                return XElement.Parse(data).Value;
            }
            catch (System.Xml.XmlException)
            {
                return data;
            }
        }

        public static String xmltest2(String data)
        {
            // Has to have length to be XML
            if (!string.IsNullOrEmpty(data))
            {
                // If it starts with a < then it probably is XML
                // But also cover the case where there is indeterminate whitespace before the <
                if (data[0] == '<' || new Regex("^[ \t\r\n]*<").Match(data).Success)
                {
                    try
                    {
                        return XElement.Parse(data).Value;
                    }
                    catch (System.Xml.XmlException)
                    {
                        return data;
                    }
                }
            }
           return data;
        }

        public static String xmltest3(String data)
        {
            Regex regex = new Regex(@"<(?<tag>\w*)>(?<text>.*)</\k<tag>>");
            Match m = regex.Match(data);
            if (m.Success)
            {
                GroupCollection gc = m.Groups;
                if (gc.Count > 0)
                {
                    return gc["text"].Value;
                }
            }
            return data;
        }

        public static String xmltest4(String data)
        {
            String result;
            if (!XmlExpresssion.TryParse(data, out result))
                result = data;

            return result;
        }

    }

    public class XmlExpresssion
    {
        // EXPLANATION OF EXPRESSION
        // <        :   \<{1}
        // text     :   (?<xmlTag>\w+)  : xmlTag is a backreference so that the start and end tags match
        // >        :   >{1}
        // xml data :   (?<data>.*)     : data is a backreference used for the regex to return the element data      
        // </       :   <{1}/{1}
        // text     :   \k<xmlTag>
        // >        :   >{1}
        // (\w|\W)* :   Matches attributes if any

        // Sample match and pattern egs
        // Just to show how I incrementally made the patterns so that the final pattern is well-understood
        // <text>data</text>
        // @"^\<{1}(?<xmlTag>\w+)\>{1}.*\<{1}/{1}\k<xmlTag>\>{1}$";

        //<text />
        // @"^\<{1}(?<xmlTag>\w+)\s*/{1}\>{1}$";

        //<text>data</text> or <text />
        // @"^\<{1}(?<xmlTag>\w+)((\>{1}.*\<{1}/{1}\k<xmlTag>)|(\s*/{1}))\>{1}$";

        //<text>data</text> or <text /> or <text attr='2'>xml data</text> or <text attr='2' attr2 >data</text>
        // @"^\<{1}(?<xmlTag>\w+)(((\w|\W)*\>{1}(?<data>.*)\<{1}/{1}\k<xmlTag>)|(\s*/{1}))\>{1}$";

        private const string XML_PATTERN = @"^\<{1}(?<xmlTag>\w+)(((\w|\W)*\>{1}(?<data>.*)\<{1}/{1}\k<xmlTag>)|(\s*/{1}))\>{1}$";

        // Checks if the string is in xml format
        private static bool IsXml(string value)
        {
            return Regex.IsMatch(value, XML_PATTERN);
        }

        /// <summary>
        /// Assigns the element value to result if the string is xml
        /// </summary>
        /// <returns>true if success, false otherwise</returns>
        public static bool TryParse(string s, out string result)
        {
            if (XmlExpresssion.IsXml(s))
            {
                Regex r = new Regex(XML_PATTERN, RegexOptions.Compiled);
                result = r.Match(s).Result("${data}");
                return true;
            }
            else
            {
                result = null;
                return false;
            }
        }

    }


}

And here are the results. Each one was executed 1 million times.

这是结果。每一个都被执行了 100 万次。

<tag>base</tag>: 3.531
 <tag>base</tag> : 3.624
base: 41.422
<tag>ColinBurnett</tag>: 3.622
 <tag>ColinBurnett</tag> : 16.467
ColinBurnett: 7.995
<tag>Si</tag>: 19.014
 <tag>Si</tag> : 19.201
Si: 15.567

Test 4 took too long, as 30 minutes later it was deemed too slow. To demonstrate how much slower it was, here is the same test only run 1000 times.

测试 4 花费的时间太长,因为 30 分钟后它被认为太慢了。为了演示它有多慢,这里是相同的测试只运行了 1000 次。

<tag>base</tag>: 0.004
 <tag>base</tag> : 0.004
base: 0.047
<tag>ColinBurnett</tag>: 0.003
 <tag>ColinBurnett</tag> : 0.016
ColinBurnett: 0.008
<tag>Si</tag>: 0.021
 <tag>Si</tag> : 0.017
Si: 0.014
<tag>RashmiPandit</tag>: 3.456
 <tag>RashmiPandit</tag> : fail
RashmiPandit: 0
Done.

Extrapolating out to a million executions, it would've taken 3456 seconds, or just over 57 minutes.

外推到一百万次执行,它需要 3456 秒,或者刚好超过 57 分钟。

This is a good example as to why complex regex are a bad idea if you're looking for efficient code. However it showed that simple regex can still be good answer in some cases - i.e. the small 'pre-test' of xml in colinBurnett answer created a potentially more expensive base case, (regex was created in case 2) but also a much shorter else case by avoiding the exception.

这是一个很好的例子,说明如果您正在寻找高效的代码,为什么复杂的正则表达式是个坏主意。然而,它表明,在某些情况下,简单的正则表达式仍然是很好的答案——即,在 colinBurnett 答案中对 xml 的小型“预测试”创建了一个可能更昂贵的基本案例,(正则表达式是在案例 2 中创建的)但也更短情况下避免异常。

回答by John M Gant

A variation on Colin Burnett's technique: you could do a simple regex at the beginning to see if the text starts with a tag, then try to parse it. Probably >99% of strings you'll deal with that start with a valid element are XML. That way you could skip the regex processing for full-blown valid XML and also skip the exception-based processing in almost every case.

Colin Burnett 技术的一个变体:你可以在开始时做一个简单的正则表达式,看看文本是否以标签开头,然后尝试解析它。您将处理的以有效元素开头的字符串中可能有 99% 以上是 XML。这样您就可以跳过针对成熟有效 XML 的正则表达式处理,并且几乎在所有情况下都可以跳过基于异常的处理。

Something like ^<[^>]+>would probably do the trick.

类似的东西^<[^>]+>可能会起作用。