C# 用于匹配函数并捕获其参数的正则表达式

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/18906514/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-10 13:33:16  来源:igfitidea点击:

Regex for matching Functions and Capturing their Arguments

c#regex

提问by Brandon Miller

I'm working on a calculator and it takes string expressions and evaluates them. I have a function that searches the expression for math functions using Regex, retrieves the arguments, looks up the function name, and evaluates it. What I'm having problem with is that I can only do this if I know how many arguments there are going to be, I can't get the Regex right. And if I just split the contents of the (and )characters by the ,character then I can't have other function calls in that argument.

我正在研究一个计算器,它需要字符串表达式并评估它们。我有一个函数,它使用正则表达式搜索数学函数的表达式、检索参数、查找函数名称并对其求值。我遇到的问题是,如果我知道将有多少参数,我只能这样做,我无法正确使用正则表达式。如果我只是按字符拆分()字符的内容,,那么我不能在该参数中进行其他函数调用。

Here is the function matching pattern: \b([a-z][a-z0-9_]*)\((..*)\)\b

这是函数匹配模式: \b([a-z][a-z0-9_]*)\((..*)\)\b

It only works with one argument, have can I create a group for every argument excluding the ones inside of nested functions? For example, it would match: func1(2 * 7, func2(3, 5))and create capture groups for: 2 * 7and func2(3, 5)

它只适用于一个参数,我可以为每个参数创建一个组,不包括嵌套函数内部的参数吗?例如,它将匹配:func1(2 * 7, func2(3, 5))并为:2 * 7和创建捕获组func2(3, 5)

Here the function I'm using to evaluate the expression:

这是我用来评估表达式的函数:

    /// <summary>
    /// Attempts to evaluate and store the result of the given mathematical expression.
    /// </summary>
    public static bool Evaluate(string expr, ref double result)
    {
        expr = expr.ToLower();

        try
        {
            // Matches for result identifiers, constants/variables objects, and functions.
            MatchCollection results = Calculator.PatternResult.Matches(expr);
            MatchCollection objs = Calculator.PatternObjId.Matches(expr);
            MatchCollection funcs = Calculator.PatternFunc.Matches(expr);

            // Parse the expression for functions.
            foreach (Match match in funcs)
            {
                System.Windows.Forms.MessageBox.Show("Function found. - " + match.Groups[1].Value + "(" + match.Groups[2].Value + ")");

                int argCount = 0;
                List<string> args = new List<string>();
                List<double> argVals = new List<double>();
                string funcName = match.Groups[1].Value;

                // Ensure the function exists.
                if (_Functions.ContainsKey(funcName)) {
                    argCount = _Functions[funcName].ArgCount;
                } else {
                    Error("The function '"+funcName+"' does not exist.");
                    return false;
                }

                // Create the pattern for matching arguments.
                string argPattTmp = funcName + "\(\s*";

                for (int i = 0; i < argCount; ++i)
                    argPattTmp += "(..*)" + ((i == argCount - 1) ? ",":"") + "\s*";
                argPattTmp += "\)";

                // Get all of the argument strings.
                Regex argPatt = new Regex(argPattTmp);

                // Evaluate and store all argument values.
                foreach (Group argMatch in argPatt.Matches(match.Value.Trim())[0].Groups)
                {
                    string arg = argMatch.Value.Trim();
                    System.Windows.Forms.MessageBox.Show(arg);

                    if (arg.Length > 0)
                    {
                        double argVal = 0;

                        // Check if the argument is a double or expression.
                        try {
                            argVal = Convert.ToDouble(arg);
                        } catch {
                            // Attempt to evaluate the arguments expression.
                            System.Windows.Forms.MessageBox.Show("Argument is an expression: " + arg);

                            if (!Evaluate(arg, ref argVal)) {
                                Error("Invalid arguments were passed to the function '" + funcName + "'.");
                                return false;
                            }
                        }

                        // Store the value of the argument.
                        System.Windows.Forms.MessageBox.Show("ArgVal = " + argVal.ToString());
                        argVals.Add(argVal);
                    }
                    else
                    {
                        Error("Invalid arguments were passed to the function '" + funcName + "'.");
                        return false;
                    }
                }

                // Parse the function and replace with the result.
                double funcResult = RunFunction(funcName, argVals.ToArray());
                expr = new Regex("\b"+match.Value+"\b").Replace(expr, funcResult.ToString());
            }

            // Final evaluation.
            result = Program.Scripting.Eval(expr);
        }
        catch (Exception ex)
        {
            Error(ex.Message);
            return false;
        }

        return true;
    }

    ////////////////////////////////// ---- PATTERNS ---- \\\\\\\\\\\\\\\\\

    /// <summary>
    /// The pattern used for function calls.
    /// </summary>
    public static Regex PatternFunc = new Regex(@"([a-z][a-z0-9_]*)\((..*)\)");

As you can see, there is a pretty bad attempt at building a Regex to match the arguments. It doesn't work.

如您所见,构建正则表达式以匹配参数的尝试非常糟糕。它不起作用。

All I am trying to do is extract 2 * 7and func2(3, 5)from the expression func1(2 * 7, func2(3, 5))but it must work for functions with different argument counts as well. If there is a way to do this without using Regex that is also good.

我想要做的就是从表达式中提取2 * 7和提取,但它也必须适用于具有不同参数计数的函数。如果有一种方法可以在不使用正则表达式的情况下做到这一点,那也很好。func2(3, 5)func1(2 * 7, func2(3, 5))

回答by Monty Wild

Regular expressions aren't going to get you completely out of trouble with this...

正则表达式不会让你完全摆脱这个问题......

Since you have nested parentheses, you need to modify your code to count (against ). When you encounter an (, you need to take note of the position then look ahead, incrementing a counter for each extra(you find, and decrementing it for each )you find. When your counter is 0 and you find a ), that is the end of your function parameter block, and you can then parse the text between the parentheses. You can also split the text on ,when the counter is 0to get function parameters.

由于您有嵌套括号,您需要修改代码以(针对). 当您遇到 时(,您需要记下位置然后向前看,为您找到的每个额外增加一个计数器(,并为)您找到的每个减少它。当您的计数器为 0 并且您找到 a 时),即函数参数块的结尾,然后您可以解析括号之间的文本。您还可以,在计数器为 0 时拆分文本以获取函数参数。

If you encounter the end of the string while the counter is 0, you have a "(" without ")"error.

如果在计数器为 0 时遇到字符串的结尾,则会"(" without ")"出错。

You then take the text block(s) between the opening and closing parentheses and any commas, and repeat the above for each parameter.

然后在左括号和右括号以及任何逗号之间取出文本块,并对每个参数重复上述操作。

回答by acarlon

There is both a simple solution and a more advanced solution (added after edit) to handle more complex functions.

有一个简单的解决方案和一个更高级的解决方案(在edit之后添加)来处理更复杂的功能。

To achieve the example you posted, I suggest doing this in two steps, the first step is to extract the parameters (regexes are explained at the end):

为了实现您发布的示例,我建议分两步执行此操作,第一步是提取参数(正则表达式在最后解释):

\b[^()]+\((.*)\)$

Now, to parse the parameters.

现在,解析参数。

Simple solution

简单的解决方案

Extract the parameters using:

使用以下方法提取参数:

([^,]+\(.+?\))|([^,]+)

Here are some C# code examples (all asserts pass):

下面是一些 C# 代码示例(所有断言都通过):

string extractFuncRegex = @"\b[^()]+\((.*)\)$";
string extractArgsRegex = @"([^,]+\(.+?\))|([^,]+)";

//Your test string
string test = @"func1(2 * 7, func2(3, 5))";

var match = Regex.Match( test, extractFuncRegex );
string innerArgs = match.Groups[1].Value;
Assert.AreEqual( innerArgs, @"2 * 7, func2(3, 5)" );
var matches = Regex.Matches( innerArgs, extractArgsRegex );            
Assert.AreEqual( matches[0].Value, "2 * 7" );
Assert.AreEqual( matches[1].Value.Trim(), "func2(3, 5)" );

Explanation of regexes. The arguments extraction as a single string:

正则表达式的解释。参数提取为单个字符串:

\b[^()]+\((.*)\)$

where:

在哪里:

  • [^()]+chars that are not an opening, closing bracket.
  • \((.*)\)everything inside the brackets
  • [^()]+不是左括号、右括号的字符。
  • \((.*)\)括号内的所有内容

The args extraction:

args 提取:

([^,]+\(.+?\))|([^,]+)

where:

在哪里:

  • ([^,]+\(.+?\))character that are not commas followed by characters in brackets. This picks up the func arguments. Note the +? so that the match is lazy and stops at the first ) it meets.
  • |([^,]+)If the previous does not match then match consecutive chars that are not commas. These matches go into groups.
  • ([^,]+\(.+?\))不是逗号后跟括号中的字符的字符。这将获取 func 参数。注意 +? 这样匹配就很懒惰并在第一个相遇时停止)。
  • |([^,]+)如果前一个不匹配,则匹配不是逗号的连续字符。这些比赛分组进行。

More advanced solution

更先进的解决方案

Now, there are some obvious limitations with that approach, for example it matches the first closing bracket, so it doesn't handle nested functions very well. For a more comprehensive solution (if you require it), we need to use balancing group definitions(as I mentioned before this edit). For our purposes, balancing group definitions allow us to keep track of the instances of the open brackets and subtract the closing bracket instances. In essence opening and closing brackets will cancel each other out in the balancing part of the search until the final closing bracket is found. That is, the match will continue until the brackets balance and the final closing bracket is found.

现在,这种方法有一些明显的限制,例如它匹配第一个右括号,所以它不能很好地处理嵌套函数。对于更全面的解决方案(如果您需要),我们需要使用平衡组定义(正如我在此编辑之前提到的)。出于我们的目的,平衡组定义允许我们跟踪左括号的实例并减去右括号的实例。本质上,左括号和右括号将在搜索的平衡部分相互抵消,直到找到最终的右括号。也就是说,匹配将继续,直到括号平衡并找到最终的结束括号。

So, the regex to extract the parms is now (func extraction can stay the same):

因此,提取参数的正则表达式现在是(函数提取可以保持不变):

(?:[^,()]+((?:\((?>[^()]+|\((?<open>)|\)(?<-open>))*\)))*)+

Here are some test cases to show it in action:

下面是一些测试用例来展示它的实际效果:

string extractFuncRegex = @"\b[^()]+\((.*)\)$";
string extractArgsRegex = @"(?:[^,()]+((?:\((?>[^()]+|\((?<open>)|\)(?<-open>))*\)))*)+";

//Your test string
string test = @"func1(2 * 7, func2(3, 5))";

var match = Regex.Match( test, extractFuncRegex );
string innerArgs = match.Groups[1].Value;
Assert.AreEqual( innerArgs, @"2 * 7, func2(3, 5)" );
var matches = Regex.Matches( innerArgs, extractArgsRegex );
Assert.AreEqual( matches[0].Value, "2 * 7" );
Assert.AreEqual( matches[1].Value.Trim(), "func2(3, 5)" );

//A more advanced test string
test = @"someFunc(a,b,func1(a,b+c),func2(a*b,func3(a+b,c)),func4(e)+func5(f),func6(func7(g,h)+func8(i,(a)=>a+2)),g+2)";
match = Regex.Match( test, extractFuncRegex );
innerArgs = match.Groups[1].Value;
Assert.AreEqual( innerArgs, @"a,b,func1(a,b+c),func2(a*b,func3(a+b,c)),func4(e)+func5(f),func6(func7(g,h)+func8(i,(a)=>a+2)),g+2" );
matches = Regex.Matches( innerArgs, extractArgsRegex );
Assert.AreEqual( matches[0].Value, "a" );
Assert.AreEqual( matches[1].Value.Trim(), "b" );            
Assert.AreEqual( matches[2].Value.Trim(), "func1(a,b+c)" );
Assert.AreEqual( matches[3].Value.Trim(), "func2(a*b,func3(a+b,c))" );
Assert.AreEqual( matches[4].Value.Trim(), "func4(e)+func5(f)" );
Assert.AreEqual( matches[5].Value.Trim(), "func6(func7(g,h)+func8(i,(a)=>a+2))" );
Assert.AreEqual( matches[6].Value.Trim(), "g+2" );

Note especially that the method is now quite advanced:

请特别注意,该方法现在非常先进:

someFunc(a,b,func1(a,b+c),func2(a*b,func3(a+b,c)),func4(e)+func5(f),func6(func7(g,h)+func8(i,(a)=>a+2)),g+2)

So, looking at the regex again:

因此,再次查看正则表达式:

(?:[^,()]+((?:\((?>[^()]+|\((?<open>)|\)(?<-open>))*\)))*)+

In summary, it starts out with characters that are not commas or brackets. Then if there are brackets in the argument, it matches and subtracts the brackets until they balance. It then tries to repeat that match in case there are other functions in the argument. It then goes onto the next argument (after the comma). In detail:

总之,它以不是逗号或括号的字符开头。然后如果参数中有括号,它匹配并减去括号直到它们平衡。然后它会尝试重复该匹配,以防参数中有其他函数。然后进入下一个参数(在逗号之后)。详细:

  • [^,()]+matches anything that is not ',()'
  • ?:means non-capturing group, i.e. do not store matches within brackets in a group.
  • \(means start at an open bracket.
  • ?>means atomic grouping- essentially, this means it does not remember backtracking positions. This also helps to improve performance because there are less stepbacks to try different combinations.
  • [^()]+|means anything but an opening or closing bracket. This is followed by | (or)
  • \((?<open>)|This is the good stuff and says match '(' or
  • (?<-open>)This is the better stuff that says match a ')' and balance out the '('. This means that this part of the match (everything after the first bracket) will continue until all the internal brackets match. Without the balancing expressions, the match would finish on the first closing bracket. The crux is that the engine does not match this ')' against the final ')', instead it is subtracted from the matching '('. When there are no further outstanding '(', the -open fails so the final ')' can be matched.
  • The rest of the regex contains the closing parenthesis for the group and the repetitions (,and +) which are respectively: repeat the inner bracket match 0 or more times, repeat the full bracket search 0 or more times (0 allows arguments without brackets) and repeat the full match 1 or more times (allows foo(1)+foo(2))
  • [^,()]+匹配任何不是 ',()' 的东西
  • ?:表示非捕获组,即不在组中的括号内存储匹配项。
  • \(意味着从一个开放的括号开始。
  • ?>意味着原子分组- 本质上,这意味着它不记得回溯位置。这也有助于提高性能,因为尝试不同组合的步骤更少。
  • [^()]+|意味着除了左括号或右括号之外的任何东西。紧随其后的是 | (或者)
  • \((?<open>)|这是好东西,并说匹配 '(' 或
  • (?<-open>)这是更好的东西,说匹配 ')' 并平衡 '('。这意味着匹配的这一部分(第一个括号之后的所有内容)将继续,直到所有内部括号匹配。没有平衡表达式,匹配将在第一个右括号上完成。关键是引擎没有将这个 ')' 与最后的 ')' 匹配,而是从匹配的 '(' 中减去它。当没有其他未完成的 '(', -open 失败,因此可以匹配最后的 ')'。
  • 正则表达式的其余部分包含组的右括号和重复 ( ,和 +),它们分别是:重复内括号匹配 0 次或更多次,重复全括号搜索 0 次或更多次(0 允许没有括号的参数)并重复完整匹配 1 次或多次(允许 foo(1)+foo(2))

One final embellishment:

最后的点缀:

If you add (?(open)(?!))to the regex:

如果您添加(?(open)(?!))到正则表达式:

(?:[^,()]+((?:\((?>[^()]+|\((?<open>)|\)(?<-open>))*(?(open)(?!))\)))*)+

The (?!) will always fail if open has captured something (that hasn't been subtracted), i.e. it will always fail if there is an opening bracket without a closing bracket. This is a useful way to test whether the balancing has failed.

如果 open 捕获了一些东西(没有被减去), (?!) 将总是失败,即如果有一个没有右括号的左括号,它总是会失败。这是测试平衡是否失败的有用方法。

Some notes:

一些注意事项:

  • \b will not match when the last character is a ')' because it is not a word character and \b tests for word character boundariesso your regex would not match.
  • While regex is powerful, unless you are a guru among gurus it is best to keep the expressions simple because otherwise they are hard to maintain and hard for other people to understand. That is why it is sometimes best to break up the problem into subproblems and simpler expressions and let the language do some of the non search/match operations that it is good at. So, you may want to mix simple regexes with more complex code or visa versa, depending on where you are comfortable.
  • This will match some very complex functions, but it is not a lexical analyzer for functions.
  • If you can have strings in the arguments and the strings themselves can contains brackets, e.g. "go(..." then you will need to modify the regex to take strings out of the comparison. Same with comments.
  • Some links for balancing group definitions: here, here, hereand here.
  • \b 当最后一个字符是 ')' 时将不匹配,因为它不是单词字符,而 \b测试单词字符边界,因此您的正则表达式将不匹配。
  • 虽然正则表达式很强大,除非你是大师中的大师,否则最好保持表达式简单,否则它们很难维护,其他人也很难理解。这就是为什么有时最好将问题分解为子问题和更简单的表达式,并让语言执行一些它擅长的非搜索/匹配操作。因此,您可能希望将简单的正则表达式与更复杂的代码混合使用,反之亦然,这取决于您在何处感到舒适。
  • 这将匹配一些非常复杂的函数,但它不是函数的词法分析器。
  • 如果您可以在参数中包含字符串并且字符串本身可以包含括号,例如“go(...”,那么您将需要修改正则表达式以从比较中取出字符串。与注释相同。
  • 一些用于平衡组定义的链接:此处此处此处此处

Hope that helps.

希望有帮助。

回答by Corey

I'm sorry to burst the RegEx bubble, but this is one of those things that you just can't do effectivelywith regular expressions alone.

我很抱歉打破正则表达式泡沫,但这是单靠正则表达式无法有效完成的事情之一。

What you're implementing is basically an Operator-Precedence Parserwith support for sub-expressions and argument lists. The statement is processed as a stream of tokens - possibly using regular expressions - with sub-expressions processed as high-priority operations.

您正在实现的基本上是一个支持子表达式和参数列表的运算符优先级解析器。该语句作为标记流处理 - 可能使用正则表达式 - 子表达式作为高优先级操作处理。

With the right code you can do this as an iteration over the full token stream, but recursive parsers are common too. Either way you have to be able to effectively push state and restart parsing at each of the sub-expression entry points - a (, ,or <function_name>(token - and pushing the result up the parser chain at the sub-expression exit points - )or ,token.

使用正确的代码,您可以将其作为对完整令牌流的迭代来执行,但递归解析器也很常见。你必须能够在每个子表达式切入点,有效地推动国家和重启解析无论哪种方式-一(,<function_name>(令牌-和推动的结果了处子表达式出口点解析器链-),令牌。

回答by Mike Clark

There are some new (relatively very new) language-specific enhancements to regexthat make it possible to match context free languages with "regex", but you will find more resources and more help when using the tools more commonly used for this kind of task:

regex有一些新的(相对非常新的)特定语言的增强功能,可以将上下文无关语言与“regex”相匹配,但是在使用更常用于此类任务的工具时,您会发现更多资源和更多帮助:

It'd be better to use a parser generator like ANTLR, LEX+YACC, FLEX+BISON, or any other commonly used parser generator. Most of them come with complete examples on how to build simple calculators that support grouping and function calls.

最好使用解析器生成器,如 ANTLR、LEX+YACC、FLEX+BISON 或任何其他常用的解析器生成器。他们中的大多数都提供了关于如何构建支持分组和函数调用的简单计算器的完整示例。

回答by alexandrecote99

This regex does what you want:

这个正则表达式做你想要的:

^(?<FunctionName>\w+)\((?>(?(param),)(?<param>(?>(?>[^\(\),"]|(?<p>\()|(?<-p>\))|(?(p)[^\(\)]|(?!))|(?(g)(?:""|[^"]|(?<-g>"))|(?!))|(?<g>")))*))+\)$

Don't forget to escape backslashes and double quotes when pasting it in your code.

在将其粘贴到代码中时,不要忘记转义反斜杠和双引号。

It will match correctly arguments in double quotes, inner functions and numbers like this one:
f1(123,"df""j"" , dhf",abc12,func2(),func(123,a>2))

The param stack will contains
123
"df""j"" , dhf"
abc12
func2()
func(123,a>2)

它将正确匹配双引号中的参数、内部函数和数字,如下所示:
f1(123,"df""j"" , dhf",abc12,func2(),func(123,a>2))

参数堆栈将包含
123 个
"df""j"" , dhf"
abc12
func2()
func(123,a>2)