如何用 C 或 C++ 编写一个简单的正则表达式模式匹配函数?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/4040835/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-28 14:21:36  来源:igfitidea点击:

How do I write a simple regular expression pattern matching function in C or C++?

c++cregexalgorithm

提问by Tracy

This is a question in my paper test today, the function signature is

这是我今天纸考的一道题,函数签名是

int is_match(char* pattern,char* string)

The pattern is limited to only ASCII chars and the quantification *and ?, so it is relatively simple. is_matchshould return 1 if matched, otherwise 0.

该模式仅限于 ASCII 字符和量化*?,因此相对简单。is_match如果匹配,则应返回 1,否则返回 0。

How do I do this?

我该怎么做呢?

回答by Richard Chambers

Brian Kernighanprovided a short article on A Regular Expression Matcherthat Rob Pikewrote as a demonstration program for a book they were working on. The article is a very nice read explaining a bit about the code and regular expressions in general.

Brian Kernighan提供了一篇关于正则表达式匹配器的简短文章,Rob Pike编写了这篇文章,作为他们正在编写的一本书的演示程序。这篇文章非常好读,从总体上解释了代码和正则表达式。

I have played with this code, making a few changes to experiment with some extensions such as to also return where in the string the pattern matches so that the substring matching the pattern can be copied from the original text.

我已经使用了这段代码,做了一些更改以试验一些扩展,例如还返回模式匹配的字符串中的位置,以便可以从原始文本中复制与模式匹配的子字符串。

From the article:

从文章:

I suggested to Rob that we needed to find the smallest regular expression package that would illustrate the basic ideas while still recognizing a useful and non-trivial class of patterns. Ideally, the code would fit on a single page.

Rob disappeared into his office, and at least as I remember it now, appeared again in no more than an hour or two with the 30 lines of C code that subsequently appeared in Chapter 9 of TPOP. That code implements a regular expression matcher that handles these constructs:

c    matches any literal character c
.    matches any single character
^    matches the beginning of the input string
$    matches the end of the input string
*    matches zero or more occurrences of the previous character

This is quite a useful class; in my own experience of using regular expressions on a day-to-day basis, it easily accounts for 95 percent of all instances. In many situations, solving the right problem is a big step on the road to a beautiful program. Rob deserves great credit for choosing so wisely, from among a wide set of options, a very small yet important, well-defined and extensible set of features.

Rob's implementation itself is a superb example of beautiful code: compact, elegant, efficient, and useful. It's one of the best examples of recursion that I have ever seen, and it shows the power of C pointers. Although at the time we were most interested in conveying the important role of a good notation in making a program easier to use and perhaps easier to write as well, the regular expression code has also been an excellent way to illustrate algorithms, data structures, testing, performance enhancement, and other important topics.

我向 Rob 建议我们需要找到最小的正则表达式包来说明基本思想,同时仍能识别出有用且非平凡的一类模式。理想情况下,代码将适合单个页面。

Rob 消失在他的办公室里,至少在我现在的记忆中,他在不超过一两个小时的时间里又出现了 30 行 C 代码,这些代码随后出现在 TPOP 的第 9 章中。该代码实现了处理这些结构的正则表达式匹配器:

c    matches any literal character c
.    matches any single character
^    matches the beginning of the input string
$    matches the end of the input string
*    matches zero or more occurrences of the previous character

这是一个非常有用的类;根据我自己日常使用正则表达式的经验,它很容易占所有实例的 95%。在许多情况下,解决正确的问题是通往美丽程序的一大步。Rob 从众多选项中如此明智地选择了一个非常小但很重要、定义明确且可扩展的功能集,因此值得称赞。

Rob 的实现本身就是优美代码的极好例子:紧凑、优雅、高效和有用。这是我见过的最好的递归示例之一,它展示了 C 指针的强大功能。尽管当时我们最感兴趣的是传达良好符号在使程序更易于使用和编写方面的重要作用,但正则表达式代码也是说明算法、数据结构、测试的极好方式。 、性能增强和其他重要主题。

The actual C source code from the article is very very nice.

文章中的实际 C 源代码非常好。

/* match: search for regexp anywhere in text */
int match(char *regexp, char *text)
{
    if (regexp[0] == '^')
        return matchhere(regexp+1, text);
    do {    /* must look even if string is empty */
        if (matchhere(regexp, text))
            return 1;
    } while (*text++ != '
#include <string.h>
#include <string>
#include <vector>
#include <iostream>

struct Match {
  Match():_next(0) {}
  virtual bool match(const char * pattern, const char * input) const {
    return !std::strcmp(pattern, input);
  }
  bool next(const char * pattern, const char * input) const {
    if (!_next) return false;
    return _next->match(pattern, input);
  }
  const Match * _next;
};

class MatchSet: public Match {
  typedef std::vector<Match *> Set;
  Set toTry;
public:
  virtual bool match(const char * pattern, const char * input) const {
    for (Set::const_iterator i = toTry.begin(); i !=toTry.end(); ++i) {
      if ((*i)->match(pattern, input)) return true;
    }
    return false;
  }
  void add(Match * m) {
    toTry.push_back(m);
    m->_next = this;
  }
  ~MatchSet() {
    for (Set::const_iterator i = toTry.begin(); i !=toTry.end(); ++i)
      if ((*i)->_next==this) (*i)->_next = 0;
  }
};

struct MatchQuestion: public Match  {
  virtual bool match(const char * pattern, const char * input) const {
    if (pattern[0] != '?')
      return false;
    if (next(pattern+1, input))
      return true;
    if (next(pattern+1, input+1))
      return true;
    return false;
  }
};

struct MatchEmpty: public Match {
  virtual bool match(const char * pattern, const char * input) const {
    if (pattern[0]==0 && input[0]==0)
      return true;
    return false;
  }
};

struct MatchAsterisk: public Match {
  virtual bool match(const char * pattern, const char * input) const {
    if (pattern[0] != '*')
      return false;
    if (pattern[1] == 0) {
      return true;
    }
    for (int i = 0; input[i] != 0; ++i) {
      if (next(pattern+1, input+i))
        return true;
    }
    return false;
  }
};

struct MatchSymbol: public Match {
  virtual bool match(const char * pattern, const char * input) const {
    // TODO: consider cycle here to prevent unnecessary recursion
    // Cycle should detect special characters and call next on them
    // Current implementation abstracts from that
    if (pattern[0] != input[0])
      return false;
    return next(pattern+1, input+1);
  }
};

class DefaultMatch: public MatchSet {
  MatchEmpty empty;
  MatchQuestion question;
  MatchAsterisk asterisk;
  MatchSymbol symbol;
public:
  DefaultMatch() {
    add(&empty);
    add(&question);
    add(&asterisk);
    add(&symbol);
  }
  void test(const char * p, const char * input) const {
    testOneWay(p, input);
    if (!std::strcmp(p, input)) return;
    testOneWay(input, p);
  }
  bool testOneWay(const char * p, const char * input) const {
    const char * eqStr = " == ";
    bool rv = match(p, input);
    if (!rv) eqStr = " != ";
    std::cout << p << eqStr << input << std::endl;
    return rv;
  }

};


int _tmain(int argc, _TCHAR* argv[])
{
  using namespace std;

  typedef vector<string> Strings;
  Strings patterns;

  patterns.push_back("*");
  patterns.push_back("*hw");
  patterns.push_back("h*w");
  patterns.push_back("hw*");

  patterns.push_back("?");
  patterns.push_back("?ab");
  patterns.push_back("a?b");
  patterns.push_back("ab?");

  patterns.push_back("c");
  patterns.push_back("cab");
  patterns.push_back("acb");
  patterns.push_back("abc");

  patterns.push_back("*this homework?");
  patterns.push_back("Is this homework?");
  patterns.push_back("This is homework!");
  patterns.push_back("How is this homework?");

  patterns.push_back("hw");
  patterns.push_back("homework");
  patterns.push_back("howork");

  DefaultMatch d;
  for (unsigned i = 0; i < patterns.size(); ++i)
    for (unsigned j =i; j < patterns.size(); ++j)
      d.test(patterns[i].c_str(), patterns[j].c_str());

    return 0;
}
'); return 0; } /* matchhere: search for regexp at beginning of text */ int matchhere(char *regexp, char *text) { if (regexp[0] == '
for each character in the pattern
  if pattern character after the current one is *
    // enter * state
    while current character from target == current pattern char, and not at end
      get next character from target
    skip a char from the pattern
  else if pattern character after the current one is ?
    // enter ? state
    if current character from target == current pattern char
      get next char from target
    skip a char from the pattern
  else
    // enter character state
    if current character from target == current pattern character
      get next character from target
    else
      return false
return true
') return 1; if (regexp[1] == '*') return matchstar(regexp[0], regexp+2, text); if (regexp[0] == '$' && regexp[1] == '
#include <stdio.h>
#include <string.h>
#include <stdlib.h>

int is_match(char* pattern, char* string)
{
  int n = strlen(string);
  int m = strlen(pattern);

  int i, j;
  int **match;

  match = (int **) malloc((n + 1) * sizeof(int *));
  for(i = 0; i <= n; i++) {
    match[i] = (int *) malloc((m + 1) * sizeof(int));
  }

  for(i = n; i >= 0; i--) {
    for(j = m; j >= 0; j--) {
      if(i == n && j == m) {
        match[i][j] = 1;
      }
      else if(i < n && j == m) {
        match[i][j] = 0;
      }
      else {
        match[i][j] = 0;
        if(pattern[j + 1] == '*') {
          if(match[i][j + 2]) match[i][j] = 1;
          if(i < n && pattern[j] == string[i] && match[i + 1][j]) match[i][j] = 1;
        }
        else if(pattern[j + 1] == '?') {
          if(match[i][j + 2]) match[i][j] = 1;
          if(i < n && pattern[j] == string[i] && match[i + 1][j + 2]) match[i][j] = 1;
        }
        else if(i < n && pattern[j] == string[i] && match[i + 1][j + 1]) {
          match[i][j] = 1;
        }
      }
    }
  }

  int result = match[0][0];

  for(i = 0; i <= n; i++) {
    free(match[i]);
  }

  free(match);

  return result;
}

int main(void)
{
  printf("is_match(dummy, dummy)  = %d\n", is_match("dummy","dummy"));
  printf("is_match(dumm?y, dummy) = %d\n", is_match("dumm?y","dummy"));
  printf("is_match(dum?y, dummy)  = %d\n", is_match("dum?y","dummy"));
  printf("is_match(dum*y, dummy)  = %d\n", is_match("dum*y","dummy")); 

  system("pause");

  return 0;
}
') return *text == '
int is_match(char *pattern, char *string)
{
    if (!pattern[0]) {
        return !string[0];
    } else if (pattern[1] == '?') {
        return (pattern[0] == string[0] && is_match(pattern+2, string+1))
            || is_match(pattern+2, string);
    } else if (pattern[1] == '*') {
        size_t i;
        for (i=0; string[i] == pattern[0]; i++)
            if (is_match(pattern+2, string+i)) return 1;
        return 0;
    } else {
        return pattern[0] == string[0] && is_match(pattern+1, string+1);
    }
}
'; if (*text!='
#include<stdio.h>
int mystrstr (const char *,const char *);
int mystrcmp(char *,char *);
int main()
{
    char *s1,*s2;//enter the strings, s1 is main string and s2 is substring.
    printf("Index is %d\n",mystrstr(s1,s2));
    //print the index of the string if string is found
}
//search for the sub-string in the main string
int mystrstr (const char *ps1,const char *ps2) 
{
    int i=0,j=0,c=0,l,m;char *x,*y;
    x=ps1;
    y=ps2;
    while(*ps1++)i++;
    while(*ps2++)j++;
    ps1=x;
    ps2=y;
    char z[j];
    for(l=0;l<i-j;l++)
    {
        for(m=l;m<j+l;m++)
            //store the sub-string of similar size from main string
            z[c++]=ps1[m];
        z[c]='##代码##'
        c=0;
        if(mystrcmp(z,ps2)==0)
        break;
    }
    return l;
}


int mystrcmp(char *ps3,char *ps4) //compare two strings
{
    int i=0;char *x,*y;
    x=ps3;y=ps4;
    while((*ps3!=0)&&(*ps3++==*ps4++))i++;      
    ps3=x;ps4=y;
    if(ps3[i]==ps4[i])
        return 0;
    if(ps3[i]>ps4[i])
        return +1;
    else
        return -1;
}
' && (regexp[0]=='.' || regexp[0]==*text)) return matchhere(regexp+1, text+1); return 0; } /* matchstar: search for c*regexp at beginning of text */ int matchstar(int c, char *regexp, char *text) { do { /* a * matches zero or more instances */ if (matchhere(regexp, text)) return 1; } while (*text != '##代码##' && (*text++ == c || c == '.')); return 0; }

回答by AShelly

See This Questionfor a solution you can not submit. See this paperfor a description of how to implement a more readable one.

有关您无法提交的解决方案,请参阅此问题。有关如何实现更具可读性的描述,请参阅本文

回答by Basilevs

Here is recursive extendable implementation. Tested for first order of pattern complexity.

这是递归可扩展实现。测试模式复杂性的一阶。

##代码##

If something is unclear, ask.

如果有什么不清楚的,问。

回答by Benoit

Cheat. Use #include <boost/regex/regex.hpp>.

欺骗。使用#include <boost/regex/regex.hpp>.

回答by siukurnin

try to make a list of interesting test cases:

尝试列出有趣的测试用例:

is_match("dummy","dummy") should return true;

is_match("dumm?y","dummy") should return true;

is_match("dum?y","dummy") should return false;

is_match("dum*y","dummy") should return true;

is_match("dummy","dummy") 应该返回真;

is_match("dumm?y","dummy") 应该返回真;

is_match("dum?y","dummy") 应该返回 false;

is_match("dum*y","dummy") 应该返回真;

and so on ...

等等 ...

then see how to make the easier test pass, then the next one ...

然后看看如何让更简单的测试通过,然后下一个......

回答by Merlyn Morgan-Graham

Didn't test this, actually code it, or debug it, but this might get you a start...

没有测试这个,实际编码或调试它,但这可能会让你开始......

##代码##

回答by Ivanvi

The full power of regular expressions and finite state machines are not needed to solve this problem. As an alternative there is a relatively simple dynamic programming solution.

解决这个问题不需要正则表达式和有限状态机的全部力量。作为替代,有一个相对简单的动态规划解决方案。

Let match(i, j) be 1 if it is possible to match the the sub-string string[i..n-1] with the sub-pattern pattern[j, m - 1], where n and m are the lengths of stringand patternrespectively. Otherwise let match(i, j) be 0.

如果可以将子字符串字符串[i..n-1] 与子模式模式[j, m - 1]匹配,则让 match(i, j) 为 1 ,其中 n 和 m 是长度分别是字符串模式。否则让 match(i, j) 为 0。

The base cases are:

基本情况是:

  • match(n, m) = 1, you can match an empty string with an empty pattern;

  • match(i, m) = 0, you can't match a non-empty string with an empty pattern;

  • match(n, m) = 1,可以用空模式匹配空字符串;

  • match(i, m) = 0,不能用空模式匹配非空字符串;

The transition is divided into 3 cases depending on whether the current sub-pattern starts with a character followed by a '*', or a character followed by a '?' or just starts with a character with no special symbol after it.

转换分为 3 种情况,具体取决于当前子模式是以字符后跟 '*' 开始,还是以字符后跟 '?' 开始。或者只是以一个字符开头,后面没有特殊符号。

##代码##

The time complexity of this approach is O(n * m). The memory complexity is also O(n * m) but with a simple modification can be reduced to O(m).

这种方法的时间复杂度是 O(n * m)。内存复杂度也是 O(n * m) 但通过简单的修改可以减少到 O(m)。

回答by R.. GitHub STOP HELPING ICE

Simple recursive implementation. It's slow but easy to understand:

简单的递归实现。它很慢但很容易理解:

##代码##

Hope I got it all right.

希望我一切顺利。

回答by Abhinav Jain

A C program to find the index,from where the sub-string in the main string is going to start. enter code here

AC 程序找到索引,从主字符串中的子字符串开始。在此处输入代码

##代码##