在 C/C++ 中进行不区分大小写的子字符串搜索的最快方法？

Question

提问by Anders Sandvig

Note

笔记

The question below was asked in 2008 about some code from 2003. As the OP's updateshows, this entire post has been obsoleted by vintage 2008 algorithms and persists here only as a historical curiosity.

下面的问题是在 2008 年提出的关于 2003 年的一些代码。正如 OP 的更新所示，整篇文章已被 2008 年的老式算法淘汰，仅作为历史好奇心留在这里。

I need to do a fast case-insensitive substring search in C/C++. My requirements are as follows:

我需要在 C/C++ 中进行快速的不区分大小写的子字符串搜索。我的要求如下：

Should behave like strstr() (i.e. return a pointer to the match point).
Must be case-insensitive (doh).
Must support the current locale.
Must be available on Windows (MSVC++ 8.0) or easily portable to Windows (i.e. from an open source library).

应该表现得像 strstr() （即返回一个指向匹配点的指针）。
必须不区分大小写（doh）。
必须支持当前语言环境。
必须在 Windows (MSVC++ 8.0) 上可用或易于移植到 Windows（即来自开源库）。

Here is the current implementation I am using (taken from the GNU C Library):

这是我正在使用的当前实现（取自 GNU C 库）：

/* Return the offset of one string within another.
   Copyright (C) 1994,1996,1997,1998,1999,2000 Free Software Foundation, Inc.
   This file is part of the GNU C Library.

   The GNU C Library is free software; you can redistribute it and/or
   modify it under the terms of the GNU Lesser General Public
   License as published by the Free Software Foundation; either
   version 2.1 of the License, or (at your option) any later version.

   The GNU C Library is distributed in the hope that it will be useful,
   but WITHOUT ANY WARRANTY; without even the implied warranty of
   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
   Lesser General Public License for more details.

   You should have received a copy of the GNU Lesser General Public
   License along with the GNU C Library; if not, write to the Free
   Software Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA
   02111-1307 USA.  */

/*
 * My personal strstr() implementation that beats most other algorithms.
 * Until someone tells me otherwise, I assume that this is the
 * fastest implementation of strstr() in C.
 * I deliberately chose not to comment it.  You should have at least
 * as much fun trying to understand it, as I had to write it :-).
 *
 * Stephen R. van den Berg, [email protected] */

/*
 * Modified to use table lookup instead of tolower(), since tolower() isn't
 * worth s*** on Windows.
 *
 * -- Anders Sandvig ([email protected])
 */

#if HAVE_CONFIG_H
# include <config.h>
#endif

#include <ctype.h>
#include <string.h>

typedef unsigned chartype;

char char_table[256];

void init_stristr(void)
{
  int i;
  char string[2];

  string[1] = '$ gcc -Wall -o my_stristr my_stristr.c
steve@solaris:~/code/tmp
$ gcc -Wall -o strcasestr strcasestr.c 
steve@solaris:~/code/tmp
$ ./bench ./my_stristr > my_stristr.result ; ./bench ./strcasestr > strcasestr.result;
steve@solaris:~/code/tmp
$ cat my_stristr.result 
run 1... time = 6.32
run 2... time = 6.31
run 3... time = 6.31
run 4... time = 6.31
run 5... time = 6.32
run 6... time = 6.31
run 7... time = 6.31
run 8... time = 6.31
run 9... time = 6.31
run 10... time = 6.31
average user time over 10 runs = 6.3120
steve@solaris:~/code/tmp
$ cat strcasestr.result 
run 1... time = 3.82
run 2... time = 3.82
run 3... time = 3.82
run 4... time = 3.82
run 5... time = 3.82
run 6... time = 3.82
run 7... time = 3.82
run 8... time = 3.82
run 9... time = 3.82
run 10... time = 3.82
average user time over 10 runs = 3.8200
steve@solaris:~/code/tmp
';
  for (i = 0; i < 256; i++)
  {
    string[0] = i;
    _strlwr(string);
    char_table[i] = string[0];
  }
}

#define my_tolower(a) ((chartype) char_table[a])

char *
my_stristr (phaystack, pneedle)
     const char *phaystack;
     const char *pneedle;
{
  register const unsigned char *haystack, *needle;
  register chartype b, c;

  haystack = (const unsigned char *) phaystack;
  needle = (const unsigned char *) pneedle;

  b = my_tolower (*needle); 
  if (b != 'int main(void)
{
        char * needle="hello";
        char haystack[1024];
        int i;

        for(i=0;i<sizeof(haystack)-strlen(needle)-1;++i)
        {
                haystack[i]='A'+i%57;
        }
        memcpy(haystack+i,needle, strlen(needle)+1);
        /*printf("%s\n%d\n", haystack, haystack[strlen(haystack)]);*/
        init_stristr();

        for (i=0;i<1000000;++i)
        {
                /*my_stristr(haystack, needle);*/
                strcasestr(haystack,needle);
        }


        return 0;
}
')
  {
    haystack--;             /* possible ANSI violation */
    do
      {
        c = *++haystack;
        if (c == '#!/bin/bash
function bc_calc()
{
        echo $(echo "scale=4;" | bc)
}
time="/usr/bin/time -p"
prog=""
accum=0
runs=10
for a in $(jot $runs 1 $runs)
do
        echo -n "run $a... "
        t=$($time $prog 2>&1| grep user | awk '{print }')
        echo "time = $t"
        accum=$(bc_calc "$accum+$t")
done

echo -n "average user time over $runs runs = "
echo $(bc_calc "$accum/$runs")
')
          goto ret0;
      }
    while (my_tolower (c) != (int) b);

    c = my_tolower (*++needle);
    if (c == 'const wchar_t *szk_wcsstri(const wchar_t *s1, const wchar_t *s2)
{
    if (s1 == NULL || s2 == NULL) return NULL;
    const wchar_t *cpws1 = s1, *cpws1_, *cpws2;
    char ch1, ch2;
    bool bSame;

    while (*cpws1 != L'#include <boost/algorithm/string/find.hpp>

const char* istrstr( const char* haystack, const char* needle )
{
   using namespace boost;
   iterator_range<char*> result = ifind_first( haystack, needle );
   if( result ) return result.begin();

   return NULL;
}
')
    {
        bSame = true;
        if (*cpws1 != *s2)
        {
            ch1 = towlower(*cpws1);
            ch2 = towlower(*s2);

            if (ch1 == ch2)
                bSame = true;
        }

        if (true == bSame)
        {
            cpws1_ = cpws1;
            cpws2 = s2;
            while (*cpws1_ != L'char_table[i] = tolower(i);
')
            {
                ch1 = towlower(*cpws1_);
                ch2 = towlower(*cpws2);

                if (ch1 != ch2)
                    break;

                cpws2++;

                if (*cpws2 == L'int StringInStringFindFirst(const char* p_cText, const char* p_cSearchText)
{
    int iTextSize = strlen(p_cText);
    int iSearchTextSize = strlen(p_cSearchText);

    char* p_cFound = NULL;

    if(iTextSize >= iSearchTextSize)
    {
        int iCounter = 0;
        while((iCounter + iSearchTextSize) <= iTextSize)
        {
            if(memcmp( (p_cText + iCounter), p_cSearchText, iSearchTextSize) == 0)
                return  iCounter;
            iCounter ++;
        }
    }

    return -1;
}
')
                    return cpws1_-(cpws2 - s2 - 0x01);
                cpws1_++;
            }
        }
        cpws1++;
    }
    return NULL;
}
')
        goto foundneedle;

    ++needle;
    goto jin;

    for (;;)
    {
      register chartype a;
        register const unsigned char *rhaystack, *rneedle;

        do
        {
          a = *++haystack;
          if (a == 'long GetStringMask(const char* p_cText)
{
    long lMask=0;

    while(*p_cText != 'int main(int argc, char* argv[])
{

    char* p_cText = "this is a test";   
    char* p_cSearchText = "test";

    long lTextMask = GetStringMask(p_cText);
    long lSearchMask = GetStringMask(p_cSearchText);

    int iFoundAt = -1;
    // If Both masks are Valid
    if(lTextMask != 0 && lSearchMask != 0)
    {
        if((lTextMask & lSearchMask) == lSearchMask)
        {       
             iFoundAt = StringInStringFindFirst(p_cText, p_cSearchText);
        }
    }
    else
    {
        iFoundAt = StringInStringFindFirst(p_cText, p_cSearchText);
    }


    return 0;
}
')
    {       
        if (*p_cText>='a' && *p_cText<='z')
            lMask = lMask | (1 << (*p_cText - 'a') );
        else if(*p_cText != ' ')
        {
            lMask = 0;
            break;      
        }

        p_cText ++;
    }
    return lMask;
}
')
              goto ret0;
          if (my_tolower (a) == (int) b)
              break;
          a = *++haystack;
          if (a == '#define IS_ALPHA(c) (((c) >= 'A' && (c) <= 'Z') || ((c) >= 'a' && (c) <= 'z'))
#define TO_UPPER(c) ((c) & 0xDF)

char * __cdecl strstri (const char * str1, const char * str2){
        char *cp = (char *) str1;
        char *s1, *s2;

        if ( !*str2 )
            return((char *)str1);

        while (*cp){
                s1 = cp;
                s2 = (char *) str2;

                while ( *s1 && *s2 && (IS_ALPHA(*s1) && IS_ALPHA(*s2))?!(TO_UPPER(*s1) - TO_UPPER(*s2)):!(*s1-*s2))
                        ++s1, ++s2;

                if (!*s2)
                        return(cp);

                ++cp;
        }
        return(NULL);
}
')
              goto ret0;
        shloop:
          ;
        }
      while (my_tolower (a) != (int) b);

jin:      
      a = *++haystack;
      if (a == 'if (BitLookup(table[char1], char2)) { /* match */ }
')
          goto ret0;

        if (my_tolower (a) != (int) c)
          goto shloop;

        rhaystack = haystack-- + 1;
        rneedle = needle;

        a = my_tolower (*rneedle);

        if (my_tolower (*rhaystack) == (int) a)
          do
          {
              if (a == '#define kCongruentCase (1 << 0)
#define kCongruentDiacritical (1 << 1)
#define kCongruentVowel (1 << 2)
#define kCongruentConsonant (1 << 3)
')
                goto foundneedle;

              ++rhaystack;
          a = my_tolower (*++needle);
              if (my_tolower (*rhaystack) != (int) a)
                break;

          if (a == 'inline bool CharsAreCongruent(char c1, char c2, unsigned char congruency)
{
    return (_congruencyTable[c1][c2] & congruency) != 0;
}

#define CaseInsensitiveCharEqual(c1, c2) CharsAreCongruent(c1, c2, kCongruentCase)
')
                goto foundneedle;

          ++rhaystack;
              a = my_tolower (*++needle);
          }
          while (my_tolower (*rhaystack) == (int) a);

        needle = rneedle;       /* took the register-poor approach */

      if (a == '##代码##')
          break;
    }
  }
foundneedle:
  return (char*) haystack;
ret0:
  return 0;
}

Can you make this code faster, or do you know of a better implementation?

你能让这段代码更快，或者你知道更好的实现吗？

Note:I noticed that the GNU C Library now has a new implementation of strstr(), but I am not sure how easily it can be modified to be case-insensitive, or if it is in fact faster than the old one (in my case). I also noticed that the old implementation is still used for wide character strings, so if anyone knows why, please share.

注意：我注意到 GNU C 库现在有一个新的实现strstr()，但我不确定它可以被修改为不区分大小写有多容易，或者它实际上是否比旧的更快（在我的情况下）。我还注意到旧的实现仍然用于宽字符串，所以如果有人知道为什么，请分享。

Update

更新

Just to make things clear—in case it wasn't already—I didn't write this function, it's a part of the GNU C Library. I only modified it to be case-insensitive.

只是为了说明清楚——如果它还没有——我没有编写这个函数，它是 GNU C 库的一部分。我只是将其修改为不区分大小写。

Also, thanks for the tip about strcasestr()and checking out other implementations from other sources (like OpenBSD, FreeBSD, etc.). It seems to be the way to go. The code above is from 2003, which is why I posted it here in hope for a better version being available, which apparently it is. :)

另外，感谢您提供有关strcasestr()和检查来自其他来源（如 OpenBSD、FreeBSD 等）的其他实现的提示。这似乎是要走的路。上面的代码是 2003 年的，这就是为什么我把它贴在这里，希望有更好的版本可用，显然是这样。:)

Answer 1

回答by freespace

The code you posted is about half as fast as strcasestr.

您发布的代码大约是strcasestr.

##代码##

The mainfunction was:

该main功能是：

##代码##

It was suitably modified to test both implementations. I notice as I am typing this up I left in the init_stristrcall, but it shouldn't change things too much. benchis just a simple shell script:

它经过适当修改以测试两种实现。我注意到我在打字时留下了init_stristr电话，但它不应该改变太多。bench只是一个简单的shell脚本：

##代码##

Answer 2

回答by Nitin Chhabra

You can use StrStrI function which finds the first occurrence of a substring within a string. The comparison is not case-sensitive. Don't forget to include its header - Shlwapi.h. Check this out: http://msdn.microsoft.com/en-us/library/windows/desktop/bb773439(v=vs.85).aspx

您可以使用 StrStrI 函数来查找字符串中子字符串的第一次出现。比较不区分大小写。不要忘记包含其标题 - Shlwapi.h。看看这个：http: //msdn.microsoft.com/en-us/library/windows/desktop/bb773439(v=vs.85).aspx

Answer 3

回答by Suzuki Keem

For platform independent use:

对于平台独立使用：

##代码##

Answer 4

回答by deft_code

use boost string algo. It is available, cross platform, and only a header file (no library to link in). Not to mention that you should be using boost anyway.

使用boost 字符串算法。它是可用的，跨平台的，并且只有一个头文件（没有要链接的库）。更不用说无论如何你应该使用boost。

##代码##

Answer 5

回答by Chris Young

Why do you use _strlwr(string); in init_stristr()? It's not a standard function. Presumably it's for locale support, but as it's not standard, I'd just use:

为什么要使用 _strlwr(string); 在 init_strstr() 中？这不是标准功能。大概是为了语言环境支持，但由于它不是标准的，我只使用：

##代码##

Answer 6

回答by quinmars

I'd advice you to take some of the common strcasestr implementation that already exists. For example of glib, glibc, OpenBSD, FreeBSD, etc. You can search for more with google.com/codesearch. You can then make some performance measurements and compare the different implementation.

我建议您采用一些已经存在的常见 strcasestr 实现。例如 glib、glibc、OpenBSD、FreeBSD 等。你可以用 google.com/codesearch 搜索更多。然后，您可以进行一些性能测量并比较不同的实现。

Answer 7

回答by Jo?o Augusto

Assuming both input strings are already lowercase.

假设两个输入字符串都已经是小写的。

##代码##

You could also, try using masks... if for example most of the strings you are going to compare only contains chars from a to z, maybe it's worth to do something like this.

您也可以尝试使用掩码...例如，如果您要比较的大多数字符串仅包含从 a 到 z 的字符，那么也许值得做这样的事情。

##代码##

Then...

然后...

##代码##

Answer 8

回答by Lefteris E

This will not consider the locale, but If you can change the IS_ALPHA and TO_UPPER you can make it to consider it.

这不会考虑语言环境，但是如果您可以更改 IS_ALPHA 和 TO_UPPER，您可以使其考虑它。

##代码##

Answer 9

回答by plinth

If you want to shed CPU cycles, you might consider this - let's assume that we're dealing with ASCII and not Unicode.

如果你想减少 CPU 周期，你可以考虑这个 - 假设我们正在处理 ASCII 而不是 Unicode。

Make a static table with 256 entries. Each entry in the table is 256 bits.

制作一个包含 256 个条目的静态表。表中的每个条目都是 256 位。

To test whether or not two characters are equal, you do something like this:

要测试两个字符是否相等，您可以执行以下操作：

##代码##

To build the table, you set a bit everywhere in table[char1] where you consider it a match for char2. So in building the table you would set the bits at the index for 'a' and 'A' in the 'a'th entry (and the 'A'th entry).

要构建表，您在 table[char1] 中的任何地方设置一点，您认为它与 char2 匹配。因此，在构建表时，您将在第 'a' 个条目（和第 'A' 个条目）中为 'a' 和 'A' 的索引设置位。

Now this is going to be slowish to do the bit lookup (bit look up will be a shift, mask and add most likely), so you could use instead a table of bytes so you use 8 bits to represent 1 bit. This will take 32K - so hooray - you've hit a time/space trade-off! We might want to make the table more flexible, so let's say we do this instead - the table will define congruences instead.

现在进行位查找会很慢（位查找将很可能是移位、掩码和添加），因此您可以使用字节表来代替，因此您可以使用 8 位来表示 1 位。这将需要 32K - 所以万岁 - 你已经达到了时间/空间的权衡！我们可能想让表格更灵活，所以假设我们这样做 - 表格将改为定义同余。

Two characters are considered congruent if and only if there is a function that defines them as equivalent. So 'A' and 'a' are congruent for case insensitivity. 'A', 'À', 'Á' and 'Â' are congruent for diacritical insensitivity.

当且仅当存在将它们定义为等价的函数时，两个字符才被认为是全等的。所以 'A' 和 'a' 不区分大小写是一致的。'A'、'À'、'Á' 和 'Â' 与变音不敏感是一致的。

So you define bitfields that correspond to your congruencies

所以你定义了对应于你的一致性的位域

##代码##

Then your test is something like this:

那么你的测试是这样的：

##代码##

This kind of bit fiddling with ginormous tables is the heart of ctype, by the by.

顺便说一句，这种摆弄巨大表格的位是 ctype 的核心。

Answer 10

回答by Jonathan Leffler

If you can control the needle string so that it is always in lower case, then you can write a modified version of stristr() to avoid the lookups for that, and thus speed up the code. It isn't as general, but it can be faster - slightly faster. Similar comments apply to the haystack, but you are more likely to be reading the haystack from sources outside your control for you cannot be certain that the data meets the requirement.

如果您可以控制针字符串使其始终为小写，那么您可以编写 stristr() 的修改版本以避免查找，从而加快代码速度。它不是一般的，但它可以更快 - 稍微快一点。类似的评论适用于干草堆，但您更有可能从您无法控制的来源读取干草堆，因为您无法确定数据是否满足要求。

Whether the gain in performance is worth it is another question altogether. For 99% of applications, the answer is "No, it is not worth it". Your application might be one of the tiny minority where it matters. More likely, it is not.

性能的提升是否值得完全是另一个问题。对于 99% 的应用程序，答案是“不，不值得”。您的应用程序可能是重要的少数应用程序之一。更有可能的是，事实并非如此。

在 C/C++ 中进行不区分大小写的子字符串搜索的最快方法？

提问by Anders Sandvig

Note

笔记

回答by freespace

回答by Nitin Chhabra

回答by Suzuki Keem

回答by deft_code

回答by Chris Young

回答by quinmars

回答by Jo?o Augusto

回答by Lefteris E

回答by plinth

回答by Jonathan Leffler

相关推荐

最近更新

标签

在 C/C++ 中进行不区分大小写的子字符串搜索的最快方法？

提问by Anders Sandvig

Note

笔记

回答by freespace

回答by Nitin Chhabra

回答by Suzuki Keem

回答by deft_code

回答by Chris Young

回答by quinmars

回答by Jo?o Augusto

回答by Lefteris E

回答by plinth

回答by Jonathan Leffler

相关推荐

C++ OpenCV 如何在使用单个静态图像时将速度矢量绘制为箭头

垂直管道 ( | ) 在 C++ 中是什么意思？

C++ 如何将未知大小的二维数组传递给函数

C++ CMake 可以在 Mac OS X 上指定基础 SDK 吗？

相关推荐

最近更新

标签