C语言 为什么glibc的strlen需要这么复杂才能快速运行?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/57650895/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-02 10:46:46  来源:igfitidea点击:

Why does glibc's strlen need to be so complicated to run quickly?

coptimizationglibcportabilitystrlen

提问by

I was looking through the strlencode hereand I was wondering if the optimizations used in the code are really needed? For example, why wouldn't something like the following work equally good or better?

我正在查看这里strlen代码,我想知道代码中使用的优化是否真的需要?例如,为什么像下面这样的东西不能同样好或更好?

unsigned long strlen(char s[]) {
    unsigned long i;
    for (i = 0; s[i] != '
/* Copyright (C) 1991, 1993, 1997, 2000, 2003 Free Software Foundation, Inc.
   This file is part of the GNU C Library.
   Written by Torbjorn Granlund ([email protected]),
   with help from Dan Sahlin ([email protected]);
   commentary by Jim Blandy ([email protected]).

   The GNU C Library is free software; you can redistribute it and/or
   modify it under the terms of the GNU Lesser General Public
   License as published by the Free Software Foundation; either
   version 2.1 of the License, or (at your option) any later version.

   The GNU C Library is distributed in the hope that it will be useful,
   but WITHOUT ANY WARRANTY; without even the implied warranty of
   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
   Lesser General Public License for more details.

   You should have received a copy of the GNU Lesser General Public
   License along with the GNU C Library; if not, write to the Free
   Software Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA
   02111-1307 USA.  */

#include <string.h>
#include <stdlib.h>

#undef strlen

/* Return the length of the null-terminated string STR.  Scan for
   the null terminator quickly by testing four bytes at a time.  */
size_t
strlen (str)
     const char *str;
{
  const char *char_ptr;
  const unsigned long int *longword_ptr;
  unsigned long int longword, magic_bits, himagic, lomagic;

  /* Handle the first few characters by reading one character at a time.
     Do this until CHAR_PTR is aligned on a longword boundary.  */
  for (char_ptr = str; ((unsigned long int) char_ptr
            & (sizeof (longword) - 1)) != 0;
       ++char_ptr)
    if (*char_ptr == '
/* Copyright (C) 1991, 1993, 1997, 2000, 2003 Free Software Foundation, Inc.
   This file is part of the GNU C Library.
   Written by Torbjorn Granlund ([email protected]),
   with help from Dan Sahlin ([email protected]);
   commentary by Jim Blandy ([email protected]).

   The GNU C Library is free software; you can redistribute it and/or
   modify it under the terms of the GNU Lesser General Public
   License as published by the Free Software Foundation; either
   version 2.1 of the License, or (at your option) any later version.

   The GNU C Library is distributed in the hope that it will be useful,
   but WITHOUT ANY WARRANTY; without even the implied warranty of
   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
   Lesser General Public License for more details.

   You should have received a copy of the GNU Lesser General Public
   License along with the GNU C Library; if not, write to the Free
   Software Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA
   02111-1307 USA.  */

#include <string.h>
#include <stdlib.h>

#undef strlen

/* Return the length of the null-terminated string STR.  Scan for
   the null terminator quickly by testing four bytes at a time.  */
size_t
strlen (str)
     const char *str;
{
  const char *char_ptr;
  const unsigned long int *longword_ptr;
  unsigned long int longword, magic_bits, himagic, lomagic;

  /* Handle the first few characters by reading one character at a time.
     Do this until CHAR_PTR is aligned on a longword boundary.  */
  for (char_ptr = str; ((unsigned long int) char_ptr
            & (sizeof (longword) - 1)) != 0;
       ++char_ptr)
    if (*char_ptr == '
size_t stupid_strlen(const char s[]) {
    size_t i;
    for (i=0; s[i] != '
char *str = "hello world";  // or
char array[] = "hello world";
'; i++) ; return i; }
') return char_ptr - str; /* All these elucidatory comments refer to 4-byte longwords, but the theory applies equally well to 8-byte longwords. */ longword_ptr = (unsigned long int *) char_ptr; /* Bits 31, 24, 16, and 8 of this number are zero. Call these bits the "holes." Note that there is a hole just to the left of each byte, with an extra at the end: bits: 01111110 11111110 11111110 11111111 bytes: AAAAAAAA BBBBBBBB CCCCCCCC DDDDDDDD The 1-bits make sure that carries propagate to the next 0-bit. The 0-bits provide holes for carries to fall into. */ magic_bits = 0x7efefeffL; himagic = 0x80808080L; lomagic = 0x01010101L; if (sizeof (longword) > 4) { /* 64-bit version of the magic. */ /* Do the shift in two steps to avoid a warning if long has 32 bits. */ magic_bits = ((0x7efefefeL << 16) << 16) | 0xfefefeffL; himagic = ((himagic << 16) << 16) | himagic; lomagic = ((lomagic << 16) << 16) | lomagic; } if (sizeof (longword) > 8) abort (); /* Instead of the traditional loop which tests each character, we will test a longword at a time. The tricky part is testing if *any of the four* bytes in the longword in question are zero. */ for (;;) { /* We tentatively exit the loop if adding MAGIC_BITS to LONGWORD fails to change any of the hole bits of LONGWORD. 1) Is this safe? Will it catch all the zero bytes? Suppose there is a byte with all zeros. Any carry bits propagating from its left will fall into the hole at its least significant bit and stop. Since there will be no carry from its most significant bit, the LSB of the byte to the left will be unchanged, and the zero will be detected. 2) Is this worthwhile? Will it ignore everything except zero bytes? Suppose every byte of LONGWORD has a bit set somewhere. There will be a carry into bit 8. If bit 8 is set, this will carry into bit 16. If bit 8 is clear, one of bits 9-15 must be set, so there will be a carry into bit 16. Similarly, there will be a carry into bit 24. If one of bits 24-30 is set, there will be a carry into bit 31, so all of the hole bits will be changed. The one misfire occurs when bits 24-30 are clear and bit 31 is set; in this case, the hole at bit 31 is not changed. If we had access to the processor carry flag, we could close this loophole by putting the fourth hole at bit 32! So it ignores everything except 128's, when they're aligned properly. */ longword = *longword_ptr++; if ( #if 0 /* Add MAGIC_BITS to LONGWORD. */ (((longword + magic_bits) /* Set those bits that were unchanged by the addition. */ ^ ~longword) /* Look at only the hole bits. If any of the hole bits are unchanged, most likely one of the bytes was a zero. */ & ~magic_bits) #else ((longword - lomagic) & himagic) #endif != 0) { /* Which of the bytes was the zero? If none of them were, it was a misfire; continue the search. */ const char *cp = (const char *) (longword_ptr - 1); if (cp[0] == 0) return cp - str; if (cp[1] == 0) return cp - str + 1; if (cp[2] == 0) return cp - str + 2; if (cp[3] == 0) return cp - str + 3; if (sizeof (longword) > 4) { if (cp[4] == 0) return cp - str + 4; if (cp[5] == 0) return cp - str + 5; if (cp[6] == 0) return cp - str + 6; if (cp[7] == 0) return cp - str + 7; } } } } libc_hidden_builtin_def (strlen)
') return char_ptr - str; /* All these elucidatory comments refer to 4-byte longwords, but the theory applies equally well to 8-byte longwords. */ longword_ptr = (unsigned long int *) char_ptr; /* Bits 31, 24, 16, and 8 of this number are zero. Call these bits the "holes." Note that there is a hole just to the left of each byte, with an extra at the end: bits: 01111110 11111110 11111110 11111111 bytes: AAAAAAAA BBBBBBBB CCCCCCCC DDDDDDDD The 1-bits make sure that carries propagate to the next 0-bit. The 0-bits provide holes for carries to fall into. */ magic_bits = 0x7efefeffL; himagic = 0x80808080L; lomagic = 0x01010101L; if (sizeof (longword) > 4) { /* 64-bit version of the magic. */ /* Do the shift in two steps to avoid a warning if long has 32 bits. */ magic_bits = ((0x7efefefeL << 16) << 16) | 0xfefefeffL; himagic = ((himagic << 16) << 16) | himagic; lomagic = ((lomagic << 16) << 16) | lomagic; } if (sizeof (longword) > 8) abort (); /* Instead of the traditional loop which tests each character, we will test a longword at a time. The tricky part is testing if *any of the four* bytes in the longword in question are zero. */ for (;;) { /* We tentatively exit the loop if adding MAGIC_BITS to LONGWORD fails to change any of the hole bits of LONGWORD. 1) Is this safe? Will it catch all the zero bytes? Suppose there is a byte with all zeros. Any carry bits propagating from its left will fall into the hole at its least significant bit and stop. Since there will be no carry from its most significant bit, the LSB of the byte to the left will be unchanged, and the zero will be detected. 2) Is this worthwhile? Will it ignore everything except zero bytes? Suppose every byte of LONGWORD has a bit set somewhere. There will be a carry into bit 8. If bit 8 is set, this will carry into bit 16. If bit 8 is clear, one of bits 9-15 must be set, so there will be a carry into bit 16. Similarly, there will be a carry into bit 24. If one of bits 24-30 is set, there will be a carry into bit 31, so all of the hole bits will be changed. The one misfire occurs when bits 24-30 are clear and bit 31 is set; in this case, the hole at bit 31 is not changed. If we had access to the processor carry flag, we could close this loophole by putting the fourth hole at bit 32! So it ignores everything except 128's, when they're aligned properly. */ longword = *longword_ptr++; if ( #if 0 /* Add MAGIC_BITS to LONGWORD. */ (((longword + magic_bits) /* Set those bits that were unchanged by the addition. */ ^ ~longword) /* Look at only the hole bits. If any of the hole bits are unchanged, most likely one of the bytes was a zero. */ & ~magic_bits) #else ((longword - lomagic) & himagic) #endif != 0) { /* Which of the bytes was the zero? If none of them were, it was a misfire; continue the search. */ const char *cp = (const char *) (longword_ptr - 1); if (cp[0] == 0) return cp - str; if (cp[1] == 0) return cp - str + 1; if (cp[2] == 0) return cp - str + 2; if (cp[3] == 0) return cp - str + 3; if (sizeof (longword) > 4) { if (cp[4] == 0) return cp - str + 4; if (cp[5] == 0) return cp - str + 5; if (cp[6] == 0) return cp - str + 6; if (cp[7] == 0) return cp - str + 7; } } } } libc_hidden_builtin_def (strlen)
'; i++) continue; return i; }

Isn't simpler code better and/or easier for the compiler to optimize?

更简单的代码不是更好和/或更容易让编译器优化吗?

The code of strlenon the page behind the link looks like this:

strlen链接后面页面上的代码如下所示:

int main(void) {
    char buf[12];
    printf("%zu\n", the_strlen(fgets(buf, 12, stdin)));
}
% ./a.out
hello world
=================================================================
==8355==ERROR: AddressSanitizer: stack-buffer-overflow on address 0x7ffffe63a3f8 at pc 0x55fbec46ab6c bp 0x7ffffe63a350 sp 0x7ffffe63a340
READ of size 8 at 0x7ffffe63a3f8 thread T0
    #0 0x55fbec46ab6b in the_strlen (.../a.out+0x1b6b)
    #1 0x55fbec46b139 in main (.../a.out+0x2139)
    #2 0x7f4f0848fb96 in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x21b96)
    #3 0x55fbec46a949 in _start (.../a.out+0x1949)

Address 0x7ffffe63a3f8 is located in stack of thread T0 at offset 40 in frame
    #0 0x55fbec46b07c in main (.../a.out+0x207c)

  This frame has 1 object(s):
    [32, 44) 'buf' <== Memory access at offset 40 partially overflows this variable
HINT: this may be a false positive if your program uses some custom stack unwind mechanism or swapcontext
      (longjmp and C++ exceptions *are* supported)
SUMMARY: AddressSanitizer: stack-buffer-overflow (.../a.out+0x1b6b) in the_strlen
Shadow bytes around the buggy address:
  0x10007fcbf420: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  0x10007fcbf430: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  0x10007fcbf440: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  0x10007fcbf450: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  0x10007fcbf460: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
=>0x10007fcbf470: 00 00 00 00 00 00 00 00 00 00 f1 f1 f1 f1 00[04]
  0x10007fcbf480: f2 f2 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  0x10007fcbf490: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  0x10007fcbf4a0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  0x10007fcbf4b0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  0x10007fcbf4c0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
Shadow byte legend (one shadow byte represents 8 application bytes):
  Addressable:           00
  Partially addressable: 01 02 03 04 05 06 07 
  Heap left redzone:       fa
  Freed heap region:       fd
  Stack left redzone:      f1
  Stack mid redzone:       f2
  Stack right redzone:     f3
  Stack after return:      f5
  Stack use after scope:   f8
  Global redzone:          f9
  Global init order:       f6
  Poisoned by user:        f7
  Container overflow:      fc
  Array cookie:            ac
  Intra object redzone:    bb
  ASan internal:           fe
  Left alloca redzone:     ca
  Right alloca redzone:    cb
==8355==ABORTING

Why does this version run quickly?

为什么这个版本跑得快?

Isn't it doing a lot of unnecessary work?

是不是做了很多不必要的工作?

采纳答案by Antti Haapala

You don'tneed and you should neverwrite code like that - especially if you're not a C compiler / standard library vendor. It is code used to implement strlenwith some very questionable speed hacks and assumptions (that are not tested with assertions or mentioned in the comments):

不会需要你千万别写代码这样的-特别是如果你不是一个C编译器/标准库供应商。它是用于实现strlen一些非常可疑的速度黑客和假设的代码(未使用断言测试或在评论中提及):

  • unsigned longis either 4 or 8 bytes
  • bytes are 8 bits
  • a pointer can be cast to unsigned long longand not uintptr_t
  • one can align the pointer simply by checking that the 2 or 3 lowest order bits are zero
  • one can access a string as unsigned longs
  • one can read past the end of array without any ill effects.
  • unsigned long是 4 或 8 个字节
  • 字节是 8 位
  • 一个指针可以被转换为unsigned long longuintptr_t
  • 可以简单地通过检查 2 或 3 个最低位是否为零来对齐​​指针
  • 可以将字符串作为unsigned longs访问
  • 可以读取数组的末尾而不会产生任何不良影响。

What is more, a good compiler could even replace code written as

更重要的是,一个好的编译器甚至可以替换写成的代码

strlen(const char *char_ptr)
{
  typedef unsigned long __attribute__((may_alias)) aliasing_ulong;

  aliasing_ulong *longword_ptr = (aliasing_ulong *)char_ptr;
  for (;;) {
     unsigned long ulong = *longword_ptr++;  // can safely alias anything
     ...
  }
}

(notice that it has to be a type compatible with size_t) with an inlined version of the compiler builtin strlen, or vectorize the code; but a compiler would be unlikely to be able to optimize the complex version.

(请注意,它必须是与size_t)兼容的类型,并且具有内置编译器的内联版本strlen,或者对代码进行向量化;但是编译器不太可能优化复杂版本。



The strlenfunction is described by C11 7.24.6.3as:

strlen函数在C11 7.24.6.3 中描述为:

Description

  1. The strlenfunction computes the length of the string pointed to by s.

Returns

  1. The strlenfunction returns the number of characters that precede the terminating null character.

描述

  1. strlen函数计算 s 指向的字符串的长度。

退货

  1. strlen函数返回终止空字符之前的字符数。

Now, if the string pointed to by swas in an array of characters just long enough to contain the string and the terminating NUL, the behaviourwill be undefinedif we access the string past the null terminator, for example in

现在,如果指向的字符串s在一个字符数组中,长度刚好足以包含字符串和终止 NUL,那么如果我们访问字符串越过空终止符,行为将是未定义的,例如在

   unsigned long longword;
   memcpy(&longword, char_ptr, sizeof(longword));
   char_ptr += sizeof(longword);

So really the onlyway in fully portable / standards compliant C to implement this correctlyis the way it is written in your question, except for trivial transformations - you can pretend to be faster by unrolling the loop etc, but it still needs to be done one byteat a time.

因此,在完全可移植/符合标准的 C 中正确实现这一点的唯一方法是它在您的问题中的编写方式,除了微不足道的转换 - 您可以通过展开循环等来假装更快,但它仍然需要完成一次一个字节

(As commenters have pointed out, when strict portability is too much of a burden, taking advantage of reasonable or known-safe assumptions is not always a bad thing. Especially in code that's part ofone specific C implementation. But you have to understand the rules before knowing how/when you can bend them.)

(正如评论者所指出的那样,当严格的可移植性负担过重时,利用合理或已知安全的假设并不总是一件坏事。尤其是在作为特定 C 实现的一部分的代码中。但您必须了解在知道如何/何时可以弯曲它们之前,规则。)



The linked strlenimplementation first checks the bytes individually until the pointer is pointing to the natural 4 or 8 byte alignment boundary of the unsigned long. The C standard says that accessing a pointer that is not properly aligned has undefined behaviour, so this absolutely has to be done for the next dirty trick to be even dirtier. (In practice on some CPU architecture other than x86, a misaligned word or doubleword load will fault. C is nota portable assembly language, but this code is using it that way). It's also what makes it possible to read past the end of an object without risk of faulting on implementations where memory protection works in aligned blocks (e.g. 4kiB virtual memory pages).

链接的strlen实现首先单独检查字节,直到指针指向unsigned long. C 标准说访问未正确对齐的指针具有未定义的行为,因此绝对必须这样做,以使下一个肮脏的技巧更加肮脏。(实际上,在 x86 以外的某些 CPU 体系结构上,未对齐的字或双字加载会出错。C不是可移植的汇编语言,但此代码以这种方式使用它)。这也是可以读取对象末尾而不会在内存保护在对齐块(例如 4kiB 虚拟内存页面)中工作的实现中出现故障的风险的原因。

Now comes the dirty part: the code breaksthe promise and reads 4 or 8 8-bit bytes at a time (a long int), and uses a bit trick with unsigned addition to quickly figure out if there were anyzero bytes within those 4 or 8 bytes - it uses a specially crafted number to that would cause the carry bit to change bits that are caught by a bit mask. In essence this would then figure out if any of the 4 or 8 bytes in the mask are zeroes supposedly fasterthan looping through each of these bytes would. Finally there is a loop at the end to figure out whichbyte was the first zero, if any, and to return the result.

现在是肮脏的部分:代码违反了承诺并一次读取 4 或 8 个 8 位字节 (a long int),并使用无符号加法的小技巧快速确定这 4 或 8 个字节中是否有任何零字节字节 - 它使用特制的数字来导致进位位更改位掩码捕获的位。本质上,这将确定掩码中的 4 或 8 个字节中的任何一个是否为零,据说比循环遍历这些字节中的每一个都快。最后有一个循环,以确定哪个字节是第一个零(如果有),并返回结果。

The biggest problem is that in sizeof (unsigned long) - 1times out of sizeof (unsigned long)cases it will read past the end of the string - only if the null byte is in the lastaccessed byte (i.e. in little-endian the most significant, and in big-endian the least significant), does it notaccess the array out of bounds!

最大的问题是,在sizeof (unsigned long) - 1超时的sizeof (unsigned long)情况下,它会读取超出字符串的结束-只有当空字节是在最后访问的字节(即little-endian的最显著,并在大端最低显著) ,会不会越界访问数组!



The code, even though used to implement strlenin a C standard library is badcode. It has several implementation-defined and undefined aspects in it and it should not be used anywhereinstead of the system-provided strlen- I renamed the function to the_strlenhere and added the following main:

即使用于strlen在 C 标准库中实现的代码也是糟糕的代码。它有几个实现定义和未定义的方面,它不应该在任何地方使用而不是系统提供的strlen- 我将函数重命名为the_strlenhere并添加以下内容main

 27 /* Return the length of the null-terminated string STR.  Scan for
 28    the null terminator quickly by testing four bytes at a time.  */

The buffer is carefully sized so that it can hold exactly the hello worldstring and the terminator. However on my 64-bit processor the unsigned longis 8 bytes, so the access to the latter part would exceed this buffer.

缓冲区的大小经过仔细调整,以便它可以准确地保存hello world字符串和终止符。但是在我的 64 位处理器上unsigned long是 8 个字节,所以对后半部分的访问将超过这个缓冲区。

If I now compile with -fsanitize=undefinedand -fsanitize=addressand run the resulting program, I get:

如果我现在编译-fsanitize=undefined-fsanitize=address和运行所产生的程序,我得到:

 73   /* Instead of the traditional loop which tests each character,
 74      we will test a longword at a time.  The tricky part is testing
 75      if *any of the four* bytes in the longword in question are zero.  */

i.e. bad things happened.

即发生了不好的事情。

回答by Peter Cordes

There's been a lot of (slightly or entirely) wrong guesses in comments about some details / background for this.

在对此的一些细节/背景的评论中,有很多(轻微或完全)错误的猜测。

You're looking at glibc's optimized C fallback optimized implementation. (For ISAs that don't have a hand-written asm implementation). Or an old version of that code, which is still in the glibc source tree. https://code.woboq.org/userspace/glibc/string/strlen.c.htmlis a code-browser based on the current glibc git tree. Apparently it is still used by a few mainstream glibc targets, including MIPS. (Thanks @zwol).

您正在查看glibc 的优化 C 回退优化实现。(对于没有手写 asm 实现的 ISA)。或者那个代码的旧版本,它仍然在 glibc 源代码树中。 https://code.woboq.org/userspace/glibc/string/strlen.c.html是一个基于当前 glibc git 树的代码浏览器。显然它仍然被一些主流的 glibc 目标使用,包括 MIPS。(感谢@zwol)。

On popular ISAs like x86 and ARM, glibc uses hand-written asm

在 x86 和 ARM 等流行的 ISA 上,glibc 使用手写的 asm

So the incentive to change anything about this code is lower than you might think.

因此,更改有关此代码的任何内容的动机比您想象的要低。

This bithack code (https://graphics.stanford.edu/~seander/bithacks.html#ZeroInWord) isn't what actually runs on your server/desktop/laptop/smartphone. It's better than a naive byte-at-a-time loop, but even this bithack is pretty bad compared to efficient asm for modern CPUs(especially x86 where AVX2 SIMD allows checking 32 bytes with a couple instructions, allowing 32 to 64 bytes per clock cycle in the main loop if data is hot in L1d cache on modern CPUs with 2/clock vector load and ALU throughput. i.e. for medium-sized strings where startup overhead doesn't dominate.)

这个 bithack 代码 ( https://graphics.stanford.edu/~seander/bithacks.html#ZeroInWord) 并不是在您的服务器/桌面/笔记本电脑/智能手机上实际运行的代码。它比一个简单的一次字节循环要好,但与现代 CPU 的高效 asm 相比即使这种比特黑客也很糟糕(尤其是 x86,其中 AVX2 SIMD 允许使用几条指令检查 32 个字节,每个时钟允许 32 到 64 个字节如果数据在具有 2/时钟向量负载和 ALU 吞吐量的现代 CPU 上的 L1d 缓存中数据很热,则在主循环中循环。即对于启动开销不占主导地位的中型字符串。)

glibc uses dynamic linking tricks to resolve strlento an optimal version for your CPU, so even within x86 there's an SSE2 version(16-byte vectors, baseline for x86-64) and an AVX2 version(32-byte vectors).

glibc 使用动态链接技巧来strlen为您的 CPU解析最佳版本,因此即使在 x86 中,也有SSE2 版本(16 字节向量,x86-64 的基线)和AVX2 版本(32 字节向量)。

x86 has efficient data transfer between vector and general-purpose registers, which makes it uniquely(?) good for using SIMD to speed up functions on implicit-length strings where the loop control is data dependent. pcmpeqb/ pmovmskbmakes it possible to testing 16 separate bytes at a time.

x86 在向量和通用寄存器之间具有高效的数据传输,这使得它独特(?)非常适合使用 SIMD 来加速隐式长度字符串上的函数,其中循环控制依赖于数据。 pcmpeqb/pmovmskb可以一次测试 16 个单独的字节。

glibc has an AArch64 version like that using AdvSIMD, and a version for AArch64 CPUs where vector->GP registers stalls the pipeline, so it does actually use this bithack. But uses count-leading-zeros to find the byte-within-register once it gets a hit, and takes advantage of AArch64's efficient unaligned accesses after checking for page-crossing.

glibc 有一个类似使用 AdvSIMD的 AArch64 版本,还有一个用于 AArch64 CPU 的版本,其中 vector->GP 寄存器会暂停管道,所以它实际上使用了这个 bithack。但是使用计数前导零在命中后找到寄存器内的字节,并在检查页面交叉后利用 AArch64 的高效未对齐访问。

Also related: Why is this code 6.5x slower with optimizations enabled?has some more details about what's fast vs. slow in x86 asm for strlenwith a large buffer and a simple asm implementation that might be good for gcc to know how to inline. (Some gcc versions unwisely inline rep scasbwhich is very slow, or a 4-byte-at-a-time bithack like this. So GCC's inline-strlen recipe needs updating or disabling.)

还相关:为什么启用优化后此代码慢 6.5 倍?有更多关于 x86 asm 中什么是快与慢的详细信息,因为它strlen有一个大缓冲区和一个简单的 asm 实现,这可能有助于 gcc 知道如何内联。(一些 gcc 版本不明智地内联rep scasb,这非常慢,或者像这样一次 4 字节的比特黑客。所以 GCC 的内联-strlen 配方需要更新或禁用。)

Asm doesn't have C-style "undefined behaviour"; it's safe to access bytes in memory however you like, and an aligned load that includes any valid bytes can't fault. Memory protection happens with aligned-page granularity; aligned accesses narrower than that can't cross a page boundary. Is it safe to read past the end of a buffer within the same page on x86 and x64?The same reasoning applies to the machine-code that this C hack gets compilers to create for a stand-alone non-inline implementation of this function.

Asm 没有 C 风格的“未定义行为”;随意访问内存中的字节是安全的,并且包含任何有效字节的对齐加载不会出错。内存保护以对齐的页面粒度发生;比这更窄的对齐访问不能跨越页面边界。 在 x86 和 x64 上的同一页面内读取缓冲区的末尾是否安全?同样的推理适用于这个 C hack 让编译器为这个函数的独立非内联实现创建的机器代码。

When a compiler emits code to call an unknown non-inline function, it has to assume that function modifies any/all global variables and any memory it might possibly have a pointer to. i.e. everything except locals that haven't had their address escape have to be in sync in memory across the call. This applies to functions written in asm, obviously, but also to library functions. If you don't enable link-time optimization, it even applies to separate translation units (source files).

当编译器发出代码来调用未知的非内联函数时,它必须假设该函数修改了任何/所有全局变量以及它可能具有指向的任何内存。即除了没有地址转义的本地人之外的所有内容都必须在整个调用过程中在内存中保持同步。显然,这适用于用 asm 编写的函数,但也适用于库函数。如果您不启用链接时优化,它甚至适用于单独的翻译单元(源文件)。



Why this is safe as part of glibcbut nototherwise.

为什么这作为 glibc 的一部分是安全的不是其他的。

The most important factor is that this strlencan't inline into anything else.It's not safe for that; it contains strict-aliasing UB(reading chardata through an unsigned long*). char*is allowed to alias anything else but the reverse is nottrue.

最重要的因素是这strlen不能内联到其他任何东西中。这是不安全的;它包含严格别名 UBchar通过读取数据unsigned long*)。 char*允许别名任何东西,但相反的是不是真的

This is a library function for an ahead-of-time compiled library (glibc). It won't get inlined with link-time-optimization into callers.This means it just has to compile to safe machine code for a stand-alone version of strlen. It doesn't have to be portable / safe C.

这是一个用于提前编译库 (glibc) 的库函数。 它不会与调用者的链接时间优化一起内联。这意味着它只需要编译为独立版本的strlen. 它不必是便携/安全的 C.

The GNU C library only has to compile with GCC. Apparently it's not supportedto compile it with clang or ICC, even though they support GNU extensions. GCC is an ahead-of-time compilers that turn a C source file into an object file of machine code. Not an interpreter, so unless it inlines at compile time, bytes in memory are just bytes in memory. i.e. strict-aliasing UB isn't dangerous when the accesses with different types happen in different functions that don't inline into each other.

GNU C 库只需要使用 GCC 进行编译。显然不支持用 clang 或 ICC 编译它,即使它们支持 GNU 扩展。GCC 是一种提前编译器,可将 C 源文件转换为机器代码的目标文件。不是解释器,所以除非它在编译时内联,否则内存中的字节只是内存中的字节。即,当不同类型的访问发生在不相互内联的不同函数中时,严格别名 UB 并不危险。

Remember that strlen's behaviour is defined bythe ISO C standard. That function name specifically is part ofthe implementation. Compilers like GCC even treat the name as a built-in function unless you use -fno-builtin-strlen, so strlen("foo")can be a compile-time constant 3. The definition in the library is onlyused when gcc decides to actually emit a call to it instead of inlining its own recipe or something.

请记住,它strlen的行为是ISO C 标准定义。该函数名称特别是实现的一部分。除非您使用-fno-builtin-strlen,否则像 GCC 这样的编译器甚至将名称视为内置函数,因此strlen("foo")可以是编译时常量3。库中的定义在 gcc 决定实际发出对它的调用而不是内联它自己的配方或其他东西时使用。

When UB isn't visible to the compilerat compile time, you get sane machine code. The machine code has to work for the no-UB case, and even if you wantedto, there's no way for the asm to detect what types the caller used to put data into the pointed-to memory.

编译器在编译时看不到 UB时,您将获得正常的机器代码。本机代码必须工作,为无UB的情况下,即使你想要到,有没有办法为ASM检测类型调用者用什么来把数据放到指向的内存。

Glibc is compiled to a stand-alone static or dynamic library that can't inline with link-time optimization. glibc's build scripts don't create "fat" static libraries containing machine code + gcc GIMPLE internal representation for link-time optimization when inlining into a program. (i.e. libc.awon't participate in -fltolink-time optimization into the main program.) Building glibc that way would be potentially unsafe on targets that actually use this .c.

Glibc 被编译成一个独立的静态或动态库,不能内联链接时优化。glibc 的构建脚本不会创建包含机器代码 + gcc GIMPLE 内部表示的“胖”静态库,用于在内联到程序中时进行链接时优化。(即libc.a不会参与-flto到主程序的链接时优化。)以这种方式构建 glibc在实际使用 this 的目标上.c可能是不安全的。

In fact as @zwol comments, LTO can't be used when building glibc itself, because of "brittle" code like this which could break if inlining between glibc source files was possible. (There are some internal uses of strlen, e.g. maybe as part of the printfimplementation)

事实上,正如@zwol 评论的那样,在构建 glibc本身时不能使用 LTO ,因为像这样的“脆弱”代码可能会在 glibc 源文件之间内联时中断。(有一些内部用途strlen,例如可能作为printf实现的一部分)



This strlenmakes some assumptions:

strlen提出了一些假设:

  • CHAR_BITis a multiple of 8. True on all GNU systems. POSIX 2001 even guarantees CHAR_BIT == 8. (This looks safe for systems with CHAR_BIT= 16or 32, like some DSPs; the unaligned-prologue loop will always run 0 iterations if sizeof(long) = sizeof(char) = 1because every pointer is always aligned and p & sizeof(long)-1is always zero.) But if you had a non-ASCII character set where chars are 9 or 12 bits wide, 0x8080...is the wrong pattern.
  • (maybe) unsigned longis 4 or 8 bytes. Or maybe it would actually work for any size of unsigned longup to 8, and it uses an assert()to check for that.
  • CHAR_BIT是 8 的倍数。在所有 GNU 系统上都是如此。POSIX 2001 甚至保证CHAR_BIT == 8。(这对于带有CHAR_BIT= 16or 的系统来说是安全的32,就像一些 DSP;如果sizeof(long) = sizeof(char) = 1因为每个指针总是对齐并且p & sizeof(long)-1总是零,那么未对齐的序言循环将总是运行 0 次迭代。)但是如果你有一个非 ASCII 字符集,其中字符是 9或 12 位宽,0x8080...是错误的模式。
  • (也许)unsigned long是 4 或 8 个字节。或者也许它实际上适用于unsigned long最多 8 的任何大小,并且它使用 anassert()来检查它。

Those two aren't possible UB, they're just non-portability to some C implementations. This code is (or was) part ofthe C implementation on platforms where it does work, so that's fine.

这两个不可能是 UB,它们只是某些 C 实现的不可移植性。这段代码是(或曾经)在它工作的平台上的 C 实现的一部分,所以这很好。

The next assumption is potential C UB:

下一个假设是潜在的 C UB:

  • An aligned load that contains any valid bytes can't fault, and is safe as long as you ignore the bytes outside the object you actually want. (True in asm on every GNU systems, and on all normal CPUs because memory protection happens with aligned-page granularity. Is it safe to read past the end of a buffer within the same page on x86 and x64?safe in C when the UB isn't visible at compile time. Without inlining, this is the case here. The compiler can't prove that reading past the first 0is UB; it could be a C char[]array containing {1,2,0,3}for example)
  • 包含任何有效字节的对齐加载不会出错,并且只要您忽略实际想要的对象之外的字节,它就是安全的。(在每个 GNU 系统和所有普通 CPU 上的 asm 中都是如此,因为内存保护发生在对齐的页面粒度上。 在 x86 和 x64 上的同一页面内读取缓冲区的末尾是否安全?当 UB 时在 C 中是安全的在编译时不可见。没有内联,这里就是这种情况。编译器无法证明读过第一个0是 UB;它可能是一个 Cchar[]数组{1,2,0,3},例如)

That last point is what makes it safe to read past the end of a C object here. That is pretty much safe even when inlining with current compilers because I think they don't currently treat that implying a path of execution is unreachable. But anyway, the strict aliasing is already a showstopper if you ever let this inline.

最后一点使得在这里可以安全地读取 C 对象的末尾。即使与当前的编译器内联,这也非常安全,因为我认为他们目前没有处理暗示执行路径无法访问的情况。但无论如何,如果你让这个内联,严格的别名已经是一个阻碍。

Then you'd have problems like the Linux kernel's old unsafe memcpyCPP macrothat used pointer-casting to unsigned long(gcc, strict-aliasing, and horror stories).

然后你会遇到一些问题,比如 Linux 内核的旧的不安全memcpyCPP 宏,它使用指针转换到unsigned longgcc、严格别名和恐怖故事)。

This strlendates back to the era when you could get away with stuff like that in general; it used to be pretty much safe without the "only when not inlining" caveat before GCC3.

strlen可以追溯到一般情况下你可以摆脱这样的东西的时代;在 GCC3 之前,如果没有“仅在不内联时”警告,它过去非常安全。



UB that's only visible when looking across call/ret boundaries can't hurt us. (e.g. calling this on a char buf[]instead of on an array of unsigned long[]cast to a const char*). Once the machine code is set in stone, it's just dealing with bytes in memory. A non-inline function call has to assume that the callee reads any/all memory.

仅在查看呼叫/返回边界时才可见的 UB 不会伤害我们。(例如,在 achar buf[]上调用它,而不是在转换unsigned long[]为 a的数组上调用const char*)。一旦机器代码一成不变,它就只是处理内存中的字节。非内联函数调用必须假设被调用者读取任何/所有内存。



Writing this safely, without strict-aliasing UB

安全地写这个,没有严格别名 UB

The GCC type attribute may_aliasgives a type the same alias-anything treatment as char*. (Suggested by @KonradBorowsk). GCC headers currently use it for x86 SIMD vector types like __m128iso you can always safely do _mm_loadu_si128( (__m128i*)foo ). (See Is `reinterpret_cast`ing between hardware vector pointer and the corresponding type an undefined behavior?for more details about what this does and doesn't mean.)

GCC类型属性may_alias给出了一个类型相同的别名,任何的待遇char*。(由@KonradBorowsk 建议)。GCC 标头目前将它用于 x86 SIMD 向量类型,__m128i因此您始终可以安全地执行_mm_loadu_si128( (__m128i*)foo ). (请参阅硬件向量指针和相应类型之间的`reinterpret_cast`ing 是否是未定义的行为?有关这意味着什么和不意味着什么的更多详细信息。)

...
#include <string.h>

size_t
strlen(const char *str)
{
    const char *s;

    for (s = str; *s; ++s)
        ;
    return (s - str);
}

DEF_STRONG(strlen);

You could also use aligned(1)to express a type with alignof(T) = 1.
typedef unsigned long __attribute__((may_alias, aligned(1))) unaligned_aliasing_ulong;

你也可以用aligned(1)来表达一个类型alignof(T) = 1
typedef unsigned long __attribute__((may_alias, aligned(1))) unaligned_aliasing_ulong;

A portable way to express an aliasing load in ISO is with memcpy, which modern compilers do know how to inline as a single load instruction. e.g.

在 ISO 中表达别名加载的一种可移植方式是 withmemcpy,现代编译器确实知道如何将其作为单个加载指令内联。例如

##代码##

This also works for unaligned loads because memcpyworks as-if by char-at-a-time access. But in practice modern compilers understand memcpyvery well.

这也适用于未对齐的加载,因为memcpy就像通过char一次访问一样工作。但在实践中,现代编译器memcpy非常理解。

The danger here is that if GCC doesn't knowfor sure that char_ptris word-aligned, it won't inline it on some platforms that might not support unaligned loads in asm. e.g. MIPS before MIPS64r6, or older ARM. If you got an actual function call to memcpyjust to load a word (and leave it in other memory), that would be a disaster. GCC can sometimes see when code aligns a pointer. Or after the char-at-a-time loop that reaches a ulong boundary you could use
p = __builtin_assume_aligned(p, sizeof(unsigned long));

这里的危险是,如果GCC不知道的肯定char_ptr是字对齐的,它不会内联它可能不支持未对齐载荷ASM一些平台。例如 MIPS64r6 之前的 MIPS,或更旧的 ARM。如果你有一个实际的函数调用memcpy只是为了加载一个单词(并将它留在其他内存中),那将是一场灾难。GCC 有时可以看到代码何时对齐指针。或者在达到 ulong 边界的一次字符循环之后,您可以使用
p = __builtin_assume_aligned(p, sizeof(unsigned long));

This doesn't avoid the read-past-the-object possible UB, but with current GCC that's not dangerous in practice.

这并不能避免读取过去对象可能的 UB,但使用当前的 GCC 在实践中并不危险。



Why hand-optimized C source is necessary: current compilers aren't good enough

为什么需要手工优化的 C 源代码:当前的编译器不够好

Hand-optimized asm can be even better when you want every last drop of performance for a widely-used standard library function. Especially for something like memcpy, but also strlen. In this case it wouldn't be much easier to use C with x86 intrinsics to take advantage of SSE2.

当您希望广泛使用的标准库函数的每一次性能下降时,手工优化的 asm 可能会更好。尤其对于喜欢的东西memcpy,也strlen。在这种情况下,将 C 与 x86 内在函数结合使用以利用 SSE2 并不容易。

But here we're just talking about a naive vs. bithack C version without any ISA-specific features.

但在这里我们只是在谈论没有任何 ISA 特定功能的 naive 与 bithack C 版本。

(I think we can take it as a given that strlenis widely enough used that making it run as fast as possible is important. So the question becomes whether we can get efficient machine code from simpler source. No, we can't.)

(我认为我们可以把它当作一个已经strlen被广泛使用的给定,让它尽可能快地运行很重要。所以问题变成了我们是否可以从更简单的源中获得高效的机器代码。不,我们不能。)

Current GCC and clang are not capable of auto-vectorizing loops where the iteration count isn't known ahead of the first iteration. (e.g. it has to be possible to check if the loop will run at least 16 iterations beforerunning the first iteration.) e.g. autovectorizing memcpy is possible (explicit-length buffer) but not strcpy or strlen (implicit-length string), given current compilers.

当前的 GCC 和 clang 无法自动矢量化循环,其中迭代计数在第一次迭代之前是未知的。(例如,必须可以运行第一次迭代之前检查循环是否至少运行 16 次迭代。)例如,自动向量化 memcpy 是可能的(显式长度缓冲区)但不是 strcpy 或 strlen(隐式长度字符串),给定当前编译器。

That includes search loops, or any other loop with a data-dependent if()breakas well as a counter.

这包括搜索循环,或任何其他具有数据相关if()break和计数器的循环。

ICC (Intel's compiler for x86) can auto-vectorize some search loops, but still only makes naive byte-at-a-time asm for a simple / naive C strlenlike OpenBSD's libc uses. (Godbolt). (From @Peske's answer).

ICC(Intel 的 x86 编译器)可以自动矢量化一些搜索循环,但仍然只为strlen像 OpenBSD 的 libc 使用的简单/天真的 C 生成天真的一次字节 asm 。(天马行空)。(来自@Peske 的回答)。

A hand-optimized libc strlenis necessary for performance with current compilers. Going 1 byte at a time (with unrolling maybe 2 bytes per cycle on wide superscalar CPUs) is pathetic when main memory can keep up with about 8 bytes per cycle, and L1d cache can deliver 16 to 64 per cycle. (2x 32-byte loads per cycle on modern mainstream x86 CPUs since Haswell and Ryzen. Not counting AVX512 which can reduce clock speeds just for using 512-bit vectors; which is why glibc probably isn't in a hurry to add an AVX512 version. Although with 256-bit vectors, AVX512VL + BW masked compare into a mask and ktestor kortestcould make strlenmore hyperthreading friendly by reducing its uops / iteration.)

手动优化的 libcstrlen是当前编译器性能所必需的。一次处理 1 个字节(在宽超标量 CPU 上每个周期可能展开 2 个字节)是可悲的,因为主内存可以跟上每个周期大约 8 个字节,而 L1d 缓存每个周期可以提供 16 到 64 个字节。(自 Haswell 和 Ryzen 以来,现代主流 x86 CPU 上每周期 2x 32 字节负载。不包括 AVX512,它可以降低时钟速度仅用于使用 512 位向量;这就是为什么 glibc 可能不急于添加 AVX512 版本.尽管使用 256 位向量,AVX512VL + BW 掩码比较为掩码,ktest或者kortest可以strlen通过减少其 uops/迭代来使超线程更加友好。)

I'm including non-x86 here, that's the "16 bytes". e.g. most AArch64 CPUs can do at least that, I think, and some certainly more. And some have enough execution throughput for strlento keep up with that load bandwidth.

我在这里包括非 x86,即“16 字节”。例如,大多数 AArch64 CPU 至少可以做到这一点,我认为,当然还有一些。有些有足够的执行吞吐量strlen来跟上负载带宽。

Of course programs that work with large strings should usually keep track of lengths to avoid having to redo finding the length of implicit-length C strings very often. But short to medium length performance still benefits from hand-written implementations, and I'm sure some programs do end up using strlen on medium-length strings.

当然,处理大字符串的程序通常应该跟踪长度以避免不得不经常重做查找隐式长度的 C 字符串的长度。但是短到中等长度的性能仍然受益于手写的实现,我相信有些程序最终会在中等长度的字符串上使用 strlen 。

回答by Timothy Jones

It is explained in the comments in the file you linked:

您链接的文件的注释中对此进行了解释:

##代码##

and:

和:

##代码##

In C, it is possible to reason in detail about the efficiency.

在 C 中,可以对效率进行详细推理。

It is less efficient to iterate through individual characters looking for a null than it is to test more than one byte at a time, as this code does.

遍历单个字符以查找空值比一次测试多个字节的效率低,就像这段代码一样。

The additional complexity comes from needing to ensure that the string under test is aligned in the right place to start testing more than one byte at a time (along a longword boundary, as described in the comments), and from needing to ensure that the assumptions about the sizes of the datatypes are not violated when the code is used.

额外的复杂性来自需要确保被测字符串在正确的位置对齐以一次开始测试多个字节(沿着长字边界,如注释中所述),以及需要确保假设使用代码时不违反数据类型的大小。

In most(but not all) modern software development, this attention to efficiency detail is not necessary, or not worth the cost of extra code complexity.

大多数(但不是全部)现代软件开发中,这种对效率细节的关注是不必要的,或者不值得额外代码复杂性的成本。

One place where it does make sense to pay attention to efficiency like this is in standard libraries, like the example you linked.

像这样关注效率确实有意义的一个地方是在标准库中,例如您链接的示例。



If you want to read more about word boundaries, see this question, and this excellent wikipedia page

如果你想阅读更多关于单词边界的信息,请参阅这个问题这个优秀的维基百科页面

回答by Peschke

In addition to the great answers here, I want to point out that the code linked in the question is for GNU's implementation of strlen.

除了这里的好答案之外,我想指出问题中链接的代码是针对 GNU 的strlen.

The OpenBSD implementation of strlenis very similar to the code proposed in the question. The complexity of an implementation is determined by the author.

OpenBSD 实现strlen与问题中提出的代码非常相似。实现的复杂性由作者决定。

##代码##

EDIT: The OpenBSD code I linked above looks to be a fallback implementation for ISAs that don't have there own asm implementation. There are different implementations of strlendepending on architecture. The code for amd64 strlen, for example, is asm. Similar to PeterCordes' comments/answerpointing out that the non-fallback GNU implementations are asm as well.

编辑:我上面链接的 OpenBSD 代码看起来是没有自己的 asm 实现的 ISA 的后备实现。strlen根据架构有不同的实现。例如,amd64strlen的代码是 asm。类似于 PeterCordes 的评论/回答指出非回退 GNU 实现也是 asm。

回答by Konrad Borowski

In short, this is a performance optimization the standard library can do by knowing what compiler it is compiled with - you shouldn't write code like this, unless you are writing a standard library and can depend on a specific compiler. Specifically, it's processing alignment number of bytes at the same time - 4 on 32-bit platforms, 8 on 64-bit platforms. This means it can be 4 or 8 times faster than na?ve byte iteration.

简而言之,这是标准库可以通过知道它是用什么编译器编译的来实现的性能优化 - 您不应该编写这样的代码,除非您正在编写标准库并且可以依赖特定的编译器。具体来说,它同时处理对齐字节数 - 32 位平台上为 4,64 位平台上为 8。这意味着它可以比朴素字节迭代快 4 或 8 倍。

To explain how does this work, consider the following image. Assume the 32-bit platform here (4 bytes alignment).

为了解释这是如何工作的,请考虑下图。假设这里是 32 位平台(4 字节对齐)。

Let's say that the letter "H" of "Hello, world!" string was provided as an argument for strlen. Because the CPU likes having things aligned in memory (ideally, address % sizeof(size_t) == 0), the bytes before the alignment are processed byte-by-byte, using slow method.

假设“Hello, world!”的字母“H” 字符串作为参数提供strlen。因为 CPU 喜欢在内存中对齐(理想情况下,address % sizeof(size_t) == 0),对齐前的字节使用慢速方法逐字节处理。

Then, for each alignment-sized chunk, by calculating (longbits - 0x01010101) & 0x80808080 != 0it checks whether any of the bytes within an integer is zero. This calculation has a false positive when at least one of bytes is higher than 0x80, but more often than not it should work. If that's not the case (as it is in yellow area), the length is increased by alignment size.

然后,对于每个对齐大小的块,通过计算(longbits - 0x01010101) & 0x80808080 != 0它检查整数中的任何字节是否为零。当至少一个字节高于 时,此计算会出现误报0x80,但通常情况下它应该可以工作。如果不是这种情况(因为它在黄色区域中),长度会随着对齐大小而增加。

If any of bytes within an integer turns out to be zero (or 0x81), then the string is checked byte-by-byte to determine the position of zero.

如果整数中的任何字节结果为零(或0x81),则逐字节检查字符串以确定零的位置。

This can make an out-of-bounds access, however because it's within an alignment, it's more likely than not to be fine, memory mapping units usually don't have byte level precision.

这可能会导致越界访问,但是因为它在对齐范围内,所以很可能没有问题,内存映射单元通常没有字节级精度。

回答by gnasher729

You want code to be correct, maintainable, and fast. These factors have different importance:

您希望代码正确、可维护且速度快。这些因素具有不同的重要性:

"correct" is absolutely essential.

“正确”是绝对必要的。

"maintainable" depends on how much you are going to maintain the code: strlen has been a Standard C library function for over 40 years. It's not going to change. Maintainability is therefore quite unimportant - for this function.

“可维护”取决于您要维护代码的程度:strlen 作为标准 C 库函数已有 40 多年的历史。它不会改变。因此,可维护性并不重要——对于这个功能。

"Fast": In many applications, strcpy, strlen etc. use a significant amount of the execution time. To achieve the same overall speed gain as this complicated, but not very complicated implementation of strlen by improving the compiler would take heroic efforts.

“快速”:在许多应用程序中,strcpy、strlen 等使用了大量的执行时间。要通过改进编译器实现与这个复杂但不是很复杂的 strlen 实现相同的总体速度增益,需要付出巨大的努力。

Being fast has another advantage: When programmers find out that calling "strlen" is the fastest method they can measure the number of bytes in a string, they are not tempted anymore to write their own code to make things faster.

速度快还有另一个优势:当程序员发现调用“strlen”是他们可以测量字符串中字节数的最快方法时,他们不再想编写自己的代码来加快速度。

So for strlen, speed is much more important, and maintainability much less important, than for most code that you will ever write.

因此,对于 strlen 而言,与您将要编写的大多数代码相比,速度要重要得多,而可维护性要小得多。

Why must it be so complicated? Say you have a 1,000 byte string. The simple implementation will examine 1,000 bytes. A current implementation would likely examine 64 bit words at a time, which means 125 64-bit or eight-byte words. It might even use vector instructions examining say 32 bytes at a time, which would be even more complicated and even faster. Using vector instructions leads to code that is a bit more complicated but quite straightforward, checking whether one of eight bytes in a 64 bit word is zero requires some clever tricks. So for medium to long strings this code can be expected to be about four times faster. For a function as important as strlen, that's worth writing a more complex function.

为什么一定要这么复杂?假设您有一个 1,000 字节的字符串。简单的实现将检查 1,000 个字节。当前的实现可能一次检查 64 位字,这意味着 125 个 64 位或 8 字节字。它甚至可能使用向量指令一次检查 32 个字节,这会更复杂,甚至更快。使用向量指令会导致代码更复杂但非常简单,检查 64 位字中的八个字节之一是否为零需要一些巧妙的技巧。因此,对于中长字符串,此代码预计会快四倍。对于像 strlen 这样重要的函数,值得编写一个更复杂的函数。

PS. The code is not very portable. But it's part of the Standard C library, which is part of the implementation - it need not be portable.

附注。代码不是很便携。但它是标准 C 库的一部分,也是实现的一部分——它不需要是可移植的。

PPS. Someone posted an example where a debugging tool complained about accessing bytes past the end of a string. An implementation can be designed that guarantees the following: If p is a valid pointer to a byte, then any access to a byte in the same aligned block that would be undefined behaviour according to the C standard, will return an unspecified value.

聚苯乙烯。有人发布了一个示例,其中调试工具抱怨访问超出字符串末尾的字节。可以设计一个实现来保证以下内容:如果 p 是一个指向字节的有效指针,那么根据 C 标准对同一对齐块中的字节的任何访问都将是未定义的行为,将返回一个未指定的值。

PPPS. Intel has added instructions to their later processors that form a building block for the strstr() function (finding a substring in a string). Their description is mind boggling, but they can make that particular function probably 100 times faster. (Basically, given an array a containing "Hello, world!" and an array b starting with 16 bytes "HelloHelloHelloH" and containing more bytes, it figures out that the string a doesn't occur in b earlier than starting at index 15).

购买力平价。英特尔已向其后来的处理器添加了指令,这些指令构成了 strstr() 函数(在字符串中查找子字符串)的构建块。他们的描述令人难以置信,但他们可以使该特定功能的速度提高 100 倍。(基本上,给定一个包含“Hello, world!”的数组 a 和一个以 16 个字节“HelloHelloHelloH”开头并包含更多字节的数组 b,它会发现字符串 a 不会早于从索引 15 开始出现在 b 中) .

回答by Lundin

Briefly: checking a string byte by byte will potentially be slow on architectures that can fetch larger amounts of data at a time.

简而言之:在可以一次获取大量数据的体系结构上,逐字节检查字符串可能会很慢。

If the check for null termination could be done on 32 or 64 bit basis, it reduces the amount of checks the compiler has to perform. That's what the linked code attempts to do, with a specific system in mind. They make assumptions about addressing, alignment, cache use, non-standard compiler setups etc etc.

如果空终止检查可以在 32 位或 64 位基础上完成,它会减少编译器必须执行的检查量。这就是链接代码试图做的,考虑到特定的系统。他们对寻址、对齐、缓存使用、非标准编译器设置等做出假设。

Reading byte by byte as in your example would be a sensible approach on a 8 bit CPU, or when writing a portable lib written in standard C.

在 8 位 CPU 上或在编写用标准 C 编写的可移植库时,如您的示例中那样逐字节读取是一种明智的方法。

Looking at C standard libs for advise how to write fast/good code isn't a good idea, because it will be non-portable and rely on non-standard assumptions or poorly-defined behavior. If you are a beginner, reading such code will likely be more harmful than educational.

查看 C 标准库以获得如何编写快速/好的代码的建议并不是一个好主意,因为它是不可移植的,并且依赖于非标准假设或定义不明确的行为。如果您是初学者,阅读此类代码可能比教育更有害。

回答by Hyman Kelly

One important thing not mentioned by the other answers is that the FSF is very cautious about ensuring that proprietary code does not make it into GNU projects. In the GNU Coding Standardsunder Referring to Proprietary Programs, there is a warning about organising your implementation in a way that it cannot be confused with existing proprietary code:

其他答案没有提到的一件重要事情是,FSF 非常谨慎地确保专有代码不会进入 GNU 项目。在参考专有程序下的GNU 编码标准中,有一条警告,提示以不能与现有专有代码混淆的方式组织您的实现:

Don't in any circumstances refer to Unix source code for or during your work on GNU! (Or to any other proprietary programs.)

If you have a vague recollection of the internals of a Unix program, this does not absolutely mean you can't write an imitation of it, but do try to organize the imitation internally along different lines, because this is likely to make the details of the Unix version irrelevant and dissimilar to your results.

For example, Unix utilities were generally optimized to minimize memory use; if you go for speed instead, your program will be very different.

在任何情况下都不要为 GNU 工作或在您使用 GNU 的过程中参考 Unix 源代码!(或任何其他专有程序。)

如果您对 Unix 程序的内部结构有模糊的回忆,这并不绝对意味着您不能编写它的模仿,而是尝试按照不同的路线在内部组织模仿,因为这可能会使细节Unix 版本与您的结果无关且不同。

例如,Unix 实用程序通常经过优化以最大限度地减少内存使用;如果你追求速度,你的程序就会大不相同。

(Emphasis mine.)

(强调我的。)