C语言检查字符数组是否为零的快速方法

Question

提问by Claudiu

I have an array of bytes, in memory. What's the fastest way to see if all the bytes in the array are zero?

我在内存中有一个字节数组。查看数组中的所有字节是否为零的最快方法是什么？

Answer 1

回答by vladr

Nowadays, short of using SIMDextensions(such as SSEon x86 processors), you might as well iterate over the arrayand compare each value to 0.

如今，由于不使用SIMD扩展（例如x86 处理器上的SSE），您不妨遍历数组并将每个值与 0 进行比较。

In the distant past, performing a comparison and conditional branch for each element in the array (in addition to the loop branch itself) would have been deemed expensive and, depending on how often (or early) you could expect a non-zero element to appear in the array, you might have elected to completely do without conditionals inside the loop, using solely bitwise-or to detect any set bits and deferring the actual check until after the loop completes:

在遥远的过去，为数组中的每个元素（除了循环分支本身）执行比较和条件分支会被认为是昂贵的，并且取决于您期望非零元素的频率（或早期）如果出现在数组中，您可能已经选择在循环内完全不使用条件，只使用按位或检测任何设置位并将实际检查推迟到循环完成后：

int sum = 0;
for (i = 0; i < ARRAY_SIZE; ++i) {
  sum |= array[i];
}
if (sum != 0) {
  printf("At least one array element is non-zero\n");
}

However, with today's pipelined super-scalar processor designs complete with branch prediction, all non-SSE approaches are virtualy indistinguishable within a loop. If anything, comparing each element to zero and breaking out of the loop early (as soon as the first non-zero element is encountered) could be, in the long run, more efficient than the sum |= array[i]approach (which always traverses the entire array) unless, that is, you expect your array to be almost always made up exclusively of zeroes (in which case making the sum |= array[i]approach truly branchless by using GCC's -funroll-loopscould give you the better numbers -- see the numbers below for an Athlon processor, results may vary with processor model and manufacturer.)

然而，随着当今具有分支预测功能的流水线超标量处理器设计，所有非 SSE 方法在循环内几乎无法区分。如果有的话，从长远来看，将每个元素与零进行比较并尽早跳出循环（一旦遇到第一个非零元素）可能比sum |= array[i]方法（总是遍历整个数组）更有效，除非，也就是说，您希望您的数组几乎总是完全由零组成（在这种情况下，sum |= array[i]通过使用 GCC使该方法真正无分支-funroll-loops可以为您提供更好的数字 - 请参阅下面的 Athlon 处理器数字，结果可能会随着处理器型号和制造商。）

#include <stdio.h>

int a[1024*1024];

/* Methods 1 & 2 are equivalent on x86 */  

int main() {
  int i, j, n;

# if defined METHOD3
  int x;
# endif

  for (i = 0; i < 100; ++i) {
#   if defined METHOD3
    x = 0;
#   endif
    for (j = 0, n = 0; j < sizeof(a)/sizeof(a[0]); ++j) {
#     if defined METHOD1
      if (a[j] != 0) { n = 1; }
#     elif defined METHOD2
      n |= (a[j] != 0);
#     elif defined METHOD3
      x |= a[j];
#     endif
    }
#   if defined METHOD3
    n = (x != 0);
#   endif

    printf("%d\n", n);
  }
}

$ uname -mp
i686 athlon
$ gcc -g -O3 -DMETHOD1 test.c
$ time ./a.out
real    0m0.376s
user    0m0.373s
sys     0m0.003s
$ gcc -g -O3 -DMETHOD2 test.c
$ time ./a.out
real    0m0.377s
user    0m0.372s
sys     0m0.003s
$ gcc -g -O3 -DMETHOD3 test.c
$ time ./a.out
real    0m0.376s
user    0m0.373s
sys     0m0.003s

$ gcc -g -O3 -DMETHOD1 -funroll-loops test.c
$ time ./a.out
real    0m0.351s
user    0m0.348s
sys     0m0.003s
$ gcc -g -O3 -DMETHOD2 -funroll-loops test.c
$ time ./a.out
real    0m0.343s
user    0m0.340s
sys     0m0.003s
$ gcc -g -O3 -DMETHOD3 -funroll-loops test.c
$ time ./a.out
real    0m0.209s
user    0m0.206s
sys     0m0.003s

Answer 2

回答by susmits

Here's a short, quick solution, if you're okay with using inline assembly.

如果您可以使用内联汇编，这里有一个简短、快速的解决方案。

#include <stdio.h>

int main(void) {
    int checkzero(char *string, int length);
    char str1[] = "wow this is not zero!";
    char str2[] = {0, 0, 0, 0, 0, 0, 0, 0};
    printf("%d\n", checkzero(str1, sizeof(str1)));
    printf("%d\n", checkzero(str2, sizeof(str2)));
}

int checkzero(char *string, int length) {
    int is_zero;
    __asm__ (
        "cld\n"
        "xorb %%al, %%al\n"
        "repz scasb\n"
        : "=c" (is_zero)
        : "c" (length), "D" (string)
        : "eax", "cc"
    );
    return !is_zero;
}

In case you're unfamiliar with assembly, I'll explain what we do here: we store the length of the string in a register, and ask the processor to scan the string for a zero (we specify this by setting the lower 8 bits of the accumulator, namely %%al, to zero), reducing the value of said register on each iteration, until a non-zero byte is encountered. Now, if the string was all zeroes, the register, too, will be zero, since it was decremented lengthnumber of times. However, if a non-zero value was encountered, the "loop" that checked for zeroes terminated prematurely, and hence the register will not be zero. We then obtain the value of that register, and return its boolean negation.

如果您不熟悉汇编，我将解释我们在这里做什么：我们将字符串的长度存储在一个寄存器中，并要求处理器扫描字符串中的零（我们通过设置低 8 位来指定这一点）累加器的值，即%%al为零），在每次迭代时减少所述寄存器的值，直到遇到非零字节。现在，如果字符串全为零，则寄存器也将为零，因为它被递减length了多次。但是，如果遇到非零值，则检查零的“循环”过早终止，因此寄存器不会为零。然后我们获取该寄存器的值，并返回其布尔否定。

Profiling this yielded the following results:

对此进行分析产生了以下结果：

$ time or.exe

real    0m37.274s
user    0m0.015s
sys     0m0.000s


$ time scasb.exe

real    0m15.951s
user    0m0.000s
sys     0m0.046s

(Both test cases ran 100000 times on arrays of size 100000. The or.execode comes from Vlad's answer. Function calls were eliminated in both cases.)

（两个测试用例都在大小为 100000 的数组上运行了 100000 次。or.exe代码来自 Vlad 的回答。在这两种情况下都消除了函数调用。）

Answer 3

回答by WhirlWind

If you want to do this in 32-bit C, probably just loop over the array as a 32-bit integer array and compare it to 0, then make sure the stuff at the end is also 0.

如果您想在 32 位 C 中执行此操作，可能只需将数组作为 32 位整数数组循环并将其与 0 进行比较，然后确保最后的内容也是 0。

Answer 4

回答by Adisak

If the array is of any decent size, your limiting factor on a modern CPU is going to be access to the memory.

如果数组大小合适，那么现代 CPU 的限制因素将是对内存的访问。

Make sure to use cache prefetching for a decent distance ahead (i.e. 1-2K) with something like __dcbt or prefetchnta (or prefetch0 if you are going to use the buffer again soon).

确保使用 __dcbt 或 prefetchnta 之类的东西（如果您打算很快再次使用缓冲区，则使用 prefetch0）在适当的距离（即 1-2K）之前使用缓存预取。

You will also want to do something like SIMD or SWAR to or multiple bytes at a time. Even with 32-bit words, it will be 4X less operations than a per character version. I'd recommend unrolling the or's and making them feed into a "tree" of or's. You can see what I mean in my code example - this takes advantage of superscalar capability to do two integer ops (the or's) in parallel by making use of ops that do not have as many intermediate data dependencies. I use a tree size of 8 (4x4, then 2x2, then 1x1) but you can expand that to a larger number depending on how many free registers you have in your CPU architecture.

您还需要一次对或多个字节执行 SIMD 或 SWAR 之类的操作。即使使用 32 位字，它的操作也将比每个字符版本少 4 倍。我建议展开 or 并将它们放入 or 的“树”中。您可以在我的代码示例中看到我的意思 - 这利用了超标量功能，通过使用没有那么多中间数据依赖项的操作并行执行两个整数操作（或）。我使用的树大小为 8（4x4，然后是 2x2，然后是 1x1），但您可以将其扩展为更大的数字，具体取决于您在 CPU 架构中拥有多少空闲寄存器。

The following pseudo-code example for the inner loop (no prolog/epilog) uses 32-bit ints but you could do 64/128-bit with MMX/SSE or whatever is available to you. This will be fairly fast if you have prefetched the block into the cache. Also you will possibly need to do unaligned check before if your buffer is not 4-byte aligned and after if your buffer (after alignment) is not a multiple of 32-bytes in length.

以下内循环（无序言/结语）的伪代码示例使用 32 位整数，但您可以使用 MMX/SSE 或任何可用的方法执行 64/128 位。如果您已将块预取到缓存中，这将相当快。此外，如果您的缓冲区不是 4 字节对齐，您可能需要在之前和之后进行未对齐检查，如果您的缓冲区（对齐后）不是 32 字节长度的倍数。

const UINT32 *pmem = ***aligned-buffer-pointer***;

UINT32 a0,a1,a2,a3;
while(bytesremain >= 32)
{
    // Compare an aligned "line" of 32-bytes
    a0 = pmem[0] | pmem[1];
    a1 = pmem[2] | pmem[3];
    a2 = pmem[4] | pmem[5];
    a3 = pmem[6] | pmem[7];
    a0 |= a1; a2 |= a3;
    pmem += 8;
    a0 |= a2;
    bytesremain -= 32;
    if(a0 != 0) break;
}

if(a0!=0) then ***buffer-is-not-all-zeros***

I would actually suggest encapsulating the compare of a "line" of values into a single function and then unrolling that a couple times with the cache prefetching.

我实际上建议将“行”值的比较封装到单个函数中，然后通过缓存预取将其展开几次。

Answer 5

回答by Kobor42

Split the checked memory half, and compare the first part to the second.
a. If any difference, it can't be all the same.
b. If no difference repeat for the first half.

将检查的内存拆分一半，并将第一部分与第二部分进行比较。
一种。如果有任何差异，它不可能完全相同。
湾如果没有差异重复上半场。

Worst case 2*N. Memory efficient and memcmp based.
Not sure if it should be used in real life, but I liked the self-compare idea.
It works for odd length. Do you see why? :-)

最坏情况 2*N。内存高效且基于 memcmp。
不确定它是否应该在现实生活中使用，但我喜欢自我比较的想法。
它适用于奇数长度。你明白为什么吗？:-)

bool memcheck(char* p, char chr, size_t size) {
    // Check if first char differs from expected.
    if (*p != chr) 
        return false;
    int near_half, far_half;
    while (size > 1) {
        near_half = size/2;
        far_half = size-near_half;
        if (memcmp(p, p+far_half, near_half))
            return false;
        size = far_half;
    }
    return true;
}

Answer 6

回答by Ortwin Gentz

Measured two implementations on ARM64, one using a loop with early return on false, one that ORs all bytes:

在 ARM64 上测量了两种实现，一种使用早期返回假的循环，一种对所有字节进行 OR 运算：

int is_empty1(unsigned char * buf, int size)
{
    int i;
    for(i = 0; i < size; i++) {
        if(buf[i] != 0) return 0;
    }
    return 1;
}

int is_empty2(unsigned char * buf, int size)
{
    int sum = 0;
    for(int i = 0; i < size; i++) {
        sum |= buf[i];
    }
    return sum == 0;
}

Results:

结果：

All results, in microseconds:

所有结果，以微秒为单位：

        is_empty1   is_empty2
MEDIAN  0.350       3.554
AVG     1.636       3.768

only false results:

只有错误的结果：

        is_empty1   is_empty2
MEDIAN  0.003       3.560
AVG     0.382       3.777

only true results:

只有真实的结果：

        is_empty1   is_empty2
MEDIAN  3.649       3,528
AVG     3.857       3.751

Summary:only for datasets where the probability of false results is very small, the second algorithm using ORing performs better, due to the omitted branch. Otherwise, returning early is clearly the outperforming strategy.

总结：仅对于错误结果概率非常小的数据集，由于省略了分支，使用 ORing 的第二种算法性能更好。否则，早点回归显然是表现出色的策略。

Answer 7

回答by zbyszek

Rusty Russel's memeqzerois veryfast. It reuses memcmpto do the heavy lifting: https://github.com/rustyrussell/ccan/blob/master/ccan/mem/mem.c#L92.

生锈罗素的memeqzero是非常快的。它重复使用memcmp来完成繁重的工作：https: //github.com/rustyrussell/ccan/blob/master/ccan/mem/mem.c#L92。

C语言检查字符数组是否为零的快速方法

提问by Claudiu

回答by vladr

回答by susmits

回答by WhirlWind

回答by Adisak

回答by Kobor42

回答by Ortwin Gentz

回答by zbyszek

相关推荐

最近更新

标签

C语言 检查字符数组是否为零的快速方法

提问by Claudiu

回答by vladr

回答by susmits

回答by WhirlWind

回答by Adisak

回答by Kobor42

回答by Ortwin Gentz

回答by zbyszek

相关推荐

C语言 如何通过c程序绘制数据？

C语言 `break` 是否仅适用于 `for`、`while`、`do-while`、`switch' 和 `if` 语句？

C语言 PTHREAD_MUTEX_INITIALIZER 与 pthread_mutex_init ( &mutex, param)

C语言 C 中从字节到 ASCII 的转换

相关推荐

最近更新

标签

C语言检查字符数组是否为零的快速方法

C语言如何通过c程序绘制数据？