如何让gcc生成合适的代码来检查缓冲区是否充满了NUL字节？

我正在实现一个解析磁带存档的程序。解析器逻辑的一部分是检查归档结束标记，该标记是一个充满NUL字节的512字节块。我为此目的编写了以下代码，希望gcc能够很好地优化它：

int is_eof_block(const char usth[static 512]) { size_t i; for (i = 0; i < 512; i++) if (usth[i] != '\0') return 0; return 1; }

但令我惊讶的是，gcc仍会为此生成可怕的代码，即使我明确允许它访问缓冲区中的整个512字节：

 is_eof_block: leaq 512(%rdi), %rax jmp .L239 .p2align 4,,10 .L243: addq $1, %rdi cmpq %rax, %rdi je .L242 .L239: cmpb $0, (%rdi) je .L243 xorl %eax, %eax ret .p2align 4,,10 .L242: movl $1, %eax ret

我希望gcc生成这样的东西，甚至是SIMD代码：

 is_eof_block: mov $64,%ecx xor %eax,%eax repz scasq setz %al ret

如何重写代码，使其仍然可移植（如：不使用非C99语言扩展，适用于不支持错位内存访问的体系结构），但编译为更好的机器代码，如amd64和AArch32 ？

基准

我写了下面的microbenchmark来演示时差。您可以将MISALIGNED定义为正整数，以使用未对齐的缓冲区进行测试。

benchmark.c

 #include  #include  #define TESTS 10000000 #ifndef MISALIGNED # define MISALIGNED 0 #endif char testarray[512 + MISALIGNED]; extern int is_eof_block(const char[static 512]); int main() { size_t i, j; clock_t begin, end; fprintf(stderr, "testing %d times\n", TESTS); fprintf(stderr, "no byte set to 1... "); begin = clock(); for (i = 0; i < TESTS; i++) if (!is_eof_block(testarray + MISALIGNED)) { fprintf(stderr, "\nWrong test result in iteration %zu!\n", i); return EXIT_FAILURE; } end = clock(); fprintf(stderr, "%fs\n", (end - begin) / (double)CLOCKS_PER_SEC); fprintf(stderr, "with non-null byte... "); begin = clock(); for (i = j = 0; i < TESTS; i++) { testarray[MISALIGNED + j] = '\0'; j = (j + 47) & 511; testarray[MISALIGNED + j] = '1'; if (is_eof_block(testarray + MISALIGNED)) { fprintf(stderr, "\nWrong test result in iteration %zu!\n", i); return EXIT_FAILURE; } } end = clock(); fprintf(stderr, "%fs\n", (end - begin) / (double)CLOCKS_PER_SEC); return EXIT_SUCCESS; }

is_eof_block_c.c

 #include  int is_eof_block(const char test[static 512]) { size_t i; for (i = 0; i < 512; i++) if (test[i] != '\0') return 0; return 1; }

is_eof_block_asm.s

  .text .globl is_eof_block .type is_eof_block,@function .align 16 is_eof_block: mov $64,%ecx xor %eax,%eax repz scasq setz %al ret .size is_eof_block,.-is_eof_block

以下是is_eof_block的C实现链接的输出：

 testing 10000000 times no byte set to 1... 2.281250s with non-null byte... 1.195312s

这是assembly版本：

 testing 10000000 times no byte set to 1... 0.476562s with non-null byte... 0.320312s

两者都使用gcc 5编译，唯一的优化选项是-O3 。传递各种-march=...标志并没有改变代码。差异大约是四分之一。对于未对齐的缓冲区，程序集实现大约慢3％，而与C实现没有区别。

这是一个触及每个字节的版本，似乎比测试工具中的原始函数快2-3倍（我不相信它能准确地反映现实）：

 int is_eof_block1(const char usth[static 512]) { unsigned int i; int res = 0; for (i = 0; i < 512; i++) res |= usth[i]; return res == 0; }

这是一个优化可读性的版本，不会浪费人们的时间，并试图让编写编译器/ libc的人变得更加聪明（它比汇编程序快得多，至少在我的机器上）：

 int is_eof_block2(const char usth[static 512]) { const static char foo[512]; return !memcmp(usth, foo, sizeof(foo)); }

这是一个版本（天真）认为如果你给它一个stdint.h _fast类型，编译器将做最好的工作：

 #include  #include  typedef uint_fast16_t fast_t; // 16 since 512 can't fit in 8 bits #define FAST_SIZE (512/sizeof(fast_t)) typedef union // union to guarantee there's no aliasing mishaps { char usth [512]; fast_t fast [FAST_SIZE]; } block_t; // misc sanity checks: _Static_assert(512%sizeof(fast_t) == 0, "This should never happen"); _Static_assert(sizeof(block_t) == 512, "Padding gone crazy"); int is_eof_block(const block_t* block) { for(const fast_t* i=&block->fast[0]; ifast+FAST_SIZE; i++) { if(*i != 0) return 0; } return 1; } int main (void) { block_t block = {0}; printf("%d", is_eof_block(&block)); }

循环可以用array + iterator替换，而不是指针算术。可能更快或更慢，我没有对它进行基准测试。

编辑：

数组+迭代器版本。这就是为什么我使用uint_fast16_t – 我希望“ fast_t ”比size_t做得更好，然后它必须至少足够大以包含值512。

 int is_eof_block(const block_t* block) { for(fast_t i=0; ifast[i] != 0) return 0; } return 1; }

由于已知该块为512字节，因此将每个16字节组提取到UInt64中，然后对零进行测试。这应该减少循环开销。

对齐问题的一种可能的解决方法是将缓冲区复制到本地结构中。

 struct x { unsigned long long :0; char buffer[512]; };

这将为您提供一个对齐的缓冲区。

由于对该问题的真正有用的评论，我决定使用原始的C代码。谢谢大家的帮助！

如何让gcc生成合适的代码来检查缓冲区是否充满了NUL字节？

基准

benchmark.c

is_eof_block_c.c

is_eof_block_asm.s

使用C中的梯形法则对某些值给出错误答案进行集成

排序2个大型数组

读/ dev / urandom线程安全吗？

从文件中读取后没有获得所有字符

数字中设置的位数

在C中反转一个字符串

通过函数指针调用函数 – 是否取消引用指针？有什么不同？

什么是__flexarr以及c程序员如何/为何使用它？

C：将Farenheit转换为Celsius

如何构建包含Mac OS X入口点的C程序？

如何让gcc生成合适的代码来检查缓冲区是否充满了NUL字节？

基准

benchmark.c

is_eof_block_c.c

is_eof_block_asm.s

使用C中的梯形法则对某些值给出错误答案进行集成

排序2个大型数组

读/ dev / urandom线程安全吗？

从文件中读取后没有获得所有字符

数字中设置的位数

在C中反转一个字符串

通过函数指针调用函数 – 是否取消引用指针？ 有什么不同？

什么是__flexarr以及c程序员如何/为何使用它？

C：将Farenheit转换为Celsius

如何构建包含Mac OS X入口点的C程序？

通过函数指针调用函数 – 是否取消引用指针？有什么不同？