数百万UINT64 RGBZ图形像素的最快排序算法

我正在使用来自.RAW文件的RGB数据对1000多万uint64_t排序，并且在qsort花费了79％的C程序时间。我正在寻找这种特定数据类型的更快排序。

作为RAW图形数据，这些数字非常随机，大约80％是唯一的。不需要对排序数据进行部分排序或运行。 uint64_t内的4 uint16_t s是R，G，B和零（可能是一个小的计数<= ~20）。

我有最简单的比较函数，我可以想到使用unsigned long long （你不能只减去它们）：

 qsort(hpidx, num_pix, sizeof(uint64_t), comp_uint64); ... int comp_uint64(const void *a, const void *b) { if(*((uint64_t *)a) > *((uint64_t *)b)) return(+1); if(*((uint64_t *)a) < *((uint64_t *)b)) return(-1); return(0); } // End Comp_uint64().

StackExchange上有一个非常有趣的“Programming Puzzles＆Code Golf”，但是他们使用了float 。然后有QSort，RecQuick，堆，stooge，树，基数……

swenson / sort看起来很有趣，但对我的数据类型uint64_t没有（明显的）支持。而“快速排序”时间是最好的。有些消息称，系统qsort可以是任何东西，不一定是“快速排序”。

C ++排序绕过了void指针的通用转换，并实现了对C的性能的极大改进。必须有一种优化的方法，以经线速度通过64位处理器猛击U8。

系统/编译器信息：

我目前正在使用GCC和Strawberry Perl

 gcc version 4.9.2 (x86_64-posix-sjlj, built by strawberryperl.com Intel 2700K Sandy Bridge CPU, 32GB DDR3 windows 7/64 pro gcc -D__USE_MINGW_ANSI_STDIO -O4 -ffast-math -m64 -Ofast -march=corei7-avx -mtune=corei7 -Ic:/bin/xxHash-master -Lc:/bin/xxHash-master c:/bin/stddev.c -oc:/bin/stddev.g6.exe

首次尝试更好的`qsort` ， `QSORT()` ！

试图使用Michael Tokarev的内联qsort 。

“可以用了”？来自qsort.h文档

 ----------------------------- * Several ready-to-use examples: * * Sorting array of integers: * void int_qsort(int *arr, unsigned n) { * #define int_lt(a,b) ((*a)<(*b)) * QSORT(int, arr, n, int_lt); -------------------------------- Change from type "int" to "uint64_t" compile error on TYPE??? c:/bin/bpbfct.c:586:8: error: expected expression before 'uint64_t' QSORT(uint64_t, hpidx, num_pix, islt);

我找不到一个真实的，编译的，有效的示例程序，只是用“一般概念”来评论

 #define QSORT_TYPE uint64_t #define islt(a,b) ((*a)<(*b)) uint64_t *QSORT_BASE; int QSORT_NELT; hpidx=(uint64_t *) calloc(num_pix+2, sizeof(uint64_t)); // Hash . PIDX QSORT_BASE = hpidx; QSORT_NELT = num_pix; // QSORT_LT is function QSORT_LT() QSORT(uint64_t, hpidx, num_pix, islt); //QSORT(uint64_t *, hpidx, num_pix, QSORT_LT); // QSORT_LT mal-defined? //qsort(hpidx, num_pix, sizeof(uint64_t), comp_uint64); // << WORKS

“即用型”示例使用int ， char *和struct elt 。 uint64_t不是一个类型?? 试试long long

 QSORT(long long, hpidx, num_pix, islt); c:/bin/bpbfct.c:586:8: error: expected expression before 'long' QSORT(long long, hpidx, num_pix, islt);

下一次尝试： `RADIXSORT` ：

结果：RADIX_SORT是根本的！

  I:\br3\pf.249465>grep "Event" bb12.log | grep -i Sort <grep "Event" bb11.log | grep -i Sort << 5.525 sec average = 3.95 time slower 4) Time=5.538 sec = 86.34%, Event QSort , hits=1 4) Time=5.519 sec = 79.41%, Event QSort , hits=1 4) Time=5.519 sec = 79.02%, Event QSort , hits=1 4) Time=5.563 sec = 79.49%, Event QSort , hits=1 4) Time=5.684 sec = 79.83%, Event QSort , hits=1 4) Time=5.509 sec = 79.30%, Event QSort , hits=1

比开箱即用的任何类型qsort快3.94倍！

而且，更重要的是，有一些实际的，有效的代码，不仅仅是一些Guru所需要的80％，他们假设你知道他们所知道的一切，并且可以填写其他20％。

很棒的解决方案！谢谢Louis Ricci！

我会使用Radix Sort和8bit基数。对于64位值，优化良好的基数排序将不得不在列表上迭代9次（一次用于预先计算计数和偏移量，8位用于64位/ 8位）。 9 * N时间和2 * N空间（使用阴影arrays）。

这是优化的基数排序的样子。

 typedef union { struct { uint32_t c8[256]; uint32_t c7[256]; uint32_t c6[256]; uint32_t c5[256]; uint32_t c4[256]; uint32_t c3[256]; uint32_t c2[256]; uint32_t c1[256]; }; uint32_t counts[256 * 8]; } rscounts_t; uint64_t * radixSort(uint64_t * array, uint32_t size) { rscounts_t counts; memset(&counts, 0, 256 * 8 * sizeof(uint32_t)); uint64_t * cpy = (uint64_t *)malloc(size * sizeof(uint64_t)); uint32_t o8=0, o7=0, o6=0, o5=0, o4=0, o3=0, o2=0, o1=0; uint32_t t8, t7, t6, t5, t4, t3, t2, t1; uint32_t x; // calculate counts for(x = 0; x < size; x++) { t8 = array[x] & 0xff; t7 = (array[x] >> 8) & 0xff; t6 = (array[x] >> 16) & 0xff; t5 = (array[x] >> 24) & 0xff; t4 = (array[x] >> 32) & 0xff; t3 = (array[x] >> 40) & 0xff; t2 = (array[x] >> 48) & 0xff; t1 = (array[x] >> 56) & 0xff; counts.c8[t8]++; counts.c7[t7]++; counts.c6[t6]++; counts.c5[t5]++; counts.c4[t4]++; counts.c3[t3]++; counts.c2[t2]++; counts.c1[t1]++; } // convert counts to offsets for(x = 0; x < 256; x++) { t8 = o8 + counts.c8[x]; t7 = o7 + counts.c7[x]; t6 = o6 + counts.c6[x]; t5 = o5 + counts.c5[x]; t4 = o4 + counts.c4[x]; t3 = o3 + counts.c3[x]; t2 = o2 + counts.c2[x]; t1 = o1 + counts.c1[x]; counts.c8[x] = o8; counts.c7[x] = o7; counts.c6[x] = o6; counts.c5[x] = o5; counts.c4[x] = o4; counts.c3[x] = o3; counts.c2[x] = o2; counts.c1[x] = o1; o8 = t8; o7 = t7; o6 = t6; o5 = t5; o4 = t4; o3 = t3; o2 = t2; o1 = t1; } // radix for(x = 0; x < size; x++) { t8 = array[x] & 0xff; cpy[counts.c8[t8]] = array[x]; counts.c8[t8]++; } for(x = 0; x < size; x++) { t7 = (cpy[x] >> 8) & 0xff; array[counts.c7[t7]] = cpy[x]; counts.c7[t7]++; } for(x = 0; x < size; x++) { t6 = (array[x] >> 16) & 0xff; cpy[counts.c6[t6]] = array[x]; counts.c6[t6]++; } for(x = 0; x < size; x++) { t5 = (cpy[x] >> 24) & 0xff; array[counts.c5[t5]] = cpy[x]; counts.c5[t5]++; } for(x = 0; x < size; x++) { t4 = (array[x] >> 32) & 0xff; cpy[counts.c4[t4]] = array[x]; counts.c4[t4]++; } for(x = 0; x < size; x++) { t3 = (cpy[x] >> 40) & 0xff; array[counts.c3[t3]] = cpy[x]; counts.c3[t3]++; } for(x = 0; x < size; x++) { t2 = (array[x] >> 48) & 0xff; cpy[counts.c2[t2]] = array[x]; counts.c2[t2]++; } for(x = 0; x < size; x++) { t1 = (cpy[x] >> 56) & 0xff; array[counts.c1[t1]] = cpy[x]; counts.c1[t1]++; } free(cpy); return array; }

编辑这个实现是基于JavaScript版本最快的方式来在JavaScript中对32位有符号整数数组进行排序？

这是C基数排序的IDEONE http://ideone.com/JHI0d9

我看到一些选项，大致按最简单的顺序排列。

使用-flto开关启用链路时间优化。这可能会让编译器内联您的比较函数。不尝试就太容易了。
如果LTO没有效果，您可以使用内联qsort实现，如Michael Tokarev的内联qsort 。这个页面提出了2倍的改进，这完全归功于编译器内联比较函数的能力。
使用C ++ std::sort 。我知道你的代码在C中，但你可以制作一个只能排序并提供C接口的小模块。您已经在使用具有出色C ++支持的工具链。
试试swenson / sort的库。它实现了许多算法，因此您可以使用最适合您数据的算法。它似乎是可以内联的，并且它们声称比qsort更快。
找到另一个排序库。可以做Louis’Radix Sort的东西是一个很好的建议。

请注意，您也可以使用单个分支而不是两个分支进行比较。找出哪个更大，然后减去。

对于一些编译器/平台，以下是无分支和更快，但与OP的原始版本没有太大差别。

 int comp_uint64_b(const void *a, const void *b) { return (*((uint64_t *)a) > *((uint64_t *)b)) - (*((uint64_t *)a) < *((uint64_t *)b)); }

也许有些？：而不是ifs会让事情变得更快。

数百万UINT64 RGBZ图形像素的最快排序算法

首次尝试更好的`qsort` ， `QSORT()` ！

下一次尝试： `RADIXSORT` ：

关于clsStr变量的全局是什么，“clsStr =（* env） – > NewGlobalRef（env，cls）”？

我如何获得ivar的Objective-C类？

我可以不在while循环中提供相等条件吗？

用Lua注册一个闭包

clock_t，time_t和struct tm有什么区别？

程序员视角教科书中的无符号/有符号算术问题

使用malloc分配的内存不会在函数范围之外保留？

使用这些fork（）语句创建了多少个进程？

int a = 1，b = a ++; 调用未定义的行为？

读取进程内存

数百万UINT64 RGBZ图形像素的最快排序算法

首次尝试更好的qsort ， QSORT() ！

下一次尝试： RADIXSORT ：

关于clsStr变量的全局是什么，“clsStr =（* env） – > NewGlobalRef（env，cls）”？

我如何获得ivar的Objective-C类？

我可以不在while循环中提供相等条件吗？

用Lua注册一个闭包

clock_t，time_t和struct tm有什么区别？

程序员视角教科书中的无符号/有符号算术问题

使用malloc分配的内存不会在函数范围之外保留？

使用这些fork（）语句创建了多少个进程？

int a = 1，b = a ++; 调用未定义的行为？

读取进程内存

首次尝试更好的`qsort` ， `QSORT()` ！

下一次尝试： `RADIXSORT` ：