有效地将无符号值除以2的幂,四舍五入

我想用2的任意幂实现无符号整数除法 ,有效地向上舍入 。 所以我想要的是数学上的ceiling(p/q) 0 。 在C中,不利用q的受限域的strawman实现可能类似于以下函数1

 /** q must be a power of 2, although this version works for any q */ uint64_t divide(uint64_t p, uint64_t q) { uint64_t res = p / q; return p % q == 0 ? res : res + 1; } 

…当然,我实际上并不想在机器级别使用除法或mod,因为即使在现代硬件上也需要很多周期。 我正在寻找使用轮class和/或其他一些廉价操作的力量减少 – 利用q是2的幂的事实。

你可以假设我们有一个有效的lg(unsigned int x)函数,它返回lg(unsigned int x)的base-2 log,如果x是2的幂。

如果q为零,则未定义的行为很好。

请注意,简单的解决方案: (p+q-1) >> lg(q)通常不起作用 – 例如,尝试使用p == 2^64-100 and q == 256 2

平台细节

我对C语言中的解决方案感兴趣,这些解决方案可能在各种平台上都表现良好,但为了具体,奖励赏金,并且因为任何对性能的明确讨论都需要包含目标架构,我将具体说明关于我将如何测试它们:

  • Skylake CPU
  • 带编译标志的gcc 5.4.0 -O3 -march=haswell

使用gcc builtins(比如bitscan / leading zero builtins)很好,特别是我已经实现了lg()函数我说的可用如下:

 inline uint64_t lg(uint64_t x) { return 63U - (uint64_t)__builtin_clzl(x); } inline uint32_t lg32(uint32_t x) { return 31U - (uint32_t)__builtin_clz(x); } 

我validation这些编译成单个bsr指令,至少与-march=haswell ,尽管明显涉及减法。 您当然可以自由地忽略这些并在解决方案中使用您想要的任何其他内置函数。

基准

我为现有答案编写了一个基准 ,并在进行更改时分享和更新结果。

为一个小的,可能内联的操作编写一个很好的基准是非常困难的。 当代码内联到调用站点时,该函数的许多工作可能会消失,尤其是当它在循环3中时

可以通过确保代码未内联来避免整个内联问题:在另一个编译单元中声明它。 我用bench二进制文件尝试了这一点,但实际上结果毫无意义。 几乎所有的实现都是每次调用4或5个周期,但即使是一个除了return 0之外什么也不做的虚拟方法需要相同的时间。 所以你大多只是测量call + ret开销。 此外,你几乎永远不会真正使用这样的function – 除非你搞砸了,它们将可用于内联并且改变一切

因此,我将重点关注的两个基准测试反复调用循环中的测试方法,允许内联,跨function优化,循环提升甚至矢量化。

有两种总体基准类型:延迟和吞吐量。 关键的区别在于,在延迟基准测试中,每次divide调用都依赖于之前的调用,因此一般来说调用不能轻易重叠4

 uint32_t bench_divide_latency(uint32_t p, uint32_t q) { uint32_t total = p; for (unsigned i=0; i < ITERS; i++) { total += divide_algo(total, q); q = rotl1(q); } return total; } 

请注意,运行total取决于每个除法调用的输出,并且它也是divide调用的输入。

另一方面,吞吐量变量不会将一个除法的输出提供给后一个除法。 这允许来自一个调用的工作与后续调用(由编译器,尤其是CPU)重叠,甚至允许向量化:

 uint32_t bench_divide_throughput(uint32_t p, uint32_t q) { uint32_t total = p; for (unsigned i=0; i < ITERS; i++) { total += fname(i, q); q = rotl1(q); } return total; } 

请注意,这里我们将循环计数器作为被除数 – 这是可变的 ,但它不依赖于先前的除法调用。

此外,每个基准测试对除数有三种行为, q

  1. 编译时常数除数。 例如, divide(p, 8)调用divide(p, 8) 。 这在实践中很常见,并且在编译时已知除数时代码可以更简单。
  2. 不变的除数。 这里除数在编译时不知道,但对于整个基准测试循环是恒定的。 这允许编译时常量所做的优化的子集。
  3. 可变除数。 除数在循环的每次迭代中都会发生变化。 上面的基准函数显示了这个变体,使用“向左旋转1”指令来改变除数。

结合所有内容,您将获得总共6个不同的基准。

结果

总体

为了选择整体最佳算法,我查看了所提算法的12个子集中的每一个:( (latency, throughput) x (constant a, invariant q, variable q) x (32-bit, 64-bit)并分配每个子测试得分为2,1或0如下:

  • 最佳算法(在5%容差范围内)得分为2。
  • “足够接近”的算法(比最好的算法慢50%)得分为1。
  • 剩下的算法得分为零。

因此,最大总分为24,但没有算法实现。 以下是总体结果:

 ╔═══════════════════════╦═══════╗ ║ Algorithm ║ Score ║ ╠═══════════════════════╬═══════╣ ║ divide_user23_variant ║ 20 ║ ║ divide_chux ║ 20 ║ ║ divide_user23 ║ 15 ║ ║ divide_peter ║ 14 ║ ║ divide_chrisdodd ║ 12 ║ ║ stoke32 ║ 11 ║ ║ divide_chris ║ 0 ║ ║ divide_weather ║ 0 ║ ╚═══════════════════════╩═══════╝ 

因此,对于此特定测试代码的目的,使用此特定编译器并在此平台上 ,user2357112“variant”(带有... + (p & mask) != 0 )表现最佳,与chux的建议相关(在事实相同的代码)。

以下是与上述相关的所有子分数:

 ╔══════════════════════════╦═══════╦════╦════╦════╦════╦════╦════╗ ║ ║ Total ║ LC ║ LI ║ LV ║ TC ║ TI ║ TV ║ ╠══════════════════════════╬═══════╬════╬════╬════╬════╬════╬════╣ ║ divide_peter ║ 6 ║ 1 ║ 1 ║ 1 ║ 1 ║ 1 ║ 1 ║ ║ stoke32 ║ 6 ║ 1 ║ 1 ║ 2 ║ 0 ║ 0 ║ 2 ║ ║ divide_chux ║ 10 ║ 2 ║ 2 ║ 2 ║ 1 ║ 2 ║ 1 ║ ║ divide_user23 ║ 8 ║ 1 ║ 1 ║ 2 ║ 2 ║ 1 ║ 1 ║ ║ divide_user23_variant ║ 10 ║ 2 ║ 2 ║ 2 ║ 1 ║ 2 ║ 1 ║ ║ divide_chrisdodd ║ 6 ║ 1 ║ 1 ║ 2 ║ 0 ║ 0 ║ 2 ║ ║ divide_chris ║ 0 ║ 0 ║ 0 ║ 0 ║ 0 ║ 0 ║ 0 ║ ║ divide_weather ║ 0 ║ 0 ║ 0 ║ 0 ║ 0 ║ 0 ║ 0 ║ ║ ║ ║ ║ ║ ║ ║ ║ ║ ║ 64-bit Algorithm ║ ║ ║ ║ ║ ║ ║ ║ ║ divide_peter_64 ║ 8 ║ 1 ║ 1 ║ 1 ║ 2 ║ 2 ║ 1 ║ ║ div_stoke_64 ║ 5 ║ 1 ║ 1 ║ 2 ║ 0 ║ 0 ║ 1 ║ ║ divide_chux_64 ║ 10 ║ 2 ║ 2 ║ 2 ║ 1 ║ 2 ║ 1 ║ ║ divide_user23_64 ║ 7 ║ 1 ║ 1 ║ 2 ║ 1 ║ 1 ║ 1 ║ ║ divide_user23_variant_64 ║ 10 ║ 2 ║ 2 ║ 2 ║ 1 ║ 2 ║ 1 ║ ║ divide_chrisdodd_64 ║ 6 ║ 1 ║ 1 ║ 2 ║ 0 ║ 0 ║ 2 ║ ║ divide_chris_64 ║ 0 ║ 0 ║ 0 ║ 0 ║ 0 ║ 0 ║ 0 ║ ║ divide_weather_64 ║ 0 ║ 0 ║ 0 ║ 0 ║ 0 ║ 0 ║ 0 ║ ╚══════════════════════════╩═══════╩════╩════╩════╩════╩════╩════╝ 

这里,每个测试都命名为XY,其中X在{Latency,Throughput}中,Y在{Constant Q,Invariant Q,Variable Q}中。 因此,例如,LC是“具有常数q的延迟测试”。

分析

在最高级别,解决方案可以大致分为两类:快速(前6名终结者)和慢速(前2名)。 差异更大: 所有 快速算法在至少两个子测试中是最快的,并且通常当它们没有首先完成时它们落入“足够接近”的类别(它们只有例外是在stokechrisdodd )。 然而, 慢速算法在每次测试中得分为0(甚至不接近)。 因此,您可以通过进一步考虑来消除慢速算法。

自动向量化

快速算法中,一个很大的区别是自动矢量化的能力。

没有一种算法能够在延迟测试中自动进行矢量化,这是有道理的,因为延迟测试旨在将其结果直接提供给下一次迭代。 所以你真的只能以连续的方式计算结果。

然而,对于吞吐量测试,许多算法能够针对常数Q不变Q情况自动矢量化。 在这两个测试中,除数q是循环不变的(在前一种情况下,它是一个编译时常量)。 被除数是循环计数器,因此它是可变的,但是可预测的(特别是可以通过将8添加到先前的输入向量来简单地计算红利矢量: [0, 1, 2, ..., 7] + [8, 8, ..., 8] == [8, 9, 10, ..., 15] )。

在这种情况下, gcc能够矢量化peteruser23chuxuser23user23_variant 。 由于某些原因,它无法对chrisdodd进行矢量化,可能是因为它包含了一个分支(但是条件并没有严格阻止矢量化,因为许多其他解决方案都有条件元素但仍然被矢量化)。 影响是巨大的:矢量化的算法显示,与不具备快速性的变体相比,吞吐量提高了8倍。

但是,矢量化并不是免费的! 以下是每个函数的“常量”变量的函数大小,以及Vec? 列显示函数是否已向量化:

 Size Vec? Name 045 N bench_c_div_stoke_64 049 N bench_c_divide_chrisdodd_64 059 N bench_c_stoke32_64 212 Y bench_c_divide_chux_64 227 Y bench_c_divide_peter_64 220 Y bench_c_divide_user23_64 212 Y bench_c_divide_user23_variant_64 

趋势很明显 – 矢量化函数大约是非矢量化函数的4倍。 这是因为核心循环本身更大(向量指令往往更大并且有更多),并且因为循环设置,特别是后循环代码要大得多:例如,向量化版本需要减少到求和向量中的所有部分和。 循环计数是固定的,是8的倍数,因此不会生成尾部代码 – 但如果是可变的,则生成的代码会更大。

此外,尽管运行时有很大改进,但gcc的矢量化实际上很差。 以下是彼得常规矢量化版本的摘录:

  on entry: ymm4 == all zeros on entry: ymm5 == 0x00000001 0x00000001 0x00000001 ... 4007a4: c5 ed 76 c4 vpcmpeqd ymm0,ymm2,ymm4 4007ad: c5 fd df c5 vpandn ymm0,ymm0,ymm5 4007b1: c5 dd fa c0 vpsubd ymm0,ymm4,ymm0 4007b5: c5 f5 db c0 vpand ymm0,ymm1,ymm0 

这个块在8个源自ymm2 DWORD元素上独立工作。 如果我们将x作为ymm2的单个DWORD元素,并且yymm1的输入值, ymm1这些指令对应于:

  x == 0 x != 0 x = x ? 0 : -1; // -1 0 x = x & 1; // 1 0 x = 0 - x; // -1 0 x = y1 & x; // y1 0 

所以前三个指令可以简单地用第一个替换,因为状态在任何一种情况下都是相同的。 因此,将两个周期添加到该依赖关系链(不是循环传递)和两个额外的uop。 显然gcc的优化阶段在某种程度上与这里的矢量化代码交互不佳,因为在标量代码中很少遗漏这些简单的优化。 检查其他矢量化版本同样显示了很多性能下降。

分支机构vs无分支机构

即使C代码具有条件或显式分支,几乎所有解决方案都编译为无分支代码。 条件部分足够小,编译器通常决定使用条件移动或某些变体。 一个例外是chrisdodd ,它在所有吞吐量测试中使用分支编译(检查是否p == 0 ),但没有一个延迟。 以下是常量q吞吐量测试的典型示例:

 0000000000400e60 : 400e60: 89 f8 mov eax,edi 400e62: ba 01 00 00 00 mov edx,0x1 400e67: eb 0a jmp 400e73  400e69: 0f 1f 80 00 00 00 00 nop DWORD PTR [rax+0x0] 400e70: 83 c2 01 add edx,0x1 400e73: 83 fa 01 cmp edx,0x1 400e76: 74 f8 je 400e70  400e78: 8d 4a fe lea ecx,[rdx-0x2] 400e7b: c1 e9 03 shr ecx,0x3 400e7e: 8d 44 08 01 lea eax,[rax+rcx*1+0x1] 400e82: 81 fa 00 ca 9a 3b cmp edx,0x3b9aca00 400e88: 75 e6 jne 400e70  400e8a: c3 ret 400e8b: 0f 1f 44 00 00 nop DWORD PTR [rax+rax*1+0x0] 

400e76处的分支跳过p == 0的情况。 实际上,编译器可能只是将第一次迭代剥离(显式计算其结果),然后完全避免跳转,因为之后它可以certificatep != 0 。 在这些测试中,分支是完全可预测的,这可以为使用分支实际编译的代码提供优势(因为比较和分支代码基本上脱节且接近自由),并且是chrisdodd胜利的重要部分吞吐量,变量q情况。

详细的测试结果

在这里,您可以找到一些详细的测试结果和测试本身的一些细节。

潜伏

下面的结果在1e9次迭代中测试每个算法。 简单地通过将时间/呼叫乘以时钟频率来计算周期。 您通常可以假设类似4.01东西与4.00相同,但像5.11这样的较大偏差似乎是真实且可重复的。

divide_plusq_32的结果使用(p + q - 1) >> lg(q)但仅作为参考显示,因为此函数对于大p + q失败。 dummy的结果是一个非常简单的函数: return p + q ,并让你估计基准开销5 (加法本身最多需要一个周期)。

 ============================== Bench: Compile-time constant Q ============================== Function ns/call cycles divide_peter_32 2.19 5.67 divide_peter_64 2.18 5.64 stoke32_32 1.93 5.00 stoke32_64 1.97 5.09 stoke_mul_32 2.75 7.13 stoke_mul_64 2.34 6.06 div_stoke_32 1.94 5.03 div_stoke_64 1.94 5.03 divide_chux_32 1.55 4.01 divide_chux_64 1.55 4.01 divide_user23_32 1.97 5.11 divide_user23_64 1.93 5.00 divide_user23_variant_32 1.55 4.01 divide_user23_variant_64 1.55 4.01 divide_chrisdodd_32 1.95 5.04 divide_chrisdodd_64 1.93 5.00 divide_chris_32 4.63 11.99 divide_chris_64 4.52 11.72 divide_weather_32 2.72 7.04 divide_weather_64 2.78 7.20 divide_plusq_32 1.16 3.00 divide_plusq_64 1.16 3.00 divide_dummy_32 1.16 3.00 divide_dummy_64 1.16 3.00 ============================== Bench: Invariant Q ============================== Function ns/call cycles divide_peter_32 2.19 5.67 divide_peter_64 2.18 5.65 stoke32_32 1.93 5.00 stoke32_64 1.93 5.00 stoke_mul_32 2.73 7.08 stoke_mul_64 2.34 6.06 div_stoke_32 1.93 5.00 div_stoke_64 1.93 5.00 divide_chux_32 1.55 4.02 divide_chux_64 1.55 4.02 divide_user23_32 1.95 5.05 divide_user23_64 2.00 5.17 divide_user23_variant_32 1.55 4.02 divide_user23_variant_64 1.55 4.02 divide_chrisdodd_32 1.95 5.04 divide_chrisdodd_64 1.93 4.99 divide_chris_32 4.60 11.91 divide_chris_64 4.58 11.85 divide_weather_32 12.54 32.49 divide_weather_64 17.51 45.35 divide_plusq_32 1.16 3.00 divide_plusq_64 1.16 3.00 divide_dummy_32 0.39 1.00 divide_dummy_64 0.39 1.00 ============================== Bench: Variable Q ============================== Function ns/call cycles divide_peter_32 2.31 5.98 divide_peter_64 2.26 5.86 stoke32_32 2.06 5.33 stoke32_64 1.99 5.16 stoke_mul_32 2.73 7.06 stoke_mul_64 2.32 6.00 div_stoke_32 2.00 5.19 div_stoke_64 2.00 5.19 divide_chux_32 2.04 5.28 divide_chux_64 2.05 5.30 divide_user23_32 2.05 5.30 divide_user23_64 2.06 5.33 divide_user23_variant_32 2.04 5.29 divide_user23_variant_64 2.05 5.30 divide_chrisdodd_32 2.04 5.30 divide_chrisdodd_64 2.05 5.31 divide_chris_32 4.65 12.04 divide_chris_64 4.64 12.01 divide_weather_32 12.46 32.28 divide_weather_64 19.46 50.40 divide_plusq_32 1.93 5.00 divide_plusq_64 1.99 5.16 divide_dummy_32 0.40 1.05 divide_dummy_64 0.40 1.04 

吞吐量

以下是吞吐量测试的结果。 请注意,这里的许多算法都是自动矢量化的,因此对于那些性能来说相对非常好 :在许多情况下,这是一个循环的一小部分。 一个结果是,与大多数延迟结果不同,64位函数相当慢,因为矢量化对于较小的元素大小更有效(尽管差距比我预期的要大)。

 ============================== Bench: Compile-time constant Q ============================== Function ns/call cycles stoke32_32 0.39 1.00 divide_chux_32 0.15 0.39 divide_chux_64 0.53 1.37 divide_user23_32 0.14 0.36 divide_user23_64 0.53 1.37 divide_user23_variant_32 0.15 0.39 divide_user23_variant_64 0.53 1.37 divide_chrisdodd_32 1.16 3.00 divide_chrisdodd_64 1.16 3.00 divide_chris_32 4.34 11.23 divide_chris_64 4.34 11.24 divide_weather_32 1.35 3.50 divide_weather_64 1.35 3.50 divide_plusq_32 0.10 0.26 divide_plusq_64 0.39 1.00 divide_dummy_32 0.08 0.20 divide_dummy_64 0.39 1.00 ============================== Bench: Invariant Q ============================== Function ns/call cycles stoke32_32 0.48 1.25 divide_chux_32 0.15 0.39 divide_chux_64 0.48 1.25 divide_user23_32 0.17 0.43 divide_user23_64 0.58 1.50 divide_user23_variant_32 0.15 0.38 divide_user23_variant_64 0.48 1.25 divide_chrisdodd_32 1.16 3.00 divide_chrisdodd_64 1.16 3.00 divide_chris_32 4.35 11.26 divide_chris_64 4.36 11.28 divide_weather_32 5.79 14.99 divide_weather_64 17.00 44.02 divide_plusq_32 0.12 0.31 divide_plusq_64 0.48 1.25 divide_dummy_32 0.09 0.23 divide_dummy_64 0.09 0.23 ============================== Bench: Variable Q ============================== Function ns/call cycles stoke32_32 1.16 3.00 divide_chux_32 1.36 3.51 divide_chux_64 1.35 3.50 divide_user23_32 1.54 4.00 divide_user23_64 1.54 4.00 divide_user23_variant_32 1.36 3.51 divide_user23_variant_64 1.55 4.01 divide_chrisdodd_32 1.16 3.00 divide_chrisdodd_64 1.16 3.00 divide_chris_32 4.02 10.41 divide_chris_64 3.84 9.95 divide_weather_32 5.40 13.98 divide_weather_64 19.04 49.30 divide_plusq_32 1.03 2.66 divide_plusq_64 1.03 2.68 divide_dummy_32 0.63 1.63 divide_dummy_64 0.66 1.71 

a至少通过指定unsigned我们可以避免与C和C ++中有符号整数的右移行为相关的整个蠕虫行为。

0当然,这种表示法实际上并不适用于C,其中/截断结果,因此ceiling什么都不做。 所以考虑伪符号而不是直线C.

1我也对所有类型都是uint32_t而不是uint64_t解决方案感兴趣。

2通常,由于溢出, p + q >= 2^64任何pq都会导致问题。

3也就是说,函数应该处于一个循环中,因为只需要在一个相当紧密的循环中调用,只需要六个周期的微观函数的性能才真正重要。

4这有点简化 – 只有被除数p取决于前一次迭代的输出,因此与q处理相关的一些工作仍然可以重叠。

5但是要谨慎使用这些估算 – 开销不仅仅是附加的。 如果开销显示为4个周期且某个函数f显示为5,那么由于执行重叠的方式, f中的实际工作成本可能不准确为5 - 4 == 1

这个答案是关于什么是asm的理想; 我们想要说服编译器为我们发射。 (我不建议实际使用内联asm,除非作为基准编译器输出时的比较点.https://gcc.gnu.org/wiki/DontUseInlineAsm )。

我确实设法从纯C中为ceil_div_andmask获得了非常好的asm输出,见下文。 (它比Broadwell / Skylake上的CMOV更糟糕,但对Haswell来说可能还不错。但是,对于这两种情况,用户23 / chux版本看起来都更好。)这是大多数值得一提的少数情况之一,我让编译器发出我想要的asm。


看起来Chris Dodd关于return ((p-1) >> lg(q)) + 1的一般概念,d = 0的特殊情况处理是最好的选择之一。 也就是说,asm的最佳实现很难被其他任何东西的最佳实现所击败。 Chux’s (p >> lg(q)) + (bool)(p & (q-1))也具有优势(如p->结果的低延迟),以及当相同的q用于多个分区时的CSE更多。 请参阅下文,了解gcc对其所做的延迟/吞吐量分析。

如果相同的e = lg(q)被重用于多个被除数,或者相同的被除数被重用于多个除数,则不同的实现可以CSE更多的表达式。 它们还可以有效地使用AVX2进行矢量化

如果它们预测得很好 ,分支很便宜且非常有效 ,所以如果几乎从未采用分支,那么d==0分支将是最好的。 如果d==0并不罕见,无分支asm的平均表现会更好。 理想情况下,我们可以在C中编写一些内容,让gcc在配置文件引导优化期间做出正确的选择,并在任何一种情况下编译为良好的asm。

由于最好的无分支asm实现不会增加很多延迟而不是分支实现,所以无分支可能是要走的路,除非分支在99%的时间内以相同的方式运行。 这可能适用于p==0分支,但可能不太可能在p & (q-1)上进行分支。


很难指导gcc5.4发射任何最佳的东西。 这是我在Godbolt上的工作进展 。

我认为Skylake对该算法的最佳序列如下。 (显示为AMD64 SysV ABI的独立函数,但假设编译器会发出类似内联到循环中的内容而没有附加RET,则讨论吞吐量/延迟)。


从计算d-1进行分支以检测d == 0 ,而不是单独的测试和分支。 很好地减少了uop数量,特别是在SnB系列中,JC可以与SUB进行宏接合。

 ceil_div_pjc_branch: xor eax,eax ; can take this uop off the fast path by adding a separate xor-and-return block, but in reality we want to inline something like this. sub rdi, 1 jc .d_was_zero ; fuses with the sub on SnB-family tzcnt rax, rsi ; tzcnt rsi,rsi also avoids any false-dep problems, but this illustrates that the q input can be read-only. shrx rax, rdi, rax inc rax .d_was_zero: ret 
  • Fused-domain uops:5(不计算ret),其中一个是xor-zero(无执行单元)
  • 成功的分支预测的HSW / SKL延迟:
    • (d == 0):没有数据依赖于d或q, 打破了dep链 。 (控制对d的依赖以检测错误预测并退出分支)。
    • (d!= 0):q->结果:tzcnt + shrx + inc = 5c
    • (d!= 0): d->结果:sub + shrx + inc = 3c
  • 吞吐量:可能只是uop吞吐量的瓶颈

我已经尝试但没有让gcc从减法中分支CF,但它总是想要进行单独的比较。 我知道在减去两个变量之后,gcc可以被哄骗到CF上的分支,但如果一个是编译时常量,这可能会失败。 (IIRC,这通常编译为使用无符号变量的CF测试: foo -= bar; if(foo>bar) carry_detected = 1;


ADC / SBB无分支处理d==0情况 。 零处理只向关键路径添加一条指令(相对于d == 0没有特殊处理的版本),但也将另一sbb rax, -1从INC转换为sbb rax, -1使CF撤消-= -1 。 使用CMOV在Broadwell之前更便宜,但需要额外的指令来设置它。

 ceil_div_pjc_asm_adc: tzcnt rsi, rsi sub rdi, 1 adc rdi, 0 ; d? d-1 : d. Sets CF=CF shrx rax, rdi, rsi sbb rax, -1 ; result++ if d was non-zero ret 
  • 融合域uops:SKL上的5(不计算ret)。 关于HSW的7
  • SKL延迟:
    • q->结果:tzcnt + shrx + sbb = 5c
    • d-> result:sub + adc + shrx(q从这里开始)+ sbb = 4c
  • 吞吐量:TZCNT在p1上运行。 SBB,ADC和SHRX仅在p06上运行。 所以我认为我们每次迭代的p06为3 uop瓶颈,这使得每1.5c最多运行一次迭代

如果q和d同时准备就绪,请注意此版本可以与TZCNT的3c延迟并行运行SUB / ADC。 如果两者都来自相同的缓存未命中缓存行,那肯定是可能的。 在任何情况下,在d->结果依赖链中尽可能晚地引入dep是一个优点。

gcc5.4似乎不太可能从C中获取此信息。 有一个固有的附加携带,但gcc使它完全混乱。 它不使用ADC或SBB的立即操作数,并将进位存储在每个操作之间的整数寄存器中。 gcc7,clang3.9和icc17都是由此产生的可怕代码。

 #include  // compiles to completely horrible code, putting the flags into integer regs between ops. T ceil_div_adc(T d, T q) { T e = lg(q); unsigned long long dm1; // unsigned __int64 unsigned char CF = _addcarry_u64(0, d, -1, &dm1); CF = _addcarry_u64(CF, 0, dm1, &dm1); T shifted = dm1 >> e; _subborrow_u64(CF, shifted, -1, &dm1); return dm1; } # gcc5.4 -O3 -march=haswell mov rax, -1 tzcnt rsi, rsi add rdi, rax setc cl xor edx, edx add cl, -1 adc rdi, rdx setc dl shrx rdi, rdi, rsi add dl, -1 sbb rax, rdi ret 

CMOV用于修复整个结果 :q->结果的延迟更差,因为它在d->结果dep链中使用得更快。

 ceil_div_pjc_asm_cmov: tzcnt rsi, rsi sub rdi, 1 shrx rax, rdi, rsi lea rax, [rax+1] ; inc preserving flags cmovc rax, zeroed_register ret 
  • 融合域uops:SKL上的5(不计算ret)。 6关于HSW
  • SKL延迟:
    • q->结果:tzcnt + shrx + lea + cmov = 6c(比ADC / SBB差1c)
    • d->结果:sub + shrx(q从这里开始)+ lea + cmov = 4c
  • 吞吐量:TZCNT在p1上运行。 LEA是第15页。 CMOV和SHRX是p06。 SUB是p0156。 理论上,只有瓶颈的融合域uop吞吐量,因此每1.25c迭代一次 。 由于有很多独立的操作,来自SUB或LEA窃取p1或p06的资源冲突不应该是吞吐量问题,因为每1.25c一次,没有端口被只能在该端口上运行的uop所饱和。

CMOV获取SUB的操作数 :我希望我能找到一种方法为稍后的指令创建一个操作数,该指令在需要时产生零,而不依赖于q,e或SHRX结果。 如果dq之前或同时准备好,这将有所帮助。

这没有实现这个目标,并且在循环中需要额外的7字节mov rdx,-1

 ceil_div_pjc_asm_cmov: tzcnt rsi, rsi mov rdx, -1 sub rdi, 1 shrx rax, rdi, rsi cmovnc rdx, rax sub rax, rdx ; res += d ? 1 : -res ret 

具有昂贵CMOV的BDW前置CPU的低延迟版本 ,使用SETCC为AND创建掩码。

 ceil_div_pjc_asm_setcc: xor edx, edx ; needed every iteration tzcnt rsi, rsi sub rdi, 1 setc dl ; d!=0 ? 0 : 1 dec rdx ; d!=0 ? -1 : 0 // AND-mask shrx rax, rdi, rsi inc rax and rax, rdx ; zero the bogus result if d was initially 0 ret 

由于SETC / DEC与SHRX / INC并行发生, 因此d->结果仍然是4c延迟 (而q->结果是6)。 总uop计数:8。这些insn中的大多数可以在任何端口上运行,因此它应该是每2个时钟1个。

当然,对于pre-HSW,您还需要替换SHRX。

我们可以让gcc5.4发出几乎同样好的东西:(仍然使用单独的TEST而不是基于sub rdi, 1设置掩码sub rdi, 1 ,但是否则与上面相同的指令)。 在Godbolt看到它。

 T ceil_div_andmask(T p, T q) { T mask = -(T)(p!=0); // TEST+SETCC+NEG T e = lg(q); T nonzero_result = ((p-1) >> e) + 1; return nonzero_result & mask; } 

当编译器知道p非零时,它会利用并生成漂亮的代码:

 // http://stackoverflow.com/questions/40447195/can-i-hint-the-optimizer-by-giving-the-range-of-an-integer #if defined(__GNUC__) && !defined(__INTEL_COMPILER) #define assume(x) do{if(!(x)) __builtin_unreachable();}while(0) #else #define assume(x) (void)(x) // still evaluate it once, for side effects in case anyone is insane enough to put any inside an assume() #endif T ceil_div_andmask_nonzerop(T p, T q) { assume(p!=0); return ceil_div_andmask(p, q); } # gcc5.4 -O3 -march=haswell xor eax, eax # gcc7 does tzcnt in-place instead of wasting an insn on this sub rdi, 1 tzcnt rax, rsi shrx rax, rdi, rax add rax, 1 ret 

Chux / user23_variant

p->结果只有3c延迟,而常数q可以CSE很多。

 T divide_A_chux(T p, T q) { bool round_up = p & (q-1); // compiles differently from user23_variant with clang: AND instead of return (p >> lg(q)) + round_up; } xor eax, eax # in-place tzcnt would save this xor edx, edx # target for setcc tzcnt rax, rsi sub rsi, 1 test rsi, rdi shrx rdi, rdi, rax setne dl lea rax, [rdx+rdi] ret 

在TZCNT之前执行SETCC将允许就地TZCNT,从而保存xor eax,eax 。 我没有看过这是如何在循环中内联的。

  • 融合域uops:HSW / SKL上的8(不计算ret)
  • HSW / SKL延迟:
    • q->结果:(tzcnt + shrx(p)| sub + test(p)+ setne)+ lea(或add)= 5c
    • d->结果:test(从这里开始的dep)+ setne + lea = 3c 。 (shrx-> lea链更短,因此不是关键路径)
  • 吞吐量:可能只是前端的瓶颈,每2c一次 。 保存xor eax,eax应该将其加速到每1.75c一个(但当然任何循环开销都将成为瓶颈的一部分,因为前端瓶颈就是这样)。
 uint64_t exponent = lg(q); uint64_t mask = q - 1; // v divide return (p >> exponent) + (((p & mask) + mask) >> exponent) // ^ round up 

The separate computation of the “round up” part avoids the overflow issues of (p+q-1) >> lg(q) . Depending on how smart your compiler is, it might be possible to express the “round up” part as ((p & mask) != 0) without branching.

The efficient way of dividing by a power of 2 for an unsigned integer in C is a right shift — shifting right one divides by two (rounding down), so shifting right by n divides by 2 n (rounding down).

Now you want to round up rather than down, which you can do by first adding 2 n -1, or equivalently subtracting one before the shift and adding one after (except for 0). This works out to something like:

 unsigned ceil_div(unsigned d, unsigned e) { /* compute ceil(d/2**e) */ return d ? ((d-1) >> e) + 1 : 0; } 

The conditional can be removed by using the boolean value of d for addition and subtraction of one:

 unsigned ceil_div(unsigned d, unsigned e) { /* compute ceil(d/2**e) */ return ((d - !!d) >> e) + !!d; } 

Due to its size, and the speed requirement, the function should be made static inline. It probably won’t make a different for the optimizer, but the parameters should be const. If it must be shared among many files, define it in a header:

 static inline unsigned ceil_div(const unsigned d, const unsigned e){... 

Efficiently dividing unsigned value by a power of two, rounding up

[Re-write] given OP’s clarification concerning power-of-2 .

The round-up or ceiling part is easy when overflow is not a concern. Simple add q-1 , then shift.

Otherwise as the possibility of rounding depends on all the bits of p smaller than q , detection of those bits is needed first before they are shifted out.

 uint64_t divide_A(uint64_t p, uint64_t q) { bool round_up = p & (q-1); return (p >> lg64(q)) + round_up; } 

This assumes code has an efficient lg64(uint64_t x) function, which returns the base-2 log of x , if x is a power-of-two.`

My old answer didn’t work if p was one more than a power of two (whoops). So my new solution, using the __builtin_ctzll() and __builtin_ffsll() functions 0 available in gcc (which as a bonus, provides the fast logarithm you mentioned!):

 uint64_t divide(uint64_t p,uint64_t q) { int lp=__builtin_ffsll(p); int lq=__builtin_ctzll(q); return (p>>lq)+(lp<(lq+1)&&lp); } 

Note that this is assuming that a long long is 64 bits. It has to be tweaked a little otherwise.

The idea here is that if we need an overflow if and only if p has fewer trailing zeroes than q . Note that for a power of two, the number of trailing zeroes is equal to the logarithm, so we can use this builtin for the log as well.

The &&lp part is just for the corner case where p is zero: otherwise it will output 1 here.

0 Can't use __builtin_ctzll() for both because it is undefined if p==0 .

If the dividend/divisor can be guaranteed not to exceed 63 (or 31) bits, you can use the following version mentioned in the question. Note how p+q could overflow if they use all 64 bit. This would be fine if the SHR instruction shifted in the carry flag, but AFAIK it doesn’t.

 uint64_t divide(uint64_t p, uint64_t q) { return (p + q - 1) >> lg(q); } 

If those constraints cannot be guaranteed, you can just do the floor method and then add 1 if it would round up. This can be determined by checking if any bits in the dividend are within the range of the divisor. Note: p&-p extracts the lowest set bit on 2s complement machines or the BLSI instruction

 uint64_t divide(uint64_t p, uint64_t q) { return (p >> lg(q)) + ( (p & -p ) < q ); } 

Which clang compiles to:

  bsrq %rax, %rsi shrxq %rax, %rdi, %rax blsiq %rdi, %rcx cmpq %rsi, %rcx adcq $0, %rax retq 

That's a bit wordy and uses some newer instructions, so maybe there is a way to use the carry flag in the original version. Lets see: The RCR instruction does but seems like it would be worse ... perhaps the SHRD instruction... It would be something like this (unable to test at the moment)

  xor edx, edx ;edx = 0 (will store the carry flag) bsr rcx, rsi ;rcx = lg(q) ... could be moved anywhere before shrd lea rax, [rsi-1] ;rax = q-1 (adding p could carry) add rax, rdi ;rax += p (handle carry) setc dl ;rdx = carry flag ... or xor rdx and setc shrd rax, rdx, cl ;rax = rdx:rax >> cl ret 

1 more instruction, but should be compatible with older processors (if it works ... I'm always getting a source/destination swapped - feel free to edit)

附录:

I've implemented the lg() function I said was available as follows:

 inline uint64_t lg(uint64_t x) { return 63U - (uint64_t)__builtin_clzl(x); } inline uint32_t lg32(uint32_t x) { return 31U - (uint32_t)__builtin_clz(x); } 

The fast log functions don't fully optimize to bsr on clang and ICC but you can do this:

 #if defined(__x86_64__) && (defined(__clang__) || defined(__INTEL_COMPILER)) static inline uint64_t lg(uint64_t x){ inline uint64_t ret; //other compilers may want bsrq here __asm__("bsr %0, %1":"=r"(ret):"r"(x)); return ret; } #endif #if defined(__i386__) && (defined(__clang__) || defined(__INTEL_COMPILER)) static inline uint32_t lg32(uint32_t x){ inline uint32_t ret; __asm__("bsr %0, %1":"=r"(ret):"r"(x)); return ret; } #endif 

There has already been a lot of human brainpower applied to this problem, with several variants of great answers in C along with Peter Cordes’s answer which covers the best you could hope for in asm, with notes on trying to map it back to C .

So while the humans are having their kick at the can, I thought see what some brute computing power has to say! To that end, I used Stanford’s STOKE superoptimizer to try to find good solutions to the 32-bit and 64-bit versions of this problem.

Usually, a superoptimizer is usually something like a brute force search through all possible instruction sequences until you find the best one by some metric. Of course, with something like 1,000 instructions that will quickly spiral out of control for more than a few instructions 1 . STOKE, on the hand, takes a guided randomized approach: it randomly makes mutations to an existing candidate program, evaluating at each step a cost function that takes both performance and correctness into effect. That’s the one-liner anyway – there are plenty of papers if that stoked your curiosity.

So within a few minutes STOKE found some pretty interesting solutions. It found almost all the high-level ideas in the existing solutions, plus a few unique ones. For example, for the 32-bit function, STOKE found this version:

 neg rsi dec rdi pext rax, rsi, rdi inc eax ret 

It doesn’t use any leading/trailing-zero count or shift instructions at all. Pretty much, it uses neg esi to turn the divisor into a mask with 1s in the high bits, and then pext effectively does the shift using that mask. Outside of that trick it’s using the same trick that user QuestionC used: decrement p, shift, increment p – but it happens to work even for zero dividend because it uses 64-bit registers to distinguish the zero case from the MSB-set large p case.

I added the C version of this algorithm to the benchmark, and added it to the results. It’s competitive with the other good algorithms, tying for first in the “Variable Q” cases. It does vectorize, but not as well as the other 32-bit algorithms, because it needs 64-bit math and so the vectors can process only half as many elements at once.

Even better, in the 32-bit case it came up with a variety of solutions which use the fact that some of the intuitive solutions that fail for edge cases happen to “just work” if you use 64-bit ops for part of it. 例如:

 tzcntl ebp, esi dec esi add rdi, rsi sarx rax, rdi, rbp ret 

That’s the equivalent of the return (p + q - 1) >> lg(q) suggestion I mentioned in the question. That doesn’t work in general since for large p + q it overflows, but for 32-bit p and q this solution works great by doing the important parts in 64-bit. Convert that back into C with some casts and it actually figures out that using lea will do the addition in one instruction 1 :

 stoke_32(unsigned int, unsigned int): tzcnt edx, esi mov edi, edi ; goes away when inlining mov esi, esi ; goes away when inlining lea rax, [rsi-1+rdi] shrx rax, rax, rdx ret 

So it’s a 3-instruction solution when inlined into something that already has the values zero-extended into rdi and rsi . The stand-alone function definition needs the mov instructions to zero-extend because that’s how the SysV x64 ABI works .


For the 64-bit function it didn’t come up with anything that blows away the existing solutions but it did come up with some neat stuff, like:

  tzcnt r13, rsi tzcnt rcx, rdi shrx rax, rdi, r13 cmp r13b, cl adc rax, 0 ret 

That guy counts the trailing zeros of both arguments, and then adds 1 to the result if q has fewer trailing zeros than p , since that’s when you need to round up. Clever!

In general, it understood the idea that you needed to shl by the tzcnt really quickly (just like most humans) and then came up with a ton of other solutions to the problem of adjusting the result to account for rounding. It even managed to use blsi and bzhi in several solutions. Here’s a 5-instruction solution it came up with:

 tzcnt r13, rsi shrx rax, rdi, r13 imul rsi, rax cmp rsi, rdi adc rax, 0 ret 

It’s a basically a “multiply and verify” approach – take the truncated res = p \ q , multiply it back and if it’s different than p add one: return res * q == p ? ret : ret + 1 . 凉。 Not really better than Peter’s solutions though. STOKE seems to have some flaws in its latency calculation – it thinks the above has a latency of 5 – but it’s more like 8 or 9 depending on the architecture. So it sometimes narrows in solutions that are based on its flawed latency calculation.


1 Interestingly enough though this brute force approach reaches its feasibility around 5-6 instructions: if you assume you can trim the instruction count to say 300 by eliminating SIMD and x87 instructions, then you would need ~28 days to try all 300 ^ 5 5 instruction sequences at 1,000,000 candidates/second. You could perhaps reduce that by a factor of 1,000 with various optimizations, meaning less than an hour for 5-instruction sequences and maybe a week for 6-instruction. As it happens, most of the best solutions for this problem fall into that 5-6 instruction window…

2 This will be a slow lea , however, so the sequence found by STOKE was still optimal for what I optimized for, which was latency.

You can do it like this, by comparing dividing n / d with dividing by (n-1) / d .

 #include  int main(void) { unsigned n; unsigned d; unsigned q1, q2; double actual; for(n = 1; n < 6; n++) { for(d = 1; d < 6; d++) { actual = (double)n / d; q1 = n / d; if(n) { q2 = (n - 1) / d; if(q1 == q2) { q1++; } } printf("%u / %u = %u (%f)\n", n, d, q1, actual); } } return 0; } 

节目输出:

 1 / 1 = 1 (1.000000) 1 / 2 = 1 (0.500000) 1 / 3 = 1 (0.333333) 1 / 4 = 1 (0.250000) 1 / 5 = 1 (0.200000) 2 / 1 = 2 (2.000000) 2 / 2 = 1 (1.000000) 2 / 3 = 1 (0.666667) 2 / 4 = 1 (0.500000) 2 / 5 = 1 (0.400000) 3 / 1 = 3 (3.000000) 3 / 2 = 2 (1.500000) 3 / 3 = 1 (1.000000) 3 / 4 = 1 (0.750000) 3 / 5 = 1 (0.600000) 4 / 1 = 4 (4.000000) 4 / 2 = 2 (2.000000) 4 / 3 = 2 (1.333333) 4 / 4 = 1 (1.000000) 4 / 5 = 1 (0.800000) 5 / 1 = 5 (5.000000) 5 / 2 = 3 (2.500000) 5 / 3 = 2 (1.666667) 5 / 4 = 2 (1.250000) 5 / 5 = 1 (1.000000) 

更新

I posted an early answer to the original question, which works, but did not consider the efficiency of the algorithm, or that the divisor is always a power of 2. Performing two divisions was needlessly expensive.

I am using MSVC 32-bit compiler on a 64-bit system, so there is no chance that I can provide a best solution for the required target. But it is an interesting question so I have dabbled around to find that the best solution will discover the bit of the 2**n divisor. Using the library function log2 worked but was so slow. Doing my own shift was much better, but still my best C solution is

 unsigned roundup(unsigned p, unsigned q) { return p / q + ((p & (q-1)) != 0); } 

My inline 32-bit assembler solution is faster, but of course that will not answer the question. I steal some cycles by assuming that eax is returned as the function value.

 unsigned roundup(unsigned p, unsigned q) { __asm { mov eax,p mov edx,q bsr ecx,edx ; cl = bit number of q dec edx ; q-1 and edx,eax ; p & (q-1) shr eax,cl ; divide p by q, a power of 2 sub edx,1 ; generate a carry when (p & (q-1)) == 0 cmc adc eax,0 ; add 1 to result when (p & (q-1)) != 0 } } ; eax returned as function value 

This seems efficient and works for signed if your compiler is using arithmetic right shifts (usually true).

 #include  int main (void) { for (int i = -20; i <= 20; ++i) { printf ("%5d %5d\n", i, ((i - 1) >> 1) + 1); } return 0; } 

Use >> 2 to divide by 4, >> 3 to divide by 8, &ct. Efficient lg does the work there.

You can even divide by 1! >> 0