在x86上以32位块实现类似学校的划分

假设我有两个大数字(定义如下),我想通过回退到x86 avaliable算法来实现它们的划分

0008768376 – 1653656387 – 0437673667 – 0123767614 – 1039873878 – 2231712290/0038768167 – 3276287672 – 1665265628

C = A / B

这些数字存储为32位无符号整数的向量。 第一个,A,是6-unsigned-int向量,B是3-unsigned-int长向量[这个字段中的每一个我自己命名为’digit’或’field’]

得到的C将是一些3-unsigned-int向量,但要计算它我需要回退到一些可用的x86(32位模式,虽然我也可以听到x64,但这是次要的)算术

告诉我如何计算至少第一个,最重要的C结果矢量字段..

怎么做?

这是来自维基百科的任何大小操作数的无符号长除法伪代码:

 if D == 0 then error(DivisionByZeroException) end Q := 0 -- initialize quotient and remainder to zero R := 0 for i = n-1...0 do -- where n is number of bits in N R := R << 1 -- left-shift R by 1 bit R(0) := N(i) -- set the least-significant bit of R equal to bit i of the numerator if R >= D then R := R - D Q(i) := 1 end end 

这可以扩展到包括有符号的除法(将结果舍入为零,而不是负无穷大):

 if D == 0 then error(DivisionByZeroException) end Q := 0 -- initialize quotient and remainder to zero R := 0 SaveN := N -- save numerator SaveD := D -- save denominator if N < 0 then N = -N -- invert numerator if negative if D < 0 then D = -D -- invert denominator if negative for i = n-1...0 do -- where n is number of bits in N R := R << 1 -- left-shift R by 1 bit R(0) := N(i) -- set the least-significant bit of R equal to bit i of the numerator if R >= D then R := R - D Q(i) := 1 end end if SaveN < 0 then R = -R -- numerator was negative, negative remainder if (SaveN < 0 and SaveD >= 0) or (SaveN >= 0 and SaveD < 0) then Q = -Q -- differing signs of inputs, result is negative 

这是一个相对简单, 未经优化, 未经测试的x86 ASM(NASM语法)实现,应该很容易理解:

  ; function div_192_96 ; parameters: ; 24 bytes: numerator, high words are stored after low words ; 24 bytes: denominator, high words are stored after low words (only low 12 bytes are used) ; 4 bytes: address to store 12 byte remainder in (must not be NULL) ; 4 bytes: address to store 12 byte quotient in (must not be NULL) ; return value: none ; error checking: none GLOBAL div_192_96 div_192_96: pushl ebp ; set up stack frame movl ebp, esp pushl 0 ; high word of remainder pushl 0 ; middle word of remainder pushl 0 ; low word of remainder pushl 0 ; high word of quotient pushl 0 ; middle word of quotient pushl 0 ; low word of quotient movl ecx, 96 .div_loop: jecxz .div_loop_done decl ecx ; remainder = remainder << 1 movl eax, [ebp-8] ; load middle word of remainder shld [ebp-4], eax, 1 ; shift high word of remainder left by 1 movl eax, [ebp-12] ; load low word of remainder shld [ebp-8], eax, 1 ; shift middle word of remainder left by 1 shll [ebp-12], 1 ; shift low word of remainder left by 1 ; quotient = quotient << 1 movl eax, [ebp-20] ; load middle word of remainder shld [ebp-16], eax, 1; shift high word of remainder left by 1 movl eax, [ebp-24] ; load low word of remainder shld [ebp-20], eax, 1; shift middle word of remainder left by 1 shll [ebp-24], 1 ; shift low word of remainder left by 1 ; remainder(0) = numerator(127) movl eax, [ebp+28] ; load high word of numerator shrl eax, 31 ; get top bit in bit 0 orl [ebp-12], eax ; OR into low word of remainder ; numerator = numerator << 1 movl eax, [ebp+24] ; load 5th word of numerator shld [ebp+28], eax, 1; shift 6th word of numerator left by 1 movl eax, [ebp+20] ; load 4th word of numerator shld [ebp+24], eax, 1; shift 5th word of numerator left by 1 movl eax, [ebp+16] ; load 3rd word of numerator shld [ebp+20], eax, 1; shift 4th word of numerator left by 1 movl eax, [ebp+12] ; load 2nd word of numerator shld [ebp+16], eax, 1; shift 3rd word of numerator left by 1 movl eax, [ebp+8] ; load 1st word of numerator shld [ebp+12], eax, 1; shift 2nd word of numerator left by 1 shll [ebp+8], 1 ; shift 1st word of numerator left by 1 ; if (remainder >= denominator) movl eax, [ebp+40] ; compare high word of denominator cmpl eax, [ebp-4] ; with high word of remainder jb .div_loop ja .div_subtract movl eax, [ebp+36] ; compare middle word of denominator cmpl eax, [ebp-8] ; with middle word of remainder jb .div_loop ja .div_subtract movl eax, [ebp+32] ; compare low word of denominator cmpl eax, [ebp-12] ; with low word of remainder jb .div_loop .div_subtract: ; remainder = remainder - denominator movl eax, [ebp+32] ; load low word of denominator subl [ebp-12], eax ; and subtract from low word of remainder movl eax, [ebp+36] ; load middle word of denominator sbbl [ebp-8], eax ; and subtract from middle word of remainder (with borrow) movl eax, [ebp+40] ; load high word of denominator sbbl [ebp-4], eax ; and subtract from high word of remainder (with borrow) ; quotient(0) = 1 orl [ebp-24], 1 ; OR 1 into low word of quotient jmp .div_loop .div_loop_done: movl eax, [ebp+56] ; load remainder storage pointer movl edx, [ebp-12] ; load low word of remainder movl [eax+0], edx ; store low word of remainder movl edx, [ebp-8] ; load middle word of remainder movl [eax+4], edx ; store middle word of remainder movl edx, [ebp-4] ; load high word of remainder movl [eax+8], edx ; store high word of remainder movl eax, [ebp+60] ; load quotient storage pointer movl edx, [ebp-24] ; load low word of quotient movl [eax+0], edx ; store low word of quotient movl edx, [ebp-20] ; load middle word of quotient movl [eax+4], edx ; store middle word of quotient movl edx, [ebp-16] ; load high word of quotient movl [eax+8], edx ; store high word of quotient addl esp, 24 popl ebp ret 

请注意,这只是为了给您一个总体思路,并且尚未经过测试。 顺便说一句,计算汇编中两个相等大小的数的商可能比试图解决溢出问题(在上面的代码中完全未处理)更容易。

GMP的文档包括有关其使用的算法的部分 ,包括划分。 在GMP术语中,每个32b块称为“肢体”。

64位CPU非常适合扩展精度数学,因为它们一次处理两倍,每次操作大约相同的时间。 如果你想自己实现扩展精度,而不是使用LGPLed GMP库 ,我建议使用x86-64 asm并回退到C.(除非你想将它作为你希望仅作为传统发布的东西的一部分使用) 32位二进制文​​件。)

但是,除了这条规则之外,分部是例外:在最近的Intel和AMD的x86设计中,128b / 64b = 64b分区比64b / 32b = 32b分区具有更差的延迟和更低的吞吐量。 请参阅http://agner.org/optimize/ ,在指令表中查找dividiv是有符号除法, div是无符号的。)相比之下, ADC (带进位加载)的每周期吞吐量为1,与64位MUL (和IMUL )相同。

实际上,至少在Intel Sandybridge系列上, MUL/IMUL 64bit * 64bit = 128bit比IMUL/MUL 32bit * 32bit = 64bit 。 mul 32bit是3 uops,吞吐量为每2个周期一个,而mul 64bit为2 uops,吞吐量为每个周期一个。 mul的单操作数forms是rdx:rax = rax * src 。 我想有一个额外的周期需要将64位结果从乘法器中分离/混洗到edx:eax,它被设置为产生128b结果的两个64位半。

在AMD Bulldozer系列CPU上,32位mul比64位mul更快。 我猜他们没有全宽乘法器硬件。

(对于编译器的正常使用, c = a*b通常对所有变量具有相同的宽度,因此结果的上半部分可以被丢弃imul具有dest *= src双操作数forms的imul (但不是mul )这是与最快的单操作数forms相同的速度。所以不要担心使用long s,因此它们在正常代码中会加倍。)

这适用于CPU是运行32位还是64位代码,但32位代码不能执行64位操作。

你问的是div ,我下了一个切线。 x86的div指令执行rdx:rax / src,输出rax = quotient,rdx = remainder。 或者edx:eax …用于32位版本。