在x86上以32位块实现类似学校的划分

假设我有两个大数字（定义如下），我想通过回退到x86 avaliable算法来实现它们的划分

0008768376 – 1653656387 – 0437673667 – 0123767614 – 1039873878 – 2231712290/0038768167 – 3276287672 – 1665265628

C = A / B

这些数字存储为32位无符号整数的向量。第一个，A，是6-unsigned-int向量，B是3-unsigned-int长向量[这个字段中的每一个我自己命名为’digit’或’field’]

得到的C将是一些3-unsigned-int向量，但要计算它我需要回退到一些可用的x86（32位模式，虽然我也可以听到x64，但这是次要的）算术

告诉我如何计算至少第一个，最重要的C结果矢量字段..

怎么做？

这是来自维基百科的任何大小操作数的无符号长除法伪代码：

 if D == 0 then error(DivisionByZeroException) end Q := 0 -- initialize quotient and remainder to zero R := 0 for i = n-1...0 do -- where n is number of bits in N R := R << 1 -- left-shift R by 1 bit R(0) := N(i) -- set the least-significant bit of R equal to bit i of the numerator if R >= D then R := R - D Q(i) := 1 end end

这可以扩展到包括有符号的除法（将结果舍入为零，而不是负无穷大）：

 if D == 0 then error(DivisionByZeroException) end Q := 0 -- initialize quotient and remainder to zero R := 0 SaveN := N -- save numerator SaveD := D -- save denominator if N < 0 then N = -N -- invert numerator if negative if D < 0 then D = -D -- invert denominator if negative for i = n-1...0 do -- where n is number of bits in N R := R << 1 -- left-shift R by 1 bit R(0) := N(i) -- set the least-significant bit of R equal to bit i of the numerator if R >= D then R := R - D Q(i) := 1 end end if SaveN < 0 then R = -R -- numerator was negative, negative remainder if (SaveN < 0 and SaveD >= 0) or (SaveN >= 0 and SaveD < 0) then Q = -Q -- differing signs of inputs, result is negative

这是一个相对简单，未经优化， 未经测试的x86 ASM（NASM语法）实现，应该很容易理解：

  ; function div_192_96 ; parameters: ; 24 bytes: numerator, high words are stored after low words ; 24 bytes: denominator, high words are stored after low words (only low 12 bytes are used) ; 4 bytes: address to store 12 byte remainder in (must not be NULL) ; 4 bytes: address to store 12 byte quotient in (must not be NULL) ; return value: none ; error checking: none GLOBAL div_192_96 div_192_96: pushl ebp ; set up stack frame movl ebp, esp pushl 0 ; high word of remainder pushl 0 ; middle word of remainder pushl 0 ; low word of remainder pushl 0 ; high word of quotient pushl 0 ; middle word of quotient pushl 0 ; low word of quotient movl ecx, 96 .div_loop: jecxz .div_loop_done decl ecx ; remainder = remainder << 1 movl eax, [ebp-8] ; load middle word of remainder shld [ebp-4], eax, 1 ; shift high word of remainder left by 1 movl eax, [ebp-12] ; load low word of remainder shld [ebp-8], eax, 1 ; shift middle word of remainder left by 1 shll [ebp-12], 1 ; shift low word of remainder left by 1 ; quotient = quotient << 1 movl eax, [ebp-20] ; load middle word of remainder shld [ebp-16], eax, 1; shift high word of remainder left by 1 movl eax, [ebp-24] ; load low word of remainder shld [ebp-20], eax, 1; shift middle word of remainder left by 1 shll [ebp-24], 1 ; shift low word of remainder left by 1 ; remainder(0) = numerator(127) movl eax, [ebp+28] ; load high word of numerator shrl eax, 31 ; get top bit in bit 0 orl [ebp-12], eax ; OR into low word of remainder ; numerator = numerator << 1 movl eax, [ebp+24] ; load 5th word of numerator shld [ebp+28], eax, 1; shift 6th word of numerator left by 1 movl eax, [ebp+20] ; load 4th word of numerator shld [ebp+24], eax, 1; shift 5th word of numerator left by 1 movl eax, [ebp+16] ; load 3rd word of numerator shld [ebp+20], eax, 1; shift 4th word of numerator left by 1 movl eax, [ebp+12] ; load 2nd word of numerator shld [ebp+16], eax, 1; shift 3rd word of numerator left by 1 movl eax, [ebp+8] ; load 1st word of numerator shld [ebp+12], eax, 1; shift 2nd word of numerator left by 1 shll [ebp+8], 1 ; shift 1st word of numerator left by 1 ; if (remainder >= denominator) movl eax, [ebp+40] ; compare high word of denominator cmpl eax, [ebp-4] ; with high word of remainder jb .div_loop ja .div_subtract movl eax, [ebp+36] ; compare middle word of denominator cmpl eax, [ebp-8] ; with middle word of remainder jb .div_loop ja .div_subtract movl eax, [ebp+32] ; compare low word of denominator cmpl eax, [ebp-12] ; with low word of remainder jb .div_loop .div_subtract: ; remainder = remainder - denominator movl eax, [ebp+32] ; load low word of denominator subl [ebp-12], eax ; and subtract from low word of remainder movl eax, [ebp+36] ; load middle word of denominator sbbl [ebp-8], eax ; and subtract from middle word of remainder (with borrow) movl eax, [ebp+40] ; load high word of denominator sbbl [ebp-4], eax ; and subtract from high word of remainder (with borrow) ; quotient(0) = 1 orl [ebp-24], 1 ; OR 1 into low word of quotient jmp .div_loop .div_loop_done: movl eax, [ebp+56] ; load remainder storage pointer movl edx, [ebp-12] ; load low word of remainder movl [eax+0], edx ; store low word of remainder movl edx, [ebp-8] ; load middle word of remainder movl [eax+4], edx ; store middle word of remainder movl edx, [ebp-4] ; load high word of remainder movl [eax+8], edx ; store high word of remainder movl eax, [ebp+60] ; load quotient storage pointer movl edx, [ebp-24] ; load low word of quotient movl [eax+0], edx ; store low word of quotient movl edx, [ebp-20] ; load middle word of quotient movl [eax+4], edx ; store middle word of quotient movl edx, [ebp-16] ; load high word of quotient movl [eax+8], edx ; store high word of quotient addl esp, 24 popl ebp ret

请注意，这只是为了给您一个总体思路，并且尚未经过测试。顺便说一句，计算汇编中两个相等大小的数的商可能比试图解决溢出问题（在上面的代码中完全未处理）更容易。

GMP的文档包括有关其使用的算法的部分，包括划分。在GMP术语中，每个32b块称为“肢体”。

64位CPU非常适合扩展精度数学，因为它们一次处理两倍，每次操作大约相同的时间。如果你想自己实现扩展精度，而不是使用LGPLed GMP库，我建议使用x86-64 asm并回退到C.（除非你想将它作为你希望仅作为传统发布的东西的一部分使用） 32位二进制文件。）

但是，除了这条规则之外，分部是例外：在最近的Intel和AMD的x86设计中，128b / 64b = 64b分区比64b / 32b = 32b分区具有更差的延迟和更低的吞吐量。请参阅http://agner.org/optimize/ ，在指令表中查找div 。 idiv是有符号除法， div是无符号的。）相比之下， ADC （带进位加载）的每周期吞吐量为1，与64位MUL （和IMUL ）相同。

实际上，至少在Intel Sandybridge系列上， MUL/IMUL 64bit * 64bit = 128bit比IMUL/MUL 32bit * 32bit = 64bit 快。 mul 32bit是3 uops，吞吐量为每2个周期一个，而mul 64bit为2 uops，吞吐量为每个周期一个。 mul的单操作数forms是rdx:rax = rax * src 。我想有一个额外的周期需要将64位结果从乘法器中分离/混洗到edx：eax，它被设置为产生128b结果的两个64位半。

在AMD Bulldozer系列CPU上，32位mul比64位mul更快。我猜他们没有全宽乘法器硬件。

（对于编译器的正常使用， c = a*b通常对所有变量具有相同的宽度，因此结果的上半部分可以被丢弃imul具有dest *= src双操作数forms的imul （但不是mul ）这是与最快的单操作数forms相同的速度。所以不要担心使用long s，因此它们在正常代码中会加倍。）

这适用于CPU是运行32位还是64位代码，但32位代码不能执行64位操作。

你问的是div ，我下了一个切线。 x86的div指令执行rdx：rax / src，输出rax = quotient，rdx = remainder。或者edx：eax …用于32位版本。

在x86上以32位块实现类似学校的划分

使用scanf读取多行输入

strstr的优化版本（搜索具有恒定长度）

使用Unix和C / C ++帮助阻止套接字

如何为unicode字符制作小写字母

是否可以在运行时生成并执行Rust代码？

矩阵，指针，C *

C数组的范围初始化

错误：从’void *’转换为’int’会在makefile中丢失精度

我遇到多个字符和scanf_s（）的问题

C 3d数组char