单精度浮点精度降低精度差

我正在尝试将范围缩减作为实现正弦函数的第一步。

我遵循KC NG的论文“减少巨大论据”中描述的方法

当使用0到20000的x的输入范围时,我得到的误差大到0.002339146.我的错误显然不应该那么大,我不知道如何减少它。 我注意到误差幅度与输入θ幅度与余弦/正弦相关。

我能够获得本文提到的nearpi.c代码,但我不确定如何将代码用于单精度浮点。 如果有人有兴趣,可以在以下链接找到nearpi.c文件: nearpi.c

这是我的MATLAB代码:

x = 0:0.1:20000; % Perform range reduction % Store constant 2/pi twooverpi = single(2/pi); % Compute y y = (x.*twooverpi); % Compute k (round to nearest integer k = round(y); % Solve for f f = single(yk); % Solve for r r = single(f*single(pi/2)); % Find last two bits of k n = bitand(fi(k,1,32,0),fi(3,1,32,0)); n = single(n); % Preallocate for speed z(length(x)) = 0; for i = 1:length(x) switch(n(i)) case 0 z(i)=sin(r(i)); case 1 z(i) = single(cos(r(i))); case 2 z(i) = -sin(r(i)); case 3 z(i) = single(-cos(r(i))); otherwise end end maxerror = max(abs(single(z - single(sin(single(x)))))) minerror = min(abs(single(z - single(sin(single(x)))))) 

我编辑了nearpi.c程序,以便编译。 但是我不确定如何解释输出。 此外,文件需要输入,我必须手动输入,我也不确定输入的重要性。

这是工作nearpi.c:

 /* ============================================================================ Name : nearpi.c Author : Version : Copyright : Your copyright notice Description : Hello World in C, Ansi-style ============================================================================ */ #include  #include  #include  /* * Global macro definitions. */ # define hex( double ) *(1 + ((long *) &double)), *((long *) &double) # define sgn(a) (a >= 0 ? 1 : -1) # define MAX_k 2500 # define D 56 # define MAX_EXP 127 # define THRESHOLD 2.22e-16 /* * Global Variables */ int CFlength, /* length of CF including terminator */ binade; double e, f; /* [e,f] range of D-bit unsigned int of f; form 1X...X */ // Function Prototypes int dbleCF (double i[], double j[]); void input (double i[]); void nearPiOver2 (double i[]); /* * This is the start of the main program. */ int main (void) { int k; /* subscript variable */ double i[MAX_k], j[MAX_k]; /* i and j are continued fractions (coeffs) */ // fp = fopen("/src/cfpi.txt", "r"); /* * Compute global variables e and f, where * * e = 2 ^ (D-1), ie the D bit number 10...0 * and * f = 2 ^ D - 1, ie the D bit number 11...1 . */ e = 1; for (k = 2; k <= D; k = k + 1) e = 2 * e; f = 2 * e - 1; /* * Compute the continued fraction for (2/e)/(pi/2) , ie * q's starting value for the first binade, given the continued * fraction for pi as input; set the global variable CFlength * to the length of the resulting continued fraction (including * its negative valued terminator). One should use as many * partial coefficients of pi as necessary to resolve numbers * of the width of the underflow plus the overflow threshold. * A rule of thumb is 0.97 partial coefficients are generated * for every decimal digit of pi . * * Note: for radix B machines, subroutine input should compute * the continued fraction for (B/e)/(pi/2) where e = B ^ (D - 1). */ input (i); /* * Begin main loop over all binades: * For each binade, find the nearest multiples of pi/2 in that binade. * * [ Note: for hexadecimal machines ( B = 16 ), the rest of the main * program simplifies(!) to * * B_ade = 1; * while (B_ade < MAX_EXP) * { * dbleCF (i, j); * dbleCF (j, i); * dbleCF (i, j); * CFlength = dbleCF (j, i); * B_ade = B_ade + 1; * } * } * * because the alternation of source & destination are no longer necessary. ] */ binade = 1; while (binade = MAX_EXP) break; /* * For the current (even) binade, find the nearest multiples of pi/2. */ nearPiOver2 (j); /* * Double the continued fraction to get to the next (odd) binade. */ CFlength = dbleCF (j, i); binade = binade + 1; } return 0; } /* end of Main Program */ /* * Subroutine DbleCF doubles a continued fraction whose partial * coefficients are i[] into a continued fraction j[], where both * arrays are of a type sufficient to do D-bit integer arithmetic. * * In my case ( D = 56 ) , I am forced to treat integers as double * precision reals because my machine does not have integers of * sufficient width to handle D-bit integer arithmetic. * * Adapted from a Basic program written by W. Kahan. * * Algorithm based on Hurwitz's method of doubling continued * fractions (see Knuth Vol. 3, p.360). * * A negative value terminates the last partial quotient. * * Note: for the non-C programmers, the statement break * exits a loop and the statement continue skips to the next * case in the same loop. * * The call modf ( l / 2, &l0 ) assigns the integer portion of * half of L to L0. */ int dbleCF (double i[], double j[]) { double k, l, l0, j0; int n, m; n = 1; m = 0; j0 = i[0] + i[0]; l = i[n]; while (1) { if (l  0) { j[m] = j0; j[m + 1] = l0; j0 = 0; m = m + 2; }; if (l == 0) { /* * Even case. */ if (k < 0) { m = m - 1; break; } else { j0 = j0 + k + k; n = n + 2; l = i[n]; continue; }; } /* * Odd case. */ if (k = 0); /* * Double the continued fraction for pi D-3 times using * i and j alternately as source and destination. On my * machine D = 56 so D-3 is odd; hence the following code: * * Double twice (D-3)/2 times, */ for (k = 1; k = 0); /* * Return the length of the continued fraction, including its * terminator and initial zero, in the global variable CFlength. */ CFlength = k; } /* * Given a continued fraction's coefficients in an array i , * subroutine nearPiOver2 finds all machine representable * values near a integer multiple of pi/2 in the current binade. */ void nearPiOver2 (double i[]) { int k, /* subscript for recurrences (see handout) */ K; /* like k , but used during cancel. elim. */ double p[MAX_k], /* product of the q's (see handout) */ q[MAX_k], /* successive tail evals of CF (see handout) */ j[MAX_k], /* like convergent numerators (see handout) */ tmp, /* temporary used during cancellation elim. */ mk0, /* m[k - 1] (see handout) */ mk, /* m[k] is one of the few ints (see handout) */ mkAbs, /* absolute value of m sub k */ mK0, /* like mk0 , but used during cancel. elim. */ mK, /* like mk , but used during cancel. elim. */ z, /* the object of our quest (the argument) */ m0, /* the mantissa of z as a D-bit integer */ x, /* the reduced argument (see handout) */ ldexp (), /* sys routine to multiply by a power of two */ fabs (), /* sys routine to compute FP absolute value */ floor (), /* sys routine to compute greatest int = value */ /* * Compute the q's by evaluating the continued fraction from * bottom up. * * Start evaluation with a big number in the terminator position. */ q[CFlength] = 1.0 + 30; for (k = CFlength - 1; k >= 0; k = k - 1) q[k] = i[k] + 1 / q[k + 1]; /* * Let THRESHOLD be the biggest | x | that we are interesed in * seeing. * * Compute the p's and j's by the recurrences from the top down. * * Stop when * * 1 1 * ----- >= THRESHOLD > ------ . * 2 |j | 2 |j | * k k+1 */ p[0] = 1; j[0] = 0; j[1] = 1; k = 0; do { p[k + 1] = -q[k + 1] * p[k]; if (k > 0) j[1 + k] = j[k - 1] - i[k] * j[k]; k = k + 1; } while (1 / (2 * fabs (j[k])) >= THRESHOLD); /* * Then mk runs through the integers between * * k + k + * (-1) e / p - 1/2 & (-1) f / p - 1/2 . * kk */ for (mkAbs = floor (e / fabs (p[k])); mkAbs <= ceil (f / fabs (p[k])); mkAbs = mkAbs + 1) { mk = mkAbs * sgn (p[k]); /* * For each mk , mk0 runs through integers between * * + * mq - p THRESHOLD . * kkk */ for (mk0 = floor (mk * q[k] - fabs (p[k]) * THRESHOLD); mk0 <= ceil (mk * q[k] + fabs (p[k]) * THRESHOLD); mk0 = mk0 + 1) { /* * For each pair { mk , mk0 } , check that * * k * m = (-1) ( jm - jm ) * 0 k-1 kk k-1 */ m0 = (k & 1 ? -1 : 1) * (j[k - 1] * mk - j[k] * mk0); /* * lies between e and f . */ if (e <= fabs (m0) && fabs (m0)  0) { p[K + 1] = -q[K + 1] * p[K]; tmp = mK0 - i[K] * mK; if (fabs (tmp) > fabs (mK0)) break; mK0 = mK; mK = tmp; K = K + 1; }; /* * Then * x = ( mq - m ) / p * KK K-1 K * * as accurately as one could hope. */ x = (mK * q[K] - mK0) / p[K]; /* * To return z and m0 as positive numbers, * x must take the sign of m0 . */ x = x * sgn (m0); m0 = fabs (m0); /*d * Set z = m0 * 2 ^ (binade+1-D) . */ z = ldexp (m0, binade + 1 - D); /* * Print z (hex), z (dec), m0 (dec), binade+1-D, x (hex), x (dec). */ printf ("%08lx %08lx Z=%22.16EM=%17.17G L+1-%d=%3d %08lx %08lx x=%23.16E\n", hex (z), z, m0, D, binade + 1 - D, hex (x), x); } } } } 

理论

首先让我们注意使用单精度算术的差异。

  1. [等式8] f的最小值可以更大。 由于双精度数是单精度数的超集,因此最接近2/pi的倍数的single只能比~ 2.98e-19更远,因此固定算术表示中的前导零的数量f必须至多为61个前导零(但可能会更少)。 表示此数量fdigits

  2. [9之前的等式]因此,代替121位, y必须精确到fdigits + 24(单精度中的非零有效位) fdigits (额外保护位)= fdigits + 31,并且最多92。

  3. [等式9]“因此,与x的指数的宽度一起, 2/pi必须包含127( single最大指数)+ 31 + fdigits ,或158 + fdigits和至多219位。

  4. [2.5小节] A的大小由二进制点之前的x的零的数量决定(并且不受移动到single ),而C的大小由9之前的等式确定。

    • 对于大xx > = 2 ^ 24), x看起来像这样:[24位,M 0]。 将它乘以A ,其大小是2/pi的前M位,将得到一个整数( x的零将只将整数转换为整数)。
    • 选择C2/piM+d位开始将导致乘积x*C的大小最多为d-24 。 在双精度中, d被选择为174(而不是24,我们有53),因此产品的大小最多为121.在single ,选择d使得d-24 <= 92就足够了,或者更准确地说, d-24 <= fdigits+31 。 也就是说, d可以选择为fdigits + 55,或最多116。
    • 结果, B的大小最多为116位。

因此,我们有两个问题:

  1. 计算fdigits 。 这涉及从链接的论文中阅读参考文献6并理解它。 可能不那么容易。 :)据我所知,这是使用nearpi.c的唯一地方。
  2. 计算B2/pi的相关位。 由于M低于127,我们可以计算离线2/pi的前127 + 116位并将它们存储在一个数组中。 见维基百科 。
  3. 计算y=x*B 这涉及将x乘以116位数。 这是使用第3节的地方。 块的大小选择为24,因为2 * 24 + 2(乘以两个24位数字,并且添加3个这样的数字)小于double的精度53(并且因为24除以96)。 出于类似的原因,我们可以使用大小为11位的块进行single运算。

注意 - B的技巧仅适用于指数为正(x> = 2 ^ 24)的数字。

总结一下 - 首先,你必须用double精度解决问题。 您的Matlab代码也不能以double精度工作(尝试删除single和计算sin(2^53) ,因为您的twooverpi只有53个有效位,而不是175(并且无论如何,您不能在Matlab中直接乘以这样的精确数字)其次,该方案应该适用于single工作,而且关键问题是准确地表示2/pi ,并支持高精度数字的乘法。最后,当一切正常时,你可以尝试找出一个更好的fdigits减少你必须存储和乘法的位数。

希望我没有完全离开 - 欢迎评论和矛盾。

举个例子,让我们计算sin(x) ,其中x = single(2^24-1) ,在有效位( M = 0)之后没有零。 这简化了发现B ,因为B2/pi的前116位组成。 由于x具有24位的精度和116位的B ,因此产品

 y = x * B 

将根据需要具有92位精度。

链接文章的第3部分描述了如何以足够的精度执行此产品; 在我们的例子中,相同的算法可以与大小为11的块一起使用来计算y 。 作为一个苦差事,我希望我没有明确地这样做,而是依靠Matlab的符号数学工具箱。 这个工具箱为我们提供了vpa函数,它允许我们以十进制数字指定数字的精度。 所以,

 vpa('2/pi', ceil(116*log10(2))) 

将产生至少116位精度的2/pi的近似值。 因为vpa只接受其精度参数的整数,所以我们通常不能精确地指定数字的二进制精度,所以我们使用次优的。

以下代码根据本文以single精度计算sin(x)

 x = single(2^24-1); y = x * vpa('2/pi', ceil(116*log10(2))); % Precision = 103.075 k = round(y); f = single(y - k); r = f * single(pi) / 2; switch mod(k, 4) case 0 s = sin(r); case 1 s = cos(r); case 2 s = -sin(r); case 3 s = -cos(r); end sin(x) - s % Expected value: exactly zero. 

y的精度是使用Mathematica获得的,结果certificate它是比Matlab更好的数值工具:))

libm

这个问题的另一个答案(之后已被删除)引导我在libm实现,虽然它在双精度数字上工作,但是非常彻底地遵循链接的文章。

请参阅文件s_sin.c以获取包装器(链接纸中的表2显示为文件末尾的switch语句), e_rem_pio2.c用于参数减少代码(特别感兴趣的是包含第一个396 hex-的数组 -数字2/pi ,从第69行开始)。