单精度浮点精度降低精度差

我正在尝试将范围缩减作为实现正弦函数的第一步。

我遵循KC NG的论文“减少巨大论据”中描述的方法

当使用0到20000的x的输入范围时，我得到的误差大到0.002339146.我的错误显然不应该那么大，我不知道如何减少它。我注意到误差幅度与输入θ幅度与余弦/正弦相关。

我能够获得本文提到的nearpi.c代码，但我不确定如何将代码用于单精度浮点。如果有人有兴趣，可以在以下链接找到nearpi.c文件： nearpi.c

这是我的MATLAB代码：

x = 0:0.1:20000; % Perform range reduction % Store constant 2/pi twooverpi = single(2/pi); % Compute y y = (x.*twooverpi); % Compute k (round to nearest integer k = round(y); % Solve for f f = single(yk); % Solve for r r = single(f*single(pi/2)); % Find last two bits of k n = bitand(fi(k,1,32,0),fi(3,1,32,0)); n = single(n); % Preallocate for speed z(length(x)) = 0; for i = 1:length(x) switch(n(i)) case 0 z(i)=sin(r(i)); case 1 z(i) = single(cos(r(i))); case 2 z(i) = -sin(r(i)); case 3 z(i) = single(-cos(r(i))); otherwise end end maxerror = max(abs(single(z - single(sin(single(x)))))) minerror = min(abs(single(z - single(sin(single(x))))))

我编辑了nearpi.c程序，以便编译。但是我不确定如何解释输出。此外，文件需要输入，我必须手动输入，我也不确定输入的重要性。

这是工作nearpi.c：

 /* ============================================================================ Name : nearpi.c Author : Version : Copyright : Your copyright notice Description : Hello World in C, Ansi-style ============================================================================ */ #include  #include  #include  /* * Global macro definitions. */ # define hex( double ) *(1 + ((long *) &double)), *((long *) &double) # define sgn(a) (a >= 0 ? 1 : -1) # define MAX_k 2500 # define D 56 # define MAX_EXP 127 # define THRESHOLD 2.22e-16 /* * Global Variables */ int CFlength, /* length of CF including terminator */ binade; double e, f; /* [e,f] range of D-bit unsigned int of f; form 1X...X */ // Function Prototypes int dbleCF (double i[], double j[]); void input (double i[]); void nearPiOver2 (double i[]); /* * This is the start of the main program. */ int main (void) { int k; /* subscript variable */ double i[MAX_k], j[MAX_k]; /* i and j are continued fractions (coeffs) */ // fp = fopen("/src/cfpi.txt", "r"); /* * Compute global variables e and f, where * * e = 2 ^ (D-1), ie the D bit number 10...0 * and * f = 2 ^ D - 1, ie the D bit number 11...1 . */ e = 1; for (k = 2; k <= D; k = k + 1) e = 2 * e; f = 2 * e - 1; /* * Compute the continued fraction for (2/e)/(pi/2) , ie * q's starting value for the first binade, given the continued * fraction for pi as input; set the global variable CFlength * to the length of the resulting continued fraction (including * its negative valued terminator). One should use as many * partial coefficients of pi as necessary to resolve numbers * of the width of the underflow plus the overflow threshold. * A rule of thumb is 0.97 partial coefficients are generated * for every decimal digit of pi . * * Note: for radix B machines, subroutine input should compute * the continued fraction for (B/e)/(pi/2) where e = B ^ (D - 1). */ input (i); /* * Begin main loop over all binades: * For each binade, find the nearest multiples of pi/2 in that binade. * * [ Note: for hexadecimal machines ( B = 16 ), the rest of the main * program simplifies(!) to * * B_ade = 1; * while (B_ade < MAX_EXP) * { * dbleCF (i, j); * dbleCF (j, i); * dbleCF (i, j); * CFlength = dbleCF (j, i); * B_ade = B_ade + 1; * } * } * * because the alternation of source & destination are no longer necessary. ] */ binade = 1; while (binade = MAX_EXP) break; /* * For the current (even) binade, find the nearest multiples of pi/2. */ nearPiOver2 (j); /* * Double the continued fraction to get to the next (odd) binade. */ CFlength = dbleCF (j, i); binade = binade + 1; } return 0; } /* end of Main Program */ /* * Subroutine DbleCF doubles a continued fraction whose partial * coefficients are i[] into a continued fraction j[], where both * arrays are of a type sufficient to do D-bit integer arithmetic. * * In my case ( D = 56 ) , I am forced to treat integers as double * precision reals because my machine does not have integers of * sufficient width to handle D-bit integer arithmetic. * * Adapted from a Basic program written by W. Kahan. * * Algorithm based on Hurwitz's method of doubling continued * fractions (see Knuth Vol. 3, p.360). * * A negative value terminates the last partial quotient. * * Note: for the non-C programmers, the statement break * exits a loop and the statement continue skips to the next * case in the same loop. * * The call modf ( l / 2, &l0 ) assigns the integer portion of * half of L to L0. */ int dbleCF (double i[], double j[]) { double k, l, l0, j0; int n, m; n = 1; m = 0; j0 = i[0] + i[0]; l = i[n]; while (1) { if (l  0) { j[m] = j0; j[m + 1] = l0; j0 = 0; m = m + 2; }; if (l == 0) { /* * Even case. */ if (k < 0) { m = m - 1; break; } else { j0 = j0 + k + k; n = n + 2; l = i[n]; continue; }; } /* * Odd case. */ if (k = 0); /* * Double the continued fraction for pi D-3 times using * i and j alternately as source and destination. On my * machine D = 56 so D-3 is odd; hence the following code: * * Double twice (D-3)/2 times, */ for (k = 1; k = 0); /* * Return the length of the continued fraction, including its * terminator and initial zero, in the global variable CFlength. */ CFlength = k; } /* * Given a continued fraction's coefficients in an array i , * subroutine nearPiOver2 finds all machine representable * values near a integer multiple of pi/2 in the current binade. */ void nearPiOver2 (double i[]) { int k, /* subscript for recurrences (see handout) */ K; /* like k , but used during cancel. elim. */ double p[MAX_k], /* product of the q's (see handout) */ q[MAX_k], /* successive tail evals of CF (see handout) */ j[MAX_k], /* like convergent numerators (see handout) */ tmp, /* temporary used during cancellation elim. */ mk0, /* m[k - 1] (see handout) */ mk, /* m[k] is one of the few ints (see handout) */ mkAbs, /* absolute value of m sub k */ mK0, /* like mk0 , but used during cancel. elim. */ mK, /* like mk , but used during cancel. elim. */ z, /* the object of our quest (the argument) */ m0, /* the mantissa of z as a D-bit integer */ x, /* the reduced argument (see handout) */ ldexp (), /* sys routine to multiply by a power of two */ fabs (), /* sys routine to compute FP absolute value */ floor (), /* sys routine to compute greatest int = value */ /* * Compute the q's by evaluating the continued fraction from * bottom up. * * Start evaluation with a big number in the terminator position. */ q[CFlength] = 1.0 + 30; for (k = CFlength - 1; k >= 0; k = k - 1) q[k] = i[k] + 1 / q[k + 1]; /* * Let THRESHOLD be the biggest | x | that we are interesed in * seeing. * * Compute the p's and j's by the recurrences from the top down. * * Stop when * * 1 1 * ----- >= THRESHOLD > ------ . * 2 |j | 2 |j | * k k+1 */ p[0] = 1; j[0] = 0; j[1] = 1; k = 0; do { p[k + 1] = -q[k + 1] * p[k]; if (k > 0) j[1 + k] = j[k - 1] - i[k] * j[k]; k = k + 1; } while (1 / (2 * fabs (j[k])) >= THRESHOLD); /* * Then mk runs through the integers between * * k + k + * (-1) e / p - 1/2 & (-1) f / p - 1/2 . * kk */ for (mkAbs = floor (e / fabs (p[k])); mkAbs <= ceil (f / fabs (p[k])); mkAbs = mkAbs + 1) { mk = mkAbs * sgn (p[k]); /* * For each mk , mk0 runs through integers between * * + * mq - p THRESHOLD . * kkk */ for (mk0 = floor (mk * q[k] - fabs (p[k]) * THRESHOLD); mk0 <= ceil (mk * q[k] + fabs (p[k]) * THRESHOLD); mk0 = mk0 + 1) { /* * For each pair { mk , mk0 } , check that * * k * m = (-1) ( jm - jm ) * 0 k-1 kk k-1 */ m0 = (k & 1 ? -1 : 1) * (j[k - 1] * mk - j[k] * mk0); /* * lies between e and f . */ if (e <= fabs (m0) && fabs (m0)  0) { p[K + 1] = -q[K + 1] * p[K]; tmp = mK0 - i[K] * mK; if (fabs (tmp) > fabs (mK0)) break; mK0 = mK; mK = tmp; K = K + 1; }; /* * Then * x = ( mq - m ) / p * KK K-1 K * * as accurately as one could hope. */ x = (mK * q[K] - mK0) / p[K]; /* * To return z and m0 as positive numbers, * x must take the sign of m0 . */ x = x * sgn (m0); m0 = fabs (m0); /*d * Set z = m0 * 2 ^ (binade+1-D) . */ z = ldexp (m0, binade + 1 - D); /* * Print z (hex), z (dec), m0 (dec), binade+1-D, x (hex), x (dec). */ printf ("%08lx %08lx Z=%22.16EM=%17.17G L+1-%d=%3d %08lx %08lx x=%23.16E\n", hex (z), z, m0, D, binade + 1 - D, hex (x), x); } } } }

理论

首先让我们注意使用单精度算术的差异。

[等式8] f的最小值可以更大。由于双精度数是单精度数的超集，因此最接近2/pi的倍数的single只能比~ 2.98e-19更远，因此固定算术表示中的前导零的数量f必须至多为61个前导零（但可能会更少）。表示此数量fdigits 。
[9之前的等式]因此，代替121位， y必须精确到fdigits + 24（单精度中的非零有效位） fdigits （额外保护位）= fdigits + 31，并且最多92。
[等式9]“因此，与x的指数的宽度一起， 2/pi必须包含127（ single最大指数）+ 31 + fdigits ，或158 + fdigits和至多219位。
[2.5小节] A的大小由二进制点之前的x的零的数量决定（并且不受移动到single ），而C的大小由9之前的等式确定。
- 对于大x （ x > = 2 ^ 24）， x看起来像这样：[24位，M 0]。将它乘以A ，其大小是2/pi的前M位，将得到一个整数（ x的零将只将整数转换为整数）。
- 选择C从2/pi的M+d位开始将导致乘积x*C的大小最多为d-24 。在双精度中， d被选择为174（而不是24，我们有53），因此产品的大小最多为121.在single ，选择d使得d-24 <= 92就足够了，或者更准确地说， d-24 <= fdigits+31 。也就是说， d可以选择为fdigits + 55，或最多116。
- 结果， B的大小最多为116位。

因此，我们有两个问题：

计算fdigits 。这涉及从链接的论文中阅读参考文献6并理解它。可能不那么容易。 :)据我所知，这是使用nearpi.c的唯一地方。
计算B ， 2/pi的相关位。由于M低于127，我们可以计算离线2/pi的前127 + 116位并将它们存储在一个数组中。见维基百科。
计算y=x*B 这涉及将x乘以116位数。这是使用第3节的地方。块的大小选择为24，因为2 * 24 + 2（乘以两个24位数字，并且添加3个这样的数字）小于double的精度53（并且因为24除以96）。出于类似的原因，我们可以使用大小为11位的块进行single运算。

注意 - B的技巧仅适用于指数为正（x> = 2 ^ 24）的数字。

总结一下 - 首先，你必须用double精度解决问题。您的Matlab代码也不能以double精度工作（尝试删除single和计算sin(2^53) ，因为您的twooverpi只有53个有效位，而不是175（并且无论如何，您不能在Matlab中直接乘以这样的精确数字）其次，该方案应该适用于single工作，而且关键问题是准确地表示2/pi ，并支持高精度数字的乘法。最后，当一切正常时，你可以尝试找出一个更好的fdigits减少你必须存储和乘法的位数。

希望我没有完全离开 - 欢迎评论和矛盾。

例

举个例子，让我们计算sin(x) ，其中x = single(2^24-1) ，在有效位（ M = 0）之后没有零。这简化了发现B ，因为B由2/pi的前116位组成。由于x具有24位的精度和116位的B ，因此产品

 y = x * B

将根据需要具有92位精度。

链接文章的第3部分描述了如何以足够的精度执行此产品; 在我们的例子中，相同的算法可以与大小为11的块一起使用来计算y 。作为一个苦差事，我希望我没有明确地这样做，而是依靠Matlab的符号数学工具箱。这个工具箱为我们提供了vpa函数，它允许我们以十进制数字指定数字的精度。所以，

 vpa('2/pi', ceil(116*log10(2)))

将产生至少116位精度的2/pi的近似值。因为vpa只接受其精度参数的整数，所以我们通常不能精确地指定数字的二进制精度，所以我们使用次优的。

以下代码根据本文以single精度计算sin(x) ：

 x = single(2^24-1); y = x * vpa('2/pi', ceil(116*log10(2))); % Precision = 103.075 k = round(y); f = single(y - k); r = f * single(pi) / 2; switch mod(k, 4) case 0 s = sin(r); case 1 s = cos(r); case 2 s = -sin(r); case 3 s = -cos(r); end sin(x) - s % Expected value: exactly zero.

（ y的精度是使用Mathematica获得的，结果certificate它是比Matlab更好的数值工具:)）

在`libm`

这个问题的另一个答案（之后已被删除）引导我在libm实现，虽然它在双精度数字上工作，但是非常彻底地遵循链接的文章。

请参阅文件s_sin.c以获取包装器（链接纸中的表2显示为文件末尾的switch语句）， e_rem_pio2.c用于参数减少代码（特别感兴趣的是包含第一个396 hex-的数组 -数字2/pi ，从第69行开始）。

单精度浮点精度降低精度差

理论

例

在`libm`

NEON pack vector将结果比较到位图

C编译：collect2：错误：ld返回1退出状态

是`x！= x`是一种可以测试NaN的便携方法吗？

在C中，如何计算无符号数量的负数

整数转换（缩小，扩展），未定义的行为

计算保存在文本文件中的2个纬度和经度之间的距离？

保存霍夫曼代码的问题？

如何将Char转换为Float

用C ++编写的strtok（）模拟

在c上使用getchar（）在输入后获得’Enter’

单精度浮点精度降低精度差

理论

例

在libm

NEON pack vector将结果比较到位图

C编译：collect2：错误：ld返回1退出状态

是`x！= x`是一种可以测试NaN的便携方法吗？

在C中，如何计算无符号数量的负数

整数转换（缩小，扩展），未定义的行为

计算保存在文本文件中的2个纬度和经度之间的距离？

保存霍夫曼代码的问题？

如何将Char转换为Float

用C ++编写的strtok（）模拟

在c上使用getchar（）在输入后获得’Enter’

在`libm`