单精度浮点精度降低精度差
我正在尝试将范围缩减作为实现正弦函数的第一步。
我遵循KC NG的论文“减少巨大论据”中描述的方法
当使用0到20000的x的输入范围时,我得到的误差大到0.002339146.我的错误显然不应该那么大,我不知道如何减少它。 我注意到误差幅度与输入θ幅度与余弦/正弦相关。
我能够获得本文提到的nearpi.c代码,但我不确定如何将代码用于单精度浮点。 如果有人有兴趣,可以在以下链接找到nearpi.c文件: nearpi.c
这是我的MATLAB代码:
x = 0:0.1:20000; % Perform range reduction % Store constant 2/pi twooverpi = single(2/pi); % Compute y y = (x.*twooverpi); % Compute k (round to nearest integer k = round(y); % Solve for f f = single(yk); % Solve for r r = single(f*single(pi/2)); % Find last two bits of k n = bitand(fi(k,1,32,0),fi(3,1,32,0)); n = single(n); % Preallocate for speed z(length(x)) = 0; for i = 1:length(x) switch(n(i)) case 0 z(i)=sin(r(i)); case 1 z(i) = single(cos(r(i))); case 2 z(i) = -sin(r(i)); case 3 z(i) = single(-cos(r(i))); otherwise end end maxerror = max(abs(single(z - single(sin(single(x)))))) minerror = min(abs(single(z - single(sin(single(x))))))
我编辑了nearpi.c程序,以便编译。 但是我不确定如何解释输出。 此外,文件需要输入,我必须手动输入,我也不确定输入的重要性。
这是工作nearpi.c:
/* ============================================================================ Name : nearpi.c Author : Version : Copyright : Your copyright notice Description : Hello World in C, Ansi-style ============================================================================ */ #include #include #include /* * Global macro definitions. */ # define hex( double ) *(1 + ((long *) &double)), *((long *) &double) # define sgn(a) (a >= 0 ? 1 : -1) # define MAX_k 2500 # define D 56 # define MAX_EXP 127 # define THRESHOLD 2.22e-16 /* * Global Variables */ int CFlength, /* length of CF including terminator */ binade; double e, f; /* [e,f] range of D-bit unsigned int of f; form 1X...X */ // Function Prototypes int dbleCF (double i[], double j[]); void input (double i[]); void nearPiOver2 (double i[]); /* * This is the start of the main program. */ int main (void) { int k; /* subscript variable */ double i[MAX_k], j[MAX_k]; /* i and j are continued fractions (coeffs) */ // fp = fopen("/src/cfpi.txt", "r"); /* * Compute global variables e and f, where * * e = 2 ^ (D-1), ie the D bit number 10...0 * and * f = 2 ^ D - 1, ie the D bit number 11...1 . */ e = 1; for (k = 2; k <= D; k = k + 1) e = 2 * e; f = 2 * e - 1; /* * Compute the continued fraction for (2/e)/(pi/2) , ie * q's starting value for the first binade, given the continued * fraction for pi as input; set the global variable CFlength * to the length of the resulting continued fraction (including * its negative valued terminator). One should use as many * partial coefficients of pi as necessary to resolve numbers * of the width of the underflow plus the overflow threshold. * A rule of thumb is 0.97 partial coefficients are generated * for every decimal digit of pi . * * Note: for radix B machines, subroutine input should compute * the continued fraction for (B/e)/(pi/2) where e = B ^ (D - 1). */ input (i); /* * Begin main loop over all binades: * For each binade, find the nearest multiples of pi/2 in that binade. * * [ Note: for hexadecimal machines ( B = 16 ), the rest of the main * program simplifies(!) to * * B_ade = 1; * while (B_ade < MAX_EXP) * { * dbleCF (i, j); * dbleCF (j, i); * dbleCF (i, j); * CFlength = dbleCF (j, i); * B_ade = B_ade + 1; * } * } * * because the alternation of source & destination are no longer necessary. ] */ binade = 1; while (binade = MAX_EXP) break; /* * For the current (even) binade, find the nearest multiples of pi/2. */ nearPiOver2 (j); /* * Double the continued fraction to get to the next (odd) binade. */ CFlength = dbleCF (j, i); binade = binade + 1; } return 0; } /* end of Main Program */ /* * Subroutine DbleCF doubles a continued fraction whose partial * coefficients are i[] into a continued fraction j[], where both * arrays are of a type sufficient to do D-bit integer arithmetic. * * In my case ( D = 56 ) , I am forced to treat integers as double * precision reals because my machine does not have integers of * sufficient width to handle D-bit integer arithmetic. * * Adapted from a Basic program written by W. Kahan. * * Algorithm based on Hurwitz's method of doubling continued * fractions (see Knuth Vol. 3, p.360). * * A negative value terminates the last partial quotient. * * Note: for the non-C programmers, the statement break * exits a loop and the statement continue skips to the next * case in the same loop. * * The call modf ( l / 2, &l0 ) assigns the integer portion of * half of L to L0. */ int dbleCF (double i[], double j[]) { double k, l, l0, j0; int n, m; n = 1; m = 0; j0 = i[0] + i[0]; l = i[n]; while (1) { if (l 0) { j[m] = j0; j[m + 1] = l0; j0 = 0; m = m + 2; }; if (l == 0) { /* * Even case. */ if (k < 0) { m = m - 1; break; } else { j0 = j0 + k + k; n = n + 2; l = i[n]; continue; }; } /* * Odd case. */ if (k = 0); /* * Double the continued fraction for pi D-3 times using * i and j alternately as source and destination. On my * machine D = 56 so D-3 is odd; hence the following code: * * Double twice (D-3)/2 times, */ for (k = 1; k = 0); /* * Return the length of the continued fraction, including its * terminator and initial zero, in the global variable CFlength. */ CFlength = k; } /* * Given a continued fraction's coefficients in an array i , * subroutine nearPiOver2 finds all machine representable * values near a integer multiple of pi/2 in the current binade. */ void nearPiOver2 (double i[]) { int k, /* subscript for recurrences (see handout) */ K; /* like k , but used during cancel. elim. */ double p[MAX_k], /* product of the q's (see handout) */ q[MAX_k], /* successive tail evals of CF (see handout) */ j[MAX_k], /* like convergent numerators (see handout) */ tmp, /* temporary used during cancellation elim. */ mk0, /* m[k - 1] (see handout) */ mk, /* m[k] is one of the few ints (see handout) */ mkAbs, /* absolute value of m sub k */ mK0, /* like mk0 , but used during cancel. elim. */ mK, /* like mk , but used during cancel. elim. */ z, /* the object of our quest (the argument) */ m0, /* the mantissa of z as a D-bit integer */ x, /* the reduced argument (see handout) */ ldexp (), /* sys routine to multiply by a power of two */ fabs (), /* sys routine to compute FP absolute value */ floor (), /* sys routine to compute greatest int = value */ /* * Compute the q's by evaluating the continued fraction from * bottom up. * * Start evaluation with a big number in the terminator position. */ q[CFlength] = 1.0 + 30; for (k = CFlength - 1; k >= 0; k = k - 1) q[k] = i[k] + 1 / q[k + 1]; /* * Let THRESHOLD be the biggest | x | that we are interesed in * seeing. * * Compute the p's and j's by the recurrences from the top down. * * Stop when * * 1 1 * ----- >= THRESHOLD > ------ . * 2 |j | 2 |j | * k k+1 */ p[0] = 1; j[0] = 0; j[1] = 1; k = 0; do { p[k + 1] = -q[k + 1] * p[k]; if (k > 0) j[1 + k] = j[k - 1] - i[k] * j[k]; k = k + 1; } while (1 / (2 * fabs (j[k])) >= THRESHOLD); /* * Then mk runs through the integers between * * k + k + * (-1) e / p - 1/2 & (-1) f / p - 1/2 . * kk */ for (mkAbs = floor (e / fabs (p[k])); mkAbs <= ceil (f / fabs (p[k])); mkAbs = mkAbs + 1) { mk = mkAbs * sgn (p[k]); /* * For each mk , mk0 runs through integers between * * + * mq - p THRESHOLD . * kkk */ for (mk0 = floor (mk * q[k] - fabs (p[k]) * THRESHOLD); mk0 <= ceil (mk * q[k] + fabs (p[k]) * THRESHOLD); mk0 = mk0 + 1) { /* * For each pair { mk , mk0 } , check that * * k * m = (-1) ( jm - jm ) * 0 k-1 kk k-1 */ m0 = (k & 1 ? -1 : 1) * (j[k - 1] * mk - j[k] * mk0); /* * lies between e and f . */ if (e <= fabs (m0) && fabs (m0) 0) { p[K + 1] = -q[K + 1] * p[K]; tmp = mK0 - i[K] * mK; if (fabs (tmp) > fabs (mK0)) break; mK0 = mK; mK = tmp; K = K + 1; }; /* * Then * x = ( mq - m ) / p * KK K-1 K * * as accurately as one could hope. */ x = (mK * q[K] - mK0) / p[K]; /* * To return z and m0 as positive numbers, * x must take the sign of m0 . */ x = x * sgn (m0); m0 = fabs (m0); /*d * Set z = m0 * 2 ^ (binade+1-D) . */ z = ldexp (m0, binade + 1 - D); /* * Print z (hex), z (dec), m0 (dec), binade+1-D, x (hex), x (dec). */ printf ("%08lx %08lx Z=%22.16EM=%17.17G L+1-%d=%3d %08lx %08lx x=%23.16E\n", hex (z), z, m0, D, binade + 1 - D, hex (x), x); } } } }
理论
首先让我们注意使用单精度算术的差异。
-
[等式8]
f
的最小值可以更大。 由于双精度数是单精度数的超集,因此最接近2/pi
的倍数的single
只能比~2.98e-19
更远,因此固定算术表示中的前导零的数量f
必须至多为61个前导零(但可能会更少)。 表示此数量fdigits
。 -
[9之前的等式]因此,代替121位,
y
必须精确到fdigits
+ 24(单精度中的非零有效位)fdigits
(额外保护位)=fdigits
+ 31,并且最多92。 -
[等式9]“因此,与
x
的指数的宽度一起,2/pi
必须包含127(single
最大指数)+ 31 +fdigits
,或158 +fdigits
和至多219位。 -
[2.5小节]
A
的大小由二进制点之前的x
的零的数量决定(并且不受移动到single
),而C
的大小由9之前的等式确定。- 对于大
x
(x
> = 2 ^ 24),x
看起来像这样:[24位,M 0]。 将它乘以A
,其大小是2/pi
的前M
位,将得到一个整数(x
的零将只将整数转换为整数)。 - 选择
C
从2/pi
的M+d
位开始将导致乘积x*C
的大小最多为d-24
。 在双精度中,d
被选择为174(而不是24,我们有53),因此产品的大小最多为121.在single
,选择d
使得d-24 <= 92
就足够了,或者更准确地说,d-24 <= fdigits+31
。 也就是说,d
可以选择为fdigits
+ 55,或最多116。 - 结果,
B
的大小最多为116位。
- 对于大
因此,我们有两个问题:
- 计算
fdigits
。 这涉及从链接的论文中阅读参考文献6并理解它。 可能不那么容易。 :)据我所知,这是使用nearpi.c
的唯一地方。 - 计算
B
,2/pi
的相关位。 由于M
低于127,我们可以计算离线2/pi
的前127 + 116位并将它们存储在一个数组中。 见维基百科 。 - 计算
y=x*B
这涉及将x
乘以116位数。 这是使用第3节的地方。 块的大小选择为24,因为2 * 24 + 2(乘以两个24位数字,并且添加3个这样的数字)小于double
的精度53(并且因为24除以96)。 出于类似的原因,我们可以使用大小为11位的块进行single
运算。
注意 - B
的技巧仅适用于指数为正(x> = 2 ^ 24)的数字。
总结一下 - 首先,你必须用double
精度解决问题。 您的Matlab
代码也不能以double
精度工作(尝试删除single
和计算sin(2^53)
,因为您的twooverpi
只有53个有效位,而不是175(并且无论如何,您不能在Matlab中直接乘以这样的精确数字)其次,该方案应该适用于single
工作,而且关键问题是准确地表示2/pi
,并支持高精度数字的乘法。最后,当一切正常时,你可以尝试找出一个更好的fdigits
减少你必须存储和乘法的位数。
希望我没有完全离开 - 欢迎评论和矛盾。
例
举个例子,让我们计算sin(x)
,其中x = single(2^24-1)
,在有效位( M
= 0)之后没有零。 这简化了发现B
,因为B
由2/pi
的前116位组成。 由于x
具有24位的精度和116位的B
,因此产品
y = x * B
将根据需要具有92位精度。
链接文章的第3部分描述了如何以足够的精度执行此产品; 在我们的例子中,相同的算法可以与大小为11的块一起使用来计算y
。 作为一个苦差事,我希望我没有明确地这样做,而是依靠Matlab
的符号数学工具箱。 这个工具箱为我们提供了vpa
函数,它允许我们以十进制数字指定数字的精度。 所以,
vpa('2/pi', ceil(116*log10(2)))
将产生至少116位精度的2/pi
的近似值。 因为vpa
只接受其精度参数的整数,所以我们通常不能精确地指定数字的二进制精度,所以我们使用次优的。
以下代码根据本文以single
精度计算sin(x)
:
x = single(2^24-1); y = x * vpa('2/pi', ceil(116*log10(2))); % Precision = 103.075 k = round(y); f = single(y - k); r = f * single(pi) / 2; switch mod(k, 4) case 0 s = sin(r); case 1 s = cos(r); case 2 s = -sin(r); case 3 s = -cos(r); end sin(x) - s % Expected value: exactly zero.
( y
的精度是使用Mathematica
获得的,结果certificate它是比Matlab
更好的数值工具:))
在libm
这个问题的另一个答案(之后已被删除)引导我在libm
实现,虽然它在双精度数字上工作,但是非常彻底地遵循链接的文章。
请参阅文件s_sin.c以获取包装器(链接纸中的表2显示为文件末尾的switch
语句), e_rem_pio2.c用于参数减少代码(特别感兴趣的是包含第一个396 hex-的数组 -数字2/pi
,从第69行开始)。