高效的浮点比较（Cortex-A8）

有一个大的（~100 000）浮点变量数组，并且有一个阈值（也是浮点）。

问题是我必须将数组中的每个变量与阈值进行比较，但NEON标记传输需要很长时间（根据分析器约20个周期）。

有没有有效的方法来比较这些值？

注意：由于舍入误差无关紧要，我尝试了以下方法：

float arr[10000]; float threshold; .... int a = arr[20]; // eg int t = threshold; if (t > a) {....}

但在这种情况下，我得到以下处理器命令序列：

 vldr.32 s0, [r0] vcvt.s32.f32 s0, s0 vmov r0, s0 <--- takes 20 cycles as `vmrs APSR_nzcv, fpscr` in case of cmp r0, r1 floating point comparison

当转换发生在NEON时，无论是通过描述的方式还是浮点数来比较整数。

如果浮点数是32位IEEE-754并且整数也是32位且如果没有+无穷大，无穷大和NaN值，我们可以将浮点数作为整数与一个小技巧进行比较：

 #include  #include  #include  #define C_ASSERT(expr) extern char CAssertExtern[(expr)?1:-1] C_ASSERT(sizeof(int) == sizeof(float)); C_ASSERT(sizeof(int) * CHAR_BIT == 32); int isGreater(float* f1, float* f2) { int i1, i2, t1, t2; i1 = *(int*)f1; i2 = *(int*)f2; t1 = i1 >> 31; i1 = (i1 ^ t1) + (t1 & 0x80000001); t2 = i2 >> 31; i2 = (i2 ^ t2) + (t2 & 0x80000001); return i1 > i2; } int main(void) { float arr[9] = { -3, -2, -1.5, -1, 0, 1, 1.5, 2, 3 }; float thr; int i; // Make sure floats are 32-bit IEE754 and // reinterpreted as integers as we want/expect { static const float testf = 8873283.0f; unsigned testi = *(unsigned*)&testf; assert(testi == 0x4B076543); } thr = -1.5; for (i = 0; i < 9; i++) { printf("%f %s %f\n", arr[i], "<=\0> " + 3*isGreater(&arr[i], &thr), thr); } thr = 1.5; for (i = 0; i < 9; i++) { printf("%f %s %f\n", arr[i], "<=\0> " + 3*isGreater(&arr[i], &thr), thr); } return 0; }

输出：

 -3.000000 <= -1.500000 -2.000000 <= -1.500000 -1.500000 <= -1.500000 -1.000000 > -1.500000 0.000000 > -1.500000 1.000000 > -1.500000 1.500000 > -1.500000 2.000000 > -1.500000 3.000000 > -1.500000 -3.000000 <= 1.500000 -2.000000 <= 1.500000 -1.500000 <= 1.500000 -1.000000 <= 1.500000 0.000000 <= 1.500000 1.000000 <= 1.500000 1.500000 <= 1.500000 2.000000 > 1.500000 3.000000 > 1.500000

当然，如果阈值没有改变，那么在比较运算符中使用的isGreater()中预先计算最终整数值是isGreater() 。

如果你害怕上面代码中的C / C ++中的未定义行为，你可以在程序集中重写代码。

如果您的数据是浮动的，那么您应该与浮点数进行比较，例如

 float arr[10000]; float threshold; .... float a = arr[20]; // eg if (threshold > a) {....}

否则你将有昂贵的float-int转换。

您的示例显示了编译器生成的代码有多糟糕：

它使用NEON加载一个值只是为了将其转换为int，然后进行导致管道刷新的NEON-> ARM传输导致11~14个周期浪费。

最好的解决方案是将函数完全写入手工assembly。

但是，有一个简单的技巧，允许快速浮点比较，而无需进行类型转换和截断：

阈值为正（与int比较完全一样）：

 void example(float * pSrc, float threshold, unsigned int count) { typedef union { int ival, unsigned int uval, float fval } unitype; unitype v, t; if (count==0) return; t.fval = threshold; do { v.fval = *pSrc++; if (v.ival < t.ival) { // your code here } else { // your code here (optional) } } while (--count); }

阈值为负（每个值比int比较多1个周期）：

 void example(float * pSrc, float threshold, unsigned int count) { typedef union { int ival, unsigned int uval, float fval } unitype; unitype v, t, temp; if (count==0) return; t.fval = threshold; t.uval &= 0x7fffffff; do { v.fval = *pSrc++; temp.uval = v.uval ^ 0x80000000; if (temp.ival >= t.ival) { // your code here } else { // your code here (optional) } } while (--count); }

我认为它比上面接受的要快得多。再说一次，我有点太晚了。

如果舍入错误无关紧要，那么你应该使用std :: lrint 。

更快的浮点到整数转换建议使用它进行浮点到int转换。

高效的浮点比较（Cortex-A8）

如何在ARM7中进行primefaces比较和交换？