数学’pow’函数gcc的SSE向量化

我试图对包含在数学库中使用’pow’函数的循环进行矢量化。我知道英特尔编译器支持使用’pow’作为sse指令 – 但我似乎无法使用gcc运行（我认为）。这是我正在使用的情况：

int main(){ int i=0; float a[256], b[256]; float x= 2.3; for (i =0 ; i<256; i++){ a[i]=1.5; } for (i=0; i<256; i++){ b[i]=pow(a[i],x); } for (i=0; i<256; i++){ b[i]=a[i]*a[i]; } return 0; }

我正在编译以下内容：

 gcc -O3 -Wall -ftree-vectorize -msse2 -ftree-vectorizer-verbose=5 code.c -o runthis

这是在使用gcc版本4.2的os X 10.5.8上（我也使用了4.5，并且无法判断它是否已经向量化了 – 因为它根本没有输出任何内容）。似乎没有一个循环矢量化 – 是否存在一个对齐问题或其他一些我需要使用限制的问题？如果我将其中一个循环写为函数，我会得到更详细的输出（代码）：

 void pow2(float *a, float * b, int n) { int i; for (i=0; i<n; i++){ b[i]=a[i]*a[i]; } }

输出（使用7级详细输出）：

 note: not vectorized: can't determine dependence between *D.2878_13 and *D.2877_8 bad data dependence.

我查看了gcc自动矢量化页面，但这并没有多大帮助。如果在gcc版本中不可能使用pow，我在哪里可以找到执行pow等效function的资源（我主要处理整数幂）。

编辑所以我只是深入研究其他来源 – 它是如何对此进行矢量化的？！：

 void array_op(double * d,int len,double value,void (*f)(double*,double*) ) { for ( int i = 0; i < len; i++ ){ f(&d[i],&value); } };

相关的gcc输出：

 note: Profitability threshold is 3 loop iterations. note: LOOP VECTORIZED.

那么现在我很茫然 – ‘d’和’价值’被gcc不知道的函数修改 – 奇怪？也许我需要更彻底地测试这个部分，以确保矢量化部分的结果是正确的。还在寻找一个矢量化的数学库 – 为什么没有任何开源的数学库？

在写入输出之前使用__restrict或消耗输入（分配到本地变量）应该会有所帮助。

就像现在一样，编译器无法进行向量化，因为可能是别名b ，因此并行4次并且写回4个值可能不正确。

（注意__restrict不保证编译器矢量化，但现在可以这么说，肯定不能）。

这不是你问题的真正答案; 而是建议如何完全避免这个问题。

你提到你在OS X上; 在该平台上已经存在API，它们提供您正在查看的操作，而无需自动向量化。是否有某些原因导致您不使用它们？自动矢量化真的很酷，但它需要一些工作，并且通常它不会产生与使用已经为您矢量化的API一样好的结果。

 #include  #include  int main() { int n = 256; float a[256], b[256]; // You can initialize the elements of a vector to a set value using memset_pattern: float threehalves = 1.5f; memset_pattern4(a, &threehalves, 4*n); // Since you have a fixed exponent for all of the base values, we will use // the vImage gamma functions. If you wanted to have different exponents // for each input (ie from an array of exponents), you would use the vForce // vvpowf( ) function instead (also part of Accelerate). // // If you don't need full accuracy, replace kvImageGamma_UseGammaValue with // kvImageGamma_UseGammaValue_half_precision to get better performance. GammaFunction func = vImageCreateGammaFunction(2.3f, kvImageGamma_UseGammaValue, 0); vImage_Buffer src = { .data = a, .height = 1, .width = n, .rowBytes = 4*n }; vImage_Buffer dst = { .data = b, .height = 1, .width = n, .rowBytes = 4*n }; vImageGamma_PlanarF(&src, &dst, func, 0); vImageDestroyGammaFunction(func); // To simply square a instead, use the vDSP_vsq function. vDSP_vsq(a, 1, b, 1, n); return 0; }

更一般地说，除非您的算法非常简单，否则自动矢量化不太可能产生很好的结果。根据我的经验，矢量化技术的范围通常看起来像这样：

 better performance worse performance more effort less effort +------+------+----------------------+----------------------------+-----------+ | | | | | | | | use vectorized APIs | auto vectorization | | skilled vector C | scalar code hand written assembly unskilled vector C

数学’pow’函数gcc的SSE向量化

多个线程从同一个文件读取

ftell在2GB以上的位置

混合符号整数数学取决于变量大小

在C中读取unicode文件时出错

C中的小写字符到大写字符并写入文件

在处理Aleph One文章时，64位系统上没有堆栈分配

sprintf for unsigned _int64

用Delphi消费C dll

C预处理器插入的空格

使用GDB调试以编程方式调用的函数