Tag: 自动矢量化

如何告诉GCC循环自动矢量化没有指针别名？（限制不起作用）: 我在让GCC对这个循环进行矢量化时遇到了问题： register int_fast8_t __attribute__ ((aligned)) * restrict fillRow = __builtin_assume_aligned(rowMaps + query[i]*rowLen,8); register int __attribute__ ((aligned (16))) *restrict curRow = __builtin_assume_aligned(scoreMatrix + i*rowLen,16), __attribute__ ((aligned (16))) *restrict prevRow = __builtin_assume_aligned(curRow – rowLen,16); register unsigned __attribute__ ((aligned (16))) *restrict shiftCur = __builtin_assume_aligned(shiftMatrix + i*rowLen,16), __attribute__ ((aligned (16))) *restrict shiftPrev = __builtin_assume_aligned(shiftCur – rowLen,16); unsigned j; unsigned […]

如何帮助gcc矢量化C代码: 我有以下C代码。第一部分只是从标准中读入一个复数的矩阵，称为M矩阵。有趣的部分是第二部分。 #include #include #include #include #include int main() { int n, m, c, d; float re, im; scanf(“%d %d”, &n, &m); assert(n==m); complex float M[n][n]; for(c=0; c<n; c++) { for(d=0; d<n; d++) { scanf("%f%fi", &re, &im); M[c][d] = re + im * I; } } for(c=0; c<n; c++) { for(d=0; d<n; d++) { […]

了解gcc 4.9.2自动矢量化输出: 我正在尝试学习gcc自动矢量化模块。从这里阅读文档后。这是我尝试过的（debian jessie amd64）： $ cat ex1.c int a[256], b[256], c[256]; foo () { int i; for (i=0; i<256; i++){ a[i] = b[i] + c[i]; } } 然后，我只是运行： $ gcc -xc -Ofast -msse2 -c -ftree-vectorize -fopt-info-vec-missed ex1.c ex1.c:5:3: note: misalign = 0 bytes of ref b[i_11] ex1.c:5:3: note: misalign = 0 bytes of ref […]

为什么gcc autovectorization对3×3的卷积矩阵不起作用？: 我已经为卷积矩阵实现了以下程序 #include #include #define NUM_LOOP 1000 #define N 128 //input or output dimention 1 #define MN //input or output dimention 2 #define P 5 //convolution matrix dimention 1 if you want a 3×3 convolution matrix it must be 3 #define QP //convolution matrix dimention 2 #define Csize P*Q #define Cdiv 1 //div for filter #define […]

展开循环并使用矢量化进行独立求和: 对于以下循环，如果我告诉它使用关联数学，例如使用-Ofast GCC将仅对循环进行矢量化。 float sumf(float *x) { x = (float*)__builtin_assume_aligned(x, 64); float sum = 0; for(int i=0; i<2048; i++) sum += x[i]; return sum; } 这是带-Ofast -mavx的程序集 sumf(float*): vxorps %xmm0, %xmm0, %xmm0 leaq 8192(%rdi), %rax .L2: vaddps (%rdi), %ymm0, %ymm0 addq $32, %rdi cmpq %rdi, %rax jne .L2 vhaddps %ymm0, %ymm0, %ymm0 vhaddps %ymm0, %ymm0, %ymm1 […]

重叠数组的总和，自动矢量化和限制: Arstechnia最近有一篇文章为什么一些编程语言比其他语言更快。它比较了Fortran和C，并提到了求和数组。在Fortran中，假设数组不重叠，因此可以进一步优化。在C / C ++中，指向相同类型的指针可能会重叠，因此通常不能使用此优化。但是，在C / C ++中，可以使用restrict或__restrict关键字告诉编译器不要假设指针重叠。所以我开始研究自动矢量化。以下代码在GCC和MSVC中进行矢量化 void dot_int(int *a, int *b, int *c, int n) { for(int i=0; i<n; i++) { c[i] = a[i] + b[i]; } } 我使用和不使用重叠数组测试了它，它得到了正确的结果。但是，我使用SSE手动向量化循环的方式不能处理重叠数组。 int i=0; for(; i<n-3; i+=4) { __m128i a4 = _mm_loadu_si128((__m128i*)&a[i]); __m128i b4 = _mm_loadu_si128((__m128i*)&b[i]); __m128i c4 = […]

为什么对mmap的内存进行未对齐访问有时会在AMD64上出现段错误？: 我有这段代码在AMD64兼容CPU上运行Ubuntu 14.04时会出现段错误： #include #include #include int main() { uint32_t sum = 0; uint8_t *buffer = mmap(NULL, 1<<18, PROT_READ, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0); uint16_t *p = (buffer + 1); int i; for (i=0;i<14;++i) { //printf("%d\n", i); sum += p[i]; } return sum; } 如果使用mmap分配内存，则仅此段错误。如果我使用malloc ，堆栈上的缓冲区或全局变量，它不会发生段错误。如果我将循环的迭代次数减少到少于14的次数，则不再是段错误。如果我从循环内打印数组索引，它也不再是段错误。为什么未对齐的内存访问能够访问未对齐地址的CPU上的段错误，为什么只有在这种特定情况下呢？