如何将此代码重写为sse内在函数

我是sse内在函数的新手,并希望在使用这个9as时有一些提示帮助,这对我来说还是很模糊)

我有这样的代码

for(int k=0; k>6); int yc0 = 512 + ((idy + k*iddy)>>6); int xc1 = 512 + ((idx + (k+1)*iddx)>>6); int yc1 = 512 + ((idy + (k+1)*iddy)>>6); int xc2 = 512 + ((idx + (k+2)*iddx)>>6); int yc2 = 512 + ((idy + (k+2)*iddy)>>6); int xc3 = 512 + ((idx + (k+3)*iddx)>>6); int yc3 = 512 + ((idy + (k+3)*iddy)>>6); unsigned color0 = working_buffer[yc0*working_buffer_size_x + xc0]; unsigned color1 = working_buffer[yc1*working_buffer_size_x + xc1]; unsigned color2 = working_buffer[yc2*working_buffer_size_x + xc2]; unsigned color3 = working_buffer[yc3*working_buffer_size_x + xc3]; int adr = base_adr + k; frame_bitmap[adr] = color0; frame_bitmap[adr+1]= color1; frame_bitmap[adr+2]= color2; frame_bitmap[adr+3]= color3; } 

所有这里都是int / unsigned,这是循环的关键部分,不确定整数sse是否有助于速度,但想知道它是否会起作用? 可以帮助这个吗?

(即时通讯使用mingw32)

我的sse有点生疏,但你应该做的是:

 xmm0: [k, k+1, k+2, k+3] //xc0, xc1,.... xmm1: [k, k+1, k+2, k+3] //yc0, yc1,.... //initialize before the loop xmm2: [512, 512, 512, 512] xmm3: [idx, idx, idx, idx] xmm4: [iddx, iddx, iddx, iddx] xmm5: [idy, idy, idy, idy] xmm6: [iddy, iddy, iddy, iddy] xmm7: [working_buffer_size_x, working_buffer_size_x, working_buffer_size_x, working_buffer_size_x] 

计算:

 xmm0 * xmm4 xmm0 + xmm3 xmm0 >> 6 xmm0 + xmm2 xmm0: [xc0, xc1, xc2, xc3] /////////////////////////////// xmm1 * xmm6 xmm1 + xmm5 xmm1 >> 6 xmm1 + xmm2 xmm1: [yc0, yc1, yc2, yc3] xmm1 * xmm7 xmm1 + xmm0 

现在xmm1是:

 xmm1: [yc0*working_buffer_size_x + xc0, yc1*working_buffer_size_x + xc1, yc2*working_buffer_size_x + xc2, yc3*working_buffer_size_x + xc3] 

您正在每个循环(working_buffer,frame_bitmap数组)中读取和写入内存,这些操作比计算本身慢得多,因此速度提升不会像您预期的那样多。

编辑

你需要work_buffer和frame_bitmap数组对齐和SSE4.1

 #include  #include  //SSE4.1 int a[4] __attribute__((aligned(16))); __m128i xmm0, xmm1, xmm2, xmm3, xmm4, xmm5, xmm6, xmm7; xmm2 = _mm_set1_epi32(512); xmm3 = _mm_set1_epi32(idx); xmm4 = _mm_set1_epi32(iddx); xmm5 = _mm_set1_epi32(idy); xmm6 = _mm_set1_epi32(iddy); xmm7 = _mm_set1_epi32(working_buffer_size_x); for(k = 0; k <= n - 4; k +=4){ xmm0 = _mm_set_epi32(k + 3, k + 2, k + 1, k); xmm1 = _mm_set_epi32(k + 3, k + 2, k + 1, k); //xmm0 * xmm4 xmm0 = _mm_mullo_epi32(xmm0, xmm4); //xmm0 + xmm3 xmm0 = _mm_add_epi32(xmm0, xmm3); //xmm0 >> 6 xmm0 = _mm_srai_epi32(xmm0, 6); //xmm0 + xmm2 xmm0 = _mm_add_epi32(xmm0, xmm2); //xmm1 * xmm6 xmm1 = _mm_mullo_epi32(xmm1, xmm6); //xmm1 + xmm5 xmm1 = _mm_add_epi32(xmm1, xmm5); //xmm1 >> 6 xmm1 = _mm_srai_epi32(xmm1, 6); //xmm1 + xmm2 xmm1 = _mm_add_epi32(xmm1, xmm2); //xmm1 * xmm7 xmm1 = _mm_mullo_epi32(xmm1, xmm7); //xmm1 + xmm0 xmm1 = _mm_add_epi32(xmm1, xmm0); //a[0] = yc0*working_buffer_size_x + xc0 //a[1] = yc1*working_buffer_size_x + xc1 //a[2] = yc2*working_buffer_size_x + xc2 //a[3] = yc3*working_buffer_size_x + xc3 _mm_store_si128((__m128i *)&a[0], xmm1); unsigned color0 = working_buffer[ a[0] ]; unsigned color1 = working_buffer[ a[1] ]; unsigned color2 = working_buffer[ a[2] ]; unsigned color3 = working_buffer[ a[3] ]; int adr = base_adr + k; frame_bitmap[adr] = color0; frame_bitmap[adr+1]= color1; frame_bitmap[adr+2]= color2; frame_bitmap[adr+3]= color3; } 

您可以通过避免_mm_store_si128((__m128i *)&a[0], xmm1);来进一步优化它_mm_store_si128((__m128i *)&a[0], xmm1); 或者int adr = base_adr + k; 使用直接操作内存的程序集。