Cuda共享内存数组变量

我试图为矩阵乘法声明一个变量，如下所示：

__shared__ float As[BLOCK_SIZE][BLOCK_SIZE];

我试图让它用户可以输入矩阵的大小来计算，但这意味着改变BLOCK_SIZE。我更改了它但是我收到了编译器错误：“错误：常量值未知”。我调查了它，它与这个post类似。所以我尝试过：

 __shared__ int buf [];

但后来我得到：“错误：不允许不完整的类型”

谢谢，Dan更新代码（几乎遵循本指南并盯着cuda指南）：通过询问用户矩阵的大小来传递块大小。他们进入x和y。块大小仅为x，现在它必须接受与x和y相同的大小。

 __global__ void matrixMul( float* C, float* A, float* B, int wA, int wB,size_t block_size) { // Block index int bx = blockIdx.x; int by = blockIdx.y; // Thread index int tx = threadIdx.x; int ty = threadIdx.y; // Index of the first sub-matrix of A processed // by the block int aBegin = wA * block_size * by; // Index of the last sub-matrix of A processed // by the block int aEnd = aBegin + wA - 1; // Step size used to iterate through the // sub-matrices of A int aStep = block_size; // Index of the first sub-matrix of B processed // by the block int bBegin = block_size * bx; // Step size used to iterate through the // sub-matrices of B int bStep = block_size * wB; float Csub=0; // Loop over all the sub-matrices of A and B // required to compute the block sub-matrix for (int a = aBegin, b = bBegin; a <= aEnd; a += aStep, b += bStep) { // Declaration of the shared memory array As // used to store the sub-matrix of A extern __shared__ float As[]; // Declaration of the shared memory array Bs // used to store the sub-matrix of B extern __shared__ float Bs[]; extern __shared__ float smem[]; // Load the matrices from global memory // to shared memory; each thread loads // one element of each matrix smem[ty*block_size+tx] = A[a + wA * ty + tx]; //cuPrintf("\n\nWhat are the memory locations?\n"); //cuPrintf("The shared memory(A) is: %.2f\n",smem[ty*block_size+tx]); smem[block_size*block_size+ty*block_size+tx] = B[b + wB * ty + tx]; //cuPrintf("The shared memory(B) is: %.2f\n",smem[block_size*block_size+ty*block_size+tx]); // Synchronize to make sure the matrices // are loaded __syncthreads(); // Multiply the two matrices together; // each thread computes one element // of the block sub-matrix for (int k = 0; k < block_size; ++k) { Csub += smem[ty*block_size+k] * smem[block_size*block_size+k*block_size+tx] ; //cuPrintf("Csub is currently: %.2f\n",Csub); } //cuPrintf("\n\n\n"); // Synchronize to make sure that the preceding // computation is done before loading two new // sub-matrices of A and B in the next iteration //cuPrintf("the results are csub: %.2f\n",Csub); __syncthreads(); } // Write the block sub-matrix to device memory; // each thread writes one element int c = wB * block_size * by + block_size * bx; C[c + wB * ty + tx] = Csub; }

extern __shared__ int buf[];

当你启动内核时，你应该这样启动它;

kernel<<>>(...);

如果您有多个extern声明共享：

extern __shared__ float As[];

extern __shared__ float Bs[];

这将导致As指向与Bs相同的地址。

您需要将As和B保留在1Darrays中。

 extern __shared__ float smem[];

调用内核时，应使用2*BLOCK_SIZE*BLOCK_SIZE*sizeof(float)启动它。

索引到As时，使用smem[y*BLOCK_SIZE+x] ，当索引到Bs时使用smem[BLOCK_SIZE*BLOCK_SIZE+y*BLOCK_SIZE+x]

在内核中声明共享内存有两种选择 – 静态或动态。我猜你现在正在做的事情看起来像这样：

 #define BLOCK_SIZE (16) __global__ void sgemm0(const float *A, const float *B, float *C) { __shared__ float As[BLOCK_SIZE][BLOCK_SIZE]; }

并且您希望能够轻松更改BLOCK_SIZE。

一种可能性是继续使用静态共享内存分配，但将分配大小设置为模板参数，如下所示：

 template __global__ void sgemm1(const float *A, const float *B, float *C) { __shared__ float As[blocksize][blocksize]; } template void sgemm1<16>(const float *, const float *, float *C);

然后，您可以根据需要在编译时实例化多个不同的块大小变体。

如果要动态分配内存，请按以下方式定义：

 __global__ void sgemm2(const float *A, const float *B, float *C) { extern __shared__ float As[]; }

然后将分配的大小作为参数添加到内核调用：

 size_t blocksize = BLOCK_SIZE * BLOCK_SIZE; sgemm2<<< gridDim, blockDim, sizeof(float)*blocksize >>>(....);

如果您希望用动态分配的共享内存替换多个静态声明的数组，那么请注意每个内核只有一个动态共享内存分配，因此多个项目在（共享）内存段内退出。所以，如果你有类似的东西：

 #define BLOCK_SIZE (16) __global__ void sgemm0(const float *A, const float *B, float *C) { __shared__ float As[BLOCK_SIZE][BLOCK_SIZE]; __shared__ float Bs[BLOCK_SIZE][BLOCK_SIZE]; }

你可以用以下代替它：

 #define BLOCK_SIZE (16) __global__ void sgemm3(const float *A, const float *B, float *C) { extern __shared__ float buffer[]; float *As = &buffer[0]; float *Bs = &buffer[BLOCK_SIZE*BLOCK_SIZE]; }

并像这样启动内核：

 size_t blocksize = 2 * BLOCK_SIZE * BLOCK_SIZE; sgemm3<<< gridDim, blockDim, sizeof(float)*blocksize >>>(....);

所有这些都同样有效，虽然我个人赞成模板版本，因为它可以允许其他编译器优化，如自动循环展开，动态版本不能没有额外的工作。

听起来不错。

通常在这种情况下，你需要malloc的东西。

这里有两件事，一个C不知道2D数组（它只是一个数组数组），数组大小需要编译时间常数（或者编译器可以在编译时计算的东西）。

如果您使用的是C99，则可以使用函数的参数声明数组大小，但C99支持最多只是…。

Cuda共享内存数组变量

检测两个字符串之间的重叠长度

为什么fread会提早到达EOF？

稳定标准库qsort？

将az扩展为abc … xyzforms的方法

g_slice_alloc中的段错误

while（* p）{p ++;}，while（* ++ p）{;}和while（* p ++）{;}之间有什么区别？

GLUTesselator的替代品？

vfork（）atexit断言失败

cout和printf中的意外输出

C中的摩尔斯电码转换器