返回指向设备分配矩阵的指针从C到Fortran

首先,我是Fortran / C / CUDA的新手。 其次,我正在研究一个使用cuBLAS在GPU上执行矩阵向量乘法的Fortran / C程序。 在需要更新矩阵内容之前,我需要将多个(最多1000个)向量与一个矩阵相乘。 但是,每当新的向量发送到GPU时,我必须重新分配矩阵的当前版本(由于矩阵没有改变,这非常浪费和缓慢)。

我希望能够将矩阵与向量相乘,而无需为每个向量重新分配矩阵。 我所涉及的一个想法是调用一个单独的C函数,它将矩阵分配给GPU,返回指向Fortran主程序的指针,然后调用执行矩阵向量乘法的另一个C函数。

使用ISO_C_BINDING,我向变量返回一个指向浮点数的指针:

type(C_PTR) :: ptr 

当我尝试将其传递给矩阵向量C函数时:

在Fortran

 call cudaFunction(ptr,vector, N) 

在C.

 extern "C" void cudaFunction_(float *mat, float *vector, int *N) 

一切都编译并运行,但是cublasSgemv的执行无法执行。 关于为什么会发生这种情况的任何想法? 我已经看到了一些相关的post,但他们从未尝试将返回的指针发送回C,这就是(我相信)我遇到的问题。

提前致谢!

我建议你不要重新发明轮子,但要使用为此目的提供的cublas fortran绑定 。

“thunking”包装器不是你想要的。 只要你在fortran中使用cublas调用,它就会根据需要进行隐式复制操作。

您需要“非thunking”包装器,因此您可以明确控制正在进行的复制。 您可以使用Fortran等效的Get/SetMatrixGet/SetVector来回复制数据。

有一个示例代码(示例B.2)显示了如何使用cublas文档中包含的非thunking包装器。

即使您确实想重新发明轮子,包装器也会向您展示如何在C和Fortran之间进行必要的语法处理。

在标准的Linux CUDA安装中,包装器位于/usr/local/cuda/src中非thunking包装器是/usr/local/cuda/src/fortran.c

这是一个完整的例子:

cublasf.f:

  program cublas_fortran_example implicit none integer i, j c helper functions integer cublas_init integer cublas_shutdown integer cublas_alloc integer cublas_free integer cublas_set_vector integer cublas_get_vector c selected blas functions double precision cublas_ddot external cublas_daxpy external cublas_dscal external cublas_dcopy double precision cublas_dnrm2 c cublas variables integer cublas_status real*8 x(30), y(30) double precision alpha, beta double precision nrm integer*8 d_x, d_y, d_alpha, d_beta, d_nrm integer*8 dsize1, dlength1, dlength2 double precision dresult write(*,*) "testing cublas fortran example" c initialize cublas library c CUBLAS_STATUS_SUCCESS=0 cublas_status = cublas_init() if (cublas_status /= 0) then write(*,*) "CUBLAS Library initialization failed" write(*,*) "cublas_status=",cublas_status stop endif c initialize data do j=1,30 x(j) = 1.0 y(j) = 2.0 enddo dsize1 = 8 dlength1 = 30 dlength2 = 1 alpha = 2.0 beta = 3.0 c allocate device storage cublas_status = cublas_alloc(dlength1, dsize1, d_x) if (cublas_status /= 0) then write(*,*) "CUBLAS device malloc failed" stop endif cublas_status = cublas_alloc(dlength1, dsize1, d_y) if (cublas_status /= 0) then write(*,*) "CUBLAS device malloc failed" stop endif cublas_status = cublas_alloc(dlength2, dsize1, d_alpha) if (cublas_status /= 0) then write(*,*) "CUBLAS device malloc failed" stop endif cublas_status = cublas_alloc(dlength2, dsize1, d_beta) if (cublas_status /= 0) then write(*,*) "CUBLAS device malloc failed" stop endif cublas_status = cublas_alloc(dlength2, dsize1, d_nrm) if (cublas_status /= 0) then write(*,*) "CUBLAS device malloc failed" stop endif c copy data from host to device cublas_status = cublas_set_vector(dlength1, dsize1, x, dlength2, > d_x, dlength2) if (cublas_status /= 0) then write(*,*) "CUBLAS copy to device failed" write(*,*) "cublas_status=",cublas_status stop endif cublas_status = cublas_set_vector(dlength1, dsize1, y, dlength2, > d_y, dlength2) if (cublas_status /= 0) then write(*,*) "CUBLAS copy to device failed" write(*,*) "cublas_status=",cublas_status stop endif dresult = cublas_ddot(dlength1, d_x, dlength2, d_y, dlength2) write(*,*) "dot product result=",dresult dresult = cublas_dnrm2(dlength1, d_x, dlength2) write(*,*) "nrm2 of x result=",dresult dresult = cublas_dnrm2(dlength1, d_y, dlength2) write(*,*) "nrm2 of y result=",dresult call cublas_daxpy(dlength1, alpha, d_x, dlength2, d_y, dlength2) cublas_status = cublas_get_vector(dlength1, dsize1, d_y, dlength2, > y, dlength2) if (cublas_status /= 0) then write(*,*) "CUBLAS copy to host failed" write(*,*) "cublas_status=",cublas_status stop endif write(*,*) "daxpy y(1) =", y(1) write(*,*) "daxpy y(30) =", y(30) call cublas_dscal(dlength1, beta, d_x, dlength2) cublas_status = cublas_get_vector(dlength1, dsize1, d_x, dlength2, > x, dlength2) if (cublas_status /= 0) then write(*,*) "CUBLAS copy to host failed" write(*,*) "cublas_status=",cublas_status stop endif write(*,*) "dscal x(1) =", x(1) write(*,*) "dscal x(30) =", x(30) call cublas_dcopy(dlength1, d_x, dlength2, d_y, dlength2) cublas_status = cublas_get_vector(dlength1, dsize1, d_y, dlength2, > y, dlength2) if (cublas_status /= 0) then write(*,*) "CUBLAS copy to host failed" write(*,*) "cublas_status=",cublas_status stop endif write(*,*) "dcopy y(1) =", y(1) write(*,*) "dcopy y(30) =", y(30) c deallocate GPU memory and exit cublas_status = cublas_free(d_x) cublas_status = cublas_free(d_y) cublas_status = cublas_free(d_alpha) cublas_status = cublas_free(d_beta) cublas_status = cublas_free(d_nrm) cublas_status = cublas_shutdown() stop end 

编译/运行:

 $ gfortran -c -o cublasf.o cublasf.f $ gcc -c -DCUBLAS_GFORTRAN -I/usr/local/cuda/include -I/usr/local/cuda/src -o fortran.o /usr/local/cuda/src/fortran.c $ gfortran -L/usr/local/cuda/lib64 -lcublas -o cublasf cublasf.o fortran.o $ ./cublasf testing cublas fortran example dot product result= 60.0000000000000 nrm2 of x result= 5.47722557505166 nrm2 of y result= 10.9544511501033 daxpy y(1) = 4.00000000000000 daxpy y(30) = 4.00000000000000 dscal x(1) = 3.00000000000000 dscal x(30) = 3.00000000000000 dcopy y(1) = 3.00000000000000 dcopy y(30) = 3.00000000000000 $ 

CUDA 5.0,RHEL 5.5