返回指向设备分配矩阵的指针从C到Fortran

首先，我是Fortran / C / CUDA的新手。其次，我正在研究一个使用cuBLAS在GPU上执行矩阵向量乘法的Fortran / C程序。在需要更新矩阵内容之前，我需要将多个（最多1000个）向量与一个矩阵相乘。但是，每当新的向量发送到GPU时，我必须重新分配矩阵的当前版本（由于矩阵没有改变，这非常浪费和缓慢）。

我希望能够将矩阵与向量相乘，而无需为每个向量重新分配矩阵。我所涉及的一个想法是调用一个单独的C函数，它将矩阵分配给GPU，返回指向Fortran主程序的指针，然后调用执行矩阵向量乘法的另一个C函数。

使用ISO_C_BINDING，我向变量返回一个指向浮点数的指针：

type(C_PTR) :: ptr

当我尝试将其传递给矩阵向量C函数时：

在Fortran

 call cudaFunction(ptr,vector, N)

在C.

 extern "C" void cudaFunction_(float *mat, float *vector, int *N)

一切都编译并运行，但是cublasSgemv的执行无法执行。关于为什么会发生这种情况的任何想法？我已经看到了一些相关的post，但他们从未尝试将返回的指针发送回C，这就是（我相信）我遇到的问题。

提前致谢！

我建议你不要重新发明轮子，但要使用为此目的提供的cublas fortran绑定。

“thunking”包装器不是你想要的。只要你在fortran中使用cublas调用，它就会根据需要进行隐式复制操作。

您需要“非thunking”包装器，因此您可以明确控制正在进行的复制。您可以使用Fortran等效的Get/SetMatrix和Get/SetVector来回复制数据。

有一个示例代码（示例B.2）显示了如何使用cublas文档中包含的非thunking包装器。

即使您确实想重新发明轮子，包装器也会向您展示如何在C和Fortran之间进行必要的语法处理。

在标准的Linux CUDA安装中，包装器位于/usr/local/cuda/src中非thunking包装器是/usr/local/cuda/src/fortran.c

这是一个完整的例子：

cublasf.f：

  program cublas_fortran_example implicit none integer i, j c helper functions integer cublas_init integer cublas_shutdown integer cublas_alloc integer cublas_free integer cublas_set_vector integer cublas_get_vector c selected blas functions double precision cublas_ddot external cublas_daxpy external cublas_dscal external cublas_dcopy double precision cublas_dnrm2 c cublas variables integer cublas_status real*8 x(30), y(30) double precision alpha, beta double precision nrm integer*8 d_x, d_y, d_alpha, d_beta, d_nrm integer*8 dsize1, dlength1, dlength2 double precision dresult write(*,*) "testing cublas fortran example" c initialize cublas library c CUBLAS_STATUS_SUCCESS=0 cublas_status = cublas_init() if (cublas_status /= 0) then write(*,*) "CUBLAS Library initialization failed" write(*,*) "cublas_status=",cublas_status stop endif c initialize data do j=1,30 x(j) = 1.0 y(j) = 2.0 enddo dsize1 = 8 dlength1 = 30 dlength2 = 1 alpha = 2.0 beta = 3.0 c allocate device storage cublas_status = cublas_alloc(dlength1, dsize1, d_x) if (cublas_status /= 0) then write(*,*) "CUBLAS device malloc failed" stop endif cublas_status = cublas_alloc(dlength1, dsize1, d_y) if (cublas_status /= 0) then write(*,*) "CUBLAS device malloc failed" stop endif cublas_status = cublas_alloc(dlength2, dsize1, d_alpha) if (cublas_status /= 0) then write(*,*) "CUBLAS device malloc failed" stop endif cublas_status = cublas_alloc(dlength2, dsize1, d_beta) if (cublas_status /= 0) then write(*,*) "CUBLAS device malloc failed" stop endif cublas_status = cublas_alloc(dlength2, dsize1, d_nrm) if (cublas_status /= 0) then write(*,*) "CUBLAS device malloc failed" stop endif c copy data from host to device cublas_status = cublas_set_vector(dlength1, dsize1, x, dlength2, > d_x, dlength2) if (cublas_status /= 0) then write(*,*) "CUBLAS copy to device failed" write(*,*) "cublas_status=",cublas_status stop endif cublas_status = cublas_set_vector(dlength1, dsize1, y, dlength2, > d_y, dlength2) if (cublas_status /= 0) then write(*,*) "CUBLAS copy to device failed" write(*,*) "cublas_status=",cublas_status stop endif dresult = cublas_ddot(dlength1, d_x, dlength2, d_y, dlength2) write(*,*) "dot product result=",dresult dresult = cublas_dnrm2(dlength1, d_x, dlength2) write(*,*) "nrm2 of x result=",dresult dresult = cublas_dnrm2(dlength1, d_y, dlength2) write(*,*) "nrm2 of y result=",dresult call cublas_daxpy(dlength1, alpha, d_x, dlength2, d_y, dlength2) cublas_status = cublas_get_vector(dlength1, dsize1, d_y, dlength2, > y, dlength2) if (cublas_status /= 0) then write(*,*) "CUBLAS copy to host failed" write(*,*) "cublas_status=",cublas_status stop endif write(*,*) "daxpy y(1) =", y(1) write(*,*) "daxpy y(30) =", y(30) call cublas_dscal(dlength1, beta, d_x, dlength2) cublas_status = cublas_get_vector(dlength1, dsize1, d_x, dlength2, > x, dlength2) if (cublas_status /= 0) then write(*,*) "CUBLAS copy to host failed" write(*,*) "cublas_status=",cublas_status stop endif write(*,*) "dscal x(1) =", x(1) write(*,*) "dscal x(30) =", x(30) call cublas_dcopy(dlength1, d_x, dlength2, d_y, dlength2) cublas_status = cublas_get_vector(dlength1, dsize1, d_y, dlength2, > y, dlength2) if (cublas_status /= 0) then write(*,*) "CUBLAS copy to host failed" write(*,*) "cublas_status=",cublas_status stop endif write(*,*) "dcopy y(1) =", y(1) write(*,*) "dcopy y(30) =", y(30) c deallocate GPU memory and exit cublas_status = cublas_free(d_x) cublas_status = cublas_free(d_y) cublas_status = cublas_free(d_alpha) cublas_status = cublas_free(d_beta) cublas_status = cublas_free(d_nrm) cublas_status = cublas_shutdown() stop end

编译/运行：

 $ gfortran -c -o cublasf.o cublasf.f $ gcc -c -DCUBLAS_GFORTRAN -I/usr/local/cuda/include -I/usr/local/cuda/src -o fortran.o /usr/local/cuda/src/fortran.c $ gfortran -L/usr/local/cuda/lib64 -lcublas -o cublasf cublasf.o fortran.o $ ./cublasf testing cublas fortran example dot product result= 60.0000000000000 nrm2 of x result= 5.47722557505166 nrm2 of y result= 10.9544511501033 daxpy y(1) = 4.00000000000000 daxpy y(30) = 4.00000000000000 dscal x(1) = 3.00000000000000 dscal x(30) = 3.00000000000000 dcopy y(1) = 3.00000000000000 dcopy y(30) = 3.00000000000000 $

CUDA 5.0，RHEL 5.5

返回指向设备分配矩阵的指针从C到Fortran

指针失去其值+ execv编译警告

用于C / C ++的StAX Writer实现

C / C ++图片载入

fgets（）不会再次提示用户

‘main’函数的返回值在哪里？

“文件范围”和“程序范围”之间有什么区别

为什么建议在scanf格式字符串中添加前导空格？

是否可以在C中重复使用opendir？

通过分析程序集列表validationgcc / g ++中的编译器优化

编译成更快的代码：“n * 3”或“n +（n * 2）”？