Matlab下通过Mex文件编写C程序
本节参考NVIDIA网站相关资源,点击此处链接:
1、MEX规则
Matlab提供了MEX文件的方式来支持C/C++代码编写的算法。mex文件需要满足如下要求:
(1) 包含mex.h头文件
(2) 函数名称、参数返回值必须为如下形式:
void mexFunction(int nlhs, mxArray *plhs[],int nrhs, const mxArray *prhs[]);
其中:
nlhs 为输出数组个数(Left Hand Side)
plhs 为指向输出数组的指针
nrhs 为输入数组个数(Right Hand Side)
prhs 为指向输入数组的指针,且输入数组只读。
以上四个变量,是在C/C++代码中唯一可用到的变量。实际上,由于matlab中所有的变量都是mxArray结构(向量,数组,字符串。。。),因此常见的数据类型均可放入mxArray传递给程序进行处理。
(3)常用mex函数:
a. mex函数定义在mex.h(./extern/include)中,并以mex为前缀,例如打印输出的mexPrintf()函数、在mex文件中调用matlab函数的mexCallMATLAB()函数等。
b. 在matrix.h(./extern/include)中定义了mxArray结构以及对矩阵操作的函数,如创建双精度矩阵的mxCreateDoubleMatrix()、从mxArray中获取数据指针mxGetPr()等。
(4)编译
首先可以通过mex -setup来选择编译器,此时matlab会提示:
Would you like mex to locate installed compilers [y]/n?
此处如果选y,则会列出系统中安装的编译器,但不一定完整(作者就遇到这样的情况)。如果选n,则会列出matlab支持的所有编译器,
我们按实际情况选取即可(按照CUDA的要求,vs2005以上)。
另外常用的编译选项有:
-I 增加头文件(.h)包含目录
-L 增加库文件(.lib)包含目录
-l 链接引用的库文件名称
其他命令可查看matlab中的帮助文档。
2、MEX例程
在本教程中,举例实现将两个矩阵相加(C = A + B)。
/******************************************************************** filename: addMatrix.c file ext: c author: wy@gpgpu.org.cn purpose: test matlab with c *********************************************************************/ #include "mex.h" void mexFunction(int nlhs, mxArray *plhs[],int nrhs, const mxArray *prhs[]) { int i, j, mA, nA, mB, nB, nMatSize; double *A, *B, *C; /* 输入变量检查 */ if (nrhs != 2) mexErrMsgTxt("输入参数必须为2个"); if (nlhs != 1) mexErrMsgTxt("输出参数必须为1个"); mA = mxGetM(prhs[0]); nA = mxGetN(prhs[0]); mB = mxGetM(prhs[1]); nB = mxGetN(prhs[1]); if (mA != mB || nA != nB) mexErrMsgTxt("输入矩阵尺寸必须相同"); /*为输出结果分配内存*/ plhs[0]=mxCreateDoubleMatrix(mA,nA,mxREAL); /*获取数据指针*/ A = mxGetPr(prhs[0]); B = mxGetPr(prhs[1]); C = mxGetPr(plhs[0]); /*求和计算*/ nMatSize = mA * nA; for (i=0; i<nMatSize; i++) { C[i] = A[i] + B[i]; } /*输出信息*/ mexPrintf("求和完毕"); }
编译c文件:
mex -v addMatrix.c
编译成功后,在matlab中即可像常规函数一样使用addMatrix函数。
a = rand(2,2)
a =
0.93547 0.41027
0.9169 0.89365
b = rand(2,2)
b =
0.057891 0.81317
0.35287 0.0098613
c = addMatrix(a,b)
求和完毕
c =
0.99336 1.2234
1.2698 0.90351
3. 使用cuda标准库
在nvidia给出的白皮书例程中,给出了二维FFT的实现方法,本文将其引用,说明通过mex文件使用cuda标准库实现加速的方法。
在该程序中,没有调用内核程序,而只使用了runtime库和fft库,并不需要.cu文件,可以用mex命令直接编译。因此,将cuda算法封装之后再由mex文件调用是使用matlab使用cuda的一种方法。
编译命令:
mex -IC:\CUDA\include fft2_cuda.c -LC:\cuda\lib -lcufft -lcudart
/* This is a mex file to offload the FFT2 function in Matlab to CUDA: Y = fft2(X) or Y = fft2(X,M,N) Y = fft2(X) returns the two-dimensional FFT of X. The result Y is the same size as X. Y = fft2(X,M,N) truncates X, or pads X with zeros to create an M-by-N array before doing the transform. The result is M-by-N. On Windows: Setup the mex build from a Matlab shell: mex -setup Compile the mex file: mex -IC:\CUDA\include fft2_cuda.c -LC:\cuda\lib -lcufft -lcudart On Linux: Use the makefile */ #include "mex.h" #include "cufft.h" #include "cuda_runtime.h" /**************************************************************************/ /* MATLAB stores complex numbers in separate arrays for the real and imaginary parts. The following functions take the data in this format and pack it into a complex work array, or unpack it, respectively. We are using cufftComplex defined in cufft.h to handle complex on Windows and Linux */ void pack_r2c(cufftComplex *output_float, double *input_re, int Ntot) { int i; for (i = 0; i < Ntot; i++) { output_float[i].x = input_re[i]; output_float[i].y = 0.0f; } } void pack_c2c(cufftComplex *output_float, double *input_re, double *input_im, int Ntot) { int i; for (i = 0; i < Ntot; i++) { output_float[i].x = input_re[i]; output_float[i].y = input_im[i]; } } void pack_r2c_expand(cufftComplex *output_float, double *input_re, int originalM, int originalN, int M, int N) { int i, j; for (i = 0; i < originalM; i++) for (j = 0; j < originalN; j++) { output_float[i+M*j].x = input_re[i+originalM*j]; output_float[i+M*j].y = 0.0f; } } void pack_c2c_expand(cufftComplex *output_float, double *input_re, double *input_im, int originalM, int originalN, int M, int N) { int i, j; for (i = 0; i < originalM; i++) for (j = 0; j < originalN; j++) { output_float[i+M*j].x = input_re[i+originalM*j]; output_float[i+M*j].y = input_im[i+originalM*j]; } } void unpack_c2c(cufftComplex *input_float, double *output_re, double *output_im, int Ntot) { int i; for (i = 0; i < Ntot; i++) { output_re[i] = input_float[i].x; output_im[i] = input_float[i].y; } } /**************************************************************************/ void mexFunction( int nlhs, mxArray *plhs[], int nrhs, const mxArray *prhs[]) { int originalM, originalN, M, N; double *ar,*ai; cufftComplex *input_single ; cufftHandle plan; cufftComplex *rhs_complex_d; /* Find out the dimension of the array we want to transform: prhs(originalM,originalN) originalM = Number of rows in the mxArray prhs originalN = Number of columns in the mxArray prhs */ originalM = mxGetM(prhs[0]); originalN = mxGetN(prhs[0]); M = originalM; N = originalN; /* Find out if the result is of the same size of the input. plhs(M,N) M = The number of rows in the mxArray plhs N = The number of columns in the mxArray plhs */ if (nrhs == 3) { M = mxGetScalar(prhs[1]); N = mxGetScalar(prhs[2]); } /* Find out if the input array was real or complex. Matlab is passing two separate pointers for the real and imaginary part. The current version of CUDAFFT is expecting interleaved data. We will need to pack and unpack the data. The current version of CUDA supports only single precision, so the original double precision data need to be converted. */ /* Allocating working array on host */ if(nrhs == 1) input_single = (cufftComplex*) mxMalloc(sizeof(cufftComplex)*N*M); if(nrhs == 3) input_single = (cufftComplex*) mxCalloc(N*M,sizeof(cufftComplex)); /* Pointer for the real part of the input */ ar = (double *) mxGetData(prhs[0]); if (mxIsComplex(prhs[0])) { /* The input array is complex */ ai = (double *) mxGetImagData(prhs[0]); /* Input and output have same dimensions */ if(nrhs ==1) pack_c2c(input_single, ar, ai, N*M); /* Input and output have different dimensions */ if(nrhs ==3) pack_c2c_expand(input_single, ar, ai, originalM, originalN, M, N); } else { /* The input array is real */ /* Input and output have same dimensions */ if(nrhs ==1) pack_r2c(input_single, ar, N*M); /* Input and output have different dimensions */ if(nrhs ==3) pack_r2c_expand(input_single, ar, originalM, originalN, M, N); } /* Allocate array on device */ cudaMalloc( (void **) &rhs_complex_d,sizeof(cufftComplex)*N*M); /* Copy input array in interleaved format to the device */ cudaMemcpy( rhs_complex_d, input_single, sizeof(cufftComplex)*N*M, cudaMemcpyHostToDevice); /* Create plan for CUDA FFT */ cufftPlan2d(&plan, N, M, CUFFT_C2C) ; /* Execute FFT on device */ cufftExecC2C(plan, rhs_complex_d, rhs_complex_d, CUFFT_FORWARD) ; /* Destroy plan */ cufftDestroy(plan); /* Copy result back to host */ cudaMemcpy( input_single, rhs_complex_d, sizeof(cufftComplex)*N*M, cudaMemcpyDeviceToHost); plhs[0]=mxCreateDoubleMatrix(M,N,mxCOMPLEX); ar = mxGetPr(plhs[0]); ai = mxGetPi(plhs[0]); unpack_c2c(input_single, ar, ai, N*M); mxFree(input_single); cudaFree(rhs_complex_d); return; }
