I am trying to get GPc (https://github.com/SheffieldML/GPc) working in Matlab, using mex-files. I got the examples working, I took the bit I'm currently being interested in out as a standalone C++ program, that works just fine. However, when I try to do the same in a mex and run it through Matlab, I'm getting some errors, in particular:
MKL ERROR: Parameter 4 was incorrect on entry to DPOTRF.
or
** On entry to DPOTRF parameter number 4 had an illegal value
depending on whether I use the system version of MKL or the one Matlab carries along. The call to dpotrf is:
dpotrf_(type, nrows, vals, nrows, info);
with all variables valid (type="U", nrows=40, vals = double[40*40]) and with the interface:
extern "C" void dpotrf_(
const char* t, // whether upper or lower triangluar 'U' or 'L'
const int &n, // (input)
double *a, // a[n][lda] (input/output)
const int &lda, // (input)
int &info // (output)
);
(both are taken from GPc). LDA was originally supplied as ncols (which I believe is incorrect, but I didn't inquiry the library author about it yet), but it shouldn't make a difference, because this is called on a square matrix.
I feared that there might be problem with the references, so I changed the interface header to accept int* (like in http://www.netlib.org/clapack/clapack-3.2.1-CMAKE/SRC/dpotrf.c), but that started giving me segfaults, so it made me thinking the references there are right.
Does anybody have an idea what might be wrong?
I've tried to reproduce with an example on my end, but I'm not seeing any errors. In fact the result is identical to MATLAB's.
mex_chol.cpp
#include "mex.h"
#include "lapack.h"
void mexFunction(int nlhs, mxArray *plhs[], int nrhs, const mxArray *prhs[])
{
// verify arguments
if (nrhs != 1 || nlhs > 1) {
mexErrMsgTxt("Wrong number of arguments.");
}
if (!mxIsDouble(prhs[0]) || mxIsComplex(prhs[0])) {
mexErrMsgTxt("Input must be a real double matrix.");
}
if (mxGetM(prhs[0]) != mxGetN(prhs[0])) {
mexErrMsgTxt("Input must be a symmetric positive-definite matrix.");
}
// copy input matrix to output (its contents will be overwritten)
plhs[0] = mxDuplicateArray(prhs[0]);
// pointer to data
double *A = mxGetPr(plhs[0]);
mwSignedIndex n = mxGetN(plhs[0]);
// perform matrix factorization
mwSignedIndex info = 0;
dpotrf("U", &n, A, &n, &info);
// check if call was successful
if (info < 0) {
mexErrMsgTxt("Parameters had an illegal value.");
} else if (info > 0) {
mexErrMsgTxt("Matrix is not positive-definite.");
}
}
Note that MATLAB already ships with BLAS/LAPCK headers and libraries (Intel MKL implementation). In fact this is what $MATLABROOT\extern\include\lapack.h has as function prototype for dpotrf:
#define dpotrf FORTRAN_WRAPPER(dpotrf)
extern void dpotrf(
char *uplo,
ptrdiff_t *n,
double *a,
ptrdiff_t *lda,
ptrdiff_t *info
);
Here is how you compile the above C++ code:
>> mex -largeArrayDims mex_chol.cpp libmwblas.lib libmwlapack.lib
Finally let's test the MEX function:
% some random symmetric positive semidefinite matrix
A = gallery('randcorr',10);
% our MEX-version of Cholesky decomposition
chol2 = #(A) triu(mex_chol(A));
% compare
norm(chol(A) - chol2(A)) % I get 0
(Note that the MEX code returns the working matrix as is, where the LAPACK routine only overwrites half of the matrix. So I used TRIU to zero-out the other half and extract the upper part).
Related
I'm trying to write an R wrapper for the FINUFFT routines for calculating the FFT of an unevenly sampled series. I have virtually no experience with C/C++, so I'm working from an example that compares the traditional Fourier transform to the NUFFT. The example code follows.
// this is all you must include for the finufft lib...
#include "finufft.h"
#include <complex>
// also needed for this example...
#include <stdio.h>
#include <stdlib.h>
using namespace std;
int main(int argc, char* argv[])
/* Simple example of calling the FINUFFT library from C++, using plain
arrays of C++ complex numbers, with a math test. Barnett 3/10/17
Double-precision version (see example1d1f for single-precision)
Compile with:
g++ -fopenmp example1d1.cpp -I ../src ../lib-static/libfinufft.a -o example1d1 -lfftw3 -lfftw3_omp -lm
or if you have built a single-core version:
g++ example1d1.cpp -I ../src ../lib-static/libfinufft.a -o example1d1 -lfftw3 -lm
Usage: ./example1d1
*/
{
int M = 1e6; // number of nonuniform points
int N = 1e6; // number of modes
double acc = 1e-9; // desired accuracy
nufft_opts opts; finufft_default_opts(&opts);
complex<double> I = complex<double>(0.0,1.0); // the imaginary unit
// generate some random nonuniform points (x) and complex strengths (c):
double *x = (double *)malloc(sizeof(double)*M);
complex<double>* c = (complex<double>*)malloc(sizeof(complex<double>)*M);
for (int j=0; j<M; ++j) {
x[j] = M_PI*(2*((double)rand()/RAND_MAX)-1); // uniform random in [-pi,pi)
c[j] = 2*((double)rand()/RAND_MAX)-1 + I*(2*((double)rand()/RAND_MAX)-1);
}
// allocate output array for the Fourier modes:
complex<double>* F = (complex<double>*)malloc(sizeof(complex<double>)*N);
// call the NUFFT (with iflag=+1): note N and M are typecast to BIGINT
int ier = finufft1d1(M,x,c,+1,acc,N,F,opts);
int n = 142519; // check the answer just for this mode...
complex<double> Ftest = complex<double>(0,0);
for (int j=0; j<M; ++j)
Ftest += c[j] * exp(I*(double)n*x[j]);
int nout = n+N/2; // index in output array for freq mode n
double Fmax = 0.0; // compute inf norm of F
for (int m=0; m<N; ++m) {
double aF = abs(F[m]);
if (aF>Fmax) Fmax=aF;
}
double err = abs(F[nout] - Ftest)/Fmax;
printf("1D type-1 NUFFT done. ier=%d, err in F[%d] rel to max(F) is %.3g\n",ier,n,err);
free(x); free(c); free(F);
return ier;
}
Much of this I don't need, such as generating the test series and comparing to the traditional FFT. Further, I want to return the values of the transform, not just an error code indicating success. Below is my code.
#include "finufft.h"
#include <complex>
#include <Rcpp.h>
#include <stdlib.h>
using namespace Rcpp;
using namespace std;
// [[Rcpp::export]]
ComplexVector finufft(int M, NumericVector x, ComplexVector c, int N) {
// From example code for finufft, sets precision and default options
double acc = 1e-9;
nufft_opts opts; finufft_default_opts(&opts);
// allocate output array for the finufft routine:
complex<double>* F = (complex<double>*)malloc(sizeof(complex<double>*)*N);
// Change vector inputs from R types to C++ types
double* xd = as< double* >(x);
complex<double>* cd = as< complex<double>* >(c);
// call the NUFFT (with iflag=-1): note N and M are typecast to BIGINT
int ier = finufft1d1(M,xd,cd,-1,acc,N,F,opts);
ComplexVector Fd = as<ComplexVector>(*F);
return Fd;
}
When I try to source this in Rstudio, I get the error "no matching function for call to 'as(std::complex<double>*&)'", pointing to the line declaring Fd towards the end. I believe the error indicates that either the function 'as' isn't defined (which I know is false), or the argument to 'as' isn't the correct type. The examples here include one using 'as' to convert to a NumericVector, so unless there's some complication with complex values I don't see why it should be a problem here.
I know there are potential problems using two namespaces, but I don't believe that's the issue here. My best guess is that there's an issue with how I'm trying to use pointers, but I lack the experience to identify it and I can't find any similar examples online to guide me.
Rcpp::as<T> converts from an R data type (SEXP) to a C++ data type, e.g. Rcpp::ComplexVector. This does not fit your situation, where you try to convert from a C-style array to C++. Fortunately Rcpp::Vector, which is the basis for Rcpp::ComplexVector, has a constructor for this task: Vector (InputIterator first, InputIterator last). For the other direction (going from C++ to C-style array) you can use vector.begin() or &vector[0].
However, one needs a reinterpret_cast to convert between Rcomplex* and std::complex<double>*. That should cause no problems, though, since Rcomplex (a.k.a. complex double in C) and std::complex<doulbe> are compatible.
A minimal example:
#include <Rcpp.h>
#include <complex>
using namespace Rcpp;
// [[Rcpp::export]]
ComplexVector foo(ComplexVector v) {
std::complex<double>* F = reinterpret_cast<std::complex<double>*>(v.begin());
int N = v.length();
// do something with F
ComplexVector Fd(reinterpret_cast<Rcomplex*>(F),
reinterpret_cast<Rcomplex*>(F + N));
return Fd;
}
/*** R
set.seed(42)
foo(runif(4)*(1+1i))
*/
Result:
> Rcpp::sourceCpp('56675308/code.cpp')
> set.seed(42)
> foo(runif(4)*(1+1i))
[1] 0.9148060+0.9148060i 0.9370754+0.9370754i 0.2861395+0.2861395i 0.8304476+0.8304476i
BTW, you can move these reinterpret_casts out of sight by using std::vector<std::complex<double>> as argument and return types for your function. Rcpp does the rest for you. This also helps getting rid of the naked malloc:
#include <Rcpp.h>
// dummy function with reduced signature
int finufft1d1(int M, double *xd, std::complex<double> *cd, int N, std::complex<double> *Fd) {
return 0;
}
// [[Rcpp::export]]
std::vector<std::complex<double>> finufft(int M,
std::vector<double> x,
std::vector<std::complex<double>> c,
int N) {
// allocate output array for the finufft routine:
std::vector<std::complex<double>> F(N);
// Change vector inputs from R types to C++ types
double* xd = x.data();
std::complex<double>* cd = c.data();
std::complex<double>* Fd = F.data();
int ier = finufft1d1(M, xd, cd, N, Fd);
return F;
}
I am using Coin-Or's rehearse to implement linear programming.
I need a modulo constraint. Example: x shall be a multiple of 3.
OsiCbcSolverInterface solver;
CelModel model(solver);
CelNumVar x;
CelIntVar z;
unsigned int mod = 3;
// Maximize
solver.setObjSense(-1.0);
model.setObjective(x);
model.addConstraint(x <= 7.5);
// The modulo constraint:
model.addConstraint(x == z * mod);
The result for x should be 6. However, z is set to 2.5, which should not be possible as I declared it as a CellIntVar.
How can I enforce z to be an integer?
I never used that lib, but you i think you should follow the tests.
The core message comes from the readme:
If you want some of your variables to be integers, use CelIntVar instead of CelNumVar. You must bind the solver to an Integer Linear Programming solver as well, for example Coin-cbc.
Looking at Rehearse/tests/testRehearse.cpp -> exemple4() (here presented: incomplete code; no copy-paste):
OsiClpSolverInterface *solver = new OsiClpSolverInterface();
CelModel model(*solver);
...
CelIntVar x1("x1");
...
solver->initialSolve(); // this is the relaxation (and maybe presolving)!
...
CbcModel cbcModel(*solver); // MIP-solver
cbcModel.branchAndBound(); // Use MIP-solver
printf("Solution for x1 : %g\n", model.getSolutionValue(x1, *cbcModel.solver()));
printf("Solution objvalue = : %g\n", cbcModel.solver()->getObjValue());
This kind of usage (use Osi to get LP-solver; build MIP-solver on top of that Osi-provided-LP-solver and call brandAndBound) basically follows Cbc's internal interface (with python's cylp this looks similar).
Just as reference: This is the official CoinOR Cbc (Rehearse-free) example from here:
// Copyright (C) 2005, International Business Machines
// Corporation and others. All Rights Reserved.
#include "CbcModel.hpp"
// Using CLP as the solver
#include "OsiClpSolverInterface.hpp"
int main (int argc, const char *argv[])
{
OsiClpSolverInterface solver1;
// Read in example model in MPS file format
// and assert that it is a clean model
int numMpsReadErrors = solver1.readMps("../../Mps/Sample/p0033.mps","");
assert(numMpsReadErrors==0);
// Pass the solver with the problem to be solved to CbcModel
CbcModel model(solver1);
// Do complete search
model.branchAndBound();
/* Print the solution. CbcModel clones the solver so we
need to get current copy from the CbcModel */
int numberColumns = model.solver()->getNumCols();
const double * solution = model.bestSolution();
for (int iColumn=0;iColumn<numberColumns;iColumn++) {
double value=solution[iColumn];
if (fabs(value)>1.0e-7&&model.solver()->isInteger(iColumn))
printf("%d has value %g\n",iColumn,value);
}
return 0;
}
I am trying to convert matlab code to a c++ mex file in order to run a few computations more efficiently. I am using the armadillo library with blas and lapack for a few matrix operations, which involves interpolating data to apply a delay.
However, I am receiving an inconsistent output from my mex file. If I run the same mex file with the same input, sometimes I receive the correct output, and occasionally it will output a HUGE number (i.e. instead of on the order of 100, it is on the order of 10^246).
I am very new to c++ coding, and have exhausted my general knowledge base. I believe the problem is in my interpolation step, because I am able to consistently output the correct delay matrix, which is the preceeding step.
Does anyone have any idea what I am doing to produce this?
In Matlab I call:
mex test.cpp -lblas -llapack
[outData] = test( squeeze(inData(:,:,ang,:)) , params, angles(ang),1);
My mex file is generally:
#include <math.h>
#include <mex.h>
#include <armadillo>
#include "armaMex.hpp"
using namespace std; //avoid having to scope with std:: before commands
using namespace arma; //avoid having to scope with std:: before commands
void mexFunction(int nlhs, mxArray *plhs[], int nrhs, const mxArray *prhs[]){
// ============== INITIALIZE =============
// Initialize Data
const mwSize *dims;
int cDim,dDim,aDim,numDims; // Dimension variables
int m, n, a; // Loop variables
mxArray *fs_p, *f0_p, *prf_p, *pval_p, *c_p; // Parameter pointers
const double *fs,*f0,*prf,*pval, *c, *ang; // Parameter variables
const int *nthreads;
// Initialize pointers for param variables
pval_p = mxGetField(prhs[1],0,"pval"); //note that your parameters need these exact names
fs_p = mxGetField(prhs[1],0,"fs");
f0_p = mxGetField(prhs[1],0,"f0");
prf_p = mxGetField(prhs[1],0,"prf");
c_p = mxGetField(prhs[1],0,"c");
// Initialize parameters
pval = mxGetPr(pval_p);
fs = mxGetPr(fs_p);
f0 = mxGetPr(pval_p);
prf = mxGetPr(prf_p);
c = mxGetPr(c_p);
ang = (double*)mxGetData(prhs[2]);
nthreads = (int*)mxGetData(prhs[3]);
dims = mxGetDimensions(prhs[0]);
numDims = (int)mxGetNumberOfDimensions(prhs[0]);
dDim=(int)dims[0];cDim=(int)dims[1];aDim=(int)dims[2];
//Read in channel Data
cube data_in = armaGetCubePr(prhs[0]);
(....... simple calculations that look okay ... )
cube data_out(dDim, bDim, aDim);
cube delayedData(dDim, aDim, bDim);
vec delayArray(dDim); //need to define these tmp variables bc subcube fcn otherwise gives me errors idk
vec tmpIN(dDim);
vec tmpOut(dDim);
vec tmpOUTdata(dDim);
for(m=0;m<bDim;m++){
for(n=0;n<cDim;n++){
for (a=0;a<aDim;a++){
delayArray = tdelays.subcube(0,n,m,dDim-1,n,m);
tmpIN = data_in.subcube(0,n,a,dDim-1,n,a);
tmpOUTdata = data_out.subcube(0,m,a,dDim-1,m,a);
interp1(timeArray, tmpIN , delayArray, tmpOut, "linear",0);
data_out.subcube(0,m,a,dDim-1,m,a) = tmpOUTdata +tmpOut;
}
}
}
// Define output data
plhs[0] = armaCreateMxMatrix(data_out.n_rows, data_out.n_cols, data_out.n_slices);
armaSetCubePr(plhs[0], data_out);
return
}
I've followed the MATLAB example for creating a mex file from here https://uk.mathworks.com/help/matlab/matlab_external/standalone-example.html
The source code it produces is as follows
#include "mex.h"
/* The computational routine */
void arrayProduct(double x, double *y, double *z, mwSize n)
{
mwSize i;
/* multiply each element y by x */
for (i=0; i<n; i++) {
z[i] = x * y[i];
}
}
/* The gateway function */
void mexFunction( int nlhs, mxArray *plhs[],
int nrhs, const mxArray *prhs[])
{
double multiplier; /* input scalar */
double *inMatrix; /* 1xN input matrix */
size_t ncols; /* size of matrix */
double *outMatrix; /* output matrix */
/* check for proper number of arguments */
if(nrhs!=2) {
mexErrMsgIdAndTxt("MyToolbox:arrayProduct:nrhs","Two inputs required.");
}
if(nlhs!=1) {
mexErrMsgIdAndTxt("MyToolbox:arrayProduct:nlhs","One output required.");
}
/* make sure the first input argument is scalar */
if( !mxIsDouble(prhs[0]) ||
mxIsComplex(prhs[0]) ||
mxGetNumberOfElements(prhs[0])!=1 ) {
mexErrMsgIdAndTxt("MyToolbox:arrayProduct:notScalar","Input multiplier must be a scalar.");
}
/* make sure the second input argument is type double */
if( !mxIsDouble(prhs[1]) ||
mxIsComplex(prhs[1])) {
mexErrMsgIdAndTxt("MyToolbox:arrayProduct:notDouble","Input matrix must be type double.");
}
/* check that number of rows in second input argument is 1 */
if(mxGetM(prhs[1])!=1) {
mexErrMsgIdAndTxt("MyToolbox:arrayProduct:notRowVector","Input must be a row vector.");
}
/* get the value of the scalar input */
multiplier = mxGetScalar(prhs[0]);
/* create a pointer to the real data in the input matrix */
inMatrix = mxGetPr(prhs[1]);
/* get dimensions of the input matrix */
ncols = mxGetN(prhs[1]);
/* create the output matrix */
plhs[0] = mxCreateDoubleMatrix(1,(mwSize)ncols,mxREAL);
/* get a pointer to the real data in the output matrix */
outMatrix = mxGetPr(plhs[0]);
/* call the computational routine */
arrayProduct(multiplier,inMatrix,outMatrix,(mwSize)ncols);
}
When I run the command mex arrayProduct.cpp (the name of my file), I get the following error:
Building with 'Microsoft Visual C++ 2017'.
Error using mex
LINK : error LNK2001: unresolved external symbol mexfilerequiredapiversion
arrayProduct.lib : fatal error LNK1120: 1 unresolved externals
I'm using MATLAB 2015b 32-bit, with the Visual Studio 2017 C++ compiler. Is there some preliminary set-up required for making mex files that isn't mentioned in the MATLAB tutorial?
The youngest supported compiler for MATLAB R2015b is MSVC Professional 2015. Also, R2015b is the latest version with 32-bit support. Your compiler is probably MSVC 2017, 64-bit.
Try to install .NET4 + SDK 7.1, select that in MATLAB, and re-run your mex command. That is an officially supported compiler for R2015b, and I expect that that solves your problem.
Note: for me, .NET4 refused to install because it detected a previously installed framework, but this answer resolved that problem for me.
I started implementing a few m-files in C++ in order to reduce run times. The m-files produce n-dimensional points and evaluate function values at these points. The functions are user-defined and they are passed to m-files and mex-files as function handles. The mex-files use mexCallMATLAB with feval for finding function values.
I constructed the below example where a function handle fn constructed in the Matlab command line is passed to matlabcallingmatlab.m and mexcallingmatlab.cpp routines. With a freshly opened Matlab, mexcallingmatlab evaluates this function 200000 in 241.5 seconds while matlabcallingmatlab evaluates it in 0.81522 seconds therefore a 296 times slow-down with the mex implementation. These times are the results of the second runs as the first runs seem to be larger probably due to some overhead associated first time loading the program etc.
I have spent many days searching online on this problem and tried some suggestions on it. I tried different mex compiling flags to optimize the mex but there was almost no difference in performance. A previous post in Stackoverflow stated that upgrading Matlab was the solution but I am using probably the latest version MATLAB Version: 8.1.0.604 (R2013a) on Mac OS X Version: 10.8.4. I did compile the mex file with and without –largeArrayDims flag but this didn’t make any difference either. Some suggested that the content of the function handle could be directly coded in the cpp file but this is impossible as I would like to provide this code to any user with any type of function with a vector input and real number output.
As far as I found out, mex files need to go through feval function for using a function handle whereas m-files can directly call function handles provided that Matlab version is newer than some version.
Any help would be greatly appreciated.
simple function handle created in the Matlab command line:
fn = #(x) x'*x
matlabcallingmatlab.m :
function matlabcallingmatlab( fn )
x = zeros(2,1);
for i = 0 : 199999
x(2) = i;
f = fn( x );
end
mexcallingmatlab.cpp:
#include "mex.h"
#include <cstring>
void mexFunction( int nlhs, mxArray *plhs[],
int nrhs, const mxArray *prhs[] )
{
mxArray *lhs[1], *rhs[2]; //parameters to be passed to feval
double f, *xptr, x[] = {0.0, 0.0}; // x: input to f and f=f(x)
int n = 2, nbytes = n * sizeof(double); // n: dimension of input x to f
// prhs[0] is the function handle as first argument to feval
rhs[0] = const_cast<mxArray *>( prhs[0] );
// rhs[1] contains input x to the function
rhs[1] = mxCreateDoubleMatrix( n, 1, mxREAL);
xptr = mxGetPr( rhs[1] );
for (int i = 0; i < 200000; ++i)
{
x[1] = double(i); // change input
memcpy( xptr, x, nbytes ); // now rhs[1] has new x
mexCallMATLAB(1, lhs, 2, rhs, "feval");
f = *mxGetPr( lhs[0] );
}
}
Compilation of mex file:
>> mex -v -largeArrayDims mexcallingmatlab.cpp
So I tried to implement this myself, and I think I found the reason for the slowness.
Basically your code have a small memory leak where you are not freeing the lhs mxArray returned from the call to mexCallMATLAB. It is not exactly a memory-leak, seeing that MATLAB memory manager takes care of freeing the memory when the MEX-file exits:
MATLAB allocates dynamic memory to store the mxArrays in plhs.
MATLAB automatically deallocates the dynamic memory when you clear the MEX-file.
However, if heap space is at a premium, call mxDestroyArray when
you are finished with the mxArrays plhs points to.
Still explicit is better than implicit... So your code is really stressing the deallocator of the MATLAB memory manager :)
mexcallingmatlab.cpp
#include "mex.h"
#ifndef N
#define N 100
#endif
void mexFunction(int nlhs, mxArray *plhs[], int nrhs, const mxArray *prhs[])
{
// validate input/output arguments
if (nrhs != 1) {
mexErrMsgTxt("One input argument required.");
}
if (mxGetClassID(prhs[0]) != mxFUNCTION_CLASS) {
mexErrMsgTxt("Input must be a function handle.");
}
if (nlhs > 1) {
mexErrMsgTxt("Too many output arguments.");
}
// allocate output
plhs[0] = mxCreateDoubleMatrix(N, 1, mxREAL);
double *out = mxGetPr(plhs[0]);
// prepare for mexCallMATLAB: val = feval(#fh, zeros(2,1))
mxArray *lhs, *rhs[2];
rhs[0] = mxDuplicateArray(prhs[0]);
rhs[1] = mxCreateDoubleMatrix(2, 1, mxREAL);
double *xptr = mxGetPr(rhs[1]) + 1;
for (int i=0; i<N; ++i) {
*xptr = i;
mexCallMATLAB(1, &lhs, 2, rhs, "feval");
out[i] = *mxGetPr(lhs);
mxDestroyArray(lhs);
}
// cleanup
mxDestroyArray(rhs[0]);
mxDestroyArray(rhs[1]);
}
MATLAB
fh = #(x) x'*x;
N = 2e5;
% MATLAB
tic
out = zeros(N,1);
for i=0:N-1
out(i+1) = feval(fh, [0;i]);
end
toc
% MEX
mex('-largeArrayDims', sprintf('-DN=%d',N), 'mexcallingmatlab.cpp')
tic
out2 = mexcallingmatlab(fh);
toc
% check results
assert(isequal(out,out2))
Running the above benchmark a couple of times (to warm it up), I get the following consistent results:
Elapsed time is 0.732890 seconds. % pure MATLAB
Elapsed time is 1.621439 seconds. % MEX-file
No where near the slow times you initially had! Still the pure MATLAB part is about twice as fast, probably because of the overhead of calling an external MEX-function.
(My system: Win8 running 64-bit R2013a)
There's absolutely no reason to expect that a MEX file is, in general, faster than an M file. The only reason that this is often true is that many loops in MATLAB incur a lot of function call overhead, along with parameter checking and such. Rewriting that in C eliminates the overhead, and gives your C compiler a chance to optimize the code.
In this case, there's nothing for the C compiler to optimize... it MUST make the MATLAB interface call for every iteration. In fact, the MATLAB optimizer will do a better job, since it can, in some cases "see" into the function.
In other words, forget using MEX to speed up this program.
There is some overhead cost in calls from mex to Matlab and vice versa. The overhead per call is small, but it it really adds up in a tight loop like this. As your testing indicates, pure Matlab can be much faster in this case! Your other option is to eliminate the mexCallMATLAB call and do everything in pure C++.