How to properly compare results between matlab/octave and C++ - c++

I'm writing some piece of code available in Matlab/Octave into C++ code. I only have octave so I will just say octave from now on.
I want to properly compare the results between the octave code and the C++ code. The algorithms I'm writing take as input a 2D matrix, and output another 2D matrix.
To compare the results, I write the input matrix A from octave using the function save A.mat A, with default options. This creates an ascii file A.mat which starts like
# Created by Octave 3.8.1, Tue May 27 12:12:53 2014 CEST <remi#desktop>
# name: values
# type: matrix
# rows: 25
# columns: 5
43.0656 6.752420000000001 68.39323 35.75617 98.85446
...
I run the algorithm using octave and save the output matrix B similarly.
In my C++ code, I load the matrices A and B using the following piece of code:
// I opened the file A.mat with std::ifstream infile(filename);
// and read the first lines starting by # and loaded the matrix dimensions
std::string buffer;
double* matBuffer = new double[rows*cols];
while (std::getline(infile, buffer)) {
std::istringstream iss(buffer);
while (iss >> *matBuffer) {
matBuffer++;
}
}
Then I run the C++ code with the values read from A.mat, and compare the results with the values read from B.mat by computing the mean squared error(MSE) on the coeff of B read vs B computed.
However, with such a design, can I expect that the MSE be 0 between the C++ and octave code? Of course I do the computation on octave and C++ with the same machine. But what about the loss in precision due to writing/reading the matrices in files? Also, I assume that coefficients of octave matrices are stored in double by default, is this correct?

can I expect that the MSE be 0 between the C++ and octave code?
I don't think so, because of the many levels of conversion, a precision loss is hard to avoid.
Also, I assume that coefficients of octave matrices are stored in double by default, is this correct?
Octave uses double precision for internal representation of the values, but again there can be a loss in precision when storing the values in ASCII.
I'd suggest you try to use the binary format for storing the values, which will exclude any problems with precision. You can go with the HDF5 format by using
save -hdf5 A.mat A
You can then use the HDF5 API to read the values in your CPP application.

Related

Understanding tflite examples in documentation

I've been trying to figure out how to use tflite with c++ for a raspberry pi project, and I've had trouble understanding documentation and how to perform inference. I've been confused about the code shown on this page:
https://www.tensorflow.org/lite/api_docs/cc/class/tflite/interpreter
For context, I am inexperienced in c++, and usually work with python and matlab.
Here are the lines of code I'm trying to understand:
auto input = interpreter->typed_tensor(0);
for (int i = 0; i < input_size; i++) {
input[i] = ...; interpreter->Invoke();
My understanding of these lines is - They set input equal to the typed_tensor function of interpreter, with (0) as the input.
Then, they loop for values of i between 0 and input_size. For each i value, they make the inpt[i] equal to some value. They then call interpreter->Invoke(), which has the neural net perform inference, with input[i] as the input?
This is then repeated for each i value, each time calling interpreter->Invoke() with input[i] as the input to the neural net.
Is my understanding of this process correct?
What shape should the input take? For example, if I had converted a tensorflow model that takes a 1x100 input and then converted this to a tflite model, how would I create the input to feed into the tflite model in c++?
What data type should the input to the tflite model be in?
Thank you,
Simon
I've tried looking at example code of other people's uses of tflite in c++, but I haven't been able to examine the inputs to their models. I would also appreciate help with viewing the input tensors while debugging. I'm using code::blocks as an IDE for now.

How to correctly format input and resize output data whille using TensorRT engine?

I'm trying implementing deep learning model into TensorRT runtime. The model conversion step is done quite OK and i'm pretty sure about it.
Now there's 2 parts i'm currently struggle with is memCpy data from host To Device (like openCV to Trt) and get the right output shape in order to get the right data. So my questions is:
How actually a shape of input dims relate with memory buffer. What is the difference when the model input dims is NCHW and NHWC, so when i read a openCV image, it's NHWC and also the model input is NHWC, do i have to re-arange the buffer data, if Yes then what's the actual consecutive memory format i have to do ?. Or simply what does the format or sequence of data that the engine are expecting ?
About the output (assume the input are correctly buffered), how do i get the right result shape for each task (Detection, Classification, etc..)..
Eg. an array or something look similar like when working with python .
I read Nvidia docs and it's not beginner-friendly at all.
//Let's say i have a model thats have a dynamic shape input dim in the NHWC format.
auto input_dims = nvinfer1::Dims4{1, 386, 342, 3}; //Using fixed H, W for testing
context->setBindingDimensions(input_idx, input_dims);
auto input_size = getMemorySize(input_dims, sizeof(float));
// How do i format openCV Mat to this kind of dims and if i encounter new input dim format, how do i adapt to that ???
And the expected output dims is something like (1,32,53,8) for example, the output buffer result in a pointer and i don't know what's the sequence of the data to reconstruct to expected array shape.
// Run TensorRT inference
void* bindings[] = {input_mem, output_mem};
bool status = context->enqueueV2(bindings, stream, nullptr);
if (!status)
{
std::cout << "[ERROR] TensorRT inference failed" << std::endl;
return false;
}
auto output_buffer = std::unique_ptr<int>{new int[output_size]};
if (cudaMemcpyAsync(output_buffer.get(), output_mem, output_size, cudaMemcpyDeviceToHost, stream) != cudaSuccess)
{
std::cout << "ERROR: CUDA memory copy of output failed, size = " << output_size << " bytes" << std::endl;
return false;
}
cudaStreamSynchronize(stream);
//How do i use this output_buffer to form right shape of output, (1,32,53,8) in this case ?
Could you please edit your question and tell us which model you're using if it's a commonly known NN, prehaps one we can download to test locally?
Then, the answer since it doesn't depend on the model (even though it would help to answer)
How actually a shape of input dims relate with memory buffer
If the input is NxCxHxW, you need to allocate N*C*H*W*sizeof(float) memory for that on your CPU and GPU. To be more precise, you need to allocate space on GPU for all the bindings and on CPU for only input and output bindings.
when i read a openCV image, it's NHWC and also the model input is NHWC, do i have to re-arange the buffer data
No, you do not have to re-arrange the buffer data. If you would have to change between NHWC and NCHW you can check this or google 'opencv NHWC to NHCW'.
Full working code example here, especially this function.
Or simply what does the format or sequence of data that the engine are expecting ?
This depends on how the neural network was trained. You should in general know exactly which kind of preprocessing and image data formats have been used to train the NN. You should even use the same libraries to load images and process them if possible. It's an open problem in ML: if you try to replicate results of some papers and use their models but they haven't open sourced the preprocessing you might get worse results. In the "worst" case you can implement both NHCW and NCHW and test which of them works.
About the output (assume the input are correctly buffered), how do i get the right result shape for each task (Detection, Classification, etc..).. Eg. an array or something look similar like when working with python .
This question clearly requires me to understand which NNs you are referring to. But I myself do the following:
Load the TensorRT .engine file in my code like this and deserialize like this
Print the bindings like this
Then I know the size of the input binding or bindings if there are many inputs, and the size of the output binding or bindings if there are many outputs.
This way you know the right result shape for each task. I hope this answered your question. If not, please add detailed comments and edit your post to be more precise. Thank you.
I read Nvidia docs and it's not beginner-friendly at all.
Yes I agree. You're better of searching TensorRT c++ (or Python) repositories from Github and studying their code. Have you seen TensorRT samples? It doesn't really take many lines of code to implement TensorRT inference.

extracting output data from typed_output_tensor in TFlite

Thanks in advance for your support.
I'm trying to get the output of a tensor after the inference on a .tflite U-net neural network. I'm using Tensorflow lite image classification code as a baseline.
I need to adapt the code for a segmentation task. My question is how I can access the output of the inferenced model (which 128x128x1) and write the result into an image?
I already debugged the code and explored many different approaches. Unfortunately, I'm not confident with the C++ language. What I found is that the command: interpreter->typed_output_tensor<float>(0) should be what I need, as also referenced here: https://www.tensorflow.org/lite/guide/inference#loading_a_model. However, I cannot access the 128x128 tensor generated by the network.
You can find the code at the address: https://github.com/tensorflow/tensorflow/blob/770481fb3e9126f9a29db5667f528e450d54d719/tensorflow/lite/examples/label_image/label_image.cc
The interesting part is here (lines 217 -224):
const float threshold = 0.001f;
std::vector<std::pair<float, int>> top_results;
int output = interpreter->outputs()[0];
TfLiteIntArray* output_dims = interpreter->tensor(output)->dims;
// assume output dims to be something like (1, 1, ... ,size)
auto output_size = output_dims->data[output_dims->size - 1];
I expect the values saved in an image or an alternative way of saving the output tensor

Cuda: least square solving , poor in speed

Recently ,I use Cuda to write an algorithm called 'orthogonal matching pursuit' . In my ugly Cuda code the entire iteration takes 60 sec , and Eigen lib takes just 3 sec...
In my code Matrix A is [640,1024] and y is [640,1] , in each step I select some vectors from A to compose a new Matrix called A_temp [640,itera], iter=1:500 . I new a array MaxDex_Host[] in cpu to tell which column to select .
I want to get x_temp[itera,1] from A_temp*x_temp=y using least-square , I use a cula API 'culaDeviceSgels' and cublas matrix-vector multiplication API.
So the culaDeviceSgels would call 500 times , and I think this would be faster than Eigen lib's QR.Sovler .
I check the Nisight performence anlysis , I found the custreamdestory takes a long time . I initial cublas before iteration and destory it after I get the result . So I want to know the what is the custreamdestory , different with cublasdestory?
The main problem is memcpy and function 'gemm_kernel1x1val' . I think this function is from 'culaDeviceSgels'
while(itera<500): I use cublasSgemv and cublasIsamax to get MaxDex_Host[itera] , then
MaxDex_Host[itera]=pos;
itera++;
float* A_temp_cpu=new float[M*itera]; // matrix all in col-major
for (int j=0;j<itera;j++) // to get A_temp [M,itera] , the MaxDex_Host[] shows the positon of which column of A to chose ,
{
for (int i=0;i<M;i++) //M=640 , and A is 640*1024 ,itera is add 1 each step
{
A_temp_cpu[j*M+i]=A[MaxDex_Host[j]*M+i];
}
}
// I must allocate one more array because culaDeviceSgels will decompose the one input Array , and I want to use A_temp after least-square solving.
float* A_temp_gpu;
float* A_temp2_gpu;
cudaMalloc((void**)&A_temp_gpu,Size_float*M*itera);
cudaMalloc((void**)&A_temp2_gpu,Size_float*M*itera);
cudaMemcpy(A_temp_gpu,A_temp_cpu,Size_float*M*itera,cudaMemcpyHostToDevice);
cudaMemcpy(A_temp2_gpu,A_temp_gpu,Size_float*M*itera,cudaMemcpyDeviceToDevice);
culaDeviceSgels('N',M,itera,1,A_temp_gpu,M,y_Gpu_temp,M);// the x_temp I want is in y_Gpu_temp's return value , stored in the y_Gpu_temp[0]——y_Gpu_temp[itera-1]
float* x_temp;
cudaMalloc((void**)&x_temp,Size_float*itera);
cudaMemcpy(x_temp,y_Gpu_temp,Size_float*itera,cudaMemcpyDeviceToDevice);
Cuda's memory manage seems too complex , is there any other convenience method to solve least-square?
I think that custreamdestory and gemm_kernel1x1val are internally called by the APIs you are using, so there is not much to do with them.
To improve your code, I would suggest to do the following.
You can get rid of A_temp_cpu by keeping a device copy of the matrix A. Then you can copy the rows of A into the rows of A_temp_gpu and A_temp2_gpu by a kernel assignment. This would avoid performing the first two cudaMemcpys.
You can preallocate A_temp_gpu and A_temp2_gpu outside the while loop by using the maximum possible value of itera instead of itera. This will avoid the first two cudaMallocs inside the loop. The same applies to x_temp.
As long as I know, culaDeviceSgels solves a linear system of equations. I think you can do the same also by using cuBLAS APIs only. For example, you can perform an LU factorization first by cublasDgetrfBatched() and then use cublasStrsv() two times to solve the two arising linear systems. You may wish to see if this solution leads to a faster algorithm.

Moving matrix from c++ to Matlab

I'm trying to take a matrix from c++ and import it to Matlab to run bintprog on this matrix, call it m. My c++ code generates these matrices of a certain type, and I need to run bintprog on them quickly, and with ideally millions of matrices.
So any of the following would be great:
A way to import a bunch of matrices at once so I can run a lot of iterations thru my Matlab code.
Or
If I could implement Matlab code right in c++ nicely.
If this is not clear leave me comments and I'll update what I can.
You can call Matlab commands from C++ code (and vice versa):
Compile your C++ code into a mex function and call bintprog using mexCallMatlab.
As proposed by Mark, you may call Matlab engine from native C++ code using matlab engine.
You may compile your C++ code as a shared library and call it from Matlab using calllib.
I suggest the simple solution, assuming that your matrices are kept in 3-dimmensional array:
Build a loop in C++, to save your matrices... Something like this:
ofstream arquivoOut0("myMatrices.dat");
for(int m=0;m<numberMatrices;m++){
for (int i=0; i< numberlines;i++){
for(int j=0;j<numberColumns;j++)
if(j!=numberColumns-1) arquivoOut0<< matrices[m][i][j] << "\t";
else arquivoOut0<< matrices[m][i][j] << "\n";
}
}
}
arquivoOut0.close();
Ok. You have saved your matrices in an ascii file! Now you have to read it in Matlab!
load myMatrices.dat
for m=1:numberMatrices
for i=1:numberLines
for j=1:numberColumns
myMatricesInMatlab(m,i,j)=myMatrices((m-1)*numberLines+i,j);
end
end
end
Now, you can use the toolbox that you need:
for i=1:numberMatrices
Apply the toolbox for myMatricesInMatlab(i,:,:);
end
I think it works, it the processing time is not an issue!