CUDA access matrix stored in RAM and possibility of being implemented - c++

Recently I started working with numerical computation and solving mathematical problems numerically, programing in C++ with OpenMP. But now my problem is to big and take days to solve even parallelized. So, I’m thinking in start learning CUDA to reduce the time, but I have some doubts.
The heart of my code is the following function. The entries are two pointes to vectors. N_mesh_points_x,y,z are integers pre-defined, weights_x,y,z are column matrices, kern_1 is an exponential function and table_kernel is a function who access a 50 Gb matrix stored in RAM and pre calculated.
void Kernel::paralel_iterate(std::vector<double>* K1, std::vector<double>* K2 )
{
double r, sum_1 = 0 , sum_2 = 0;
double phir;
for (int l = 0; l < N_mesh_points_x; l++){
for (int m = 0; m < N_mesh_points_y; m++){
for (int p = 0; p < N_mesh_points_z; p++){
sum_1 = 0;
sum_2 = 0;
#pragma omp parallel for schedule(dynamic) private(phir) reduction(+: sum_1,sum_2)
for (int i = 0; i < N_mesh_points_x; i++){
for (int j = 0; j < N_mesh_points_y; j++){
for (int k = 0; k < N_mesh_points_z; k++){
if (!(i==l) || !(j==m) || !(k==p)){
phir = weights_x[i]*weights_y[j]*weights_z[k]*kern_1(i,j,k,l,m,p);
sum_1 += phir * (*K1)[position(i,j,k)];
sum_2 += phir;
}
}
}
}
(*K2)[ position(l,m,p)] = sum_1 + (table_kernel[position(l,m,p)] - sum_2) * (*K1)[position (l,m,p)];
}
}
}
return;
}
My questions are:
Can I program, at least the central part of this function, in CUDA? I only parallelized with OpenMP the internals loops because was giving the wrong answer when I parallelized all the loops.
The function table_kernel who access a big matrix, the matrix is too big to be stored in the memory of my video card, so the file will stay in RAM. Is this a problem? Can CUDA easily access the files in RAM? Or this can’t be done and all the files needed to be stored inside video card?

Can I program, at least the central part of this function, in CUDA? I only parallelized with OpenMP the internals loops because was giving the wrong answer when I parallelized all the loops.
Yes, you should be able to program the portion that you currently have in the OpenMP scope, as a CUDA kernel.
The function table_kernel who access a big matrix, the matrix is to big to be stored in the memory of my video card, so the file will stay in RAM. This is a problem? The CUDA can access easily the files in RAM? Or this can’t be done and all the files needed to be stored inside video card?
Since you only access this outside the OpenMP scope, if you only use a CUDA kernel for the work that you are currently doing with OpenMP, it should not be necessary to access table_kernel from the GPU, and therefore this should not be an issue. If you attempt to add additional loops to be parallelized on the GPU, then this may become an issue. Since the access would be relatively infrequent (compared to the processing going on in the inner loops), if you wanted to pursue this, you could try making the table_kernel data available to the GPU via cudaHostAlloc - basically mapping host memory in the GPU address space. This normally is a significant performance hazard, but if you make infrequent accesses to it as mentioned, it may or may not be a serious performance issue.
Note that you won't be able to use or access std::vector in device code, so those types of data containers would probably have to be realized as ordinary double arrays.

Related

Advantages on flattening to 1D

I have questions about the flattening operation I see on forums. People often recommend flattening a multi-dimensional vector, or array to a single dimension one.
For example:
int height = 10;
int width = 10;
std::vector<int> grid;
for(int i = 0; i < height; i++){
for(int j = 0; j < width; j++){
grid.push_back(rand() % i + j);
}
}
std::vector<std::vector<int>> another_grid;
for(int i = 0; i < height; i++){
std::vector<int> row;
for(int j = 0; j < width; j++){
row.push_back(rand() % i + j);
}
another_grid.push_back(row);
}
I can guess that it's less memory consuming to have a single vector instead of many ones, but what about a multidimensional array of int ? Is there real advantages to flatten multi dimensional data structures ?
I can think of multiple reasons to do this, in no particular order and there might be more that I missed:
Slightly less memory use: each vector takes 24 bytes*, if you have 1000 rows, it's 24K more memory. Not that important, but it's there.
Fewer allocations: Again, not very important, but allocations can be slow, and if this is happening for instance in real time and you're allocating buffers for images coming from a camera, having 1 allocation is better than potentially thousands.
Locality: This is the most important one, with a single allocation, all the data is going to be very close to each other, so accessing nearby data will be much faster either because it's already in the cache, or the prefetching hardware can accurately pull the next cache line.
Easier serialization/deserialization: For instance, if this is a texture data, it can be passed to a GPU with a single copy. Same applies for writing to a disk or network, though you may want some compression with those.
The downside is it's less comfortable to write and use, but with a proper class abstracting this away, it's pretty much a must-have if performance matters. It may also be less efficient for certain operations. For instance, with the vector<vector<>> version, you can swap entire rows with a single pointer swap, and the single vector version needs to copy a bunch of data around.
*: This depends on your implementation, but on 64-bit platforms, this is common.

How Can I make it faster in c++11 with std::vector?

I have cv::Mat Mat_A and cv::Mat Mat_B both are (800000 X 512) floats
and below code is looks slow .
int rows = Mat_B.rows;
cv::Mat Mat_A = cv::repeat(img, rows, 1, Mat_A);
Mat_A = Mat_A - Mat_B
cv::pow(Mat_A,2,Mat_A)
cv::reduce(Mat_A, Mat_A, 1, CV_REDUCE_SUM);
cv::minMaxLoc(Mat_A, &dis, 0, &point, 0);
How Can I do this in std::vector ?
I think it should be faster.
In my 2.4 Ghz mabook pro it took 4 sec ? very slow.
I don't think you should use std::vector to do these operations. Image processing (CV aka Computer Vision) algorithms tend to be quite computationally heavy because there is so much data to deal with. OpenCV 2.0 C++ is highly optimized for this kind of operations, e.g. cv::Mat has a header and whenever a cv::Mat is copied with copy assignment or constructor, only the headers are copied with a pointer to the data. They use reference counting to keep track of instances. So memory management is done for you, and that's a good thing.
https://docs.opencv.org/2.4/doc/tutorials/core/mat_the_basic_image_container/mat_the_basic_image_container.html
You could try to compile without debug symbols, i.e. release vs debug. And you can also try to compile with optimization flags, e.g. for gcc -O3 which should reduce the size of your binary and speed up runtime operations. Maybe it might make a difference.
https://www.rapidtables.com/code/linux/gcc/gcc-o.html
Another thing you could try is to give your process a higher priority, i.e. the higher the priority, the less it the process yields the CPU. Again, that might not make a lot of difference, it all depends of other processes and their priorities, etc.
https://superuser.com/questions/42817/is-there-any-way-to-set-the-priority-of-a-process-in-mac-os-x
I hope that helps a bit.
Well your thinking is wrong.
Why your program is slow:
Your CPU have to loop through a lot of number and do calculation. This will make computation complexity high. That's why it's slow. Your program's speed is in proportion to size of Mat A and B. You can check this point by reducing/increasing the size of Mat A and B.
Can we accelerate it by std::vector
Sorry but it's no. Using std::vector will not reduce the calculation complexity. The math arthmetic of opencv is da "best", re-writing will only lead to slower code.
How to accelerate the calculation: you need to enable the acceleration options for opencv
you can see it at : https://github.com/opencv/opencv/wiki/CPU-optimizations-build-options . Intel provide intel mkl library to accelerate the matrix calculation. You could try it first.
Personally, the easiest approach is to use the GPU. But your machine doesn't have GPU, so it's out of the scope here.
You keep iterating over the data over and over again to do independent operations on them.
Something like this iterates only once over the data.
//assumes Mat_B and img cv::Mat
using px_t = float;//you mentioned float so I'll assume both img and Mat_B use floats
int rows = Mat_B.rows;
cv::Mat output(1,rows, Mat_B.type());
auto output_ptr = output.ptr<px_t>(0);
auto img_ptr = img.ptr<px_t>(0);
int min_idx =0;
int max_idx =0;
px_t min_ele = std::numeric_limits<px_t>::max();
px_t max_ele = std::numeric_limits<px_t>::min();
for(int i = 0; i< rows; ++i)
{
output[i]=0;
auto mat_row = Mat_B.ptr<px_t>(i);
for(int j = 0; j< Mat_B.cols; ++j)
{
output[i] +=(img_ptr[j]-mat_row[j])*(img_ptr[j]-mat_row[j]);
}
if(output[i]<min_ele)
{
min_idx = i;
min_ele = output[i];
}
if(output[i]>max_ele)
{
max_idx = i;
max_ele = output[i];
}
}
While I am also not sure if it is faster you can do this, assuming Mat_B contains uchar
std::vector<uchar> array_B(Mat_B.rows* Mat_B.cols);
if(Mat_B.isContinuous())
array_B = Mat_B.data;

Slow Matlab R2018b TypedArray array access in C++

I am using MATLAB R2018b mex functions to integrate a C++ library with my MATLAB code. As part of that, I need to take data in a MATLAB array and save into a C++ pointer array and a C++ vector of structures. However, mapping the matlab typed array is proving to be very slow (~0.4 seconds for ~800,000 elements).
here is the relevant code
const matlab::data::TypedArray<float> Vertices = std::move(inputs[0]);
float* positions = new float[Vertices.getNumberofElements()];
for (size_t i = 0; i < Vertices.getDimensions()[0]; i ++)
{
ctr = 9 * i;
positions[ctr + 0] = Vertices[i][0];
positions[ctr + 1] = Vertices[i][1];
positions[ctr + 2] = Vertices[i][2];
}
What is causing this loop to be slow? I tried re-ordering array access for Vertices to try and make the code more cache friendly, but that didn't produce a meaningful speed-up. Right now, the loop is ~0.4ms for 800,000 elements, ideally memory copy should take far less time, right?
When I looked over previous advice, I found that most answers use older mex functions, where the new(?) MATLAB C++ API doesn't have the same functions or structure.
Edit:
I followed Cris' advice and used a loop over iterators, that increased speed by about half, to 0.14 seconds.
The new code I'm using is:
const matlab::data::TypedArray<float> Vertices = std::move(inputs[0]);
float* positions = new float[Vertices.getNumberofElements()];
for (auto it = Vertices.begin(); it != Vertices.end(); ++it)
{
positions[ctr] = *it;
++ctr;
}
So it is faster, but still surprisingly slow (0.14 seconds for 800,000 elements). Is there any other way to speed this loop?
I got a major speedup by applying Cris advice and using the following code:
const matlab::data::TypedArray<float> Vertices = std::move(inputs[0]);
float* positions = new float[Vertices.getNumberofElements()];
memcpy(positions,&*Vertices.begin,sizeof(float)*Vertices.getNumberofElements());
Runtime went from 0.14 (using standard Visual Studio optimization) to 0.0035, which is acceptably fast for my application.

OpenMP doesn't run this for loop in different threads, how can I fix it

I have this code:
#pragma omp parallel for
for( i=0;i<(int)table.size();i++)
{
Vec3b bgrPixel;
TableElement element=table[i];
bgrPixel = inputImage.at<Vec3b>(element.InputPixel.y,element.InputPixel.x);
outputImage.at<Vec4b>(element.OutputPixel.y,element.OutputPixel.x)[0] = bgrPixel[0];
outputImage.at<Vec4b>(element.OutputPixel.y,element.OutputPixel.x)[1] = bgrPixel[1];
outputImage.at<Vec4b>(element.OutputPixel.y,element.OutputPixel.x)[2] = bgrPixel[2];
outputImage.at<Vec4b>(element.OutputPixel.y,element.OutputPixel.x)[3] = 255;
}
when I run it, I can see that only 25% of my processors power is used. I believe it is not run in parallel. Why it is not run in parallel and how can I improve its performance?
Images are OpenCV mat objects.
It was suggested in the comments, that Mat::at might do some kind of locking. I checked the OpenCV source, and this does'nt seem to be the case. One version of Mat::at is reproduced below:
template<typename _Tp> inline _Tp& Mat::at(int i0, int i1)
{
CV_DbgAssert( dims <= 2 && data && (unsigned)i0 < (unsigned)size.p[0] &&
(unsigned)(i1*DataType<_Tp>::channels) < (unsigned)(size.p[1]*channels(⇉
CV_ELEM_SIZE1(DataType<_Tp>::depth) == elemSize1());
return ((_Tp*)(data + step.p[0]*i0))[i1];
}
It would seem to me that as suggested in the comments, the reason for low CPU usage is most probably that your code doesn't have much anything to do for the CPU. As the code is just simple memory assignments, your code is probably memory, not CPU bound. My suggestion is that you look into not copying the data (the OpenCV Mat format is very flexible about pointing to the same data with two matrices by just creating a new header). If the InputPixel and OutputPixel values are uncorrelated or have a compex correlation, you probably need to resign to the fact that accessing memory randomly (i.e. lots of cache misses) is going to take some time.

FFTW vs Matlab FFT

I posted this on matlab central but didn't get any responses so I figured I'd repost here.
I recently wrote a simple routine in Matlab that uses an FFT in a for-loop; the FFT dominates the calculations. I wrote the same routine in mex just for experimentation purposes and it calls the FFTW 3.3 library. It turns out that the matlab routine runs faster than the mex routine for very large arrays (about twice as fast). The mex routine uses wisdom and and performs the same FFT calculations. I also know matlab uses FFTW, but is it possible their version is slightly more optimized? I even used the FFTW_EXHAUSTIVE flag and its still about twice as slow for large arrays than the MATLAB counterpart. Furthermore I ensured the matlab I used was single threaded with the "-singleCompThread" flag and the mex file I used was not in debug mode. Just curious if this was the case - or if there are some optimizations matlab is using under the hood that I dont know about. Thanks.
Here's the mex portion:
void class_cg_toeplitz::analysis() {
// This method computes CG iterations using FFTs
// Check for wisdom
if(fftw_import_wisdom_from_filename("cd.wis") == 0) {
mexPrintf("wisdom not loaded.\n");
} else {
mexPrintf("wisdom loaded.\n");
}
// Set FFTW Plan - use interleaved FFTW
fftw_plan plan_forward_d_buffer;
fftw_plan plan_forward_A_vec;
fftw_plan plan_backward_Ad_buffer;
fftw_complex *A_vec_fft;
fftw_complex *d_buffer_fft;
A_vec_fft = fftw_alloc_complex(n);
d_buffer_fft = fftw_alloc_complex(n);
// CREATE MASTER PLAN - Do this on an empty vector as creating a plane
// with FFTW_MEASURE will erase the contents;
// Use d_buffer
// This is somewhat dangerous because Ad_buffer is a vector; but it does not
// get resized so &Ad_buffer[0] should work
plan_forward_d_buffer = fftw_plan_dft_r2c_1d(d_buffer.size(),&d_buffer[0],d_buffer_fft,FFTW_EXHAUSTIVE);
plan_forward_A_vec = fftw_plan_dft_r2c_1d(A_vec.height,A_vec.value,A_vec_fft,FFTW_WISDOM_ONLY);
// A_vec_fft.*d_buffer_fft will overwrite d_buffer_fft
plan_backward_Ad_buffer = fftw_plan_dft_c2r_1d(Ad_buffer.size(),d_buffer_fft,&Ad_buffer[0],FFTW_EXHAUSTIVE);
// Get A_vec_fft
fftw_execute(plan_forward_A_vec);
// Find initial direction - this is the initial residual
for (int i=0;i<n;i++) {
d_buffer[i] = b.value[i];
r_buffer[i] = b.value[i];
}
// Start CG iterations
norm_ro = norm(r_buffer);
double fft_reduction = (double)Ad_buffer.size(); // Must divide by size of vector because inverse FFT does not do this
while (norm(r_buffer)/norm_ro > relativeresidual_cutoff) {
// Find Ad - use fft
fftw_execute(plan_forward_d_buffer);
// Get A_vec_fft.*fft(d) - A_vec_fft is only real, but d_buffer_fft
// has complex elements; Overwrite d_buffer_fft
for (int i=0;i<n;i++) {
d_buffer_fft[i][0] = d_buffer_fft[i][0]*A_vec_fft[i][0]/fft_reduction;
d_buffer_fft[i][1] = d_buffer_fft[i][1]*A_vec_fft[i][0]/fft_reduction;
}
fftw_execute(plan_backward_Ad_buffer);
// Calculate r'*r
rtr_buffer = 0;
for (int i=0;i<n;i++) {
rtr_buffer = rtr_buffer + r_buffer[i]*r_buffer[i];
}
// Calculate alpha
alpha = 0;
for (int i=0;i<n;i++) {
alpha = alpha + d_buffer[i]*Ad_buffer[i];
}
alpha = rtr_buffer/alpha;
// Calculate new x
for (int i=0;i<n;i++) {
x[i] = x[i] + alpha*d_buffer[i];
}
// Calculate new residual
for (int i=0;i<n;i++) {
r_buffer[i] = r_buffer[i] - alpha*Ad_buffer[i];
}
// Calculate beta
beta = 0;
for (int i=0;i<n;i++) {
beta = beta + r_buffer[i]*r_buffer[i];
}
beta = beta/rtr_buffer;
// Calculate new direction vector
for (int i=0;i<n;i++) {
d_buffer[i] = r_buffer[i] + beta*d_buffer[i];
}
*total_counter = *total_counter+1;
if(*total_counter >= iteration_cutoff) {
// Set total_counter to -1, this indicates failure
*total_counter = -1;
break;
}
}
// Store Wisdom
fftw_export_wisdom_to_filename("cd.wis");
// Free fft alloc'd memory and plans
fftw_destroy_plan(plan_forward_d_buffer);
fftw_destroy_plan(plan_forward_A_vec);
fftw_destroy_plan(plan_backward_Ad_buffer);
fftw_free(A_vec_fft);
fftw_free(d_buffer_fft);
};
Here's the matlab portion:
% Take FFT of A_vec.
A_vec_fft = fft(A_vec); % Take fft once
% Find initial direction - this is the initial residual
x = zeros(n,1); % search direction
r = zeros(n,1); % residual
d = zeros(n+(n-2),1); % search direction; pad to allow FFT
for i = 1:n
d(i) = b(i);
r(i) = b(i);
end
% Enter CG iterations
total_counter = 0;
rtr_buffer = 0;
alpha = 0;
beta = 0;
Ad_buffer = zeros(n+(n-2),1); % This holds the product of A*d - calculate this once per iteration and using FFT; only 1:n is used
norm_ro = norm(r);
while(norm(r)/norm_ro > 10^-6)
% Find Ad - use fft
Ad_buffer = ifft(A_vec_fft.*fft(d));
% Calculate rtr_buffer
rtr_buffer = r'*r;
% Calculate alpha
alpha = rtr_buffer/(d(1:n)'*Ad_buffer(1:n));
% Calculate new x
x = x + alpha*d(1:n);
% Calculate new residual
r = r - alpha*Ad_buffer(1:n);
% Calculate beta
beta = r'*r/(rtr_buffer);
% Calculate new direction vector
d(1:n) = r + beta*d(1:n);
% Update counter
total_counter = total_counter+1;
end
In terms of time, for N = 50000 and b = 1:n it takes about 10.5 seconds with mex and 4.4 seconds with matlab. I'm using R2011b. Thanks
A few observations rather than a definite answer since I do not know any of the specifics of the MATLAB FFT implementation:
Based on the code you have, I can see two explanations for the speed difference:
the speed difference is explained by differences in levels of optimization of the FFT
the while loop in MATLAB is executed a significantly smaller number of times
I will assume you already looked into the second issue and that the number of iterations are comparable. (If they aren't, this is most likely to some accuracy issues and worth further investigations.)
Now, regarding FFT speed comparison:
Yes, the theory is that FFTW is faster than other high-level FFT implementations but it is only relevant as long as you compare apples to apples: here you are comparing implementations at a level further down, at the assembly level, where not only the selection of the algorithm but its actual optimization for a specific processor and by software developers with varying skills comes at play
I have optimized or reviewed optimized FFTs in assembly on many processors over the year (I was in the benchmarking industry) and great algorithms are only part of the story. There are considerations that are very specific to the architecture you are coding for (accounting for latencies, scheduling of instructions, optimization of register usage, arrangement of data in memory, accounting for branch taken/not taken latencies, etc.) and that make differences as important as the selection of the algorithm.
With N=500000, we are also talking about large memory buffers: yet another door for more optimizations that can quickly get pretty specific to the platform you run your code on: how well you manage to avoid cache misses won't be dictated by the algorithm so much as by how the data flow and what optimizations a software developer may have used to bring data in and out of memory efficiently.
Though I do not know the details of the MATLAB FFT implementation, I am pretty sure that an army of DSP engineers has been (and is still) honing on its optimization as it is key to so many designs. This could very well mean that MATLAB had the right combination of developers to produce a much faster FFT.
This is classic performance gain thanks to low-level and architecture-specific optimization.
Matlab uses FFT from the Intel MKL (Math Kernel Library) binary (mkl.dll). These are routines optimized (at assembly level) by Intel for Intel processors. Even on AMD's it seems to give nice performance boosts.
FFTW seems like a normal c library that is not as optimized. Hence the performance gain to use the MKL.
I have found the following comment on the MathWorks website [1]:
Note on large powers of 2: For FFT dimensions that are powers of
2, between 2^14 and 2^22, MATLAB software uses special preloaded
information in its internal database to optimize the FFT computation.
No tuning is performed when the dimension of the FTT is a power of 2,
unless you clear the database using the command fftw('wisdom', []).
Although it relates to powers of 2, it may hint upon that MATLAB employs its own 'special wisdom' when using FFTW for certain (large) array sizes. Consider: 2^16 = 65536.
[1] R2013b Documentation available from http://www.mathworks.de/de/help/matlab/ref/fftw.html (accessed on 29 Oct 2013)
EDIT: #wakjah 's reply to this answer is accurate: FFTW does support split real and imaginary memory storage via its Guru interface. My claim about hacking is thus not accurate but can very well apply if FFTW's Guru interface is not used - which is the case by default, so beware still!
First, sorry for being a year late. I'm not convinced that the speed increase you see comes from MKL or other optimizations. There is something quite fundamentally different between FFTW and Matlab, and that is how complex data is stored in memory.
In Matlab, the real and imaginary parts of a complex vector X are separate arrays Xre[i] and Xim[i] (linear in memory, efficient when operating on either of them separately).
In FFTW, the real and imaginary parts are interlaced as double[2] by default, i.e. X[i][0] is the real part, and X[i][1] is the imaginary part.
Thus, to use the FFTW library in mex files one cannot use the Matlab array directly, but must allocate new memory first, then pack the input from Matlab into FFTW format, and then unpack the output from FFTW into Matlab format. i.e.
X = (fftw_complex*) fftw_malloc(sizeof(fftw_complex) * N);
Y = (fftw_complex*) fftw_malloc(sizeof(fftw_complex) * N);
then
for (size_t i=0; i<N; ++i) {
X[i][0] = Xre[i];
X[i][1] = Xim[i];
}
then
for (size_t i=0; i<N; ++i) {
Yre[i] = Y[i][0];
Yim[i] = Y[i][1];
}
Hence, this requires 2x memory allocations + 4x reads + 4x writes -- all of size N. This does take a toll speed-wise on large problems.
I have a hunch that Mathworks may have hacked the FFTW3 code to allow it to read input vectors directly in the Matlab format, which avoids all of the above.
In this scenario, one can only allocate X and use X for Y to run FFTW in-place (as fftw_plan_*(N, X, X, ...) instead of fftw_plan_*(N, X, Y, ...)), since it'll be copied to the Yre and Yim Matlab vector, unless the application requires/benefits from keeping X and Y separate.
EDIT: Looking at the memory consumption in real-time when running Matlab's fft2() and my code based on the fftw3 library, it shows that Matlab only allocates only one additional complex array (the output), whereas my code needs two such arrays (the *fftw_complex buffer plus the Matlab output). An in-place conversion between the Matlab and fftw formats is not possible because the Matlab's real and imaginary arrays are not consecutive in memory. This suggests that Mathworks hacked the fftw3 library to read/write the data using the Matlab format.
One other optimization for multiple calls, is to allocate persistently (using mexMakeMemoryPersistent()). I'm not sure if the Matlab implementation does this as well.
Cheers.
p.s. As a side note, the Matlab complex data storage format is more efficient for operating on the real or imaginary vectors separately. On FFTW's format you'd have to do ++2 memory reads.