Combine two separate buffers into a complex one - c++

i'm working to a project involving realtime rx and tx trasmission using a Software Defined Radio.
I have to pass the SDR transmission API a Complex Float array buffer to be sent. I'm trying to implement every feature avoid using "for loops" since working each element at a time slow down the code execution, since i need to do many upsampling, FIR filtering and other computing intensive computation.
Now i am facing a problem. Suppose i have two separate buffers one representing the real part and the other the imag part of the complex samples buffer i have to pass to the API tx function.
Say real buffer is RRRRRRRRRRRR while the imag buffer is IIIIIIIIIIII. The example is for 12 samples but really it could be 2048, 4096 or more ...
int size=12
float *reals,*imags;
reals = new float[size];
image = new float[size];
Now i need an output that is defined as
complex<float> *cplxOut;
cplxOut = new complex<float>[size];
In memory this object is stored as RIRIRIRIRIRIRIRIRIRIRIRI.
build the cplxOut from the two real and imag buffers is easy using a for loop
for (int i=0;i<size;i++)
{
(*(cplxOut+i)).real(*(reals+i));
(*(cplxOut+i)).imag(*(imags+i));
}
I wonder if there is a quickest way to do it using direct memory move functions for whole buffers.
I tried to use inline assembly to speed up the task but it has problem in portability on different architectures and is not supported for x64 on Windows side.
A possibile way could be upsample by two interleaving with zero, shift forward the imag buffer of 1 place and then OR both buffers, but to do upsampling i have to use a for loop as well.. so no way.
Do you have any suggestion ? I need the fastest way to do it.
Tnx, Fabio

You don't have to construct an array of empty complex objects, but use std::vector and emplace_back() instead:
vector<complex<float>> cplxOut;
// to avoid reallocations when adding new elements
cplxOut.reserve(size);
for (int i = 0; i < size; i++)
{
// create in-place complex number
cplxOut.emplace_back(reals[i], imags[i]);
}

Related

Training data structure and access

I'm writing up an implementation of backpropagation for a feedforward neural network in C++ and I'm using the Armadillo library. Right now, I'm loading training data with the method load for the class matrix in the Armadillo library. Two questions:
1) Is this a reasonable choice for storing pre-formatted (CSV), numeric data that fits into main memory (<2GB)? Certainly there are better ways to do this than others and it'd be nice to know if this is not a good practice. Part of me feels like this isn't a good choice for holding the data as there are likely more data-ish structures/frameworks (like I should be accessing some SQL database or something). Another part of me feels like numeric data is by definition just matrices so this should be wonderful.
2) I need to sample without replacement from a data set in my implementation and I see two routes: either I could shuffle the rows of the data set or shuffle an array that indexes the data set. There is a shuffle method for the matrix class in the Armadillo library and I'm suspicious that what is shuffled is addresses and not the rows themselves. Wouldn't that be just as efficient as shuffling an indexing array?
1) Yes, this is fine and it's how I would do it, but note that Armadillo matrices are column-major and thus you may need to transpose the CSV that you load. If your data is sufficiently large that it won't fit in main memory, you could consider writing a custom CSV parser that looks at the data in a streaming sense (i.e. one point at a time), thus reducing your RAM footprint, or you could even use mmap() to map a file full of packed doubles as your matrix and let the kernel work out what needs to be swapped in when.
2) Because all matrix data is stored contiguously (i.e. double* not double**), shuffle() will be moving the elements in the matrix. What I generally do in this type of situation is create a vector of indices and shuffle it:
uvec indices = linspace<uvec>(0, n, n);
shuffle(indices);
// Now loop over each shuffled point...
for (uword i = 0; i < n; ++i)
{
// access the point with data.col(indices[i]) and do whatever
}
(The above code isn't tested, but it should work or easily be adapted into something that works.)
For what it's worth, mlpack (http://www.mlpack.org/) does have a not-yet-stable neural network infrastructure that uses Armadillo, and it may be worth your time to check out; the link below is to the relevant source directly, but poking around on Github and the mlpack website should reveal better documentation.
https://github.com/mlpack/mlpack/tree/master/src/mlpack/methods/ann

How to change sub-matrix of a sparse matrix on CUDA device

I have a sparse matrix structure that I am using in conjunction with CUBLAS to implement a linear solver class. I anticipate that the dimensions of the sparse matrices I will be solving will be fairly large (on the order of 10^7 by 10^7).
I will also anticipate that the solver will need to be used many times and that a portion of this matrix will need be updated several times (between computing solutions) as well.
Copying the entire matrix sturcture from system memory to device memory could become quite a performance bottle neck since only a fraction of the matrix entries will ever need to be changed at a given time.
What I would like to be able to do is to have a way to update only a particular sub-set / sub-matrix rather than recopy the entire matrix structure from system memory to device memory each time I need to change the matrix.
The matrix data structure would reside on the CUDA device in arrays:
d_col, d_row, and d_val
On the system side I would have corresponding arrays I, J, and val.
So ideally, I would only want to change the subsets of d_val that correspond to the values in the system array, val, that changed.
Note that I do not anticipate that any entries will be added to or removed from the matrix, only that existing entries will change in value.
Naively I would think that to implement this, I would have an integer array or vector on the host side, e.g. updateInds , that would track the indices of entries in val that have changed, but I'm not sure how to efficiently tell the CUDA device to update the corresponding values of d_val.
In essence: how do I change the entries in a CUDA device side array (d_val) at indicies updateInds[1],updateInds[2],...,updateInds[n] to a new set of values val[updatInds[1]], val[updateInds[2]], ..., val[updateInds[3]], with out recopying the entire val array from system memory into CUDA device memory array d_val?
As long as you only want to change the numerical values of the value array associated with CSR (or CSC, or COO) sparse matrix representation, the process is not complicated.
Suppose I have code like this (excerpted from the CUDA conjugate gradient sample):
checkCudaErrors(cudaMalloc((void **)&d_val, nz*sizeof(float)));
...
cudaMemcpy(d_val, val, nz*sizeof(float), cudaMemcpyHostToDevice);
Now, subsequent to this point in the code, let's suppose I need to change some values in the d_val array, corresponding to changes I have made in val:
for (int i = 10; i < 25; i++)
val[i] = 4.0f;
The process to move these particular changes is conceptually the same as if you were updating an array using memcpy, but we will use cudaMemcpy to update the d_val array on the device:
cudaMemcpy(d_val+10, val+10, 15*sizeof(float), cudaMempcyHostToDevice);
Since these values were all contiguous, I can use a single cudaMemcpy call to effect the transfer.
If I have several disjoint regions similar to above, it will require several calls to cudaMemcpy, one for each region. If, by chance, the regions are equally spaced and of equal length:
for (int i = 10; i < 5; i++)
val[i] = 1.0f;
for (int i = 20; i < 5; i++)
val[i] = 2.0f;
for (int i = 30; i < 5; i++)
val[i] = 4.0f;
then it would also be possible to perform this transfer using a single call to cudaMemcpy2D. The method is outlined here.
Notes:
cudaMemcpy2D is slower than you might expect compared to a cudaMemcpy operation on the same number of elements.
CUDA API calls have some inherent overhead. If a large part of the matrix is to be updated in a scattered fashion, it may still be actually quicker to just transfer the whole d_val array, taking advantage of the fact that this can be done using a single cudaMemcpy operation.
The method described here cannot be used if non-zero values change their location in the sparse matrix. In that case, I cannot provide a general answer for how to surgically update a CSR sparse matrix on the device. And certain relatively simple changes could necessitate updating most of the array data (3 vectors) anyway.

Blend two images using GPU

I need to blend thousands of pairs of images very fast.
My code currently does the following: _apply is a function pointer to a function like Blend. It is one of the many functions we can pass, but it is not the only one. Any function takes two values and outputs a third and it is done on each channel for each pixel. I would prefer a solution that is general to any such function rather than a specific solution for blending.
typedef byte (*Transform)(byte src1,byte src2);
Transform _apply;
for (int i=0 ; i< _frameSize ; i++)
{
source[i] = _apply(blend[i]);
}
byte Blend(byte src, byte blend)
{
int resultPixel = (src + blend)/2;
return (byte)resultPixel;
}
I was doing this on CPU but the performance is terrible. It is my understanding that doing this in GPU is very fast. My program needs to run in computers that will have either Nvidia GPUs or Intel GPUs so whatever solution I use needs to be vendor independent. If I use GPU it has to be OpenGL to be platform independent as well.
I think using a GLSL pixel shader would help, but I am not familiar with pixel shaders or how to use them to 2D objects (like my images).
Is that a reasonable solution? If so, how do I do this in 2D?
If there is a library that already does that it is also great to know.
EDIT: I am receiving the image pairs from different sources. One is always coming from a 3d graphics component in opengl (so it is in GPU originally). The other one is coming from system memory, either from a socket (in a compressed video stream) or from a memory mapped file. The "sink" of the resulting image is the screen. I am expected to show the images on the screen, so going to GPU is an option or using something like SDL to display them.
The blend function that is going to be executed the most is this one
byte Patch(byte delta, byte lo)
{
int resultPixel = (2 * (delta - 127)) + lo;
if (resultPixel > 255)
resultPixel = 255;
if (resultPixel < 0)
resultPixel = 0;
return (byte)resultPixel;
}
EDIT 2: The image coming from GPU land, comes in this fashion. From FBO to PBO to system memory
glBindFramebuffer(GL_FRAMEBUFFER,fbo);
glReadBuffer( GL_COLOR_ATTACHMENT0 );
glBindBuffer(GL_PIXEL_PACK_BUFFER, pbo);
glReadPixels(0,0,width,height,GL_BGR,GL_UNSIGNED_BYTE,0);
glBindBuffer(GL_PIXEL_PACK_BUFFER, pbo);
void* mappedRegion = glMapBuffer(GL_PIXEL_PACK_BUFFER, GL_READ_ONLY);
Seems like it is probably better to just work everything in GPU memory. The other bitmap can come from system memory. We may get it from a video decoder in GPU memory eventually as well.
Edit 3: One of my images will come from D3D while the other one comes from OpenGL. It seems that something like Thrust or OpenCL is the best option
From the looks of your Blend function, this is an entirely memory bounded operation. The caches on the CPU can likely only hold a very small fraction of the thousands of images you have. Meaning most of your time is spent waiting for RAM to fulfill load/store requests, and the CPU will idle a lot.
You will NOT get any speedup by having to copy your images from RAM to GPU, have the GPU arithmetic units idle while they wait for GPU RAM to feed them data, wait for GPU RAM again to write results, then copy it all back to main RAM. Using GPU for this could actually slow things down substantially.
But I could be wrong and you might not be saturating your memory bus already. You will have to try it on your system and profile it. Here are some simple things you can try to optimize.
1. Multi-thread
I would focus on optimizing the algorithm directly on the CPU. The simplest thing is to go multi-threaded, which can be as simple as enabling OpenMP in your compiler and updating your for loop:
#include <omp.h> // add this along with enabling OpenMP support in your compiler
...
#pragma omp parallel for // <--- compiler magic happens here
for (int i=0 ; i< _frameSize ; i++)
{
source[i] = _apply(blend[i]);
}
If your memory bandwidth is not saturated, this will likely speed up the blending by however many cores your system has.
2. Micro-optimizations
Another thing you can try is to implement your Blend using SIMD instructions which most CPUs have nowadays. I can't help you with that without knowing what CPU you are targeting.
You can also try unrolling your for loop to mitigate some of the loop overhead.
One easy way to achieve both of these is leverage the Eigen matrix library by wrapping your data in their data structures.
// initialize your data and result buffer
byte *source = ...
byte *blend = ...
byte *result = ...
// tell Eigen where you data/buffer are, and to treat it like a dynamic vectory of bytes
// this is a cheap shallow copy
Map<Matrix<byte, Dynamic,1> > sourceMap(source, _frameSize);
Map<Matrix<byte, Dynamic,1> > blendMap(blend, _frameSize);
Map<Matrix<byte, Dynamic,1> > resultMap(result, _frameSize);
// perform blend using all manner of insane optimization voodoo under the covers
resultMap = (sourceMap + blendMap)/2;
3. Use GPGPU
Finally, I will provide a direct answer to your question with an easy way to leverage the GPU without having to know much about GPU programming. The simplest thing to do is try the Thrust library. You will have to rewrite your algorithms as STL style algorithms, but that's pretty easy in your case.
// functor for blending
struct blend_functor
{
template <typename Tuple>
__host__ __device__
void operator()(Tuple t)
{
// C[i] = (A[i] + B[i])/2;
thrust::get<2>(t) = (thrust::get<0>(t) + thrust::get<1>(t))/2;
}
};
// initialize your data and result buffer
byte *source = ...
byte *blend = ...
byte *result = NULL;
// copy the data to the vectors on the GPU
thrust::device_vector<byte> A(source, source + _frameSize);
thrust::device_vector<byte> B(blend, blend + _frameSize);
// allocate result vector on the GPU
thrust::device_vector<byte> C(_frameSize);
// process the data on the GPU device
thrust::for_each(thrust::make_zip_iterator(thrust::make_tuple(
A.begin(), B.begin(), C.begin())),
thrust::make_zip_iterator(thrust::make_tuple(
A.end(), B.end(), C.end())),
blend_functor());
// copy the data back to main RAM
thrust::host_vector<byte> resultVec = C;
result = resultVec.data();
A really neat thing about thrust is that once you have written the algorithms in a generic way, it can automagically use different back ends for doing the computation. CUDA is the default back end, but you can also configure it at compile time to use OpenMP or TBB (Intel threading library).

Monte Carlo sweep in Cuda

I have a Monte Carlo step in Cuda that I need a help with. I already wrote the serial code, and it works as expected. Let's say I have a 256 particles, which are stored in
vector< vector<double> > *r;
Each i in r has (x,y) component both of which are double. Here, r is the position of a particle.
Now, in CUDA, I'm supposed to assign this vector in Host, and send it to Device. Once in device, these particles need to interact with each other. Each thread is supposed to run a Monte Carlo Sweep. How do I allocate memories, reference/dereference pointers using cudaMalloc, which functions to make global/shared,...---I just can't wrap my head around it.
Here's what my memory allocation looks at the moment::
cudaMalloc((void**)&r, (blocks*threads)*sizeof(double));
CUDAErrorCheck();
kernel <<<blocks, threads>>> (&r, randomnums);
cudaDeviceSynchronize();
CUDAErrorCheck();
cudaMemcpy(r, blocks*threads*sizeof(double), cudaMemcpyDeviceToHost);
The above code is at potato level. I guess, I'm not sure what to do---even conceptually. My main problem is on allocating memories, and passing information to and from device & host. The vector r needs to be allocated, copied from host to device, do something with it in device, and copy it back to host. Any help/"pointers" will be much appreciated.
Your "potato level" code demonstrates a general lack of understanding of CUDA, including but not limited to the management of the r data. I would suggest that you increase your knowledge of CUDA by taking advantage of some of the educational resources available, and then develop an understanding of at least one basic CUDA code, such as the vector add sample. You will then be much better able to frame questions and understand the responses you receive. An example:
This would almost never make sense:
cudaMalloc((void**)&r, (blocks*threads)*sizeof(double));
CUDAErrorCheck();
kernel <<<blocks, threads>>> (&r, randomnums);
You either don't know a very basic concept that data must be transferred to the device (via cudaMemcpy) before it can be used by a GPU kernel, or you can't be bothered to write "potato level" code that makes any sense at all - which would suggest to me a lack of effort in writing a sensible question. Also, regardless of what r is, passing &r to a cuda kernel would never make sense, I don't think.
Regarding your question about how to move r back and forth:
The first step in solving your problem will be to recast the r position data as something that is easily usable by a GPU kernel. In general, vector is not that useful for ordinary CUDA code and vector< vector< > > even less so. And if you have pointers floating about (*r) even less so. Therefore, flatten (copy) your position data into one or two dynamically allocated 1-D arrays of double:
#define N 1000
...
vector< vector<double> > r(N);
...
double *pos_x_h, *pos_y_h, *pos_x_d, *pos_y_d;
pos_x_h=(double *)malloc(N*sizeof(double));
pos_y_h=(double *)malloc(N*sizeof(double));
for (int i = 0; i<N; i++){
vector<double> temp = r[i];
pos_x_h[i] = temp[0];
pos_y_h[i] = temp[1];}
Now you can allocate space for the data on the device and copy the data to the device:
cudaMalloc(&pos_x_d, N*sizeof(double));
cudaMalloc(&pos_y_d, N*sizeof(double));
cudaMemcpy(pos_x_d, pos_x_h, N*sizeof(double), cudaMemcpyHostToDevice);
cudaMemcpy(pos_y_d, pos_y_h, N*sizeof(double), cudaMemcpyHostToDevice);
Now you can properly pass the position data to your kernel:
kernel<<<blocks, threads>>>(pos_x_d, pos_y_d, ...);
Copying the data back after the kernel will be approximately the
reverse of the above steps. This will get you started:
cudaMemcpy(pos_x_h, pos_x_d, N*sizeof(double), cudaMemcpyDeviceToHost);
cudaMemcpy(pos_y_h, pos_y_d, N*sizeof(double), cudaMemcpyDeviceToHost);
There are many ways to skin the cat, of course, the above is just an example. However the above data organization will be well suited to a kernel/thread strategy that assigns one thread to process one (x,y) position pair.

Better way to copy several std::vectors into 1? (multithreading)

Here is what I'm doing:
I'm taking in bezier points and running bezier interpolation then storing the result in a std::vector<std::vector<POINT>.
The bezier calculation was slowing me down so this is what I did.
I start with a std::vector<USERPOINT> which is a struct with a point and 2 other points for bezier handles.
I divide these up into ~4 groups and assign each thread to do 1/4 of the work. To do this I created 4 std::vector<std::vector<POINT> > to store the results from each thread.In the end all the points have to be in 1 continuous vector, before I used multithreading I accessed this directly but now I reserve the size of the 4 vectors produced by the threads and insert them into the original vector, in the correct order. This works, but unfortunatly the copy part is very slow and makes it slower than without multithreading. So now my new bottleneck is copying the results to the vector. How could I do this way more efficiently?
Thanks
Have all the threads put their results into a single contiguous vector just like before. You have to ensure each thread only accesses parts of the vector that are separate from the others. As long as that's the case (which it should be regardless -- you don't want to generate the same output twice) each is still working with memory that's separate from the others, and you don't need any locking (etc.) for things to work. You do, however, need/want to ensure that the vector for the result has the correct size for all the results first -- multiple threads trying (for example) to call resize() or push_back() on the vector will wreak havoc in a hurry (not to mention causing copying, which you clearly want to avoid here).
Edit: As Billy O'Neal pointed out, the usual way to do this would be to pass a pointer to each part of the vector where each thread will deposit its output. For the sake of argument, let's assume we're using the std::vector<std::vector<POINT> > mentioned as the original version of things. For the moment, I'm going to skip over the details of creating the threads (especially since it varies across systems). For simplicity, I'm also assuming that the number of curves to be generated is an exact multiple of the number of threads -- in reality, the curves won't divide up exactly evenly, so you'll have to "fudge" the count for one thread, but that's really unrelated to the question at hand.
std::vector<USERPOINT> inputs; // input data
std::vector<std::vector<POINT> > outputs; // space for output data
const int thread_count = 4;
struct work_packet { // describe the work for one thread
USERPOINT *inputs; // where to get its input
std::vector<POINT> *outputs; // where to put its output
int num_points; // how many points to process
HANDLE finished; // signal when it's done.
};
std::vector<work_packet> packets(thread_count); // storage for the packets.
std::vector<HANDLE> events(thread_count); // storage for parent's handle to events
outputs.resize(inputs.size); // can't resize output after processing starts.
for (int i=0; i<thread_count; i++) {
int offset = i * inputs.size() / thread_count;
packets[i].inputs = &inputs[0]+offset;
packets[i].outputs = &outputs[0]+offset;
packets[i].count = inputs.size()/thread_count;
events[i] = packets[i].done = CreateEvent();
threads[i].process(&packets[i]);
}
// wait for curves to be generated (Win32 style, for the moment).
WaitForMultipleObjects(&events[0], thread_count, WAIT_ALL, INFINITE);
Note that although we have to be sure that the outputs vector doesn't get resized while be operated on by multiple threads, the individual vectors of points in outputs can be, because each will only ever be touched by one thread at a time.
If the simple copy in between things is slower than before you started using Mulitthreading, it's entirely likely that what you're doing simple isn't going to scale to multiple cores. If it's something simple like bezier stuff I suspect that's going to be the case.
Remember that the overhead of making the threads and such has an impact on total run time.
Finally.. for the copy, what are you using? Is it std::copy?
Multithreading is not going to speed up your process. Processing the data in different cores, could.