Blend two images using GPU - opengl

I need to blend thousands of pairs of images very fast.
My code currently does the following: _apply is a function pointer to a function like Blend. It is one of the many functions we can pass, but it is not the only one. Any function takes two values and outputs a third and it is done on each channel for each pixel. I would prefer a solution that is general to any such function rather than a specific solution for blending.
typedef byte (*Transform)(byte src1,byte src2);
Transform _apply;
for (int i=0 ; i< _frameSize ; i++)
{
source[i] = _apply(blend[i]);
}
byte Blend(byte src, byte blend)
{
int resultPixel = (src + blend)/2;
return (byte)resultPixel;
}
I was doing this on CPU but the performance is terrible. It is my understanding that doing this in GPU is very fast. My program needs to run in computers that will have either Nvidia GPUs or Intel GPUs so whatever solution I use needs to be vendor independent. If I use GPU it has to be OpenGL to be platform independent as well.
I think using a GLSL pixel shader would help, but I am not familiar with pixel shaders or how to use them to 2D objects (like my images).
Is that a reasonable solution? If so, how do I do this in 2D?
If there is a library that already does that it is also great to know.
EDIT: I am receiving the image pairs from different sources. One is always coming from a 3d graphics component in opengl (so it is in GPU originally). The other one is coming from system memory, either from a socket (in a compressed video stream) or from a memory mapped file. The "sink" of the resulting image is the screen. I am expected to show the images on the screen, so going to GPU is an option or using something like SDL to display them.
The blend function that is going to be executed the most is this one
byte Patch(byte delta, byte lo)
{
int resultPixel = (2 * (delta - 127)) + lo;
if (resultPixel > 255)
resultPixel = 255;
if (resultPixel < 0)
resultPixel = 0;
return (byte)resultPixel;
}
EDIT 2: The image coming from GPU land, comes in this fashion. From FBO to PBO to system memory
glBindFramebuffer(GL_FRAMEBUFFER,fbo);
glReadBuffer( GL_COLOR_ATTACHMENT0 );
glBindBuffer(GL_PIXEL_PACK_BUFFER, pbo);
glReadPixels(0,0,width,height,GL_BGR,GL_UNSIGNED_BYTE,0);
glBindBuffer(GL_PIXEL_PACK_BUFFER, pbo);
void* mappedRegion = glMapBuffer(GL_PIXEL_PACK_BUFFER, GL_READ_ONLY);
Seems like it is probably better to just work everything in GPU memory. The other bitmap can come from system memory. We may get it from a video decoder in GPU memory eventually as well.
Edit 3: One of my images will come from D3D while the other one comes from OpenGL. It seems that something like Thrust or OpenCL is the best option

From the looks of your Blend function, this is an entirely memory bounded operation. The caches on the CPU can likely only hold a very small fraction of the thousands of images you have. Meaning most of your time is spent waiting for RAM to fulfill load/store requests, and the CPU will idle a lot.
You will NOT get any speedup by having to copy your images from RAM to GPU, have the GPU arithmetic units idle while they wait for GPU RAM to feed them data, wait for GPU RAM again to write results, then copy it all back to main RAM. Using GPU for this could actually slow things down substantially.
But I could be wrong and you might not be saturating your memory bus already. You will have to try it on your system and profile it. Here are some simple things you can try to optimize.
1. Multi-thread
I would focus on optimizing the algorithm directly on the CPU. The simplest thing is to go multi-threaded, which can be as simple as enabling OpenMP in your compiler and updating your for loop:
#include <omp.h> // add this along with enabling OpenMP support in your compiler
...
#pragma omp parallel for // <--- compiler magic happens here
for (int i=0 ; i< _frameSize ; i++)
{
source[i] = _apply(blend[i]);
}
If your memory bandwidth is not saturated, this will likely speed up the blending by however many cores your system has.
2. Micro-optimizations
Another thing you can try is to implement your Blend using SIMD instructions which most CPUs have nowadays. I can't help you with that without knowing what CPU you are targeting.
You can also try unrolling your for loop to mitigate some of the loop overhead.
One easy way to achieve both of these is leverage the Eigen matrix library by wrapping your data in their data structures.
// initialize your data and result buffer
byte *source = ...
byte *blend = ...
byte *result = ...
// tell Eigen where you data/buffer are, and to treat it like a dynamic vectory of bytes
// this is a cheap shallow copy
Map<Matrix<byte, Dynamic,1> > sourceMap(source, _frameSize);
Map<Matrix<byte, Dynamic,1> > blendMap(blend, _frameSize);
Map<Matrix<byte, Dynamic,1> > resultMap(result, _frameSize);
// perform blend using all manner of insane optimization voodoo under the covers
resultMap = (sourceMap + blendMap)/2;
3. Use GPGPU
Finally, I will provide a direct answer to your question with an easy way to leverage the GPU without having to know much about GPU programming. The simplest thing to do is try the Thrust library. You will have to rewrite your algorithms as STL style algorithms, but that's pretty easy in your case.
// functor for blending
struct blend_functor
{
template <typename Tuple>
__host__ __device__
void operator()(Tuple t)
{
// C[i] = (A[i] + B[i])/2;
thrust::get<2>(t) = (thrust::get<0>(t) + thrust::get<1>(t))/2;
}
};
// initialize your data and result buffer
byte *source = ...
byte *blend = ...
byte *result = NULL;
// copy the data to the vectors on the GPU
thrust::device_vector<byte> A(source, source + _frameSize);
thrust::device_vector<byte> B(blend, blend + _frameSize);
// allocate result vector on the GPU
thrust::device_vector<byte> C(_frameSize);
// process the data on the GPU device
thrust::for_each(thrust::make_zip_iterator(thrust::make_tuple(
A.begin(), B.begin(), C.begin())),
thrust::make_zip_iterator(thrust::make_tuple(
A.end(), B.end(), C.end())),
blend_functor());
// copy the data back to main RAM
thrust::host_vector<byte> resultVec = C;
result = resultVec.data();
A really neat thing about thrust is that once you have written the algorithms in a generic way, it can automagically use different back ends for doing the computation. CUDA is the default back end, but you can also configure it at compile time to use OpenMP or TBB (Intel threading library).

Related

Combine two separate buffers into a complex one

i'm working to a project involving realtime rx and tx trasmission using a Software Defined Radio.
I have to pass the SDR transmission API a Complex Float array buffer to be sent. I'm trying to implement every feature avoid using "for loops" since working each element at a time slow down the code execution, since i need to do many upsampling, FIR filtering and other computing intensive computation.
Now i am facing a problem. Suppose i have two separate buffers one representing the real part and the other the imag part of the complex samples buffer i have to pass to the API tx function.
Say real buffer is RRRRRRRRRRRR while the imag buffer is IIIIIIIIIIII. The example is for 12 samples but really it could be 2048, 4096 or more ...
int size=12
float *reals,*imags;
reals = new float[size];
image = new float[size];
Now i need an output that is defined as
complex<float> *cplxOut;
cplxOut = new complex<float>[size];
In memory this object is stored as RIRIRIRIRIRIRIRIRIRIRIRI.
build the cplxOut from the two real and imag buffers is easy using a for loop
for (int i=0;i<size;i++)
{
(*(cplxOut+i)).real(*(reals+i));
(*(cplxOut+i)).imag(*(imags+i));
}
I wonder if there is a quickest way to do it using direct memory move functions for whole buffers.
I tried to use inline assembly to speed up the task but it has problem in portability on different architectures and is not supported for x64 on Windows side.
A possibile way could be upsample by two interleaving with zero, shift forward the imag buffer of 1 place and then OR both buffers, but to do upsampling i have to use a for loop as well.. so no way.
Do you have any suggestion ? I need the fastest way to do it.
Tnx, Fabio
You don't have to construct an array of empty complex objects, but use std::vector and emplace_back() instead:
vector<complex<float>> cplxOut;
// to avoid reallocations when adding new elements
cplxOut.reserve(size);
for (int i = 0; i < size; i++)
{
// create in-place complex number
cplxOut.emplace_back(reals[i], imags[i]);
}

QT QOpenGLWidget : how to modify individual vertices values in VBO without using data block copy?

I don't know if it is possible or not:
I have an array of QVector3D vertices that I copy to a VBO
sometimes I want to modify only the z value of a range of vertices between the values (x1, y1) and (x2, y2) - the concerned vertices strictly follow each other
my "good" idea is to only modify the z values with a direct access to the VBO.
I have searched a lot, but all the solutions I saw use memcpy, something like this :
m_vboPos.bind();
GLfloat* PosBuffer = (GLfloat*) (m_vboPos.map(QOpenGLBuffer::WriteOnly));
if (PosBuffer != (GLfloat*) NULL) {
memcpy(PosBuffer, m_Vertices.constData(), m_Vertices.size() * sizeof(QVector3D));
m_vboPos.unmap();
m_vboPos.release();
But it is to copy blocks of data.
I don't think using memcpy to change only 1 float value in every concerned vertex would be very efficient (I have several millions of vertices in the VBO).
I'd just like to optimize because copying millions of vertices takes a (too) long time : is there a way to achieve my goal (without memcpy ?), for only one float here and there ? (already tried that but couldn't make it, I must be missing something)
This call here
GLfloat* PosBuffer = (GLfloat*) (m_vboPos.map(QOpenGLBuffer::WriteOnly));
will internally call glMapBuffer which means that it just maps the buffer contents into the address space of your process (see also the OpenGL Wiki on Buffer Object Mapping.
Since you map it write-only, you can simply overwrite each and every bit of the buffer, as you see fit. There is no need to use memcpy, you can just use any means to write to memory, e.g. you can directly do
PosBuffer[3*vertex_id + 2] = 42.0f; // assuming 3 floats per vertex
I don't think using memcpy to change only 1 float value in every concerned vertex would be very efficient (I have several millions of vertices in the VBO).
Yes, doing a million separate memcpy() calls for 4 bytes each will not be a good idea. A modern compiler might actually inline it, so it might be equivalent to just individual assignments, though. But you can also do the assignments directly, since memcpy is not gaining you anything here.
However, it is not clear what the performance impacts of all this are. glMapBuffer might return a pointer to
some local copy of the VBO in system memory, and will have later to copy the contents to the GPU. Since it does not know which values you changed and which not, it might have to re-transmit the whole buffer.
some system meory inside the GART area, which is mapped on the GPU, so the GPU will directly access this memory when reading from the buffer.
some I/O-mapped region in VRAM. In this case, the caching behavior of the memory region might be significantly different, and changing a 4 bytes in every 12 byte block might not be the most ideal approach. Just re-copying the whole sub-block as one big junk might yield better performance.
The mapping itself is also not for free, it involves changing the page tables, and the GL driver might have to synchronize it's threads, or, in the worst case, synchronize with the GPU (to prevent you from overwriting stuff the GPU is still using for a previous draw call which is still in flight).
sometimes I want to modify only the z value of a range of vertices between the values (x1, y1) and (x2, y2) - the concerned vertices strictly follow each other
So you have a continuous sub-region of the buffer which you want to modify. I would recommend to look at two alternatives:
Use glMapBufferRange (if available in your OpenGL version) to map only the region you care about.
Forget about buffer mapping completely, and try glBufferSubData(). Not individually on each z component of each vertex, but as one big junk for the whole range of modified vertices. This will imply you have a local copy of the buffer contents in your memory somewhere, just update in, and send the results to the GL.
Which option is better will depend on a lot of different factors, and I would not rule one of them out without benchmarking in the actual scenario, on the actual implementations you care about. Also have a look at the general strategies for Buffer Object Streaming in OpenGL. A persistently mapped buffer might or might not be also a good option for your use case.
The glMap method works great and is really FAST !
Thanks a lot genpfault, the speed gain is so great that the 3D rendering isn't choppy anymore.
Here is my new code, simplified to offer an easy to understand answer :
vertexbuffer.bind();
GLfloat* posBuffer = (GLfloat*) (vertexbuffer.map(QOpenGLBuffer::WriteOnly));
if (posBuffer != (GLfloat*) NULL) {
int index = NumberOfVertices(area.y + 1, image.cols); // index of first vertex on line area.y
for (row = ...) for (col = ...) {
if (mask.at<uchar>(row, col) != 0)
posBuffer[3 * index + 2] = depthmap.at<uchar>(row, col) * depth;
index++;
}
}
vertexbuffer.unmap();
vertexbuffer.release();

Will iterating multiple images with CUDA increase performance?

The code was oversimplified just for the question purpose.
Before I setup CUDA environment and do any changes to my code, I wanted to get an input whether executing the code below will be much faster on GPU.
The code basically iterates through images and copy image pixel value to dst only if the corresponding mask value is not zero. Number of images can be as high as 10. The size of the image can be around 2K by 2K.
If I use #pragma omp it does increase performance. So, the question is will the performance be significantly increased if I'll execute this code on GPU (assuming I have a good graphic card like GTX 1050) when each thread will handle a separate image?
for (int i = 0; i < images.size(); ++i)
{
for (int y = 0; y < images[i].height; ++y)
{
for (int x = 0; x < images[i].width; ++x)
{
bool maskVal = masks[i][y][x];
if (maskVal > 0)
{
dst[i][y][x] = images[i].data(x,y);
}
}
}
}
In this case I would guess not. That piece of code would probably execute faster if the images and destination were already in the GPUs memory. However if you plan to take the images and mask from your main memory, copy it over the pci bus, execute the code on the gpu and then transfer the result back to the cpu, you will be much better off just running this code on the cpu. However if you think you will do further parallel processing to the dst images then you might as well get it transferred to the gpu, and do this on the gpu since you will have to pay that penalty anyway. The reason openmp is faster is because it uses cpu threads which use the same memory and no extra copying is required.
An example of something which run well on a gpu would be an image convolution or fourier transform since these are tasks are much heavier than what you are doing so the overhead of the memory transfer matters less.

Performance Loss when Writing to Memory Buffer (C++)

I am writing a small renderer (based on the rasterisation algorithm). It's a personal project I am doing to test different techniques. I was measuring the time it took to render a bunch of triangles, and while doing this I noticed something strange. What the program does is write to an image buffer (a 1D array of Vec3ui) if a given pixel overlaps a 2D triangle and pass some other test (it writes in the buffer the color of that triangle).
Vec3<unsigned char> *fb = new Vec3<unsigned char>[w * h];
...
void rasterize(
...,
Vec3<unsigned char> *&fb,
float *&zbuffer)
{
Vec3<unsigned char> randcol(drand48() * 255, drand48() * 255, drand48() * 255);
...
uint32_t x, y;
// loop over bounding box of triangle
// check if given pixel is in triangle
for (y = ymin, p.y = ymin; y <= ymax; ++y, ++p.y)
{
for (x = xmin, p.x = xmin; x <= xmax; ++x, ++p.x)
{
if (pixelOverTriangle(...) {
fb[y * w + x] = randcol;
}
}
}
}
Where I measured the stat, I thought that would actually take the longest in the process is rendering the triangles, doing all the test etc. It happens that when I run the program with a given number of triangles I get the following render time:
74 ms
But when I comment out the line where I write to the image buffer I get:
5 ms
So to be clear I do:
if (pixelOverTriangle(...) {
// fb[y * w + x] = randcol;
}
In fact more than 90% of the time is spent writing to the image buffer!
I have to say that I tried optimising how the index used to access elements in the array is computed, but this not where the time goes. The times goes into actually copying the variable to the right into the buffer (so it seems anyway).
I am very surprised by these numbers.
So I have a few questions:
Is it expected?
Am i doing something wrong?
Can I make it better? What technique can I use to optimise this?
A lot more goes into a memory read / write than C++ makes it seem. More often than not, your processor caches blocks of memory for quick access; this vastly improves performance for data in contiguous memory: arrays, structs, and the stack for example. However, upon trying to access memory that has not been cached (a cache miss) the processor has to cache a new block of memory, which takes significantly longer (minutes or even hours scaled to a second-long cycle). By accessing arbitrary segments of a long block of memory – like your image – you are practically guaranteeing continuous cache misses.
To make matters worse, computer memory (RAM) actually lies on virtual pages that are swapped in and out of the physical memory all the time. If your image is big enough to lie across multiple memory pages (usually around 4kb each) then your operating system is actually loading and unloading data from secondary storage (your hard drive), which you can imagine taking much longer than a direct read from memory.
I found an article from another stackoverflow question about cache performance that might answer your question better than me. Really, it's just important to be aware of what a memory read/write is actually doing, and how that can drastically affect performance.
A possible answer which you'll have to check out...
The compiler might notice that your code does nothing and remove it. Look at the disassembly of the function and see if it is actually doing any calculations.

Cuda unified memory between gpu and host

I'm writing a cuda-based program that needs to periodically transfer a set of items from the GPU to the Host memory. In order to keep the process asynchronous, I was hoping to use cuda's UMA to have a memory buffer and flag in the host memory (so both the GPU and the CPU can access it). The GPU would make sure the flag is clear, add its items to the buffer, and set the flag. The CPU waits for the flag to be set, copies things out of the buffer, and clears the flag. As far as I can see, this doesn't produce any race condition because it forces the GPU and CPU to take turns, always reading and writing to the flag opposite each other.
So far I haven't been able to get this to work because there does seem to be some sort of race condition. I came up with a simpler example that has a similar issue:
#include <stdio.h>
__global__
void uva_counting_test(int n, int *h_i);
int main() {
int *h_i;
int n;
cudaMallocHost(&h_i, sizeof(int));
*h_i = 0;
n = 2;
uva_counting_test<<<1, 1>>>(n, h_i);
//even numbers
for(int i = 1; i <= n; ++i) {
//wait for a change to odd from gpu
while(*h_i == (2*(i - 1)));
printf("host h_i: %d\n", *h_i);
*h_i = 2*i;
}
return 0;
}
__global__
void uva_counting_test(int n, int *h_i) {
//odd numbers
for(int i = 0; i < n; ++i) {
//wait for a change to even from host
while(*h_i == (2*(i - 1) + 1));
*h_i = 2*i + 1;
}
}
For me, this case always hangs after the first print statement from the CPU (host h_i: 1). The really unusual thing (which may be a clue) is that I can get it to work in cuda-gdb. If I run it in cuda-gdb, it will hang as before. If I press ctrl+C, it will bring me to the while() loop line in the kernel. From there, surprisingly, I can tell it to continue and it will finish. For n > 2, it will freeze on the while() loop in the kernel again after each kernel, but I can keep pushing it forward with ctrl+C and continue.
If there's a better way to accomplish what I'm trying to do, that would also be helpful.
You are describing a producer-consumer model, where the GPU is producing some data and from time-to-time the CPU will consume that data.
The simplest way to implement this is to have the CPU be the master. The CPU launches a kernel on the GPU, when it is ready to ready to consume data (i.e. the while loop in your example) it synchronises with the GPU, copies the data back from the GPU, launches the kernel again to generate more data, and does whatever it has to do with the data it copied. This allows you to have the GPU filling a fixed-size buffer while the CPU is processing the previous batch (since there are two copies, one on the GPU and one on the CPU).
That can be improved upon by double-buffering the data, meaning that you can keep the GPU busy producing data 100% of the time by ping-ponging between buffers as you copy the other to the CPU. That assumes the copy-back is faster than the production, but if not then you will saturate the copy bandwidth which is also good.
Neither of those are what you actually described. What you asked for is to have the GPU master the data. I'd urge caution on that since you will need to manage your buffer size carefully and you will need to think carefully about the timings and communication issues. It's certainly possible to do something like that but before you explore that direction you should read up about memory fences, atomic operations, and volatile.
I'd try to add
__threadfence_system();
after
*h_i = 2*i + 1;
See here for details. Without it, it's totally possible that the modification stay in the GPU cache forever. However better you listen to the other answers: to improve it for multiple threads/blocks you have to deal with other "problems" to get a similar scheme to work reliably.
As Tom suggested (+1), better to use double buffering. Streams help a lot such a scheme, as you can find depicted here.