Async SSBO Readback

Async SSBO Readback - opengl

When I call GetBufferSubData with my Shader Storage Buffer Object there is typically a 4ms delay. Is it possible for my application to do work during that time?
// start GetBufferSubData
// do client/app/CPU work
// (wait if needed)
// read results from GetBufferSubData
Or otherwise use some sort of API to asynchronously start copying buffer data from the GPU?
I was able to get an async readback working using glMapBufferRange and GL_MAP_PERSISTENT_BIT. However, when running a compute shader (multiple times back to back) on that buffer, this results in a massive performance degradation compared to no persistent mapping.

The issue with simply marking the buffer with GL_MAP_PERSISTENT_BIT was that this resulted in a substantial performance degradation (8x slower) when running a compute shader on that buffer (profiled using Nvidia Nsight Graphics). I suspect this is because of the mapping, OpenGL needs to read/write the buffer into a different location that is less performant on the GPU, but more performant/accessible by the CPU.
My solution was to create a much smaller buffer (1000x smaller, 16kb) that is persistently mapped that the CPU can use to read/write to the larger buffer in small increments when needed. This combination was much faster with only a minor API overhead that achieved my needs.

Related

opengl driver seems to be ridiculously slow from glUnmapBuffer? This can't be right

Through visual studio profiling I see that nearly 50% of my program's execution time is spent in KernalBase.dll. What's it doing in there? I don't know, but the main thing calling it is nvoglv64.dll. To my understanding, nvoglv64.dll is the opengl driver. And the main thing that is calling nvoglv64.dll is one of my functions. This is the function.
draw() {
if (mapped)
{
mapped = false;
glBindBuffer(GL_ARRAY_BUFFER, triangleBuffer);
glUnmapBuffer(GL_ARRAY_BUFFER);
}
glBindVertexArray(trianglesVAO);
program.bind();
glDrawElements(GL_TRIANGLES, ...);
}
The way I use this function is as follows: I asynchronously map a gl buffer to client memory, and fill it up with a large amount of triangles. Then I draw the buffer using the above function. Except, I use two buffers, each frame I swap which one is being drawn with and which one is being written to.
It's suppose to all be asynchronous. Even glunmap and gldrawelements is suppose to be asynchronous, they should just get put in a command queue. What is causing the slow down?

Through visual studio profiling I see that nearly 50% of my program's execution time is spent in KernalBase.dll. What's it doing in there?
Mapping, unmapping and checking the status of memory as you would expect.
If you want everything to truly be asynchronous, and run the risk of clobbering data that has not been rendered yet, you can map unsynchronized buffers (see glMapBufferRange (...)).
Otherwise, there is some internal driver synchronization to prevent you from modifying memory that has not been drawn yet and that is what you are seeing here. You cannot keep everything asynchronous unless you have enough buffers to accommodate every pending frame.
Now, the thing here is what you just described (2 buffers) is not adequate. You need multiple levels of buffers to prevent pipeline stalls - the driver is (usually) allowed to queue up more than 1 frame worth of commands and the CPU will not be allowed to change memory used by any pending frame (the driver will block when you attempt to call glMapBuffer (...)) until said frame is finished.
3 buffers is a good compromise; you might still incur synchronization overhead every once in a while if the CPU's > 2 frames ahead of the GPU. This situation (> 2 pre-rendered frames) indicates that you are GPU-bound, and CPU/driver synchronization for 1 frame will not actually change that. So you can probably afford to block the render thread.

OpenGL - gpu memory exceeded, possible scenarios

I can use glTexImage2D or glBufferData to send some data to the gpu memory. Let's assume that I request driver to send more data to the gpu but the gpu memory is already full. I probably get GL_OUT_OF_MEMORY. What might happen with a rendering thread ? What are possible scenarios ? Is it possible that a rendering thread will be terminated ?

It depends on the actual OpenGL implementation. But the most likely scenario is, that you'll just encounter a serious performance drop, but things will just keep working.
OpenGL uses an abstract memory model, for which actual implementation threat the GPU's own memory as a cache. In fact for most OpenGL implementation when you load texture data it doesn't even go directly to the GPU at first. Only when it's actually required for rendering it gets loaded into the GPU RAM. If there are more textures in use than fit into GPU RAM, textures are swapped in and out from GPU RAM as needed to complete the rendering.
Older GPU generations required for a texture to completely fit into their RAM. The GPUs that came out after 2012 actually can access texture subsets from host memory as required thereby lifting that limit. In fact you're sooner running into maximum texture dimension limits, rather than memory limits (BT;DT).
Of course other, less well developed OpenGL implementations may bail out with an out of memory error. But at least for AMD and NVidia that's not an issue.

OpenGL program with Intel HD and NVidia GPU usage

I am new in OpenGL and I want somebody to explain me how the program uses GPU.
I have an array of triangles(class that contains 3 points). Here is the code that draw them( I know these functions are depricated).
glBegin(GL_LINES);
for(int i=0; i<trsize; ++i){
glVertex3d((GLdouble)trarr[i].p1().x(), (GLdouble)trarr[i].p1().y(), (GLdouble)trarr[i].p1().z());
glVertex3d((GLdouble)trarr[i].p2().x(), (GLdouble)trarr[i].p2().y(), (GLdouble)trarr[i].p2().z());
glVertex3d((GLdouble)trarr[i].p3().x(), (GLdouble)trarr[i].p3().y(), (GLdouble)trarr[i].p3().z());
}
glEnd();
And i also use depricated functions for rotating, transforming, etc.
When the size of array is bigger than 50k, the program works very slow.
I tried to use only Intel HD or only NVidia gtx860M (the default NVidia Program allows to choose GPU) but they both works very slow. Maybe Intel HD works even a bit faster.
So, why there is no difference between these two GPUs?
And will the program work faster with using shaders?

The probable bottleneck is looping over the vertices, accessing the array and pulling out the vertex data 50000 times per render then sending the data to the GPU for rendering.
Using a VBO would indeed be faster and compresses the cost of extracting the data and sending it to the GPU to once on initialization.
Even using a user memory buffer would speed it up because you won't be calling 50k functions but the driver can just do a memcopy of the relevant data.

When the size of array is bigger than 50k, the program works very slow.
The major bottleneck when drawing in intermediate mode is, that all your vertices have to be transferred in every frame from your programs memory to the GPU memory. The bus between GPU and CPU is limited in the amout of data it can transfer, so the best guess is, that 50k triangles are simply more than the bus can transport. Another problem is, that the driver has to process all the commands you send him on the CPU, which can also be a big overhead.
So, why there is no difference between these two GPUs?
There is (in general) a huge performance difference between the Intel HD card and a NVIDIA card, but the bus between them might be the same.
And will the program work faster with using shaders?
It will not benefit directly from the user of shaders, but definitely from storing the vertices once on the gpu memory (see VBO/VAO). The second improvement is, that you can render the whole VBO using only one draw call, which decreases the amount of instructions the cpu has to handle.

Seeing the same performance with two GPUs that have substantially different performance potential certainly suggests that your code is CPU limited. But I very much question some of the theories about the performance bottleneck in the other answers/comments.
Some simple calculations suggest that memory bandwidth should not come into play at all. With 50,000 triangles, with 3 vertices each, and 24 bytes per vertex, you're looking at 3,600,000 bytes of vertex data per frame. Say you're targeting 60 frames/second, this is a little over 200 MBytes/second. That's less than 1% of the memory bandwidth of a modern PC.
The most practical implementation of immediate mode on a modern GPU is for drivers to collect all the data into buffers, and then submit it all at once when a buffer fills up. So there's no need for a lot of kernel calls, and the data for each vertex is certainly not sent to the GPU separately.
Driver overhead is most likely the main culprit. With 50,000 triangles, and 3 API calls per triangle, this is 150,000 API calls per frame, or 9 million API calls/second if you're targeting 60 frames/second. That's a lot! For each of these calls, you'll have:
Looping and array accesses in your own code.
The actual function call.
Argument passing.
State management and logic in the driver code.
etc.
One important aspect that makes this much worse than it needs to be: You're using double values for your coordinates. That doubles the amount of data that needs to be passed around compared to using float values. And since the OpenGL vertex pipeline operates in single precision (*), the driver will have to convert all the values to float.
I suspect that you could get a significant performance improvement even with using the deprecated immediate mode calls if you started using float for all your coordinates (both your own storage, and for passing them to OpenGL). You could also use the version of the glVertex*() call that takes a single argument with a pointer to the vector, instead of 3 separate arguments. This would be glVertex3fv() for float vectors.
Making the move to VBOs is of course the real solution. It will reduce the number of API calls by orders of magnitude, and avoid any data copying as long as the vertex data does not change over time.
(*) OpenGL 4.1 adds support for double vertex attributes, but they require the use of specific API functions, and only make sense if single precision floats really are not precise enough.

Performance boost due to mipmapping

Why is performance boost due to mipmaps?
I read somewhere on the net that "when we have 256 x 256 texture data and want to map it to 4x4, then driver will copy only 4x4 mipmap level generated to GPU memory and not 256x 256 data. and sampling will work on 4x4 data copied on GPU memory which will save lot of computations" I just want to know whether it is correct or not?
Also, when glTeximage call happens it uploads texture data in gpu memory which is passed in glteximage call. Then it conflicts with above statement. When we call glgeneratemipmap() does it create all the mipmap levels on CPU side and then later copies all those levels on GPU side?

Mipmapping boosts performance for two main reasons. The first is because it decreases bandwidth between GPU memory and the texture unit and the second is because it improves caching inside the texture unit.
Imagine you have a distant object with a texture applied to it. With mipmaping the GPU instead of processing, reading and caching a huge texture (level 0) it reads a smaller one. In this scenario the bandwidth reduction is quite obvious.
Bandwidth and caching on mipmaping go hand in hand. Sampling the texture of the distance object most likely will read the texture in a sparse manner. For example read texel x,y and for the next fragment read the x+10,y+10. The x+10,y+10 may be in another cache line. When reading from a smaller texture though the cache misses can be far less.
glTexImageXX creates a single level and uploads the pixel data to the GPU. When calling glGenerateMipmap the driver will allocate GPU memory for the additional levels and most likely it will execute a single or multiple GPU jobs to fill those levels. In some cases it my do the same thing on CPU and then upload the data to GPU.

Why copying memory from VRAM to RAM is slower than RAM to VRAM? (OpenGL)

I am creating something similar to CUDA but I saw that copy memory from RAM to VRAM is very fast like copying from RAM to itself. But copy from VRAM to RAM is a way slower than RAM to VRAM.
By the way I am using glTexSubImage2D to copy from RAM to VRAM and glGetTexImage to copy from VRAM to RAM.
Why? Is there a way to improve it's performance like copying RAM to VRAM?

Transferring data from GPU to CPU was always a very slow operation.
A GPU -> CPU readback introduces a "sync point" where the CPU must wait for the GPU to complete its calculations. During this time, the CPU stops feeding the GPU with data, causing it to stall.
Now, remember that a modern GPU is designed in a highly parallel manner, with thousand threads in flight at any given moment. The sync point must wait for all those threads to finish processing, before it can readback the result of their calculations. Once the readback is complete, all those threads must restart execution from zero... bad!
Reading back the results asynchronously (after a few frames), allows the GPU continue execution without its threads starving (the stop-and-resume issue outlined above). This improves performance tremendously - the more parallel the GPU, the higher the performance improvement.
Depending on your graphical chip and driver, maybe you get better performances by using PBOs.

By the way I am using glTexSubImage2D to copy from RAM to VRAM and glGetTexImage to copy from VRAM to RAM.
Then you're not copying data. You're performing pixel transfer operations, which can require CPU modification, depending on your image's internal format, the pixel transfer format, and the pixel transfer type parameters.
Since you didn't provide actual code, there's no way to know if you choose bad parameters or not.
If you want to test direct copying performance, use a buffer object.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js