Opengl Large SSBO hangs when reading - c++

I have multiple SSBO(4 SSBO) of size 400400100 ints and my fragment shader updates the value of this ssbo. I do this by calling the drawelements call and I read the data after the draw call by first binding the specific SSBO and then calling the glMapBuffer and type casting the ptr to an int.
The GPU does heavy processing (loops with 10000 iteration) and updates the SSBO. I have a print statemnt after the drawelement call which is shown on the screen and a print statement after the bind call, which is also displayed but the glMapBuffer call takes forever and hangs the system.
In Windows task manager, the GPU is not used for majority of the time and only CPU is used. I think its because the GPU is only used during the draw call.
Plus, Is my understanding correct that when I call glMapBuffer, only the binded ssbo is transfered?
Do you guys have any suggestion as to what might be causing this issue?
I tried using glmapbuffer range, which caused similar problem.
std::cout << 'done rendering' <<std::endl; //prints this out
glBindBufferBase(GL_SHADER_STORAGE_BUFFER, 0, 0);
std::cout << 'done binding ssbo' <<std::endl; //prints this out
GLint *ptr;
ptr = (GLint *) glMapBuffer(GL_SHADER_STORAGE_BUFFER, GL_READ_ONLY); //hangs here


OpenGL, glMapNamedBuffer takes a long time

I've been writing an openGL program that generates vertices on the GPU using compute shaders, the problem is I need to read back the number of vertices from a buffer written to by one compute shader dispatch on the CPU so that I can allocate a buffer of the right size for the next compute shader dispatch to fill with vertices.
* Stage 1- Populate the 3d texture with voxel values
glBindTexture(GL_TEXTURE_3D, _RandomSeedTexture);
glBindImageTexture(2, _VoxelValuesTexture, 0, GL_TRUE, NULL, GL_READ_WRITE, GL_R32F);
_EvaluateVoxels.SetVec3("CellSize", voxelCubeDims);
_EvaluateVoxels.SetVec3("StartPos", chunkPosLL);
glDispatchCompute(voxelDim.x + 1, voxelDim.y + 1, voxelDim.z + 1);
* Stage 2 - Calculate the marching cube's case for each cube of 8 voxels,
* listing those that contain polygons and counting the no of vertices that will be produced
_GetNonEmptyVoxels.SetFloat("IsoLevel", isoValue);
glBindBufferBase(GL_SHADER_STORAGE_BUFFER, 3, _IntermediateDataSSBO);
glBindBufferBase(GL_ATOMIC_COUNTER_BUFFER, 0, _AtomicCountersBuffer);
glDispatchCompute(voxelDim.x, voxelDim.y, voxelDim.z);
//printStage2(_IntermediateDataSSBO, true);
// this line takes a long time
unsigned int* vals = (unsigned int*)glMapNamedBuffer(_AtomicCountersBuffer, GL_READ_WRITE);
unsigned int vertex_counter = vals[1];
unsigned int index_counter = vals[0];
vals[0] = 0;
vals[1] = 0;
The image below shows times in milliseconds that each stage of the code takes to run, "timer Evaluate" refers to the method as a whole, IE the sum total of the previous stages. getvertexcounter refers to only the mapping, reading and unmapping of a buffer containing the number of vertices. Please see code for more detail.
I've found this to be by far the slowest stage in the process, and I gather it has something to do with the asynchronous nature of the communication between openGL and the GPU and the need to synchronise data that was written by the compute shader so it can be read by the CPU. My question is this: Is this delay avoidable? I don't think that the overall approach is flawed because I know that someone else has implemented the algorithm in a similar way, albeit using direct X (I think).
You can find my code at , the code in question is in the file ComputeShaderMarcher.cpp and the method unsigned int ComputeShaderMarcher::GenerateMesh(const glm::vec3& chunkPosLL, const glm::vec3& chunkDim, const glm::ivec3& voxelDim, float isoValue, GLuint VBO)
In order to access data from a buffer that you have had OpenGL write some data to, the CPU must halt execution until the GPU has actually written that data. Whatever process you use to access this data (glMapBufferRange, glGetBufferSubData, etc), that process must halt until the GPU has finished generating the data.
So don't try to access GPU-generated data until you're sure the GPU has actually generated it (or you have absolutely nothing better to do on the CPU than wait). Use fence sync objects to test whether the GPU has finished executing past a certain point.

glGetBufferSubData and glMapBufferRange for GL_SHADER_STORAGE_BUFFER very slow on NVIDIA GTX960M

I've been having some issues with transfering a GPU buffer into CPU for performing sorting operations. The buffer is a GL_SHADER_STORAGE_BUFFER composed of 300.000 float values. The transfer operation with glGetBufferSubData is taking around 10ms, and with glMapBufferRange, it takes more than 100 ms.
The code Im using is the following:
std::vector<GLfloat> viewRow;
unsigned int viewRowBuffer = -1;
int length = -1;
void bindRowBuffer(unsigned int buffer){
glBindBuffer(GL_SHADER_STORAGE_BUFFER, buffer);
glBindBufferBase(GL_SHADER_STORAGE_BUFFER, 3, buffer);
void initRowBuffer(unsigned int &buffer, std::vector<GLfloat> &row, int lengthIn){
// Generate and initialize buffer
length = lengthIn;
memset(&row[0], 0, length*sizeof(float));
glGenBuffers(1, &buffer);
glBufferStorage(GL_SHADER_STORAGE_BUFFER, row.size() * sizeof(float), &row[0], GL_DYNAMIC_STORAGE_BIT | GL_MAP_READ_BIT | GL_MAP_WRITE_BIT);
void cleanRowBuffer(unsigned int buffer) {
float zero = 0.0;
glClearNamedBufferData(buffer, GL_R32F, GL_RED, GL_FLOAT, &zero);
void readGPUbuffer(unsigned int buffer, std::vector<GLfloat> &row) {
glGetBufferSubData(GL_SHADER_STORAGE_BUFFER,0,length *sizeof(float),&row[0]);
void readGPUMapBuffer(unsigned int buffer, std::vector<GLfloat> &row) {
float* data = (float*)glMapBufferRange(GL_SHADER_STORAGE_BUFFER, 0, length*sizeof(float), GL_MAP_READ_BIT); glBindBuffer(GL_SHADER_STORAGE_BUFFER, 0);
memcpy(&row[0], data, length *sizeof(float));
The main is doing:
glBindTexture(GL_TEXTURE_2D, gPatch);
countPixs.setInt("gPatch", 0);, SCR_HEIGHT/8, 1);
readGPUbuffer(viewRowBuffer, viewRow);
Where countPixs is a compute shader, but I'm possitive the problem is not there because if I comment the run command, the read takes exactly the same amount of time.
The weird thing is that if I execute a getbuffer of only 1 float:
glGetBufferSubData(GL_SHADER_STORAGE_BUFFER,0, 1 *sizeof(float),&row[0]);
It takes exactly the same time... so I'm guessing there is something wrong all-the-way... maybe related to the GL_SHADER_STORAGE_BUFFER?
This is likely to be an GPU-CPU synchronization/round trip caused delay.
I.e. once you map your buffer the previous GL command(s) which touched the buffer needs to complete immediately causing pipeline stall.
Note that drivers are lazy: it is very probable GL commands have not even started executing yet.
If you can: glBufferStorage(..., GL_MAP_PERSISTENT_BIT) and map the buffer persistently. This avoids completely re-mapping and allocation of any GPU memory and you can keep the mapped pointer over draw calls with some caveats:
You likely also need GPU fences to detect/wait when the data is actually available from GPU. (Unless you like reading garbace.)
The mapped buffer can't be resized. (since you already use glBufferStorage() you are ok)
It is probably good idea to combine GL_MAP_PERSISTENT_BIT with GL_MAP_COHERENT_BIT
After reading GL 4.5 docs bit more I found out that glFenceSync is mandatory in order to guarantee the data has arrived from the GPU, even with GL_MAP_COHERENT_BIT:
If GL_MAP_COHERENT_BIT is set and the server does a write, the app must call glFenceSync with GL_SYNC_GPU_COMMANDS_COMPLETE (or
glFinish). Then the CPU will see the writes after the sync is

Bind CUDA output array/surface to GL texture in ManagedCUDA

I'm currently attempting to connect some form of output from a CUDA program to a GL_TEXTURE_2D for use in rendering. I'm not that worried about the output type from CUDA (whether it'd be an array or surface, I can adapt the program to that).
So the question is, how would I do that? (my current code copies the output array to system memory, and uploads it to the GPU again with GL.TexImage2D, which is obviously highly inefficient - when I disable those two pieces of code, it goes from approximately 300 kernel executions per second to a whopping 400)
I already have a little bit of test code, to at least bind a GL texture to CUDA, but I'm not even able to get the device pointer from it...
ctx = CudaContext.CreateOpenGLContext(CudaContext.GetMaxGflopsDeviceId(), CUCtxFlags.SchedAuto);
uint textureID = (uint)GL.GenTexture(); //create a texture in GL
GL.TexParameter(TextureTarget.Texture2D, TextureParameterName.TextureMinFilter, (int)TextureMinFilter.Linear);
GL.TexParameter(TextureTarget.Texture2D, TextureParameterName.TextureMagFilter, (int)TextureMagFilter.Linear);
GL.TexImage2D(TextureTarget.Texture2D, 0, PixelInternalFormat.Rgba, width, height, 0, OpenTK.Graphics.OpenGL.PixelFormat.Rgba, PixelType.UnsignedByte, null); //allocate memory for the texture in GL
CudaOpenGLImageInteropResource resultImage = new CudaOpenGLImageInteropResource(textureID, CUGraphicsRegisterFlags.WriteDiscard, CudaOpenGLImageInteropResource.OpenGLImageTarget.GL_TEXTURE_2D, CUGraphicsMapResourceFlags.WriteDiscard); //using writediscard because the CUDA kernel will only write to this texture
//then, as far as I understood the ManagedCuda example, I have to do the following when I call my kernel
//(done without a CudaGraphicsInteropResourceCollection because I only have one item)
var ptr = resultImage.GetMappedPointer(); //this crashes
kernelSample.Run(ptr); //pass the pointer to the kernel so it knows where to write
The following exception is thrown when attempting to get the pointer:
ErrorNotMappedAsPointer: This indicates that a mapped resource is not available for access as a pointer.
What do I need to do to fix this?
And even if this exception can be resolved, how would I solve the other part of my question; that is, how do I work with the acquired pointer in my kernel? Can I use a surface for that? Access it as an arbitrary array (pointer arithmetic)?
Looking at this example, apparently I don't even need to map the resource every time I call the kernel, and call the render function. But how would this translate to ManangedCUDA?
Thanks to the example I found, I was able to translate that to ManagedCUDA (after browsing the source code and fiddling around), and I'm happy to announce that this does really improve my samples per second from about 300 to 400 :)
Apparently it is needed to use a 3D array (I haven't seen any overloads in ManagedCUDA using 2D arrays) but that doesn't really matter - I just use a 3D array/texture which is exactly 1 deep.
id = GL.GenTexture();
GL.BindTexture(TextureTarget.Texture3D, id);
GL.TexParameter(TextureTarget.Texture3D, TextureParameterName.TextureMinFilter, (int)TextureMinFilter.Linear);
GL.TexParameter(TextureTarget.Texture3D, TextureParameterName.TextureMagFilter, (int)TextureMagFilter.Linear);
GL.TexImage3D(TextureTarget.Texture3D, 0, PixelInternalFormat.Rgba, width, height, 1, 0, OpenTK.Graphics.OpenGL.PixelFormat.Bgra, PixelType.UnsignedByte, IntPtr.Zero); //allocate memory for the texture but do not upload anything
CudaOpenGLImageInteropResource resultImage = new CudaOpenGLImageInteropResource((uint)id, CUGraphicsRegisterFlags.SurfaceLDST, CudaOpenGLImageInteropResource.OpenGLImageTarget.GL_TEXTURE_3D, CUGraphicsMapResourceFlags.WriteDiscard);
CudaArray3D mappedArray = resultImage.GetMappedArray3D(0, 0);
CudaSurface surfaceResult = new CudaSurface(kernelSample, "outputSurface", CUSurfRefSetFlags.None, mappedArray); //nothing needs to be done anymore - this call connects the 3D array from the GL texture to a surface reference in the kernel
Kernel code:
surface outputSurface;
__global__ void Sample() {
surf3Dwrite(output, outputSurface, pixelX, pixelY, 0);

SDL_GL_SwapWindow bad performance

I did some performance testing and came up with this:
for(U32 i=0;i<objectList.length();++i)
PC d("draw");
VoxelObject& obj = *objectList[i];
tmpM = usedView->projection * usedView->transform * obj.transform;
glUniformMatrix4fv(shader.modelViewMatrixLoc, 1, GL_FALSE,;
glBindTexture(GL_TEXTURE_2D, typesheet.tbo);
glUniform1i(shader.typesheetLoc, 0);
glDrawArrays(GL_TRIANGLES, 0, VoxelObject::VERTICES_PER_BOX*obj.getNumBoxes());
d.out(); // 2 calls 0.000085s and 0.000043s each
PC swap("swap");
SDL_GL_SwapWindow(mainWindow); // 1 call 0.007823s
The call to SDL_GL_SwapWindow(mainWindow); is taking 200 times longer than the draw calls! To my understanding i thought all that function was supposed to do was swap buffers. That would mean that the time it takes to swap would scale depending on the screen size right? No it scales based on the amount of geometry... I did some searching online, I have double buffering enable and vsync is turned off. I am stumped.
Your OpenGL driver is likely doing deferred rendering.
That means the calls to the glDrawArrays and friends don't draw anything. Instead they buffer all required information to perform the operation later on.
The actual rendering happens inside SDL_GL_SwapWindow.
This behavior is typical these days because you want to avoid having to synchronize between the CPU and the GPU as much as possible.

Can I call `glDrawArrays` multiple times while updating the same `GL_ARRAY_BUFFER`?

In a single frame, is it "allowed" to update the same GL_ARRAY_BUFFER continuously and keep calling glDrawArrays after each update?
I know this is probably not the best and not the most recommended way to do it, but my question is: Can I do this and expect to get the GL_ARRAY_BUFFER updated before every call to glDrawArrays ?
Code example would look like this:
// setup a single buffer and bind it
GLuint vbo;
glGenBuffers(1, &vbo);
glBindBuffer(GL_ARRAY_BUFFER, vbo);
while (!renderStack.empty())
SomeObjectClass * my_object = renderStack.back();
// calculate the current buffer size for data to be drawn in this iteration
SomeDataArrays * subArrays = my_object->arrayData();
unsigned int totalBufferSize = subArrays->bufferSize();
unsigned int vertCount = my_object->vertexCount();
// initialise the buffer to the desired size and content
glBufferData(GL_ARRAY_BUFFER, totalBufferSize, NULL, GL_STREAM_DRAW);
// actually transfer some data to the GPU through glBufferSubData
for (int j = 0; j < subArrays->size(); ++j)
unsigned int subBufferOffset = subArrays->get(j)->bufferOffset();
unsigned int subBufferSize = subArrays->get(j)->bufferSize();
void * subBufferData = subArrays->get(j)->bufferData();
glBufferSubData(GL_ARRAY_BUFFER, subBufferOffset, subBufferSize, subBufferData);
unsigned int subAttributeLocation = subArrays->get(j)->attributeLocation();
// set some vertex attribute pointers
glVertexAttribPointer(subAttributeLocation, ...);
glEnableVertexAttribArray(subAttributeLocation, ...);
glDrawArrays(GL_POINTS, 0, (GLsizei)vertCount);
You may ask - why would I want to do that and not just preload everything onto the GPU at once ... well, obvious answer, because I can't do that when there is too much data that can't fit into a single buffer.
My problem is, that I can only see the result of one of the glDrawArrays calls (I believe the first one) or in other words, it appears as if the GL_ARRAY_BUFFER is not updated before each glDrawArrays call, which brings me back to my question, if this is even possible.
I am using an OpenGL 3.2 CoreProfile (under OS X) and link with GLEW for OpenGL setup as well as Qt 5 for setting up the window creation.
Yes, this is legal OpenGL code. It is in no way something that anyone should ever actually do. But it is legal. Indeed, it makes even less sense in your case, because you're calling glVertexAttribPointer for every object.
If you can't fit all your vertex data into memory, or need to generate it on the GPU, then you should stream the data with proper buffer streaming techniques.