Why is OpenGL simple loop faster than Vulkan one? - opengl

I have 2 graphics applications for OpenGL and Vulkan.
OpenGL loop looks something like this:
glClear(GL_DEPTH_BUFFER_BIT | GL_COLOR_BUFFER_BIT);
static int test = 0;
// "if" statement here is to ensure that there is no any caching or optimizations
// made by OpenGL driver (if such things exist),
// and commands are re-recorded to the buffer every frame
if ((test = 1 - test) == 0) {
glBindBuffer(GL_ARRAY_BUFFER, vertex_buffer1);
glUseProgram(program1);
glDrawArrays(GL_TRIANGLES, 0, vertices_size);
glUseProgram(0);
}
else {
glBindBuffer(GL_ARRAY_BUFFER, vertex_buffer2);
glUseProgram(program2);
glDrawArrays(GL_LINES, 0, vertices_size);
glUseProgram(0);
}
glfwSwapBuffers(window);
And Vulkan:
static uint32_t image_index = 0;
vkAcquireNextImageKHR(device, swapchain, 0xFFFFFFFF, image_available_semaphores[image_index], VK_NULL_HANDLE, &image_indices[image_index]);
vkWaitForFences(device, 1, &submission_completed_fences[image_index], VK_TRUE, 0xFFFFFFFF);
// VK_COMMAND_BUFFER_USAGE_ONE_TIME_SUBMIT_BIT
vkBeginCommandBuffer(cmd_buffers[image_index], &command_buffer_bi);
vkCmdBeginRenderPass(cmd_buffers[image_index], &render_pass_bi[image_index], VK_SUBPASS_CONTENTS_INLINE);
vkCmdEndRenderPass(cmd_buffers[image_index]);
vkEndCommandBuffer(cmd_buffers[image_index]);
vkResetFences(device, 1, &submission_completed_fences[image_index]);
vkQueueSubmit(graphics_queue, 1, &submit_info[image_index], submission_completed_fences[image_index]);
present_info[image_index].pImageIndices = &image_indices[image_index];
vkQueuePresentKHR(present_queue, &present_info[image_index]);
const static int max_swapchain_image_index = swapchain_image_count - 1;
if (++image_index > max_swapchain_image_index) {
image_index = 0;
}
In the Vulkan loop there are no even rendering commands, just empty render pass. Validation layers are disabled.
OpenGL FPS is about 10500, and Vulkan FPS is about 7500 (with 8 swapchain images in use with VK_PRESENT_MODE_IMMEDIATE_KHR, less images make FPS lower).
Code is running on laptop with Ubuntu 18.04, discrete GPU Nvidia RTX 2060, Nvidia driver 450.66, Vulkan API version 1.2.133.
I know that OpenGL driver is highly optimized, but i can't imagine what else is to be optimized in Vulkan loop to make it faster than it is.
Are there some low-level linux driver issues? Or maybe Vulkan performance increase is accomplished only in much more complex applications (using multithreading e.g.)?

Related

OpenGL compute shader premature abort after calling glComputeDispatch

I have been trying to run a very simple counting compute shader to get a grasp on how many times my shader runs and how large of a compute array I can process.
It seems that I'm either hitting some driver limit or my shader takes too long for the card to execute so it is prematurely aborted or something. There does not seem to be any error returned from glDispatchCompute at least.
I have been reading up everything on compute shaders and nowhere does it seem to say that time limit would be an issue.
The hardware is an intel integrated graphics card which is rather low end but does have compute shader support. I want to be able to run compute shaders even on lower end cards and I think this card should be able to do it but I'm running into weird premature abort problems.
glxinfo | grep compute
GL_ARB_compressed_texture_pixel_storage, GL_ARB_compute_shader,
GL_ARB_compressed_texture_pixel_storage, GL_ARB_compute_shader,
More info:
const GLubyte* renderer = glGetString(GL_RENDERER); // get renderer string
const GLubyte* version = glGetString(GL_VERSION); // version as a string
GLint texture_units = 0;
glGetIntegerv(GL_MAX_TEXTURE_IMAGE_UNITS, &texture_units);
GLint maxAttach = 0;
glGetIntegerv(GL_MAX_COLOR_ATTACHMENTS, &maxAttach);
GLint maxDrawBuf = 0;
glGetIntegerv(GL_MAX_DRAW_BUFFERS, &maxDrawBuf);
GLint workGroupCount[3], workGroupSize[3];
GLint maxInvocations;
glGetIntegerv(GL_MAX_COMPUTE_WORK_GROUP_INVOCATIONS, &maxInvocations);
glGetIntegeri_v(GL_MAX_COMPUTE_WORK_GROUP_COUNT, 0, &workGroupCount[0]);
glGetIntegeri_v(GL_MAX_COMPUTE_WORK_GROUP_COUNT, 1, &workGroupCount[1]);
glGetIntegeri_v(GL_MAX_COMPUTE_WORK_GROUP_COUNT, 2, &workGroupCount[2]);
glGetIntegeri_v(GL_MAX_COMPUTE_WORK_GROUP_SIZE, 0, &workGroupSize[0]);
glGetIntegeri_v(GL_MAX_COMPUTE_WORK_GROUP_SIZE, 1, &workGroupSize[1]);
glGetIntegeri_v(GL_MAX_COMPUTE_WORK_GROUP_SIZE, 2, &workGroupSize[2]);
printf("Renderer: %s\n", renderer);
printf("OpenGL version supported: %s\n", version);
printf("Number of texture units: %d\n", texture_units);
printf("Maximum number of color attachments: %d\n", maxAttach);
printf("Maximum number of fragment shader outputs: %d\n", maxDrawBuf);
printf("Maximum work group invocations: %d\n", maxInvocations);
printf("Maximum work group count: %d %d %d\n", workGroupCount[0], workGroupCount[1], workGroupCount[2]);
printf("Maximum work group size: %d %d %d\n", workGroupSize[0], workGroupSize[1], workGroupSize[2]);
Output:
Vendor: Intel Open Source Technology Center (0x8086)
Device: Mesa DRI Intel(R) Haswell Mobile (0x416)
OpenGL vendor string: Intel Open Source Technology Center
OpenGL renderer string: Mesa DRI Intel(R) Haswell Mobile
Renderer: Mesa DRI Intel(R) Haswell Mobile
OpenGL version supported: OpenGL ES 3.1 Mesa 17.0.7
Number of texture units: 32
Maximum number of color attachments: 8
Maximum number of fragment shader outputs: 8
Maximum work group invocations: 2048
Maximum work group count: 65535 65535 65535
Maximum work group size: 2048 2048 2048
Shader:
#version 310 es
layout (local_size_x = 32, local_size_y = 32, local_size_z = 1) in;
layout (binding=0) uniform atomic_uint counter;
void main(){
atomicCounterIncrement(counter);
}
Setup:
GLuint ac_buffer;
glGenBuffers(1, &ac_buffer);
glBindBuffer(GL_ATOMIC_COUNTER_BUFFER, ac_buffer);
glBufferData(GL_ATOMIC_COUNTER_BUFFER, sizeof(GLuint), NULL, GL_DYNAMIC_DRAW);
glBindBuffer(GL_ATOMIC_COUNTER_BUFFER, 0);
GLuint compute_shader = glCreateShader (GL_COMPUTE_SHADER);
std::string ss;
readfile("compute.cs.c", ss);
const char *shader_source = ss.c_str();
glShaderSource (compute_shader, 1, &shader_source, NULL);
glCompileShader (compute_shader);
printShaderInfoLog(compute_shader);
GLuint shader_program = glCreateProgram ();
glAttachShader (shader_program, compute_shader);
glLinkProgram (shader_program);
printProgramInfoLog(shader_program);
glDeleteShader (compute_shader);
glUseProgram (shader_program);
glBindBufferBase(GL_ATOMIC_COUNTER_BUFFER, 0, ac_buffer);
glDispatchCompute(1024, 1024, 1);
if(glGetError() != GL_NO_ERROR) {
printf("There was a problem dispatching compute\n");
}
glMemoryBarrier(GL_ALL_BARRIER_BITS);
glBindBuffer(GL_ATOMIC_COUNTER_BUFFER, ac_buffer);
GLuint *counter = (GLuint*)glMapBufferRange(GL_ATOMIC_COUNTER_BUFFER, 0, sizeof(GLuint), GL_MAP_READ_BIT);
printf("Counter: %u\n", *counter);
glUnmapBuffer(GL_ATOMIC_COUNTER_BUFFER);
glBindBuffer(GL_ATOMIC_COUNTER_BUFFER, 0);
When I call glDispatchCompute with smaller values than 128 then I do seem to get reasonable results:
For example glDispatchCompute(128, 128, 1)results in "Counter: 16777216" which is consistent with 128*128*32*32. But if I call it with a 256, 256, 1 - I get result that is 66811258 instead. Which is no longer consistent with expected 67108864.
For smaller compute sets I always get expected results, but for larger ones the counter rarely ever goes beyond 60-100 million. Could I be hitting some driver limit? I though that since max group size is 65535 along each axis then I should be able to request large compute groups to be computed and expect all elements to be processed.
Could it be that my way of counting by means of atomic is flawed? Why does it still get reasonable results for small groups but falls short for large ones? How can I better debug this issue?
It is possible you're just reading the result before computation is complete. You need an explicit call to glFinish() to force completion and can remove the call to glMemoryBarrier(). For OpenGL ES glMemoryBarrier() only deals with relative ordering on the GPU between stages, it doesn't enforce ordering relative to the client access.
The desktop OpenGL 4.6 spec supports CLIENT_MAPPED_BUFFER_BARRIER_BIT for synchronizing client-side access, but this isn't available for OpenGL ES.

SDL_GL_SwapWindow bad performance

I did some performance testing and came up with this:
glClear(GL_COLOR_BUFFER_BIT | GL_DEPTH_BUFFER_BIT);
for(U32 i=0;i<objectList.length();++i)
{
PC d("draw");
VoxelObject& obj = *objectList[i];
glBindVertexArray(obj.vao);
tmpM = usedView->projection * usedView->transform * obj.transform;
glUniformMatrix4fv(shader.modelViewMatrixLoc, 1, GL_FALSE, tmpM.data());
//glActiveTexture(GL_TEXTURE0);
glBindTexture(GL_TEXTURE_2D, typesheet.tbo);
glUniform1i(shader.typesheetLoc, 0);
glDrawArrays(GL_TRIANGLES, 0, VoxelObject::VERTICES_PER_BOX*obj.getNumBoxes());
d.out(); // 2 calls 0.000085s and 0.000043s each
}
PC swap("swap");
SDL_GL_SwapWindow(mainWindow); // 1 call 0.007823s
swap.out();
The call to SDL_GL_SwapWindow(mainWindow); is taking 200 times longer than the draw calls! To my understanding i thought all that function was supposed to do was swap buffers. That would mean that the time it takes to swap would scale depending on the screen size right? No it scales based on the amount of geometry... I did some searching online, I have double buffering enable and vsync is turned off. I am stumped.
Your OpenGL driver is likely doing deferred rendering.
That means the calls to the glDrawArrays and friends don't draw anything. Instead they buffer all required information to perform the operation later on.
The actual rendering happens inside SDL_GL_SwapWindow.
This behavior is typical these days because you want to avoid having to synchronize between the CPU and the GPU as much as possible.

OpenGL glReadPixels Performance

I am trying to implement Auto Exposure for HDR Tone mapping and I am trying to reduce the cost of finding the average brightness of my scene and I've seemed to hit a choke point with glReadPixels. Here is my setup:
1: I create a downsampled FBO to reduce the cost of reading when using glReadPixelsusing only the GL_RED values and in GL_BYTE format.
private void CreateDownSampleExposure() {
DownFrameBuffer = glGenFramebuffers();
DownTexture = GL11.glGenTextures();
glBindFramebuffer(GL_FRAMEBUFFER, DownFrameBuffer);
GL11.glBindTexture(GL11.GL_TEXTURE_2D, DownTexture);
GL11.glTexImage2D(GL11.GL_TEXTURE_2D, 0, GL11.GL_RED, 1600/8, 1200/8,
0, GL11.GL_RED, GL11.GL_BYTE, (ByteBuffer) null);
glFramebufferTexture2D(GL_FRAMEBUFFER, GL_COLOR_ATTACHMENT0,
GL11.GL_TEXTURE_2D, DownTexture, 0);
if (glCheckFramebufferStatus(GL_FRAMEBUFFER) != GL_FRAMEBUFFER_COMPLETE) {
System.err.println("error");
} else {
System.err.println("success");
}
GL11.glBindTexture(GL11.GL_TEXTURE_2D, 0);
glBindFramebuffer(GL_FRAMEBUFFER, 0);
}
2: Setting up the ByteBuffers and reading the texture of the FBO texture Above.
Setup(){
byte[] testByte = new byte[1600/8*1000/8];
ByteBuffer testByteBuffer = BufferUtils.createByteBuffer(testByte.length);
testByteBuffer.put(testByte);
testByteBuffer.flip();
}
MainLoop(){
//Render scene and store result into downSampledFBO texture
GL11.glBindTexture(GL11.GL_TEXTURE_2D, DeferredFBO.getDownTexture());
//GL11.glGetTexImage(GL11.GL_TEXTURE_2D, 0, GL11.GL_RED, GL11.GL_BYTE,
//testByteBuffer); <- This is slower than readPixels.
GL11.glReadPixels(0, 0, DisplayManager.Width/8, DisplayManager.Height/8,
GL11.GL_RED, GL11.GL_BYTE, testByteBuffer);
int x = 0;
for(int i = 0; i <testByteBuffer.capacity(); i++){
x+= testByteBuffer.get(i);
}
System.out.println(x); <-Print out accumulated value of brightness.
}
//Adjust exposure depending on brightness.
The problem is, I can downsample my FBO texture by a factor of 100, so my glReadPixelsreads only 16x10 pixels and there is little to no performance gain. There is a substantial performance gain from no downsampling but once I get past around dividing the width and height by 8 it seems to fall off. It seems like there is such a huge overhead of just calling this function. Is there something I am doing incorrectly or not considering when calling glReadPixels?.
glReadPixels is slow because the CPU must wait until the GPU has finished all of it's rendering before it can give you the results. The dreaded sync point.
One way to make glReadPixels fast is to use some sort of double/triple buffering scheme, so that you only call glReadPixels on render-to-textures that you expect the GPU has already finished with. This is only viable if waiting a couple of frames before receiving the result of glReadPixels is acceptable in your application. For example, in a video game the latency could be justified as a simulation of the pupil's response time to a change in lighting conditions.
However, for your particular tone-mapping example, presumably you want to calculate the average brightness only to feed that information back into the GPU for another rendering pass. Instead of glReadPixels, calculate the average by copying your image to successively half-sized render targets with linear filtering (a box filter), until you're down to a 1x1 target.
That 1x1 target is now a texture containing your average brightness and can use that texture in your tone-mapping rendering pass. No sync points.

Can't generate mipmaps with off-screen OpenGL context on Linux

This question is a continuation of the problem I described here .This is one of the weirdest bugs I have ever seen.I have my engine running in 2 modes:display mode and offscreen.The OS is Linux.I generate mipmaps for the textures and in Display mode it all works fine.In that mode I use GLFW3 for context creation.Now,the funny part:in the offscreen mode,context for which I create manually with the code below,the mipmap generation fails OCCASIONALLY!That's on some runs the resulting output looks ok,and in other the missing levels are clearly seen as the frame is full of texture junk data or entirely empty.
At first I though I had my mipmap gen routine wrong which goes like this:
glGenTextures(1, &textureName);
glBindTexture(GL_TEXTURE_2D, textureName);
glTexImage2D(GL_TEXTURE_2D, 0, GL_RGBA8, imageInfo.Width, imageInfo.Height, 0, imageInfo.Format, imageInfo.Type, imageInfo.Data);
glTexParameteri ( GL_TEXTURE_2D, GL_TEXTURE_BASE_LEVEL, 0 );
glGenerateMipmap(GL_TEXTURE_2D);
glTexParameteri(GL_TEXTURE_2D, GL_TEXTURE_MIN_FILTER, GL_LINEAR_MIPMAP_LINEAR);
glTexParameteri(GL_TEXTURE_2D, GL_TEXTURE_MAG_FILTER, GL_LINEAR);
I also tried to play with this param:
glTexParameteri ( GL_TEXTURE_2D, GL_TEXTURE_MAX_LEVEL, XXX);
including Max level detection formula:
int numMipmaps = 1 + floor(log2(glm::max(imageInfoOut.width, imageInfoOut.height)));
But all this stuff didn't work consistently.Out of 10-15 runs 3-4 come out with broken Mipmaps.What I then found was that switching to GL_LINEAR solved it.Also in mipmap mode,setting just 1 level worked as well.Finally I started thinking there could a problem on a context level because in screen mode it works!I switched context creation to GLFW3 and it works.So I wonder what's going on here?Do I miss something in Pbuffer setup which breaks mipmaps generation?I doubt it because AFAIK it is done by the driver.
Here is my custom off-screen context creation setup:
int visual_attribs[] = {
GLX_RENDER_TYPE,
GLX_RGBA_BIT,
GLX_RED_SIZE, 8,
GLX_GREEN_SIZE, 8,
GLX_BLUE_SIZE, 8,
GLX_ALPHA_SIZE, 8,
GLX_DEPTH_SIZE, 24,
GLX_STENCIL_SIZE, 8,
None
};
int context_attribs[] = {
GLX_CONTEXT_MAJOR_VERSION_ARB, vmaj,
GLX_CONTEXT_MINOR_VERSION_ARB, vmin,
GLX_CONTEXT_FLAGS_ARB,
GLX_CONTEXT_ROBUST_ACCESS_BIT_ARB
#ifdef DEBUG
| GLX_CONTEXT_DEBUG_BIT_ARB
#endif
,
GLX_CONTEXT_PROFILE_MASK_ARB, GLX_CONTEXT_COMPATIBILITY_PROFILE_BIT_ARB,
None
};
_xdisplay = XOpenDisplay(NULL);
int fbcount = 0;
_fbconfig = NULL;
// _render_context
if (!_xdisplay) {
throw();
}
/* get framebuffer configs, any is usable (might want to add proper attribs) */
if (!(_fbconfig = glXChooseFBConfig(_xdisplay, DefaultScreen(_xdisplay), visual_attribs, &fbcount))) {
throw();
}
/* get the required extensions */
glXCreateContextAttribsARB = (glXCreateContextAttribsARBProc) glXGetProcAddressARB((const GLubyte *) "glXCreateContextAttribsARB");
glXMakeContextCurrentARB = (glXMakeContextCurrentARBProc) glXGetProcAddressARB((const GLubyte *) "glXMakeContextCurrent");
if (!(glXCreateContextAttribsARB && glXMakeContextCurrentARB)) {
XFree(_fbconfig);
throw();
}
/* create a context using glXCreateContextAttribsARB */
if (!(_render_context = glXCreateContextAttribsARB(_xdisplay, _fbconfig[0], 0, True, context_attribs))) {
XFree(_fbconfig);
throw();
}
// GLX_MIPMAP_TEXTURE_EXT
/* create temporary pbuffer */
int pbuffer_attribs[] = {
GLX_PBUFFER_WIDTH, 128,
GLX_PBUFFER_HEIGHT, 128,
None
};
_pbuff = glXCreatePbuffer(_xdisplay, _fbconfig[0], pbuffer_attribs);
XFree(_fbconfig);
XSync(_xdisplay, False);
/* try to make it the current context */
if (!glXMakeContextCurrent(_xdisplay, _pbuff, _pbuff, _render_context)) {
/* some drivers does not support context without default framebuffer, so fallback on
* using the default window.
*/
if (!glXMakeContextCurrent(_xdisplay, DefaultRootWindow(_xdisplay),
DefaultRootWindow(_xdisplay), _render_context)) {
throw();
}
}
Almost forgot:My system and hardware:
Kubuntu 13.04 64bit. GPU: NVidia Geforce GTX 680 . The engine uses OpenGL 4.2 API
Full OpenGL info:
**OpenGL vendor string: NVIDIA Corporation
OpenGL renderer string: GeForce GTX 680/PCIe/SSE2
OpenGL version string: 4.4.0 NVIDIA 331.49
OpenGL shading language version string: 4.40 NVIDIA via Cg compiler**
Btw,I used also older drivers and it doesn't matter.
UPDATE:
Seems like my assumption regarding GLFW was wrong.When I compile the engine and run it from the terminal the same is happening.BUT - if I run the engine from IDE (debug or release )there are no issues with the mipmaps.Is it possible the standalone app works against different SOs?
To make it clear,I dont't use Pbuffers to render into.I render into custom Frame buffers.
UPDATE1:
I have read that non-power of 2 textures can be tricky to auto generate mipmaps.And that in case OpenGL fails to generate all the levels it turns of texture usage.Is it possible that's what I am experiencing here?Because once the mipmapped texture goes wrong the rest of textures (non mipmapped) disappear too.But if this is the case then why this behavior is inconsistent?
Uh, why are you using PBuffers in the first place? PBuffers have just too many caveats as that there was only one valid reason to use them in a new project?
You want offscreen rendering? Then use Framebuffer Objects (FBOs).
You need a purely off-screen context? Then create a normal window which you simply don't show and create an FBO on it.

CUDA OPENGL Interoperability: cudaGLSetGLDevice

Following the Programming Giude of CUDA 4.0, I call cudaGLSetGLDevice
before any other runtime calls. But the next cuda call, cudaMalloc, return "all CUDA-capable devices are busy or unavailable."
Also, in the NVIDIA forum (http://forums.nvidia.com/index.php?showtopic=186399) an user said that:
"In multi-GPU systems though you're going to encounter even larger flaws in CUDA...
a) You can't do CUDA/GL interop when the CUDA context and the OpenGL context are on different devices (undocumented, and unsupported in my experience)
b) You can't do GL device affinity on non-windows machines.
c) You can't do GL device affinity on consumer devices (Quadro/Tesla only)"
Is this true? My final work must run on a linux multi-gpu system. I have to change the graphic library to use? And in this case, what are you suggestions?
OS: Opensuse 11.4 64 bit
Graphic Card: GeForce 9600M GT
DRIVER: 275.21
See Cuda and OpenGL Interop
I had to replace a simple cudaMalloc() by a burden of gl* things.
Nevertheless, it works pretty well.
// The lattice as a GL Buffer
GLuint gridVBO = 0;
struct cudaGraphicsResource *gridVBO_CUDA = NULL;
// Ask for GL memory buffers
glGenBuffers(1, &gridVBO);
glBindBuffer(GL_ARRAY_BUFFER, gridVBO);
const size_t size = L * L * sizeof(unsigned char);
glBufferData(GL_ARRAY_BUFFER, size, NULL, GL_DYNAMIC_DRAW);
glBindBuffer(GL_ARRAY_BUFFER, gridVBO);
glBindBuffer(GL_ARRAY_BUFFER, 0);
cutilSafeCall(cudaGraphicsGLRegisterBuffer(&gridVBO_CUDA, gridVBO, cudaGraphicsMapFlagsWriteDiscard));
// Map the GL buffer to a device pointer
unsigned char *grid = NULL;
cutilSafeCall(cudaGraphicsMapResources(1, &gridVBO_CUDA, 0));
size_t num_bytes = 0;
cutilSafeCall(cudaGraphicsResourceGetMappedPointer((void **) &grid,
&num_bytes, gridVBO_CUDA));
// Execution configuration
dim3 dimBlock(TILE_X, TILE_Y);
dim3 dimGrid(L/TILE_X, L/TILE_Y);
// Kernel call
kernel<<<dimGrid, dimBlock>>>(grid);
cutilCheckMsg("Kernel launch failed");
// Unmap buffer object
cutilSafeCall(cudaGraphicsUnmapResources(1, &gridVBO_CUDA, 0));