Why is glReadPixels so slow and are there any alternative?

Why is glReadPixels so slow and are there any alternative? - c++

I need to take sceenshots at every frame and I need very high performance (I'm using freeGlut). What I figured out is that it can be done like this inside glutIdleFunc(thisCallbackFunction)
GLubyte *data = (GLubyte *)malloc(3 * m_screenWidth * m_screenHeight);
glReadPixels(0, 0, m_screenWidth, m_screenHeight, GL_RGB, GL_UNSIGNED_BYTE, data);
// and I can access pixel values like this: data[3*(x*512 + y) + color] or whatever
It does work indeed but I have a huge issue with it, it's really slow. When my window is 512x512 it runs no faster than 90 frames per second when only cube is being rendered, without these two lines it runs at 6500 FPS! If we compare it to irrlicht graphics engine, there I can do this
// irrlicht code
video::IImage *screenShot = driver->createScreenShot();
const uint8_t *data = (uint8_t*)screenShot->lock();
// I can access pixel values from data in a similar manner here
and 512x512 window runs at 400 FPS even with a huge mesh (Quake 3 Map) loaded! Take into account that I'm using openGL as driver inside irrlicht. To my inexperienced eye it seems like glReadPixels is copying every pixel data from one place to another while (uint8_t*)screenShot->lock() is just copying a pointer to already existent array. Can I do something similar to latter using freeGlut? I expect it to be faster than irrlicht.
Note that irrlicht uses openGL too (well it offers directX and other options as well but in the example I gave above I used openGL and by the way it was the fastest compared to other options)

OpenGL methods are used to manage the rendering pipeline. In its nature, while the graphics card is showing image to the viewer, computations of the next frame are being done. When you call glReadPixels; graphics card wait for the current frame to be done, reads the pixels and then starts computing the next frame. Therefore pipeline becomes stalled and becomes sequential.
If you can hold two buffers and tell to the graphics card to read data into these buffers interchanging each frame; then you can read-back from your buffer 1-frame late but without stalling the pipeline. This is called double buffering. You can also do triple buffering with 2 frame late read-back and so on.
There is a relatively old web page describing the phenomenon and implementation here: http://www.songho.ca/opengl/gl_pbo.html
Also there are a lot of tutorials about framebuffers and rendering into a texture on the web. One of them is here: http://www.opengl-tutorial.org/intermediate-tutorials/tutorial-14-render-to-texture/

Related

Adapting an OpenCL kernel to OpenGL-only

I would like to figure out if there's a way to adapt the following OpenCL approach to OpenGL only. What this OpenCL kernel does is go through several buffers generated by the host and copied to the device memory to know what graphics to render to the OpenGL texture srgb. The CPU generates those buffers for each frame to render, enqueues the copying of those buffers to the GPU and enqueues the execution of the following kernel, which writes to the OpenGL texture while it is temporarily owned by OpenCL, and then naturally when this is all done OpenGL displays the texture on the screen. An iteration of this OpenCL kernel focuses entirely on fully generating and writing a single pixel in one pass only, it very much operates on a per-pixel basis and for every pixel of the texture in the same way with the same parameters and data to work with.
kernel void draw_queue_srgb_kernel(global float *paramlist, global int *poslist, global int *entrylist, global uchar *data_cl, write_only image2d_t srgb, const int sector_w, const int sector_size)
{
const int2 p = (int2) (get_global_id(0), get_global_id(1));
float4 pv; // pixel value (linear)
// this computes the pixel value
pv = draw_queue(paramlist, poslist, entrylist, data_cl, sector_w, sector_size);
// this writes the pixel value to the texture
write_imagef(srgb, p, linear_to_srgb(pv));
}
What is the best way to have a similar approach work only with OpenGL? I need the flexibility to provide arrays that are not textures in any way (they're more similar to lists of indices and parameters used for drawing) and to do a lot of math for each pixel (I use sqrt(), sin(), exp() and erf() a lot for instance), on the other hand every pixel of the texture is treated the same way once, and I'm writing to an OpenGL texture in the usual 8-bit/channel RGB(A) format, so nothing very fancy.
In case you're wondering why I want to ditch OpenCL it's for two reasons, graphics drivers tend to handle running kernels with pauses at regular intervals (like 60 FPS) in ways that are wasteful of the CPU (using as much CPU for the same load no matter if you run at 400 FPS or slow it down to 25 because of wasteful polling in the driver threads), but more importantly because I'm using emscripten to make JavaScript builds and since WebCL support is pretty much nonexistent and not foreseeably happening it would be best to stick strictly to WebGL. And it just seems like maybe what I want to do would be better done with OpenGL.

Why do DirectX fullscreen applications give black screenshots?

You may know that trying to capture DirectX fullscreen applications the GDI way (using BitBlt()) gives a black screenshot.
My question is rather simple but I couldn't find any answer: why? I mean technically, why does it give a black screenshot?
I'm reading a DirectX tutorial here: http://www.directxtutorial.com/Lesson.aspx?lessonid=9-4-1. It's written:
[...] the function BeginScene() [...] does something called locking, where the buffer in the video RAM is 'locked', granting you exclusive access to this memory.
Is this the reason? VRAM is locked so GDI can't access it and it gives a black screenshot?
Or is there another reason? Like DirectX directly "talks" to the graphic card and GDI doesn't get it?
Thank you.

The reason is simple: performance.
The idea is to render a scene as much as possible on the GPU out of lock-step with the CPU. You use the CPU to send the rendering buffers to the GPU (vertex, indices, shaders etc), which is overall really cheap because they're small, then you do whatever you want, physics, multiplayer sync etc. The GPU can just crunch the data and render it on its own.
If you require the scene to be drawn on the window, you have to interrupt the GPU, ask for the rendering buffer bytes (LockRect), ask for the graphics object for the window (more interference with the GPU), render it and free every lock. You just lost any sort of gain you had by rendering on the GPU out of sync with the CPU. Even worse when you think of all the different CPU cores just sitting idle because you're busy "rendering" (more like waiting on buffer transfers).
So what graphics drivers do is they paint the rendering area with a magic color and tell the GPU the position of the scene, and the GPU takes care of overlaying the scene over the displayed screen based on the magic color pixels (sort of a multi-pass pixel shader that takes from the 2nd texture when the 1st texture has a certain color for x,y, but not that slow). You get completely out of sync rendering, but when you ask the OS for its video memory, you get the magic color where the scene is because that's what it actually uses.
Reference: http://en.wikipedia.org/wiki/Hardware_overlay

I believe it is actually due to double buffering. I'm not 100% sure but that was actually the case when I tested screenshots in OpenGL. I would notice that the DC on my window was not the same. It was using two different DC's for this one game.. For other games I wasn't sure what it was doing. The DC was the same but swapbuffers was called so many times that I don't think GDI was even fast enough to screenshot it.. Sometimes I would get half a screenshot and half black..
However, when I hooked into the client, I was able to just ask for the pixels like normal. No GDI or anything. I think there is a reason why we don't use GDI when drawing in games that use DirectX or OpenGL..
You can always look at ways to capture the screen here:http://www.codeproject.com/Articles/5051/Various-methods-for-capturing-the-screen
Anyway, I use the following for grabbing data from DirectX:
HRESULT DXI_Capture(IDirect3DDevice9* Device, const char* FilePath)
{
IDirect3DSurface9* RenderTarget = nullptr;
HRESULT result = Device->GetBackBuffer(0, 0, D3DBACKBUFFER_TYPE_MONO, &RenderTarget);
result = D3DXSaveSurfaceToFile(FilePath, D3DXIFF_PNG, RenderTarget, nullptr, nullptr);
SafeRelease(RenderTarget);
return result;
}
Then in my hooked Endscene I call it like so:
HRESULT Direct3DDevice9Proxy::EndScene()
{
DXI_Capture(ptr_Direct3DDevice9, "C:/Ssers/School/Desktop/Screenshot.png");
return ptr_Direct3DDevice9->EndScene();
}
You can either use microsoft detours for hooking EndScene of some external application or you can use a wrapper .dll.

Updating Screen pixel colors using a random generated vector color

I have divided my display circle to 16 segments. I need to update each segment reading a randomly generated vector of colors. every segment of my circle is 22.5, which is 100 * 360 circle = 2250 pixels.
As I searched the function I shall use is glDrawPixels(). Though, I am not sure yet.
Could you get me a sample code to understand how I could generate my vectors of colors and update my circle segment 2250 pixel using this vector please. Besides, I don't know where to start updating the segment, is it real or it will be updated as texture!
I also used this code, but gives segmentation fault:
void GlWidget::displayColors()
{
//Create some nice colours (3 floats per pixel) from data -10..+10
int size = 2250;
float* pixels = new float[size*3];
for(int i=0;i<size;i++) {
pixels[0] = 1;
pixels[1] = 1;
pixels[2] = 1;
}
glClear(GL_COLOR_BUFFER_BIT | GL_DEPTH_BUFFER_BIT);
glDrawPixels(width(),height(),GL_RGB,GL_FLOAT,pixels);
// glutSwapBuffers();
}

It's extremely difficult to understand what you are trying to do, but I'll offer a general piece of advice…
On most platforms, OpenGL is a hardware accelerated, and is so much faster than simply blitting pixels to display memory, that rendering a partial screen is almost never necessary. There are rare exceptions, but you should only consider doing so if you have a performance problem that optimising your OpenGL pipeline doesn't fix.

Drawing 100 circles is barely going to tax the GPU at all. It sounds like this is a premature optimisation. I strongly suggest you undertake a proper performance analysis before making your code significantly more complex in trying to do what you describe.
It sounds like you're trying to optimise the drawing by only rendering the part of the display that has changed. This is a common optimisation method on buffer-backed displays in GUI systems. However, it is not the norm when rendering with OpenGL. Usually you would render the entire scene for each frame.
Furthermore, glReadPixels() is a very expensive operation. It basically stalls the rendering pipeline and reads back the render buffer from the GPU into main memory, which is relatively slow. It is almost certainly not what you want to do.
If you really need to optimise the drawing, you could cache your drawing into a texture and just render the texture in a quad each frame. But first, make sure that it is really worth the extra effort and complexity.
There are some good books and online resources around on OpenGL and performance analysis. It is definitely worth doing some research to save yourself some time before you try to save yourself some time. :)

What is the most efficient process to push YUV texture data onto a GPU in OpenGL?

Does anyone know of an efficient way to push 2vuy non-planar data onto a GPU in a way that doesn't require swizzling?
I am grabbing the raw 2vuy data from an h264 video file and successfully loading it into a texture that I map to an an OpenGL object. I notice that my code spends a fair amount of time in glgProcessPixelsWithProcessor. My glTexImage2D call looks like the following:
glTexImage2D(GL_TEXTURE_2D, 0, GL_RGBA8, width, height, 0, GL_YCBCR_422_APPLE,
GL_UNSIGNED_SHORT_8_8_APPLE, data);
Apple says in its OpenGL guide that GL_YCBCR_422_APPLE, provides "acceptable" performance (p103), but that
Note: If your data needs only to be swizzled, glgProcessPixels performs the swizzling reasonably fast although not as fast as if the data didn't need swizzling. But non-native data formats are converted one byte at a time and incurs a performance cost that is best to avoid.
I assume that there is some kind of internal format conversion going on the CPU. I noticed in another thread that glgProcessPixels is running a block method as well.
Is my path the most efficient? If not, what is?

Your code, as it stands right now depends on extensions of Apple. I can't tell what's happening inside.
However what I suggest is, that you create three 2D textures, each with exactly one channel, where each texture receives one of the color planes; using independent textures makes supporting chroma subsampling (that 422) simpler.
In a shader you'd then perform the colorspace conversion. When writing down the math I suggest you do this via a contact color space, like XYZ, as this allows you, to take the color profile of the output device into account; ICC profiles provide the conversion data from XYZ color space coordinates to device color space (RGB) coordinates.

How to scale to resolution in SDL?

I'm writing a 2D platformer game using SDL with C++. However I have encountered a huge issue involving scaling to resolution. I want the the game to look nice in full HD so all the images for the game have been created so that the natural resolution of the game is 1920x1080. However I want the game to scale down to the correct resolution if someone is using a smaller resolution, or to scale larger if someone is using a larger resolution.
The problem is I haven't been able to find an efficient way to do this.I started by using the SDL_gfx library to pre-scale all images but this doesn't work as it creates a lot of off-by-one errors, where one pixel was being lost. And since my animations are contained in one image when the animation would play the animation would slightly move up or down each frame.
Then after some looking round I have tried using opengl to handle the scaling. Currently my program draws all the images to a SDL_Surface that is 1920x1080. It then converts this surface to a opengl texture, scales this texture to the screen resolution, then draws the texture. This works fine visually but the problem is that its not efficient at all. Currently I am getting a max fps of 18 :(
So my question is does anyone know of an efficient way to scale the SDL display to the screen resolution?

It's inefficient because OpenGL was not designed to work that way. Main performance problems with current design:
First problem: You're software rasterizing with SDL. Sorry, but no matter what you do with this configuration, that will be a bottleneck. At a resolution of 1920x1080, you have 2,073,600 pixels to color. Assuming it takes you 10 clock cycles to shade each 4-channel pixel, on a 2GHz processor you're running a maximum of 96.4 fps. That doesn't sound bad, except you probably can't shade pixels that fast, and you still haven't done AI, user input, game mechanics, sound, physics, and everything else, and you're probably drawing over some pixels at least once anyway. SDL_gfx may be quick, but for large resolutions, the CPU is just fundamentally overtasked.
Second problem: Each frame, you're copying data across the graphics bus to the GPU. This is the slowest thing you can possibly do graphics-wise. Image data is probably the worst of that, because there's typically so much of it. Basically, each frame you're telling the GPU to copy two million some pixels from RAM to VRAM. According to Wikipedia, you can expect, for 2,073,600 pixels at 4 bytes each, no more than 258.9 fps, which again doesn't sound bad until you remember everything else you need to do.
My recommendation: switch your application completely to OpenGL. This removes the need to render to a texture and copy to the screen--just render directly to the screen! Also, scaling is handled automatically by your view matrix (glOrtho/gluOrtho2D for 2D), so you don't have to care about the scaling issue at all--your viewport will just show everything at the same scale. This is the ideal solution to your problem.
Now, it comes with the one major drawback that you have to recode everything with OpenGL draw commands (which is work, but not too hard, especially in the long run). Short of that, you can try the following ideas to improve speed:
PBOs. Pixel buffer objects can be used to address problem two by making texture loading/copying asynchronous.
Multithread your rendering. Most CPUs have at least two cores and on newer chips two register states can be saved for a single core (Hyperthreading). You're essentially duplicating how the GPU solves the rendering problem (have a lot of threads going). I'm not sure how thread safe SDL_gfx is, but I bet that something could be worked out, especially if you're only working on different parts of the image at the same time.
Make sure you pay attention to what place your draw surface is in SDL. It should probably be SDL_SWSURFACE (because you're drawing on the CPU).
Remove VSync. This can improve performance, even if you're not running at 60Hz
Make sure you're drawing your original texture--DO NOT scale it up or down to a new one. Draw it at a different size, and let the rasterizer do the work!
Sporadically update: Only update half the image at a time. This will probably close to double your "framerate", and it's (usually) not noticeable.
Similarly, only update the changing parts of the image.
Hope this helps.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js