GlTexSubImage2D slow and uses 4% of CPU [closed] - c++

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 3 years ago.
Improve this question
I am using glTexSubImage2D for update window that uses openGL.
I see that this function takes a lot of time to return and it also takes 4% of CPU.
Here is the code that I use:
glEnable(GL_TEXTURE_2D);
glBindTexture(GL_TEXTURE_2D, (*i)->getTextureID());
glTexSubImage2D(GL_TEXTURE_2D, 0, 0, 0, (*i)->getWidth(), (*i)->getHeightView(),
GL_BGRA, GL_UNSIGNED_BYTE,(*i)->getBuffer());
Does anybody know of a better implementation? Something with better performance that will take less CPU?
Right now this is making my program sluggish.

There are some things you can do, though how much you can benefit from them depends on the circumstances.
First, make sure that your pixel upload format is correct for the driver's needs. You seem to have that taken care of with GL_BGRA, GL_UNSIGNED_BYTE, which is likely the driver's preferred format for GL_RGBA8 image formats.
However, if you happen to have access to OpenGL 4.3 or a driver that implements ARB_internalformat_query2, you can actually detect at runtime what the preferred upload format will be. Like this:
GLint pixelFormat, pixelType;
glGetInternalFormativ(GL_TEXTURE_2D, GL_RGBA8, GL_TEXTURE_IMAGE_FORMAT, 1, &pixelFormat);
glGetInternalFormativ(GL_TEXTURE_2D, GL_RGBA8, GL_TEXTURE_IMAGE_TYPE, 1, &pixelType);
Of course, this means that you will need to be able to modify your data generation method to generate data in the above format/type pair.
Once you've taken steps to appease the driver, your next possibilities are using buffer objects to store your pixel transfer data. This probably won't help overall performance, but it can reduce the CPU burden.
However, in order to take the best advantage of this, you need to be able to generate your pixel data "directly" into the buffer object's memory by mapping it. If you are able to do this, then you can probably get back some of the CPU cost of the upload. Otherwise, it may not be worthwhile.
If you do this, you should use proper buffer object streaming techniques.
Double-buffering your texture may also help. That is, while you're rendering from one texture object, you're uploading to another one. This will prevent GPU stalls that wait for the prior rendering to complete. How much this helps really depends on how you're rendering.
Without knowing more about the specific circumstances of your application, there's not much more that can be said.

If your texture really is changing every frame, then you will want to use a double buffer to transport your data to the GPU. (If it's not changing every frame, then the obvious optimization is to only upload it once!)
Each frame, you upload data to one buffer and draw data from the other buffer, and you switch which buffer you use each frame. This will speed everything up because the GPU will not have to wait for the memory transfer to finish.
A tutorial on PBOs is somewhat beyond my ability to condense into an answer, but "OpenGL Pixel Buffer Objects" is a decent reference, and I would look at the "OGL Samples" repository to see how PBOs work.
However, If you can't compute a texture frame in advance, then there is no real advantage to using PBOs. Just use glTexSubImage2D.
That said, 4% of CPU might not be a problem.

You should not be changing the data of a texture every frame in order to update your screen. Textures are meant to be loaded once and rarely (if ever) changed. If you are trying to write to individual pixels on your screen, I would recommend not using OpenGL, and use something more suited to the task, like SDL.
Edit: Okay, this isn't necessarily true. See discussion below.

As I understand from this answer's comment thread, you're rendering a website on the CPU side (or the rendered image goes through the CPU), but applying OpenGL shaders to it. If it's so, you need a GPU-side renderer, rendering the webpage and applying shaders on the GPU side. This way, you'll no longer upload each frame to the GPU through the CPU, and the CPU will be free from rendering needs, as it's intended to be.

Related

What is the difference between clearing the framebuffer using glClear and simply drawing a rectangle to clear the framebuffer?

I think at least some old graphics drivers used to crash if glClear wasn't used and that glClear is probably faster in a lot of cases but why? How are 3-d graphics drivers usually implemented such that these uses would have different results?
On a high level, it can be faster because the OpenGL implementation knows ahead of time that the whole buffer needs to be set to the same color/value. The more you know about what exactly needs to be done, the more you can take advantage of possible accelerations.
Let's say setting a whole buffer to the same value is more efficient than setting the same pixels to variable values. With a glClear(), you know already that all pixels will have the same value. If you draw a screen sized quad with a fragment shader that emits a constant color, the driver would either have to recognize that situation by analyzing the shaders, or the system would have to compare the values coming out of the shader, to know that all pixels have the same value.
The reason why setting everything to the same value can be more efficient has to do with framebuffer compression and related technologies. GPUs often don't actually write each pixel out to the framebuffer, but use various kinds of compression schemes to reduce the memory bandwidth needed for framebuffer writes. If you imagine almost any kind of compression, all pixels having the same value is very favorable.
To give you some ideas about the published vendor specific technologies, here are a few sources. You can probably find more with a search.
Article talking about new framebuffer compression method in relatively recent AMD cards: http://techreport.com/review/26997/amd-radeon-r9-285-graphics-card-reviewed/2.
NVIDIA patent on zero bandwidth clears: http://www.google.com/patents/US8330766.
Blurb on ARM web site about Mali framebuffer compression: http://www.arm.com/products/multimedia/mali-technologies/arm-frame-buffer-compression.php.
Why is it faster? Because it is a function that bypasses most calculations that other types of drawings have to go through.
Alpha function, blend function, logical operation, stenciling, texture mapping, and depth-buffering are ignored by glClear
Source
Why do some drivers crash without it? It's hard to say, but it should have something to do with the implementation details of OpenGL. The functions does what it's supposed to do, but might do more that you don't know about.
OpenGL might infer from this function call other tasks that it needs to perform.

OpenGL read pixels faster than glReadPixels

Is there a way to increase the speed of glReadPixels? Currently I do:
Gdx.gl.glReadPixels(0, 0, Gdx.graphics.getWidth(), Gdx.graphics.getHeight(), GL20.GL_RGBA, GL20.GL_UNSIGNED_BYTE, pixels);
The problem is that it blocks the rendering and is slow.
I have heard of Pixel Buffer Objects, but I am quite unsure on how to wire it up and whether it is faster or not.
Also is there any other solutation than glReadPixels?
Basically, I want to take a screenshot as fast as possible, without blocking the drawing of the next scene.
Is there a way to increase the speed of glReadPixels?
Well, the speed of that operation is actually not the main issue. It has to transfer a certain amount of bytes from the framebuffer to your system memory. In your typical desktop system with a discrete GPU, that involves sending the data over PCI-Express, and there is no way around that.
But as you already stated, the implicit synchronization is a big issue. If you need that pixel data as soon as possible, you can't really do much better than that synchronous readback. But if you can live with getting that data later, asynchronous readback via pixel buffer objects (PBOs) is the way to go.
The pseudo code for that is:
create PBO
bind PBO as GL_PIXEL_PACK_BUFFER
do the glReadPixels
do something else. Both work on the CPU and issuing new commands for the GPU is ideal.
Read back the data from PBO by either using glGetBufferSubData or by mapping the PBO for reading.
The crucial point is the timing of step 5. I you do that to early, you still blocking the client side, as it will wait for the data to become available. For some screenshots, It should not be hard to delay that step for even one or two frames. That way, it will have only a slight impact on the overall render performance, and it will neither stall the GPU nor the CPU.

mipmap generation in opengl - is it hardware accelerated?

The purpose here isn't rendering, but gpgpu; it's for image blurring:
given an image, I need to blur it with a fixed given separable kernel (see e.g. Separable 2D Blur Kernel).
For GPU processing, a good popular method would be to first filter the lines, then filter the columns; and using the vertex shader and the fragment shader to do so (*)
However, if I have a fixed-sized kernel, I think I can use a fast-calculated mipmap that is close to the level I want, and then upsample it (as was suggested here) .
The question is therefore: will an opengl-created mipmap be faster than a mipmap I create myself using the method of (*)?
Put another way: is the mipmap creation optimized on the gpu itself? will it always outperform (speed-wise) user-created glsl code? or would it depend on the graphics card?
Edit:
Thanks for the replies (Kahler, Jean-Simon Brochu). However, I still haven't seen any resources that explicitly say whether mipmaps generation by the gpu is faster than any user-created mipmaps, because of specific mipmap-generation-gpu-hardware...
OpenGL does not care how the functions are implemented.
OpenGL is a set of specifications, among them is the glGenerateMipmap.
Anyone can write a software renderer or develop a video card compliant to the specification. If it pass the tests, it's ~OpenGL certified~
That means that no function is mandatory to be performed on CPU or GPU, or anywhere, they just have to produce the OpenGL expected results.
Now for the practical side:
Nowadays, you can just assume the mipmap generation is done by the video card, because the major-vendors adopted this approach.
If you really want to know, you will have to check specifically to the video card you are programing to.
As for performance, assume you can't beat the video card.
Even if you come up with some highly optimized code performed in some high-tech-full-of-things-CPU, you will have to upload the mipmaps you generated to the GPU, and this operation alone will probably take more time then letting the GPU do the work after you've uploaded the full-resolution texture.
And, if you program the mipmaping as a shader, still unlikely to beat the hard-coded (maybe even hard wired) built-in function. (and that code-alone, not counting the fact that it may schedule better, process apart, etc)
This site explains the glGenerateMipmap history better =))

What is the most efficient process to push YUV texture data onto a GPU in OpenGL?

Does anyone know of an efficient way to push 2vuy non-planar data onto a GPU in a way that doesn't require swizzling?
I am grabbing the raw 2vuy data from an h264 video file and successfully loading it into a texture that I map to an an OpenGL object. I notice that my code spends a fair amount of time in glgProcessPixelsWithProcessor. My glTexImage2D call looks like the following:
glTexImage2D(GL_TEXTURE_2D, 0, GL_RGBA8, width, height, 0, GL_YCBCR_422_APPLE,
GL_UNSIGNED_SHORT_8_8_APPLE, data);
Apple says in its OpenGL guide that GL_YCBCR_422_APPLE, provides "acceptable" performance (p103), but that
Note: If your data needs only to be swizzled, glgProcessPixels performs the swizzling reasonably fast although not as fast as if the data didn't need swizzling. But non-native data formats are converted one byte at a time and incurs a performance cost that is best to avoid.
I assume that there is some kind of internal format conversion going on the CPU. I noticed in another thread that glgProcessPixels is running a block method as well.
Is my path the most efficient? If not, what is?
Your code, as it stands right now depends on extensions of Apple. I can't tell what's happening inside.
However what I suggest is, that you create three 2D textures, each with exactly one channel, where each texture receives one of the color planes; using independent textures makes supporting chroma subsampling (that 422) simpler.
In a shader you'd then perform the colorspace conversion. When writing down the math I suggest you do this via a contact color space, like XYZ, as this allows you, to take the color profile of the output device into account; ICC profiles provide the conversion data from XYZ color space coordinates to device color space (RGB) coordinates.

optimal pixel-read back strategy

I need to render certain scenes and read the whole image back in main memory. I've search for this and it seems that most video cards will accelerate the rendering but the read-back will be very slow. After a bit of research i only found this card mentioning "Hardware-Accelerated Pixel Read-Back"
The other approach would do software rendering and the read-back problem doesn't exist, but then the rendering performance will be bad.
Likely, i will have to implement both in order to be able to find the optimal trade-off, but my question is about what other alternative can i have hardware-wise; i understand Quadro is for modelling and designer market segment, which is precisely the client target of this application, Does this means that i'm not likely to find better pixel read-back performance in other video card lines? i.e: Tesla or Fermi, which don't even have video outputs btw
I don't know if the performance would be any different, but you could at least try rendering to an off-screen buffer, then setting that as a texture of a full-screen quad (or outputting that to video in some other way)