Efficient transfer of planar images for rendering in OpenGL? - c++

What is the most efficient way to transfer planar YUVA images for rendering in OpenGL?
Currently I'm using 4 separate textures (Y, U, V, A) to which I upload to from 4 separate PBOs during each frame. However, it seems to be much more efficient to transfer a lot of data in few textures, e.g. transferring YUV422 to a single packed texture is ~50% faster than transferring the same data to 3 (Y, U, V) separate textures.
Some thoughts I've had on the matter is whether I could use 2 array textures, one for (Y, A) and one for (U, V), would that be faster?
Another alternative I've considered is to convert from planar to packed while copying data to the PBO for transfer, though this does have some CPU overhead.
Any suggestions?
NOTE: dim(Y) == dim(A) && dim(U) == dim(V) && dim(Y) != dim(U).

I was wondering how you are generating the textures? ie generated dynamically or loaded in from a file? If they are loaded from a file I would recommend loading the textures as a single rgba texture to load it and utilise a fragment shader to process it as yuva, the data can then be loaded in one go, and should yield substantially better results in performance.
If there is some more information on how the texture is being utilised I should be able to give you a more detailed answer.
EDIT: The way that I normally handle YUVA is to render to a texture; use the GPU to convert the RGBA to YUVA then send the result back to the CPU via glGetTexImage for example and handle the resultant data as YUVA (or drop the alpha and use it as YUV).
In regards to the differently sized textures I wouldn't worry, pack the data as you see fit, and read it out as you see fit, you can have each channel have 0 values in all the areas where you don't have any valid data for example or use and off but memorable value (like your birthday and have values along the lines of 0.17122012 (today's date)) so you can easily programmatically ignore them, or make the channel handling code only read particular dimensions based on the channel its operating on, the extraneous data is minimal or if 0s its even less, and the speed gains of utilising a GPU to handle the data offset it and still leave a fast system in place.
Hope that helps a bit more.

Related

Post-processing individual MSAA samples in CPU

I'm interested in sub-pixel sampling my OpenGL renders around the edge silhouettes of my meshes for a computer vision task. I'm thinking of using MSAA to do it efficiently (but the application is not for anti-aliasing). The problem I find with multisampling is that in order to read the samples from the GPU I can only blit the framebuffers into a non multisampling one, thus I cannot recover individual sample information. My questions are:
Is there a way to impelement a fragment shader that stores the results of a per-sample (GL_SAMPLE_SHADING) computation such that I can read those samples back to CPU? I've thought of using glSampleID to index the output to different out buffers but don't know if that's possible at all. Perhaps a method like the linked-list structures used for OIT (i.e. http://on-demand.gputechconf.com/gtc/2014/presentations/S4385-order-independent-transparency-opengl.pdf)? However, there they perform all computations on GPU so I'm not sure if I can read the linked list data from the CPU in any way.
Maybe MSAA is the wrong approach and there are other methods to do so. I guess my last resort is to super sample the render x times and thus recover individual samples, but that seems to be a very inefficient solution.
You can write a compute shader which reads the samples and writes each sample's data via imageLoad, and then writes it to an SSBOs (FS outputs and image load/store would not be appropriate for the output). You'll need the usual memory barrier synchronization when it comes time to read it, but this way, you can write directly to a buffer object, rather than having to use a PBO to read from a texture.
The hardest part will be converting gl_GlobalInvocationID and the other compute shader inputs into the index in the SSBO array as well as the texture coordinate and sample index for your imageLoad operation.

Read and Write in one Texture (OpenGL)

I want to store and update informations in a texture. So the idea is, that I create a new texture with current informations. While storing it in the render process I actually want to read the informations out of the same pixel and store a weighted average of both values. So the value that was rendered to that pixel and the value that was already on that pixel.
Now I read very often that I can not read and write on the same texture. Now my questions is, may it maybe be possible? and if not should I copy the texture information, before the rendering step and pass the copy to the shader? If so, how can I copy the texture? or should I do a extra rendering step for copying?
I see two possible options here, depending on the mix equation
Alpha Blending: If the equation used can be mapped to one of the glBlendFunc functions, then this is the way to go. If you want to use linear factors for the stored and the new value this should be possible. This is also the option where I would expect the best performance.
Image Load Store: With this method one can read and write to the same texture at the same time (see here). The performance will usually be very bad here and you will have to use the image atomic operations to ensure that multiple fragments at the same location always read the correct value.
Copying the texture would, in my opinion, only work if you render an image and then perform one weighted average computation on it afterwards (otherwise you would have to copy the texture after each store operation). But if this is the case, one could simple render the result of the average computation to a different texture and completely avoid all the trouble of copying the input data.
If resorting to an extension is an option, you can use NV_texture_barrier which allows writing and reading from the same texture.

Data processing and video generation with OpenGL/CL

Goal: compensate and visualize a stream of 14-bit data (2D video).
Existing solution: Each sample needs to be compensated for a gain and offset, so it requires one multiplication and one addition. Then I assign a colour to the sample by a look-up table and output a stream of "colours" directly to the display. Everything is done on CPU.
Requirements: I need to be able to dynamically set a look-up table (palette).
It seems obvious to use GPU for such an operation, but I couldn't find any info about how to move from data domain to picture domain with OpenGL. I've thought about using OpenCL for data compensation and image generation and then moving to OpenGL for displaying (or in general: for manipulating picture).
Can you recommend me a good approach for this? Can this all be efficiently achieved just with the OpenGL? How?
Yes, it can be done using only OpenGL.
I would suggest a workflow like the following:
For each frame:
Upload frame from stream to texture memory
Draw a full-screen quad, with texture coordinates from 0,0 to 1,1
In a fragment shader apply for each pixel the appropriate transformation. The lookup table can also be stored in a texture, so you only have to perform a lookup on the appropriate location.
In general: This question is at the moment a little bit too broad to be answered in more detail. For example a stream of 14-bit data could be a lot of things. I assumed for this answer you meant a (2D) video stream.

What is the most efficient process to push YUV texture data onto a GPU in OpenGL?

Does anyone know of an efficient way to push 2vuy non-planar data onto a GPU in a way that doesn't require swizzling?
I am grabbing the raw 2vuy data from an h264 video file and successfully loading it into a texture that I map to an an OpenGL object. I notice that my code spends a fair amount of time in glgProcessPixelsWithProcessor. My glTexImage2D call looks like the following:
glTexImage2D(GL_TEXTURE_2D, 0, GL_RGBA8, width, height, 0, GL_YCBCR_422_APPLE,
GL_UNSIGNED_SHORT_8_8_APPLE, data);
Apple says in its OpenGL guide that GL_YCBCR_422_APPLE, provides "acceptable" performance (p103), but that
Note: If your data needs only to be swizzled, glgProcessPixels performs the swizzling reasonably fast although not as fast as if the data didn't need swizzling. But non-native data formats are converted one byte at a time and incurs a performance cost that is best to avoid.
I assume that there is some kind of internal format conversion going on the CPU. I noticed in another thread that glgProcessPixels is running a block method as well.
Is my path the most efficient? If not, what is?
Your code, as it stands right now depends on extensions of Apple. I can't tell what's happening inside.
However what I suggest is, that you create three 2D textures, each with exactly one channel, where each texture receives one of the color planes; using independent textures makes supporting chroma subsampling (that 422) simpler.
In a shader you'd then perform the colorspace conversion. When writing down the math I suggest you do this via a contact color space, like XYZ, as this allows you, to take the color profile of the output device into account; ICC profiles provide the conversion data from XYZ color space coordinates to device color space (RGB) coordinates.

Read Framebuffer-texture like an 1D array

I am doing some gpgpu calculations with GL and want to read my results from the framebuffer.
My framebuffer-texture is logically an 1D array, but I made it 2D to have a bigger area. Now I want to read from any arbitrary pixel in the framebuffer-texture with any given length.
That means all calculations are already done on GPU side and I only need to pass certain data to the cpu that could be aligned over the border of the texture.
Is this possible? If yes is it slower/faster than glReadPixels on the whole image and then cutting out what I need?
EDIT
Of course I know about OpenCL/CUDA but they are not desired because I want my program to run out of the box on (almost) any platform.
Also I know that glReadPixels is very slow and one reason might be that it offers some functionality that I do not need (Operating in 2D). Therefore I asked for a more basic function that might be faster.
Reading the whole framebuffer with glReadPixels just to discard it all except for a few pixels/lines would be grossly inefficient. But glReadPixels lets you specify a rect within the framebuffer, so why not just restrict it to fetching the few rows of interest ? So you maybe end up fetching some extra data at the start and end of the first and last lines fetched, but I suspect the overhead of that is minimal compared with making multiple calls.
Possibly writing your data to the framebuffer in tiles and/or using Morton order might help structure it so a tighter bounding box can be be found and the extra data retrieved minimised.
You can use a pixel buffer object (PBO) to transfer pixel data from the framebuffer to the PBO, then use glMapBufferARB to read the data directly:
http://www.songho.ca/opengl/gl_pbo.html