I have a openGL application which is rendering data into a rgba texture. I want to encode and stream it using gstreamer framework (using nvenc plugin for h264 encoding).
I was looking through the documentation to solve these problems:
How to export the existing openGL context of the app to nvenc element.
How to pass the texture id to source from?
How will synchronization work. i.e nvenc has to wait for rendering to finish and similarly app has to wait for nvenc to finish reading from the texture. I am assuming it would either involve using sync fences or glMemoryBarriers.
Any sample code would really be really helpful.
I do want to avoid any texture copies to cpu memory. Nvidia's NVENC sdk mentions that it uses CUDA context to make the calls, and an openGL texture can be imported into CUDA context using cudaGraphicsGLRegisterImage call. So my expectation is that from app to video encoded frame can be done without any copies.
I know this is an old question, but just in case someone else hit this problem...
If your NVENC calls and OpenGL app is in the same thread, you don't need to do anything with the context.
If not, you should probably create two OpenGL contexts, one for rendering, one for encoding. The two contexts should share objects as explained in https://www.khronos.org/opengl/wiki/OpenGL_Context.
You can also create only one context and transfer the context between threads by making it "current" to the thread that's accessing the OpenGL objects, but I found the two contexts way easier.
Texture id is an integer, just pass it.
NvEncMapInputResource "provides synchronization guarantee that any graphics or compute work submitted on the input buffer is completed before the buffer is used for encoding". NvEncEncodePicture has "synchronous mode of encoding".
As of today, NVENC supports OpenGL encode device on linux, so you don't have to register OpenGL texture in CUDA. NVENC can directly access the OpenGL texture, so there's no memory copy on the client side.
If you're working on windows, I believe you can create a CUDA encode device, then get a CUarray from an OpenGL texture, and NVENC can access the CUarray.
Sample code of OpenGL and CUDA encode device can be found in samples of NVENC SDK.
EDIT:
The synchronization guarantee of NvEncMapInputResource seems to hold only in single thread case (or in the same GL context?). Adding a sync object before mapping is mandatory if rendering and encoding are happening in different threads and contexts.
Related
I have an application which uses Ogre engine for rendering (OpenGL). There's a texture that binded to the pipeline. Also there's a CUDA call that modifies that texture. Basically it looks like this:
cudaGraphicsMapResources(tex);
// call cuda kernel that writes to te texture
cudaGraphicsUnmapResources(tex);
How safe is this? Is it possible that CUDA will update the texture that is currently in use by OpenGL? I don't know OpenGL but know other APIs. In DirextX 12 or vulkan I need to set barriers or other sync mechanisms for this kind of work. But on the other hand DirectX 11 allows to update mapped resources safely because it has synchronization inside the API.
It should be safe to do this, primarily because OpenGL stores all relevant buffers and IDs in the GPU. Note however, you might not be able to update the texture that is bound to the GPU.
As long as the texture is mapped to CUDA resource, any attempt to read or write on OpenGL side will lead to undefined results.It is explicitly stated in CUDA docs.
Is it possible to render to OpenGL from Vulkan?
It seems nVidia has something:
https://lunarg.com/faqs/mix-opengl-vulkan-rendering/
Can it be done for other GPU's?
Yes, it's possible if the Vulkan implementation and the OpenGL implementation both have the appropriate extensions available.
Here is a screenshot from an example app in the Vulkan Samples repository which uses OpenGL to render a simple shadertoy to a texture, and then uses that texture in a Vulkan rendered window.
Although your question seems to suggest you want to do the reverse (render to something using Vulkan and then display the results using OpenGL), the same concepts apply.... populate a texture in one API, use synchronization to ensure the GPU work is complete, and then use the texture in the other API. You can also do the same thing with buffers, so for instance you could use Vulkan for compute operations and then use the results in an OpenGL render.
Requirements
Doing this requires that both the OpenGL and Vulkan implementations support the required extensions, however, according to this site, these extensions are widely supported across OS versions and GPU vendors, as long as you're working with a recent (> 1.0.51) version of Vulkan.
You need the the External Objects extension for OpenGL and the External Memory/Fence/Sempahore extensions for Vulkan.
The Vulkan side of the extensions allow you to allocate memory, create semaphores or fences while marking the resulting objects as exportable. The corresponding GL extensions allow you to take the objects and manipulate them with new GL commands which allow you to wait on fences, signal and wait on semaphores, or use Vulkan allocated memory to back an OpenGL texture. By using such a texture in an OpenGL framebuffer, you can pretty much render whatever you want to it, and then use the rendered results in Vulkan.
Export / Import example code
For example, on the Vulkan side, when you're allocating memory for an image you can do this...
vk::Image image;
... // create the image as normal
vk::MemoryRequirements memReqs = device.getImageMemoryRequirements(image);
vk::MemoryAllocateInfo memAllocInfo;
vk::ExportMemoryAllocateInfo exportAllocInfo{
vk::ExternalMemoryHandleTypeFlagBits::eOpaqueWin32
};
memAllocInfo.pNext = &exportAllocInfo;
memAllocInfo.allocationSize = memReqs.size;
memAllocInfo.memoryTypeIndex = context.getMemoryType(
memReqs.memoryTypeBits, vk::MemoryPropertyFlagBits::eDeviceLocal);
vk::DeviceMemory memory;
memory = device.allocateMemory(memAllocInfo);
device.bindImageMemory(image, memory, 0);
HANDLE sharedMemoryHandle = device.getMemoryWin32HandleKHR({
texture.memory, vk::ExternalMemoryHandleTypeFlagBits::eOpaqueWin32
});
This is using the C++ interface and is using the Win32 variation of the extensions. For Posix platforms there are alternative methods for getting file descriptors instead of WIN32 handles.
The sharedMemoryHandle is the value that you'll need to pass to OpenGL, along with the actual allocation size. On the GL side you can then do this...
// These values should be populated by the vulkan code
HANDLE sharedMemoryHandle;
GLuint64 sharedMemorySize;
// Create a 'memory object' in OpenGL, and associate it with the memory
// allocated in vulkan
GLuint mem;
glCreateMemoryObjectsEXT(1, mem);
glImportMemoryWin32HandleEXT(mem, sharedMemorySize,
GL_HANDLE_TYPE_OPAQUE_WIN32_EXT, sharedMemoryHandle);
// Having created the memory object we can now create a texture and use
// the memory object for backing it
glCreateTextures(GL_TEXTURE_2D, 1, &color);
// The internalFormat here should correspond to the format of
// the Vulkan image. Similarly, the w & h values should correspond to
// the extent of the Vulkan image
glTextureStorageMem2DEXT(color, 1, GL_RGBA8, w, h, mem, 0 );
Synchronization
The trickiest bit here is synchronization. The Vulkan specification requires images to be in certain states (layouts) before corresponding operations can be performed on them. So in order to do this properly (based on my understanding), you would need to...
In Vulkan, create a command buffer that transitions the image to ColorAttachmentOptimal layout
Submit the command buffer so that it signals a semaphore that has similarly been exported to OpenGL
In OpenGL, use the glWaitSemaphoreEXT function to cause the GL driver to wait for the transition to complete.
Note that this is a GPU side wait, so the function will not block at all. It's similar to glWaitSync (as opposed to glClientWaitSync)in this regard.
Execute your GL commands that render to the framebuffer
Signal a different exported Semaphore on the GL side with the glSignalSemaphoreEXT function
In Vulkan, execute another image layout transition from ColorAttachmentOptimal to ShaderReadOnlyOptimal
Submit the transition command buffer with the wait semaphore set to the one you just signaled from the GL side.
That's would be an optimal path. Alternatively, the quick and dirty method would be to do the vulkan transition, and then execute queue and device waitIdle commands to ensure the work is done, execute the GL commands, followed by glFlush & glFinish commands to ensure the GPU is done with that work, and then resume your Vulkan commands. This is more of a brute force approach and will likely produce poorer performance than doing the proper synchronization.
NVIDIA has created an OpenGL extension, NV_draw_vulkan_image, which can render a VkImage in OpenGL. It even has some mechanisms for interacting with Vulkan semaphores and the like.
However, according to the documentation, you must bypass all Vulkan layers, since layers can modify non-dispatchable handles and the OpenGL extension doesn't know about said modifications. And their recommended means of doing so is to use the glGetVkProcAddrNV for all of your Vulkan functions.
Which also means that you can't get access to any debugging that relies on Vulkan layers.
There is some more information in this more recent slide deck from SIGGRAPH 2016. Slides 63-65 describe how to blit a Vulkan image to an OpenGL backbuffer. My opinion is that it may have been pretty easy for NVIDIA to support this since the Vulkan driver is contained in libGL.so (on Linux). So it may not have been that hard to give the Vulkan image handle to the GL side of the driver and have it be useful.
As another answer pointed out, there are still no official registered multi-vendor interop extensions. This approach just works on NVIDIA.
I'm fairly new to CUDA, but I've managed to display something generated by a kernel on the screen using OpenGL. I've tried several approach :
Using a PBO and an OpenGL texture (old style);
Using a OpenGL texture as a CUDA surface and rendering on a quad (new style);
Using a renderbuffer as a CUDA surface and rendering using glBlitFramebuffer.
All of them worked, but, while implementing #2, I erroneously set the hint as cudaGraphicsRegisterFlagsWriteDiscard. Since all of the data will be generated by CUDA, I thought this was the correct option. However, later I realized that I needed a CUDA surface to write to an OpenGL texture, and when you use a surface, you are requested to use the LoadStore flag.
So basically my question is this : Since I absolutely need a CUDA surface to write to an OpenGL texture in CUDA, what is the use case of cudaGraphicsRegisterFlagsWriteDiscard in cudaGraphicsGLRegisterImage?
The documentation description seems pretty straightforward. It is for one-way delivery of data from CUDA to OpenGL.
This online book excerpt provides a similar explanation:
Applications where CUDA is the producer and OpenGL is the consumer should register the objects with a write-discard flag...
If you want to see an example, take a look at the postProcessGL cuda sample. In that case, OpenGL is rendering an image, and it's being post-processed (blur added) by cuda, before display. In this case, there are two separate pathways for data flow. In the OpenGL->CUDA case, the data is handled by the createTextureSrc function, and the flag specified is read-only. For the CUDA->OpenGL case (delivery of the post-processed frame) the function is handled in createTextureDst, where a call is made to cudaGraphicsGLRegisterImage with the cudaGraphicsMapFlagsWriteDiscard flag specified, since on this path, CUDA is producing and OpenGL is consuming.
To understand how the textures are handled (populated with data from the cuda operations via a cudaArray) you probably want to study the sequence of operations in processImage().
So what I need is simple: Imagine we have no gui at all - ssh access to some linux where we gonna build and host our app. That app would generate video stream. We have some SDL app with OpenGL shader in it. All we want is to get rendering (as normally we would have in SDL window) as a char* (with size W*H*3) How to do such thing? How to make SDL render stuff not onto its gui window but into some swappable pointer?
To be of any use, OpenGL should be hardware accelerated, so first check if your server does have a GPU that meets your requirements. If you're on a rented virtual server or some standard root server, then you very likely don't have a GPU.
If you have a GPU, then there are two possible methods:
Method 1 -- the easy one
You'll (unfortunately) have to configure and start the X server for it and this X server must also be the current virtual terminal (i.e. it must be the active thing on the graphics card). Then you give the user who'll be running that video generator access to that X display (read man xauth and what it references)
The next step is independent of SDL, it's an OpenGL think: Create a Framebuffer Object onto which the desired graphics is rendered; a PBuffer would work as well, and actually I'd prefer it in this situation, however I found Framebuffer Objects be more reliable than PBuffers on current Linux and its drivers.
Then render to this Framebuffer Object or PBuffer as usual and retrieve the content using glReadPixels
Method 2 -- the flexible one
On the low level this is quite similar to Method 1, but things get abstracted for you: Get VirtualGL http://www.virtualgl.org/ to perform the actual OpenGL rendering on the GPU. Instead of starting your application on a secondary X server you make direct use of the VirtualGL server provided sending the GLX stream and get a JPEG image stream back. You could also use a secondary X server running a virtual framebuffer and take a continous screencapture of that. Or probably most elegant: Write your own X.Org video driver that passes the video to the video streamer directly.
You cannot directly render to a byte array in OpenGL.
There are two ways to work with this. The first way is the simplest and doesn't require context gimmickery, and the second way does.
So first, the simple way.
In order for OpenGL to work, you need to have a window. That doesn't mean the window needs to be visible, but you need to create one to get a valid OpenGL context. Therefore Step 1: Create a window and minimize it.
Now, in order to get valid rendering, the pixels in the framebuffer must pass the "pixel ownership test." When rendering to the framebuffer that holds the screen itself, pixels of the window that are not actually visible on screen fail the pixel ownership test. So the values of those pixels are undefined if you use glReadPixels.
However, this only pertains to the default framebuffer that is associated with the window. Framebuffer objects always pass the pixel ownership test. Therefore, Step 2: Create a framebuffer object and the associated renderbuffers for your needs.
From there, it's pretty simple. Just render as normal and do a glReadPixels when you want to get the data. Pixel buffer objects can be used to asynchronous transfer pixel data, if performance is a concern. Step 3: Render and use glReadPixels to get the data.
The second way is more widely available (FBOs require extension support or OpenGL 3.0), but more platform-specific.
Instead of creating an FBO in step 2, you instead have Step 2: use glXCreatePbuffer to create a pbuffer. A pbuffer is an off-screen render target that acts like the default framebuffer. You glXMakeContextCurrent to tell OpenGL to render to the pbuffer instead of the default framebuffer.
Steps 1 and 3 are the same as above.
Is it possible to allocate some memory on the GPU without cuda?
i'm adding some more details...
i need to get the video frame decoded from VLC and have some compositing functions on the video; I'm doing so using the new SDL rendering capabilities.
All works fine until i have to send the decoded data to the sdl texture... that part of code is handled by standard malloc which is slow for video operations.
Right now i'm not even sure that using gpu video will actually help me
Let's be clear: are you are trying to accomplish real time video processing? Since your latest update changed the problem considerably, I'm adding another answer.
The "slowness" you are experiencing could be due to several reasons. In order get the "real-time" effect (in the perceptual sense), you must be able to process the frame and display it withing 33ms (approximately, for a 30fps video). This means you must decode the frame, run the compositing functions (as you call) on it, and display it on the screen within this time frame.
If the compositing functions are too CPU intensive, then you might consider writing a GPU program to speed up this task. But the first thing you should do is determine where the bottleneck of your application is exactly. You could strip your application momentarily to let it decode the frames and display them on the screen (do not execute the compositing functions), just to see how it goes. If its slow, then the decoding process could be using too much CPU/RAM resources (maybe a bug on your side?).
I have used FFMPEG and SDL for a similar project once and I was very happy with the result. This tutorial shows to do a basic video player using both libraries. Basically, it opens a video file, decodes the frames and renders them on a surface for displaying.
You can do this via Direct3D 11 Compute Shaders or OpenCL. These are similar in spirit to CUDA.
Yes, it is. You can allocate memory in the GPU through OpenGL textures.
Only indirectly through a graphics framework.
You can use OpenGL which is supported by virtually every computer.
You could use a vertex buffer to store your data. Vertex buffers are usually used to store points for rendering, but you can easily use it to store an array of any kind. Unlike textures, their capacity is only limited by the amount of graphics memory available.
http://www.songho.ca/opengl/gl_vbo.html has a good tutorial on how to read and write data to vertex buffers, you can ignore everything about drawing the vertex buffer.