QOpenGLWidget video rendering perfomance in multiple processes - c++

My problem may seem vague without code, but it actually isn't.
So, there I've got an almost properly-working widget, which renders video frames.
Qt 5.10 and QOpenGLWidget subclassing worked fine, I didn't make any sophisticated optimizations -- there are two textures and a couple of shaders, converting YUV pixel format to RGB -- glTexImage2D() + shaders, no buffers.
Video frames are obtained from FFMPEG, it shows great performance due to hardware acceleration... when there is only one video window.
The piece of software is a "video wall" -- multiple independent video windows on the same screen. Of course, multi-threading would be the preferred solution, but legacy holds for now, I can't change it.
So, 1 window with Full HD video consumes ~2% CPU & 8-10% GPU regardless of the size of the window. But 7-10 similar windows, launched from the same executable at the same time consume almost all the CPU. My math says that 2 x 8 != 100...
My best guesses are:
This is a ffmpeg decoder issue, hardware acceleration still is not magic, some hardware pipeline stalls
7-8-9 independent OpenGL contexts cost a lot more than 1 cost x N
I'm not using PUBO or some other complex techniques to improve OpenGL rendering. It still explains nothing, but at least it is a guess
The behavior is the same on Ubuntu, where decoding uses different codec (I mean that using GPU accelerated or CPU accelerated codecs makes no difference!), so, it makes more probable that I'm missing something about OpenGL... or not, because launching 6-7 Qt examples with dynamic textures shows normal growth of CPU usages -- it is approximately a sum for the number of windows.
Anyway, it becomes quite tricky for me to profile the case, so I hope someone could have been solving the similar problem before and could share his experience with me. I'd be appreciated for any ideas, how to deal with the described riddle.
I can add any pieces of code if that helps.

Related

How can I draw to the display, without OpenGL?

I've been learning OpenGL, and as I sit trying to write my VBOs, PBOs, VAOs, textures, quads, bindings, fragment shaders, vertex shaders, and a whole suite of other modern abstractions upon abstractions built after decades of evolution, I wonder: Isn't the display nothing but a large block of memory?
I've heard of tales, that in the "good ol' days" (such as the Commodore 64), all you had to do was assign a value to an arbitrary byte in memory, and the screen would change a pixel. Extremely simple and elegant. In the modern day, this has changed with layers upon layers of abstractions and safeguards, such that changing a pixel on your display is several hundred feet away.
This begs the question, is it possible in the modern day to just "update a pixel of the screen"? Is it possible to write my own graphics driver or something, where I can send commands to some C wrapper which interfaces with the GPU to change those pixels? This is an extremely broad question, but I'm curious. The answer I'm looking for to this question would provide a rough outline of what you'd have to do in order to be able to arbitrarily get some C code to set a pixel on the screen, as well as a rough outline of why OpenGL has progressed the way it has - what problems did VBOs, PBOs, VAOs, bindings, shaders, etc. solve, and how we got to where we are today.
Isn't the display nothing but a large block of memory?
Yes, it is called a framebuffer.
I've heard of tales, that in the "good ol' days" (such as the Commodore 64)
Your current PC works like that right when you power it up! If you use the CPU to write into video memory, that is called a software renderer.
In the modern day, this has changed with layers upon layers of abstractions and safeguards, such that changing a pixel on your display is several hundred feet away.
No, they are not abstractions/safeguards for "changing pixels". Nowadays software renderers are not used anymore. Instead, you have to tell the GPU (which is another computer on its own) how to draw. That "talk" is what the APIs (like OpenGL) do for you.
Now, the GPUs are meant to be fast at drawing, and that requires specialized code and data structures. Those are all the things you mention: VBOs, PBOs, VAOs, shaders, etc. (in OpenGL parlance). There is no way around that, because GPUs are different hardware.
is it possible in the modern day to just "update a pixel of the screen"?
Yes, but that will end up being drawn somehow by the GPU, even if it looks to you like a memory write.
Is it possible to write my own graphics driver or something, where I can send commands to some C wrapper which interfaces with the GPU to change those pixels?
Yes, but that "C wrapper" is the graphics driver. A graphics driver for a modern GPU is very complex.
what you'd have to do in order to be able to arbitrarily get some C code to set a pixel on the screen
You cannot write a "C program" to write to a graphical screen because the C standard does not concern itself with graphical displays.
So it depends on your operating system, your hardware, whether you want 2D or 3D acceleration support, the API you choose...
as well as a rough outline of why OpenGL has progressed the way it has - what problems did VBOs, PBOs, VAOs, bindings, shaders, etc. solve, and how we got to where we are today.
See above.
You can make your own frame buffer - that is just an integer array - and do rasterization on it, then use for example the Windows GDI function SetBitmapBits() to draw it to the display in one go. The final draw-to-display command depends on the operating system.
How you do the rasterization on your framebuffer is completely up to you. You can use the CPU to draw individual pixels or rasterize lines and triangles, see for example this demo of my old CPU graphics engine using Windows GDI: https://youtu.be/GFzisvhtRS4.
Using the CPU is fine as long as you do not rasterize large datasets. From my experience, the limit to real-time 60fps rendering on the CPU is ~50k lines per frame.
If you want to rasterize really large datasets, you have to use a GPU in some way. Since the framebuffer is just an integer array, you can transfer it to/from the GPU using OpenCL or CUDA and on the GPU - if your dataset happens to already be in video memory - do all the rasterization extremely fast in parallel. For this you will need an additional z-buffer to decide which pixels to overdraw by occluding geometries. This way you can rasterize approximately 30 Million lines per frame at 60fps. This demo is rendered on the GPU in real time using OpenCL: https://youtu.be/lDsz2maaZEo
Is it possible in the modern day to just "update a pixel of the screen"?
Yes. In Windows for example, you can use SetPixel() to draw a pixel or BitBlt() to draw in bulk. See this Q/A
This works fine, but this means you're using the CPU for rendering and you'll find the GPU is much more effective for this task, especially if you require decent framerate and non-trivial graphics. The reason there's these "whole suite of other modern abstractions upon abstractions" is to serve as an interface to the GPU since it has an independent set of memory and totally different execution model. Other GPU libraries (OpenCL, DirectX, Vulkan, etc) all have the same kind of abstractions.
I've glossed over many nuances but I hope the point gets across.

Does OpenGL display image faster than OpenCV?

I am using OpenCV to show image on the projector. But it seems the cv::imshow is not fast enough or maybe the data transfer is slow from my CPU to GPU then to projector, so I wonder if there is a faster way to display than OpenCV?
I considered OpenGL, since OpenGL directly uses GPU, the command may be faster than from CPU which is used by OpenCV. Correct me if I am wrong.
OpenCV already supports OpenGL for image output by itself. No need to write this yourself!
See the documentation:
http://docs.opencv.org/modules/highgui/doc/user_interface.html#imshow
http://docs.opencv.org/modules/highgui/doc/user_interface.html#namedwindow
Create the window first with namedWindow, where you can pass the WINDOW_OPENGL flag.
Then you can even use OpenGL buffers or GPU matrices as input to imshow (the data never leaves the GPU). But it will also use OpenGL to show regular matrix data.
Please note:
To enable OpenGL support, configure OpenCV using CMake with
WITH_OPENGL=ON . Currently OpenGL is supported only with WIN32, GTK
and Qt backends on Windows and Linux (MacOS and Android are not
supported). For GTK backend gtkglext-1.0 library is required.
Note that this is OpenCV 2.4.8 and this functionality has changed quite recently. I know there was OpenGL support in earlier versions in conjunction with the Qt backend, but I don't remember when it was introduced.
About the performance: It is a quite popular optimization in the CV community to output images using OpenGL, especially when outputting video sequences.
OpenGL is optimised for rendering images, so it's likely faster. It really depends if the OpenCV implementation uses any GPU acceleration AND if the bottleneck is on rendering side of things.
Have you tried GPU accelerated OpenCV? - http://opencv.org/platforms/cuda.html
How big is the image you are displaying? How long does it take to display the image using cv::imshow now?
I know it's an old question, but I happened to have exactly the same problem. And from my observations I've concluded that the root of the problem is the projector's own latency, especially if one is using an older model.
How have I concluded it?
I displayed the same video sequence with cv::imshow() on the laptop monitor and on the projector. Then I waved my hand. It was obvious, that projector introduces significant latency.
To double-check, I've opended a webcam video, waved my hand in front of it and observed the difference on the monitor and on the projector. Webcam does no processing, no opencv operations, so in my understanding the only thing that would explain the latency would be the projector itself.

optimal pixel-read back strategy

I need to render certain scenes and read the whole image back in main memory. I've search for this and it seems that most video cards will accelerate the rendering but the read-back will be very slow. After a bit of research i only found this card mentioning "Hardware-Accelerated Pixel Read-Back"
The other approach would do software rendering and the read-back problem doesn't exist, but then the rendering performance will be bad.
Likely, i will have to implement both in order to be able to find the optimal trade-off, but my question is about what other alternative can i have hardware-wise; i understand Quadro is for modelling and designer market segment, which is precisely the client target of this application, Does this means that i'm not likely to find better pixel read-back performance in other video card lines? i.e: Tesla or Fermi, which don't even have video outputs btw
I don't know if the performance would be any different, but you could at least try rendering to an off-screen buffer, then setting that as a texture of a full-screen quad (or outputting that to video in some other way)

Graphics Profiling

Ive got an application which drops to around 10fps. I profiled it with xperf which showed my app was using just 20% of the CPU, with none of my methods using a larger than expected amount of that 20%.
This seems to indicate that the vast drop in fps is because the graphics card isnt able to keep up with rendering the frame, resulting in my program stopping while it catches up...
Is there some way to profile what the graphics card is up to and work out what my program is telling it to do thats slowing it down, so that I can try to improve the frame rate?
For debugging / profiling graphics, try Nvidia PerfHUD
NVIDIA PerfHUD is a powerful real-time performance analysis tool for Direct3D applications.
There is also an ATI solution, called 'GPU PerfStudio'
GPU PerfStudio is a real-time performance analysis tool which has been designed to help tune the graphics performance of your DirectX 9, DirectX 10, and OpenGL applications. GPU PerfStudio displays real-time API, driver and hardware data which can be visualized using extremely flexible plotting and bar chart mechanisms. The application being profiled maybe executed locally or remotely over the network. GPU PerfStudio allows the developer to override key rendering states in real-time for rapid bottleneck detection. An auto-analysis window can be used for identifying performance issues at various stages of the graphics pipeline. No special drivers or code modifications are needed to use GPU PerfStudio.
You can find more information and download links here:
http://developer.nvidia.com/object/nvperfhud_home.html
http://developer.amd.com/tools-and-sdks/graphics-development/gpu-perfstudio/
Also, check out this article on FPS:
FPS vs Frame Time
Basically it talks about the fact that a drop from 200fps to 190fps is negligible, whereas a drop from 30fps to 20fps is a MUCH bigger deal. For better performance measuring, you should be calculating frame time rather than FPS.
You never told us what your fps is or what the program is doing at all, so your "vast drop" might not be a big deal at all.
For DirectX, there is PIX for profiling the CPU and GPU operations. It can give very detailed info, and might be worth looking into.
Hope that helps!
You can try using dxprof (search in google). It's lightweight app that draws real-time bars, each bar corresponding to one DirectX event (such as draw-call or resource copy). You can freeze the bars and check calls stack to find out where the draw-call originates from.
Are you developing for Windows? If so avoid using Video for Windows as this will limit you in the manner that you describe. Use DirectX instead.
No need to guess. Just pause it a few times under the IDE, and it will show you exactly what it's waiting for.

What can cause a reduction in frame rate when upgrading a graphics card?

We have a two-screen DirectX application that previously ran at a consistent 60 FPS (the monitors' sync rate) using a NVIDIA 8400GS (256MB). However, when we swapped out the card for one with 512 MB of RAM the frame rate struggles to get above 40 FPS. (It only gets this high because we're using triple-buffering.) The two cards are from the same manufacturer (PNY). All other things are equal, this is a Windows XP Embedded application and we started from a fresh image for each card. The driver version number is 169.21.
The application is all 2D. I.E. just a bunch of textured quads and a whole lot of pre-rendered graphics (hence the need to upgrade the card's memory). We also have compressed animations which the CPU decodes on the fly - this involves a texture lock. The locks take forever but I've also tried having a separate system memory texture for the CPU to update and then updating the rendered texture using the device's UpdateTexture method. No overall difference in performance.
Although I've read through every FAQ I can find on the internet about DirectX performance, this is still the first time I've worked on a DirectX project so any arcane bits of knowledge you have would be useful. :)
One other thing whilst I'm on the subject; when calling Present on the swap chains it seems DirectX waits for the present to complete regardless of the fact that I'm using D3DPRESENT_DONOTWAIT in both present parameters (PresentationInterval) and the flags of the call itself. Because this is a two-screen application this is a problem as the two monitors do not appear to be genlocked, I'm working around it by running the Present calls through a threadpool. What could the underlying cause of this be?
Are the cards exactly the same (both GeForce 8400GS), and only the memory size differ? Quite often with different memory sizes come slightly different clock rates (i.e. your card with more memory might use slower memory!).
So the first thing to check would be GPU core & memory clock rates, using something like GPU-Z.
It's an easy test to see if the surface lock is the problem, just comment out the texture update and see if the framerate returns to 60hz. Unfortunately, writing to a locked surface and updating the resource kills perfomance, always has. Are you using mipmaps with the textures? I know DX9 added automatic generation of mipmaps, could be taking up a lot of time to generate those. If your constantly locking the same resource each frame, you could also try creating a pool of textures, kinda like triple-buffering except with textures. You would let the render use one texture, and on the next update you pick the next available texture in the pool that's not being used in to render. Unless of course your memory constrained or your only making diffs to the animated texture.