Back in the days of DOS, the way we showed graphics in computers was just copying raw image data into memory for every frame.
Because of the bandwidth between the CPU and the GPU, this proves to be very ineffective. In order to send every frame to the screen in modern resolutions, we need something like
1080 * 1200 * 4(color data) * 60(frames per second) = 311 megabytes every second.
So we preloaded the textures and vertices into the GPU memory and we just send the transformations.
So, how HD video playing is solved in modern hardware? Is there a way to compress every frame that is sent to the GPU? Or we just send the raw 311MB/s like in the old days?
Assuming decompression is not being done at least in part on the GPU itself, then yes, you send video by uploading the image each frame to graphics memory.
Your math is off though. 1080p is 1920x1080. 30fps video at 1080p requires ~238MB/sec. And... that's perfectly doable. Even PCIe 1.0 x1 could handle that (though barely), and GPUs tend to use x16 slots, so 16x more bandwidth. And PCIe is at version 4.0 (on most machines), so it's much faster today.
Related
I render a simple scene with 50 drawcalls for main scene, shadow, reflection and postprocess maps overall. For the following test I set shadow map size to 4k (4096x4096) and will vary reflection map size.
The test will run on NVidia GTX 650 with 1Gb of video memory. When the test application is running, only 600+ mb of video memory is used by all running applications, so I have 200-300 mb free video RAM. Vsync is disabled and there are no FPS boundaries in the application, so GPU utilization is 99%-100%. All metrics are measured with Nvidia driver tool:
OK, let's start the application and measure it's FPS depending on reflection map size:
1k and below - 110 FPS
2k - 102 FPS
4k - 38 FPS
Switching from 2k to 4k reflection texture significantly drops FPS rate, while the measuring utility shows abnormal PCI bandwidth utilization, which increases from 1% to 65%. At the same time, amount of used video memory does not exceed 700Mb (of 1Gb available).
Ok, let's run Nvidia graphics debugger and see what is the heaviest OpenGL call. It shows that glClear for reflection map takes more than 10 ms! Switching reflection map to 2k drops this value to thousands of nanoseconds, which is OK for present day hardware. You may say that glClear actually flushes all previous rendering commands, but if I disable clear calls in the debugger, the framerate increases up to 90-100 FPS.
This behavior also may be reproduced with Emscripten + Chrome, but in chrome the slow functions are glDrawElements, which draw terrain to shadowmap. What is even stranger, draw calls for main scene terrain (with much heavier fragment shader) take less time.
So what is the problem? Why does the driver transfer something through PCIe while having 300Mb of free video memory? May the reasonbe in improper texture or framebuffer usage?
I have a glvideomixer sink that shows 16 720p-60Hz videos simultaneously in a 4x4 matrix array. When the source of all 16 videos are from 16 different "h264 main profile" files all runs smoothly, but when I acquire the videos from 4 grabber cards (4 x 4HDMI input ports set to 1280x720-60Hz, same as video files) the output gets stutter.
The pipeline is very simple:
glvideomixer(name=vmix)
ksvideosrc(device-index=0...15)->capsfilter(video/x-raw,format=YV12,framerate=60/1,height=720,width=1280)->vmix.sink_0...15
Note: The ksvideosrc element is only available on Windows platform.
AFAIK the pipeline is GL based, so all the videos streams are implicitly uploaded to an GL context, when the glvideomixer treats them as GL textures. I am right?
But i don't understand why when i use 16 video files all runs smoothly, even when, in theory, the process is more complex because the computer must decode these streams before sending them to the GPU, and when i use the grabber cards, all output is stuttered.
I'm pretty sure that the stream format of the cards are RAW YV12, because i set the capsfilter element to explicitly choose that stream. Here the link to the grabbers: http://www.yuan.com.tw/en/products/capture/capture_sc510n4_hdmi_spec.htm
I think that the bottleneck is there in the PCIe bus, but i not sure because the GPU is an AMD FirePro W7100 run at 16x and the 4 grabber cards are 4x PCIe runs at 4x.
It should be noted that all run smoothly up to 13 video signals from the grabbers. Adding one more the stutter shows up.
So: how to know where is the bottleneck?
Many thanks in advance.
Edit:
The rig is:
MB: Asus x99-deluxe USB 3.1:
http://www.asus.com/Motherboards/X99DELUXEU31/
CPU: Hexacore i7-5930K Haswell 40 PCIe lanes:
http://ark.intel.com/es/products/82931/Intel-Core-i7-5930K-Processor-15M-Cache-up-to-3_70-GHz
RAM: Kingston Hyperx PC4 21300 HX426C15FBK2/8 dual channel setup:
http://www.kingston.com/dataSheets/HX426C15FBK2_8.pdf
GPU: AMd FirePro W7100 8Gb GDDR5 256 bits:
http://www.amd.com/en-us/products/graphics/workstation/firepro-3d/7100
HDD: Kingston SSD Now V300:
http://www.kingston.com/datasheets/sv300s3_en.pdf
We have hundreds of images which our computer gets at a time and we need to rotate and resize them as fast as possible.
Rotation is done by 90, 180 or 270 degrees.
Currently we are using the command line tool GraphicsMagick to rotate the image. Rotating the images (5760*3840 ~ 22MP) takes around 4 to 7 seconds.
The following python code sadly gives us equal results
import cv
img = cv.LoadImage("image.jpg")
timg = cv.CreateImage((img.height,img.width), img.depth, img.channels) # transposed image
# rotate counter-clockwise
cv.Transpose(img,timg)
cv.Flip(timg,timg,flipMode=0)
cv.SaveImage("rotated_counter_clockwise.jpg", timg)
Is there a faster way to rotate the images using the power of the graphics card? OpenCL and OpenGL come to mind but we are wondering whether a performance increase would be noticable.
The hardware we are using is fairly limited as the device should be as small as possible.
Intel Atom D525 (1,8 Ghz)
Mobility Radeon HD 5430 Series
4 GB of RAM
SSD Vertility 3
The software is debian 6 with official (closed source) radeon drivers.
you can perform a lossless rotation that will just modify the EXIF section. This will rotate your pictures faster.
and have a look at jpegtran utility which performs lossless jpeg modifications.
https://linux.die.net/man/1/jpegtran
There is a jpeg no-recompression plugin for irfanview which IIRC can rotate and resize images (in simple ways) without recompressing, it can also run an a directory of images - this should be a lot faster
The GPU probably wouldn't help, you are almost certainly I/O limited in opencv, it's not really optomised for high speed file access
I'm not an expert in jpeg and compression topics, but as your problem is pretty much as I/O limited as it gets (assuming that you can rotate without heavy de/encoding-related computation), you you might not be able to accelerate it very much on the GPU you have. (Un)Luckily your reference is a pretty slow Atom CPU.
I assume that the Radeon has separate main memory. This means that data needs to be communicated through PCI-E which is the extra latency compared to CPU execution and without hiding you can be sure that it is the bottleneck. This is the most probable reason why your code that uses OpenCV on the GPU is slow (besides the fact that you do two memory-bound operations, transpose & flip, instead of a single one).
The key thing is to hide as much of the PCI-E transfer times with computation as possible by using multiple-buffering. Overlapping transfers both to and from the GPU with computation by making use of the full-duplex capability of PCI-E will only work if the card in question has dual-DMA engines like high-end Radeons or the NVIDIA Quadro/Tesla cards -- which I highly doubt.
If your GPU compute-time (the time it takes the GPU to do the rotation) is lower than the time the transfer takes, you won't be able to fully overlap. The HD 4530 has a pretty slow memory interface with only 12.8 Gb/s peak, and the rotation kernel should be quite memory bound. However, I can only guesstimate, but I would say that if you reach peak PCI-E transfer rate of ~1.5 Gb/s (4x PCI-E AFAIK), the compute kernel will be a few times faster than the transfer and you'll be able to overlap very little.
You can simply time the parts separately without requiring elaborate asynchronous code and you can estimate how fast can you get things with an optimum overlap.
One thing you might want to consider is getting hardware which doesn't exhibit PCI-E as a bottleneck, e.g:
AMD APU-based system. On these platforms you will be able to page-lock the memory and use it directly from the GPU;
integrated GPUs which share main memory with the host;
a fast low-power CPU like a mobile Intel Ivy Bridge e.g. i5-3427U which consumes almost as little as the Atom D525 but has AVX support and should be several times faster.
My goal is to see what would happen when using more texture data than what would fit in physical GPU memory. My first attempt was to load up to 40 DDS textures, resulting in a memory footprint way higher than there was GPU memory. However, my scene would still render at 200+ fps on a 9500 GT.
My conclusion: the GPU/OpenGL is being smart and only keeps certain parts of the mipmaps in memory. I thought that should not be possible on a standard config, but whatever.
Second attempt: disable mip mapping, such that the GPU will always have to sample from the high res textures. Once again, I loaded about 40 DDS textures in memory. I verified the texture memory usage with gDEBugger: 1.2 GB. Still, my scene was rendering at 200+ fps.
The only thing I noticed was that when looking away with the camera and then centering it once again on the scene, a serious lag would occur. As if only then it would transfer textures from main memory to the GPU. (I have some basic frustum culling enabled)
My question: what is going on? How does this 1 GB GPU manage to sample from 1.2 GB of texture data at 200+ fps?
OpenGL can page complete textures in and out of texture memory in between draw-calls (not just in between frames). Only those needed for the current draw-call actually need to be resident in graphics memory, the others can just reside in system RAM. It likely only does this with a very small subset of your texture data. It's pretty much the same as any cache - how can you run algorithms on GBs of data when you only have MBs of cache on your CPU?
Also PCI-E busses have a very high throughput, so you don't really notice that the driver does the paging.
If you want to verify this, glAreTexturesResident might or might-not help, depending on how well the driver is implemented.
Even if you were forcing texture thrashing in your test (discarding and uploading of some textures from system memory to GPU memory every frame), which I'm not sure you are, modern GPUs and PCI-E have such a huge bandwidth that some thrashing does impact performance that much. One of the 9500GT models is quoted to have a bandwidth of 25.6 GB/s, and 16x PCI-E slots (500 MB/s x 16 = 8 GB/s) are the norm.
As for the lag, I would assume the GPU + CPU throttle down their power usage when you aren't drawing visible textures, and when you suddenly overload them they need a brief instant to power up. In real life apps and games this 0%-100% sudden workload changes never happen, so a slight lag is totally understandable and expected, I guess.
I am in the process of writing a full HD capable 2D engine for a company of artists which will hopefully be cross platform and is written in OpenGL and C++.
The main problem i've been having is how to deal with all those HD sprites. The artists have drawn the graphics at 24fps and they are exported as png sequences. I have converted them into DDS (not ideal, because it needs the directx header to load) DXT5 which reduces filesize alot. Some scenes in the game can have 5 or 6 animated sprites at a time, and these can consist of 200+ frames each. Currently I am loading sprites into an array of pointers, but this is taking too long to load, even with compressed textures, and uses quite a bit of memory (approx 500mb for a full scene).
So my question is do you have any ideas or tips on how to handle such high volumes of frames? There are a couple of ideas i've thought've of:
Use the swf format for storing the frames from Flash
Implement a 2D skeletal animation system, replacing the png sequences (I have concerns about the joints being visible tho)
How do games like Castle Crashers load so quickly with great HD graphics?
Well the first thing to bear in mind is that not all platforms support DXT5 (mobiles specifically).
Beyond that have you considered using something like zlib to compress the textures? The textures will likely have a fair degree of self similarity which will mean that they will compress down a lot. In this day and age decompression is cheap due to the speed of processors and the time saved getting the data off the disk can be far far more useful than the time lost to decompression.
I'd start there if i were you.
24 fps hand-drawn animations? Have you considered reducing the framerate? Even cinema-quality cel animation is only rarely drawn at the full 24-fps. Even going down to 18 fps will get rid of 25% of your data.
In any case, you didn't specify where your load times were long. Is the load from harddisk to memory the problem, or is it the memory to texture load that's the issue? Are you frequently swapping sets of texture data into the GPU, or do you just build a bunch of textures out of it at load time?
If it's a disk load issue, then your only real choice is to compress the texture data on the disk and decompress it into memory. S3TC-style compression is not that compressed; it's designed to be a useable compression technique for texturing hardware. You can usually make it smaller by using a standard compression library on it, such as zlib, bzip2, or 7z. Of course, this means having to decompress it, but CPUs are getting faster than harddisks, so this is usually a win overall.
If the problem is in texture upload bandwidth, then there aren't very many solutions to that. Well, depending on your hardware of interest. If your hardware of interest supports OpenCL, then you can always transfer compressed data to the GPU, and then use an OpenCL program to decompress it on the fly directly into GPU memory. But requiring OpenCL support will impact the minimum level of hardware you can support.
Don't dismiss 2D skeletal animations so quickly. Games like Odin Sphere are able to achieve better animation of 2D skeletons by having several versions of each of the arm positions. The one that gets drawn is the one that matches up the closest to the part of the body it is attached to. They also use clever art to hide any defects, like flared clothing and so forth.