Rotating hundreds of JPEGs in seconds rather than hours - opengl

We have hundreds of images which our computer gets at a time and we need to rotate and resize them as fast as possible.
Rotation is done by 90, 180 or 270 degrees.
Currently we are using the command line tool GraphicsMagick to rotate the image. Rotating the images (5760*3840 ~ 22MP) takes around 4 to 7 seconds.
The following python code sadly gives us equal results
import cv
img = cv.LoadImage("image.jpg")
timg = cv.CreateImage((img.height,img.width), img.depth, img.channels) # transposed image
# rotate counter-clockwise
cv.Transpose(img,timg)
cv.Flip(timg,timg,flipMode=0)
cv.SaveImage("rotated_counter_clockwise.jpg", timg)
Is there a faster way to rotate the images using the power of the graphics card? OpenCL and OpenGL come to mind but we are wondering whether a performance increase would be noticable.
The hardware we are using is fairly limited as the device should be as small as possible.
Intel Atom D525 (1,8 Ghz)
Mobility Radeon HD 5430 Series
4 GB of RAM
SSD Vertility 3
The software is debian 6 with official (closed source) radeon drivers.

you can perform a lossless rotation that will just modify the EXIF section. This will rotate your pictures faster.
and have a look at jpegtran utility which performs lossless jpeg modifications.
https://linux.die.net/man/1/jpegtran

There is a jpeg no-recompression plugin for irfanview which IIRC can rotate and resize images (in simple ways) without recompressing, it can also run an a directory of images - this should be a lot faster
The GPU probably wouldn't help, you are almost certainly I/O limited in opencv, it's not really optomised for high speed file access

I'm not an expert in jpeg and compression topics, but as your problem is pretty much as I/O limited as it gets (assuming that you can rotate without heavy de/encoding-related computation), you you might not be able to accelerate it very much on the GPU you have. (Un)Luckily your reference is a pretty slow Atom CPU.
I assume that the Radeon has separate main memory. This means that data needs to be communicated through PCI-E which is the extra latency compared to CPU execution and without hiding you can be sure that it is the bottleneck. This is the most probable reason why your code that uses OpenCV on the GPU is slow (besides the fact that you do two memory-bound operations, transpose & flip, instead of a single one).
The key thing is to hide as much of the PCI-E transfer times with computation as possible by using multiple-buffering. Overlapping transfers both to and from the GPU with computation by making use of the full-duplex capability of PCI-E will only work if the card in question has dual-DMA engines like high-end Radeons or the NVIDIA Quadro/Tesla cards -- which I highly doubt.
If your GPU compute-time (the time it takes the GPU to do the rotation) is lower than the time the transfer takes, you won't be able to fully overlap. The HD 4530 has a pretty slow memory interface with only 12.8 Gb/s peak, and the rotation kernel should be quite memory bound. However, I can only guesstimate, but I would say that if you reach peak PCI-E transfer rate of ~1.5 Gb/s (4x PCI-E AFAIK), the compute kernel will be a few times faster than the transfer and you'll be able to overlap very little.
You can simply time the parts separately without requiring elaborate asynchronous code and you can estimate how fast can you get things with an optimum overlap.
One thing you might want to consider is getting hardware which doesn't exhibit PCI-E as a bottleneck, e.g:
AMD APU-based system. On these platforms you will be able to page-lock the memory and use it directly from the GPU;
integrated GPUs which share main memory with the host;
a fast low-power CPU like a mobile Intel Ivy Bridge e.g. i5-3427U which consumes almost as little as the Atom D525 but has AVX support and should be several times faster.

Related

Same Direct2D application performs better on a "slower" machine

I wrote a Direct2D application that displays a certain number of graphics.
When I run this application it takes about 4 seconds to display 700,000 graphic elements on my notebook:
Intel Core i7 CPU Q 720 1.6 GHz
NVIDIA Quadro FX 880M
According to the Direct2D MSDN page:
Direct2D is a user-mode library that is built using the Direct3D 10.1
API. This means that Direct2D applications benefit from
hardware-accelerated rendering on modern mainstream GPUs.
I was expecting that the same application (without any modification) should perform better on a different machine with better specs. So I tried it on a desktop computer:
Intel Xeon(R) CPU 2.27 GHz
NVIDIA GeForce GTX 960
But it took 5 seconds (1 second more) to display the same graphics (same number and type of elements).
I would like to know how can it be possible and what are the causes.
It's impossible to say for sure without measuring. However, my gut tells me that melak47 is correct. There is no lack of GPU acceleration, it's a lack of bandwidth. Integrated GPUs have access to the same memory as the CPU. They can skip the step of having to transfer bitmaps and drawing commands across the bus to dedicated graphics memory for the GPU.
With a primarily 2D workload, any GPU will be spending most of its time waiting on memory. In your case, the integrated GPU has an advantage. I suspect that extra second you feel, is your GeForce waiting on graphics coming across the motherboard bus.
But, you could profile and enlighten us.
Some good points in the comments and other replies.(can't add a comment yet)
Your results dont surprise me as there are some differencies between your 2 setups.
Let's have a look there: http://ark.intel.com/fr/compare/47640,43122
A shame we can't see the SSE version supported by your Xeon CPU. Those are often used for code optimization. Is the model I chose for the comparison even the good one?
No integrated GPU in that Core-I7, but 4 cores + hyperthreading = 8 threads against 2 cores with no hyperthreading for the Xeon.
Quadro stuff rocks when it comes to realtime rendering. As your scene seems to be quite simple, it could be well optimized for that, but just "maybe" - I'm guessing here... could someone with experience comment on that? :-)
So it's not so simple. What appears to be a better gfx card doesn't mean better performance for sure. If you have a bottleneck somewhere else you're screwed!
The difference is small, you must compare every single element of your 2 setups: CPU, RAM, HDD, GPU, Motherboard with type of PCI-e and chipset.
So again, a lot of guessing, some tests are needed :)
Have fun and good luck ;-)

Is there any OpenGL operations cost GPU resource significantly?

The situation I met is that I run several OpenGL programs concurrently on a single server with only one GPU cards. They works fine with the FPS of 60.
But the problem is when I restart one of then, the FPS of the others drop a lot, may to 3X or even 1X
If I restart 2 or more at the same time, it can be worse, FPS can be only single-digit
I'm wondering is there any OpenGL initialization(create context/setup texture) operation that may cost the GPU resource a lot?
The environment: Linux(Ubuntu 14.04) NVIDIA GTX 770 with X11 window system
There are indeed some operations that most OpenGL implementations carry out on the CPU. Most notably when sending images to the GPU any necessary format conversions (i.e. if the image data is not yet in a format that can be processed by the GPU) are done on the CPU. Also every OpenGL object that carries bulk data (textures, render buffers, array buffer objects) usually requires a shadow copy in system memory (the GPU memory mostly acts like a cache) and initializing that is a costly operation as well.
Sending bulk data to the GPU consumes bus bandwidth. While this does not really hit the CPU or GPU due to being carried out using DMA transfer, it consumes memory bandwidth on either side thus impacting memory intensive operations on either side.

Rendering 1.2 GB of textures smoothly, how does a 1 GB GPU do this?

My goal is to see what would happen when using more texture data than what would fit in physical GPU memory. My first attempt was to load up to 40 DDS textures, resulting in a memory footprint way higher than there was GPU memory. However, my scene would still render at 200+ fps on a 9500 GT.
My conclusion: the GPU/OpenGL is being smart and only keeps certain parts of the mipmaps in memory. I thought that should not be possible on a standard config, but whatever.
Second attempt: disable mip mapping, such that the GPU will always have to sample from the high res textures. Once again, I loaded about 40 DDS textures in memory. I verified the texture memory usage with gDEBugger: 1.2 GB. Still, my scene was rendering at 200+ fps.
The only thing I noticed was that when looking away with the camera and then centering it once again on the scene, a serious lag would occur. As if only then it would transfer textures from main memory to the GPU. (I have some basic frustum culling enabled)
My question: what is going on? How does this 1 GB GPU manage to sample from 1.2 GB of texture data at 200+ fps?
OpenGL can page complete textures in and out of texture memory in between draw-calls (not just in between frames). Only those needed for the current draw-call actually need to be resident in graphics memory, the others can just reside in system RAM. It likely only does this with a very small subset of your texture data. It's pretty much the same as any cache - how can you run algorithms on GBs of data when you only have MBs of cache on your CPU?
Also PCI-E busses have a very high throughput, so you don't really notice that the driver does the paging.
If you want to verify this, glAreTexturesResident might or might-not help, depending on how well the driver is implemented.
Even if you were forcing texture thrashing in your test (discarding and uploading of some textures from system memory to GPU memory every frame), which I'm not sure you are, modern GPUs and PCI-E have such a huge bandwidth that some thrashing does impact performance that much. One of the 9500GT models is quoted to have a bandwidth of 25.6 GB/s, and 16x PCI-E slots (500 MB/s x 16 = 8 GB/s) are the norm.
As for the lag, I would assume the GPU + CPU throttle down their power usage when you aren't drawing visible textures, and when you suddenly overload them they need a brief instant to power up. In real life apps and games this 0%-100% sudden workload changes never happen, so a slight lag is totally understandable and expected, I guess.

Improve performance of dense optical flow analysis (easily)?

I wrote a program that uses OpenCV's cvCalcOpticalFlowLK. It performs fine on a low-resolution webcam input, but I need to run it on a full HD stream with significant other computation following the optical flow analysis for each frame. Processing a 5 minute video scaled down to 1440x810 took 4 hours :( Most of the time is being spent in cvCalcOpticalFlowLK.
I've researched improving the speed by adding more raw CPU, but even if I get an 8-core beast, and the speedup is the theoretical ideal (say 8x, since I'm basically only using one of my 2.9GHz cores), I'd only be getting 4FPS. I'd like to reach 30FPS.
More research seems to point to implementing it on the GPU with CUDA, OpenCL, or GLSL(?). I've found some proof-of-concept implementations (eg. http://nghiaho.com/?page_id=189), and many papers saying basically "it's a great application for the GPU, we did it, it was awesome, and no we won't share our code". Needless to say, I haven't gotten any of them to run.
Does anyone know of a GPU-based implementation that would run on Mac with an NVIDIA card? Are there resources that might help me approach writing my own? Are there other dense OF algorithms that might perform better?
Thanks!
What about OpenVidia Bayesian Optical Flow? Also the paper Real-Time Dense and Accurate Parallel Optical Flow using CUDA says that their work is freely available in the CUDA zone. I couldn't find it there immediately, but maybe you will, or can write the authors?

Full HD 2D Texture Memory OpenGL

I am in the process of writing a full HD capable 2D engine for a company of artists which will hopefully be cross platform and is written in OpenGL and C++.
The main problem i've been having is how to deal with all those HD sprites. The artists have drawn the graphics at 24fps and they are exported as png sequences. I have converted them into DDS (not ideal, because it needs the directx header to load) DXT5 which reduces filesize alot. Some scenes in the game can have 5 or 6 animated sprites at a time, and these can consist of 200+ frames each. Currently I am loading sprites into an array of pointers, but this is taking too long to load, even with compressed textures, and uses quite a bit of memory (approx 500mb for a full scene).
So my question is do you have any ideas or tips on how to handle such high volumes of frames? There are a couple of ideas i've thought've of:
Use the swf format for storing the frames from Flash
Implement a 2D skeletal animation system, replacing the png sequences (I have concerns about the joints being visible tho)
How do games like Castle Crashers load so quickly with great HD graphics?
Well the first thing to bear in mind is that not all platforms support DXT5 (mobiles specifically).
Beyond that have you considered using something like zlib to compress the textures? The textures will likely have a fair degree of self similarity which will mean that they will compress down a lot. In this day and age decompression is cheap due to the speed of processors and the time saved getting the data off the disk can be far far more useful than the time lost to decompression.
I'd start there if i were you.
24 fps hand-drawn animations? Have you considered reducing the framerate? Even cinema-quality cel animation is only rarely drawn at the full 24-fps. Even going down to 18 fps will get rid of 25% of your data.
In any case, you didn't specify where your load times were long. Is the load from harddisk to memory the problem, or is it the memory to texture load that's the issue? Are you frequently swapping sets of texture data into the GPU, or do you just build a bunch of textures out of it at load time?
If it's a disk load issue, then your only real choice is to compress the texture data on the disk and decompress it into memory. S3TC-style compression is not that compressed; it's designed to be a useable compression technique for texturing hardware. You can usually make it smaller by using a standard compression library on it, such as zlib, bzip2, or 7z. Of course, this means having to decompress it, but CPUs are getting faster than harddisks, so this is usually a win overall.
If the problem is in texture upload bandwidth, then there aren't very many solutions to that. Well, depending on your hardware of interest. If your hardware of interest supports OpenCL, then you can always transfer compressed data to the GPU, and then use an OpenCL program to decompress it on the fly directly into GPU memory. But requiring OpenCL support will impact the minimum level of hardware you can support.
Don't dismiss 2D skeletal animations so quickly. Games like Odin Sphere are able to achieve better animation of 2D skeletons by having several versions of each of the arm positions. The one that gets drawn is the one that matches up the closest to the part of the body it is attached to. They also use clever art to hide any defects, like flared clothing and so forth.