Same Direct2D application performs better on a "slower" machine - c++

I wrote a Direct2D application that displays a certain number of graphics.
When I run this application it takes about 4 seconds to display 700,000 graphic elements on my notebook:
Intel Core i7 CPU Q 720 1.6 GHz
NVIDIA Quadro FX 880M
According to the Direct2D MSDN page:
Direct2D is a user-mode library that is built using the Direct3D 10.1
API. This means that Direct2D applications benefit from
hardware-accelerated rendering on modern mainstream GPUs.
I was expecting that the same application (without any modification) should perform better on a different machine with better specs. So I tried it on a desktop computer:
Intel Xeon(R) CPU 2.27 GHz
NVIDIA GeForce GTX 960
But it took 5 seconds (1 second more) to display the same graphics (same number and type of elements).
I would like to know how can it be possible and what are the causes.

It's impossible to say for sure without measuring. However, my gut tells me that melak47 is correct. There is no lack of GPU acceleration, it's a lack of bandwidth. Integrated GPUs have access to the same memory as the CPU. They can skip the step of having to transfer bitmaps and drawing commands across the bus to dedicated graphics memory for the GPU.
With a primarily 2D workload, any GPU will be spending most of its time waiting on memory. In your case, the integrated GPU has an advantage. I suspect that extra second you feel, is your GeForce waiting on graphics coming across the motherboard bus.
But, you could profile and enlighten us.

Some good points in the comments and other replies.(can't add a comment yet)
Your results dont surprise me as there are some differencies between your 2 setups.
Let's have a look there: http://ark.intel.com/fr/compare/47640,43122
A shame we can't see the SSE version supported by your Xeon CPU. Those are often used for code optimization. Is the model I chose for the comparison even the good one?
No integrated GPU in that Core-I7, but 4 cores + hyperthreading = 8 threads against 2 cores with no hyperthreading for the Xeon.
Quadro stuff rocks when it comes to realtime rendering. As your scene seems to be quite simple, it could be well optimized for that, but just "maybe" - I'm guessing here... could someone with experience comment on that? :-)
So it's not so simple. What appears to be a better gfx card doesn't mean better performance for sure. If you have a bottleneck somewhere else you're screwed!
The difference is small, you must compare every single element of your 2 setups: CPU, RAM, HDD, GPU, Motherboard with type of PCI-e and chipset.
So again, a lot of guessing, some tests are needed :)
Have fun and good luck ;-)

Related

Using OpenGL on lower-power side of Hybrid Graphics chip

I have hit a brick wall and I wonder if someone here can help. My program opens an OpenGL surface for very minor rendering needs. It seems on the MacbookPro this causes the graphics card driver to switch the hybrid card from low performance intel graphics to high performance AMD ATI graphics.
This causes me problems as there seems to be an issue with the AMD driver and putting the Mac to sleep, but also it drains the battery unnecessarily fast. I only need OpenGL to create a static 3D image on occasion, I do not require a fast frame rate!
Is there a way in a Cocoa app to prevent OpenGL switching a hybrid graphics card into performance mode?
The relevant documentation for this is QA1734, “Allowing OpenGL applications to utilize the integrated GPU”:
… On OS X 10.6 and earlier, you are not allowed to choose to run on the integrated GPU instead. …
On OS X 10.7 and later, there is a new attribute called NSSupportsAutomaticGraphicsSwitching. To allow your OpenGL application to utilize the integrated GPU, you must add in the Info.plist of your application this key with a Boolean value of true…
So you can only do this on Lion, and “only … on the dual-GPU MacBook Pros that were shipped Early 2011 and after.”
There are a couple of other important caveats:
Additionally, you must make sure that your application works correctly with multiple GPUs or else the system may continue forcing your application to use the discrete GPU. TN2229 Supporting Multiple GPUs on Mac OS X discusses in detail the required steps that you need to follow.
and:
Features that are available on the discrete GPU may not be available on the integrated GPU. You must check that features you desire to use exist on the GPU you are using. For a complete listing of supported features by GPU class, please see: OpenGL Capabilities Tables.

opengl fixed function uses gpu or cpu?

I have a code which basically draws parallel coordinates using opengl fixed func pipeline.
The coordinate has 7 axes and draws 64k lines. SO the output is cluttered, but when I run the code on my laptop which has intel i5 proc, 8gb ddr3 ram it runs fine. One of my friend ran the same code in two different systems both having intel i7 and 8gb ddr3 ram along with a nvidia gpu. In those systems the code runs with shuttering and sometimes the mouse pointer becomes unresponsive. If you guys can give some idea why this is happening, it would be of great help. Initially I thought it would run even faster in those systems as they have a dedicated gpu. My own laptop has ubuntu 12.04 and both the other systems have ubuntu 10.x.
Fixed function pipeline is implemented using gpu programmable features in modern opengl drivers. This means most of the work is done by the GPU. Fixed function opengl shouldn't be any slower than using glsl for doing the same things, but just really inflexible.
What do you mean by coordinates having axes and 7 axes? Do you have screen shots of your application?
Mouse stuttering sounds like you are seriously taxing your display driver. This sounds like you are making too many opengl calls. Are you using immediate mode (glBegin glVertex ...)? Some OpenGL drivers might not have the best implementation of immediate mode. You should use vertex buffer objects for your data.
Maybe I've misunderstood you, but here I go.
There are API calls such as glBegin, glEnd which give commands to the GPU, so they are using GPU horsepower, though there are also calls to arrays, other function which have no relation to API - they use CPU.
Now it's a good practice to preload your models outside the onDraw loop of the OpenGL by saving the data in buffers (glGenBuffers etc) and then use these buffers(VBO/IBO) in your onDraw loop.
If managed correctly it can decrease the load on your GPU/CPU. Hope this helps.
Oleg

Rotating hundreds of JPEGs in seconds rather than hours

We have hundreds of images which our computer gets at a time and we need to rotate and resize them as fast as possible.
Rotation is done by 90, 180 or 270 degrees.
Currently we are using the command line tool GraphicsMagick to rotate the image. Rotating the images (5760*3840 ~ 22MP) takes around 4 to 7 seconds.
The following python code sadly gives us equal results
import cv
img = cv.LoadImage("image.jpg")
timg = cv.CreateImage((img.height,img.width), img.depth, img.channels) # transposed image
# rotate counter-clockwise
cv.Transpose(img,timg)
cv.Flip(timg,timg,flipMode=0)
cv.SaveImage("rotated_counter_clockwise.jpg", timg)
Is there a faster way to rotate the images using the power of the graphics card? OpenCL and OpenGL come to mind but we are wondering whether a performance increase would be noticable.
The hardware we are using is fairly limited as the device should be as small as possible.
Intel Atom D525 (1,8 Ghz)
Mobility Radeon HD 5430 Series
4 GB of RAM
SSD Vertility 3
The software is debian 6 with official (closed source) radeon drivers.
you can perform a lossless rotation that will just modify the EXIF section. This will rotate your pictures faster.
and have a look at jpegtran utility which performs lossless jpeg modifications.
https://linux.die.net/man/1/jpegtran
There is a jpeg no-recompression plugin for irfanview which IIRC can rotate and resize images (in simple ways) without recompressing, it can also run an a directory of images - this should be a lot faster
The GPU probably wouldn't help, you are almost certainly I/O limited in opencv, it's not really optomised for high speed file access
I'm not an expert in jpeg and compression topics, but as your problem is pretty much as I/O limited as it gets (assuming that you can rotate without heavy de/encoding-related computation), you you might not be able to accelerate it very much on the GPU you have. (Un)Luckily your reference is a pretty slow Atom CPU.
I assume that the Radeon has separate main memory. This means that data needs to be communicated through PCI-E which is the extra latency compared to CPU execution and without hiding you can be sure that it is the bottleneck. This is the most probable reason why your code that uses OpenCV on the GPU is slow (besides the fact that you do two memory-bound operations, transpose & flip, instead of a single one).
The key thing is to hide as much of the PCI-E transfer times with computation as possible by using multiple-buffering. Overlapping transfers both to and from the GPU with computation by making use of the full-duplex capability of PCI-E will only work if the card in question has dual-DMA engines like high-end Radeons or the NVIDIA Quadro/Tesla cards -- which I highly doubt.
If your GPU compute-time (the time it takes the GPU to do the rotation) is lower than the time the transfer takes, you won't be able to fully overlap. The HD 4530 has a pretty slow memory interface with only 12.8 Gb/s peak, and the rotation kernel should be quite memory bound. However, I can only guesstimate, but I would say that if you reach peak PCI-E transfer rate of ~1.5 Gb/s (4x PCI-E AFAIK), the compute kernel will be a few times faster than the transfer and you'll be able to overlap very little.
You can simply time the parts separately without requiring elaborate asynchronous code and you can estimate how fast can you get things with an optimum overlap.
One thing you might want to consider is getting hardware which doesn't exhibit PCI-E as a bottleneck, e.g:
AMD APU-based system. On these platforms you will be able to page-lock the memory and use it directly from the GPU;
integrated GPUs which share main memory with the host;
a fast low-power CPU like a mobile Intel Ivy Bridge e.g. i5-3427U which consumes almost as little as the Atom D525 but has AVX support and should be several times faster.

How can I stress the GPU

I would like to add some diagnostic code to our application that stresses both the CPU and GPU, and then measures heat. A third party tool is not an option. From what I can tell, CUDA is not an option either, as it requires Nvidia's compiler - is that right? As far as I can tell, my best option is DirectX. Anything simple and non visual on the GPU would do.
Platform: Windows XP Embedded
DirectX 9.0C
Simply create a shader in HLSL which contain an endless loop.
Turn off all culling and instancing and upload tones of triangle data to the gpu for processing and drawing, this will stress both the CPU (not too much these days) and the GPU should suffer under the overdrawing burden.
one should be able to use the code for any intro tutorial for this (ones that use DrawPrimitiveUP will stress the CPU more, but don't require creation of GPU buffers). you probably also want vsync disabled, so that the GPU works as fast as it can(aka it doesn't wait too much/at all on other events)

Which 3D cards support full scene antialiasing?

Is there a list of 3D cards available that provide full scene antialiasing as well as which are able to do it in hardware (decent performance)?
Pretty much all cards since DX7-level technology (GeForce 2 / Radeon 7000) can do it. Most notable exceptions are Intel cards (Intel 945 aka GMA 950 and earlier can't do it; I think Intel 965 aka GMA X3100 can't do it either).
Older cards (GeForce 2 / 4MX, Radeon 7000-9250) were using supersampling (render everything into internally larger buffer, downsample at the end). All later cards have multisampling, where this expensive process is only performed at polygon edges (simply speaking, shaders are run for each pixel, while depth/coverage is stored for each sample).
Off the top of my head, pretty much any card since a geforce 2 or something can do it. There's always a performance hit, but this varies on the card and AA mode (of which there are about 100 different kinds) but generally it's quite a performance hit.
Agree with Orion Edwards, pretty much everything new can. Performance also depends greatly on the resolution you run at.
Integrated GPUs are going to be really poor performers with games FSAA or no. If you want even moderate performance, buy a separate video card.
For something that's not crazy expensive go with either a nVidia Geforce 8000 series card or an ATI 3000 series card. Even as a nVidia 8800 GTS owner, I will tell you the ATIs have better support for older games.
Although I personally still like FSAA, it is becoming less important with higher resolution screens. Also, more and more games are using deferred rendering which makes FSAA impossible.
Yes, of course integrated cards are awful. :) But this wasn't a question about gaming, but rather about an application that we are writing that will use OpenGL/D3D for 3D rendering. The 3D scene is relatively small, but antialiasing makes a dramatic difference in terms of the quality of the rendering. We are curious if there is some way to easily determine which cards support these features fully and which do not.
With the exception of the 3100, so far all of the cards we've found that do antialiasing are plenty fast for our purposes (as is my GeForce 9500).
Having seen a pile of machines recently that don't do it, I don't think that's quite true. The GMA 950 integrated ones don't do it to start with, and I don't think that the 3100/X3100 do either (at least not in hardware... the 3100 was enormously slow in a demo). Also, I don't believe that the GeForce MX5200 supported it either.
Or perhaps I'm just misunderstanding what you mean when you refer to "AA mode". Are there a lot of cards which support modes that are virtually unnoticable? :)