Profiling a graphics rendering without a profiler - opengl

Nowadays we have pretty advanced tools to iron out rendering, allowing to see the different stages, time taken by draw calls, etc. But without them the graphics pipeline is quite a black box when it comes to understand what is happening inside.
Suppose for some reason you have no such tool, or a very limited one. How would you measure anyway what is taking time in your rendering?
I am aware of tricks like discarding draw calls to see the CPU time, setting a 1x1 viewport to see the cost of geometry, using a dumb fragment shader to highlight the fillrate... They are useful already but only give a rough idea of what is going on, and tell nothing about the level of parallelism.
Also, getting the time spent in each stage per draw call seem to be difficult, especially when taking into account the lack of precision due to the noise when measuring.
What tricks do you use when your backpack is almost empty and you still have to profile your rendering? What is your personal Swiss army knife consisting in?

Frame time rendering time
Absolute time spent for small code/stage/etc. is not that relevant as GPU driver optimization/batching/parallelism/version makes it nearly impossible to have precise code measure without GPU counters. (which you can get if you use with vendors libs)
What you can measure easily is each single code change impact. You'll only get relative impact, and it's what you really need anyway. And that just using frame rendering time.
Ideally you should aim be able can edit shader or pipeline code during runtime, and have a direct way to check impact over a whole typical scene, like just comparing graphs between several code path. (beware of static scenes, otherwise you'll end with highly optimized static views, but poor dynamic scenes performance)
Here's the swiss army knife list:
scene states loader
scene recorder (camera paths/add-remove entities,texture, mesh, fake input, etc.) using scene states.
scene states saver
scene frame time logger (not just final average but each frame rendering time)
on-the-fly shader code reload
on-the-fly codepath switch
frame time log reader+graphs+statistic framework
Note that scene state load/save/record are handy for a lot of other things, from debugging to undo/redo to on-the-fly reload, not to mention savegames.
Add a screenshot taker + image diff, and you can unit test graphic code too.
If you can, add that to your CI server so that huge code impact doesn't go unnoticed. (helps also artists when they check-in their assets, without evaluating rendering impact)
A must read on that related CI graphic test work is there : http://aras-p.info/blog/2011/06/17/testing-graphics-code-4-years-later/

Note: I'm responding to the question: "Profiling a graphics rendering with a profiler", since that something I was looking for ;)
I'm working mostly on Mac, and I'm using multiple tools:
gDebugger version 5.8 is available on Windows and Mac (this tool has been bought by AMD, the v6 version is Windows only). It gives you statistics about state changes, texture usage, draw calls, etc. It's also usefull to debug texture mapping, and see how your scene is drawn, step by step.
PVRUniSCoEditor it's a shader editor. It compiles on the fly and give you precious details about estimated cycles and registers usage.
Instruments (from XCode Utilities, OSX only), it gets informations from the OpenGL driver, it's great to find bottleneck since you can track what part of the GPU is used at 100% (tiler, renderer, texture unit, etc...)
Adreno Profiler a Windows tool to profile Adreno-based mobile devices. (Very good tool if you work on Android apps ;))
What's your trick about the "dumb fragment shader to highlight the fillrate" ? (drawing a plain color ? or something more advanced ?)

Related

Optimizing a Q3DBars graph

I am developing a Qt application that requires visualizations of very large data sets. I was hoping to use Qt's 3D graphing functionality (Q3DBars) to make the data more easily understood. However, I am having difficulty getting a reasonable framerate.
I have done the below in hopes that the framerate would improve. No effect.
bars.setReflection(false);
bars.setReflectivity(false);
bars.setOptimizationHints(...::QAbstract3DGraph::OptimizationHint::OptimizationStatic);
bars.setShadowQuality(QAbstract3DGraph::ShadowQuality::ShadowQualityNone);
bars.setSurfaceType(QSurface::OpenGLSurface);
I have a GTX1080 and windows records 40% GPU usage as I rotate the graph. However, the CPU is loaded significantly during the same rotation.
What can be done to further offload work to the GPU and/or optimize the rendering?
After playing with the other 3D graph classes, I have discovered that Q3DSurface runs smoothly with the same data set. I am unsure what the specific difference is but I suspect the surface is rendered in a vastly different way. Initially, the image looked unclear.
However, after disabling the wireframe the image became very clear.

Real time ray tracer

I would like to make a basic real time CPU ray tracer in C++ (mainly for learning proposes). This tutorial was great for making a basic ray tracer. But what would be the best solution to draw this on the screen in real time? I'm not asking on how to optimize the ray tracing-part, just the painting part so that it would paint on the screen and not in a file.
I'm developing on/for windows.
You could check out this Code Project article on the basic paint mechanism using Win32API
Update: OP wants fast drawing, which the Win32API does not provide. The OP needs this so that they can measure speedup of the ray-tracing algorithm during optimization process. Other possibilities for drawing are: DirectX, XNA, Allegro, OpenGL.
I'm professionally working on a realtime CPU raytracer, and from what I saw with 2 years of work there, the GPU part to display image won't be the bottleneck, the bottleneck if you reach it will be the speed of your RAM, I don't think the drawing technology will make any significant difference.
As an example, we are using clustering (one CPU is not enough :p), we were able to render 100-200fps at 1920x1080 when looking the sky but the bottleneck was not the display part, it was the network...
EDIT: We are using OpenGL for the display.
When you are doing a CPU raytracer you are not gonna do printPixelToGPU() but you will write to your RAM and then send it to the GPU once the image is finished. Doing printPixelToGPU() would probably cause an big overhead and it is (in my opinion) a really bad design choice.
It looks like premature optimization. But if you are still concerned about that, just do a bench of how many RAM textures to GPU transfer you can do with OpenGL, directX..., and print the average framerate. You will probably see that the framerate will be really really high, so you will certainly never reach that "bottleneck" unless you are using SDRAM or a really poor GPU.

Anti-aliasing in OpenGL

I just started with OpenGL programming and I am building a clock application. I want it to look something simple like this: http://i.stack.imgur.com/E73ap.jpg
However, my application looks very "un-anti-aliased" : http://i.stack.imgur.com/LUx2v.png
I tried the GL_SMOOTH_POLYGON method mentioned in the Red Book. However that doesn't seem to do a thing.
I am working on a laptop with Intel integrated graphics. The card doesn't support things like GL_ARB_multisample.
What are my options at this point to my app look anti-aliased?
Intel integrated videocards are notorious for their lack of support for OpenGL antialiasing. You can work around that, however.
First option: Manual supersampling
Make a texture 2x times as big as the screen. Render your scene to the texture via FBO, then render the texture at half size so it fills the screen, with bilinear interpolation. Can be very slow (in complex scenes) due to the 4x increase in pixels to draw.
Will result in weak antialiasing (so I don't recommend it for desktop software like your clock). See for yourself:
Second option: (advanced)
Use a shader to perform Morphological Antialiasing. This is a new technique and I don't know how easy it is to implement. It's used by some advanced games.
Third option:
Use textures and bilinear interpolation to your advantage by emulating OpenGL's primitives via textures. The technique is described here.
Fourth option:
Use a separate texture for every element of your clock.
For example, for your hour-arrow, don't use a flat black GL_POLYGON shaped like your arrow. Instead, use a rotated GL_QUAD, textured with a hour-arrow image drawn in an image program. Then bilinear interpolation will take care of antialiasing it as you rotate it.
This option would take the least effort and looks very well.
Fifth option:
Use a library that supports software rendering -
Qt
Cairo
Windows GDI+
WPF
XRender
etc
Such libraries contain their own algorithms for antialiased rendering, so they don't depend on your videocard for antialiasing. The advantages are:
Will render the same on every platform. (this is not guaranteed with OpenGL in various cases - for example, the thick diagonal "tick" lines in your screenshot are rendered as parallelograms, rather than rectangles)
Has a big bunch of convenient drawing functions ("drawArc", "drawText", "drawConcavePolygon", and those will support gradients and borders. also you get things like an Image class.)
Some, like Qt, will provide much more desktop-app type functionality. This can be very useful even for a clock app. For example:
in an OpenGL app you'd probably loop every 20msec and re-render the clock, and not even think twice. This would hog unnecessary CPU cycles, and wake up the CPU on a laptop, depleting the battery. By contrast, Qt is very intelligent about when it must redraw parts of your clock (e.g., when the right half of the clock stops being covered by a window, or when your clock moves the minute-arrow one step).
once you get to implementing, e.g. a tray icon, or a settings dialog, for your clock, a library like Qt can make it a snap. It's nice to use the same library for everything.
The disadvantage is much worse performance, but that doesn't matter at all for a clock app, and it turns around when you take into account the intelligent-redrawing functionality I mentioned.
For something like a clock app, the fifth option is very much recommended. OpenGL is mainly useful for games, 3D software and intense graphical stuff like music visualizers. For desktop apps, it's too low-level and the implementations differ too much.
Draw it into a framebuffer object at twice (or more) the final resolution and then use that image as a texture for a single quad drawn in the actual window.

C++ & DirectX - setting shader

Does someone know a fast way to invoke shader processing via DirectX?
Right now I'm setting shaders using D3DXCreateEffectFromFile calls, which create shaders in runtime (once per each shader) from *.fx files.
Rendering part for every object (every patch in my case - see further) then means something like:
// --------------------
// Preprocessing
effect->Begin();
effect->BeginPass(0);
effect->SetMatrix (or Vector or whatever - internal shader parameters) (...)
effect->CommitChanges();
// --------------------
// Geometry rendering
// Pass the geometry to render
// ...
// --------------------
// Postprocessing
// End 'effect' passes
effect->EndPass();
effect->End();
This is okay, but the profiler shows weird things - preprocessing (see code) takes about 60% of time (I'm rendering terrain object of 256 patches where every patch contains about 10k vertices).
Actual geometry rendering takes ~35% and postprocessing - 5% of total rendering time.
This seems pretty strange to me and I guess that D3DXEffect interface may not be the best solution for this sort of things.
I've got 2 questions:
1. Do I need to implement my own shader controller / wrapper (probably, low-level) and where should I start from?
2. Would compiling shaders help to somehow improve parameter setting performance?
Maybe somebody knows how to solve this kind of problem / some implemented shader interface or could give some advices about how is this kind of problem solved in modern game engines.
Thank you.
Actual geometry rendering takes ~35% and postprocessing - 5% of total rendering time
If you want to profile shader performance you need to use NVPerfHud or something similar. Using CPU profiler and measuring ticks is not going to help you - rendering is often asynchronous.
Do I need to implement my own shader controller / wrapper (probably, low-level)
Using your own shader wrapper isn't a bad idea - I never liked ID3DXEffect anyway.
With your own wrapper you'll have a total control of resources and program behavior.
Whether you need it or not is for you to decide. With ID3DXEffect you won't have a warranty that implementation is as fast as it could be - it could be wasting cpu cycles doing something you don't really need. D3DX library contains few classes that are useful, but aren't guaranteed to be efficient (ID3DXEffect, ID3DXMesh, All animation-related and skin-related functions, etc).
and where should I start from?
D3DXAssembleShader, IDirect3DDevice9::CreateVertexShader, IDirect3DDevice9::CreatePixelShader on DirectX 9, D3D10CompileShader on DirectX 10. Also download DirectX SDK and read shader documentation/tutorials.
Would compiling shaders help to somehow improve parameter setting performance?
Shaders are automatically compiled when you load them. You could compiling try with different optimization settings, but don't expect miracles.
Are you using a DirectX profiler or just timing your client code? Profiling DirectX API calls using timers in the client code is generally not that effective because it's not necessarily synchronously processing your state updates/draw calls as you make them. There's a lot of optimization that goes on behind the scenes. Here is an article about this for DX9 but I'm sure this hasn't changed for later versions:
http://msdn.microsoft.com/en-us/library/bb172234(VS.85).aspx
I've used effects before in DirectX and the system generally works fine. It provides some nice features that might be a pain to implement yourself at a lower-level, so I would stick with it for the moment.
As bshields suggested, your timing information might be inaccurate. It sounds likely that the drawing actually is taking the most time, compared.
The shader is compiled when it's loaded. Precompiling will save you a half-second of startup time, but so long as the shader doesn't change during runtime, you won't see any actual speed increase. Precompiling is also kind of a pain, if you're still testing a shader. You can do it with the final copy, but unless you have a lot of shaders, you won't get much benefit while loading them.
If you're creating the shaders every frame or every time your geometry is rendered, that's probably the issue. Unless the shader itself (not parameters) changes every frame, you should create the effect once and reuse that.
I don't remember where SetParameter calls go, but you may want to check the docs to make sure your SetMatrix is in the right spot. Setting parameters after the pass has started won't help anything, certainly not speed. Make sure that's set up correctly. Also, set parameters as rarely as possible, there is some slight overhead involved. Per-frame sets will give you a notable slow-down, if you have too many.
All in all, the effects system does work fine in most cases and you shouldn't be seeing what you are. Make sure your profiling is correct, your shader is valid and optimized, and your calls are in the right places.

What can cause a reduction in frame rate when upgrading a graphics card?

We have a two-screen DirectX application that previously ran at a consistent 60 FPS (the monitors' sync rate) using a NVIDIA 8400GS (256MB). However, when we swapped out the card for one with 512 MB of RAM the frame rate struggles to get above 40 FPS. (It only gets this high because we're using triple-buffering.) The two cards are from the same manufacturer (PNY). All other things are equal, this is a Windows XP Embedded application and we started from a fresh image for each card. The driver version number is 169.21.
The application is all 2D. I.E. just a bunch of textured quads and a whole lot of pre-rendered graphics (hence the need to upgrade the card's memory). We also have compressed animations which the CPU decodes on the fly - this involves a texture lock. The locks take forever but I've also tried having a separate system memory texture for the CPU to update and then updating the rendered texture using the device's UpdateTexture method. No overall difference in performance.
Although I've read through every FAQ I can find on the internet about DirectX performance, this is still the first time I've worked on a DirectX project so any arcane bits of knowledge you have would be useful. :)
One other thing whilst I'm on the subject; when calling Present on the swap chains it seems DirectX waits for the present to complete regardless of the fact that I'm using D3DPRESENT_DONOTWAIT in both present parameters (PresentationInterval) and the flags of the call itself. Because this is a two-screen application this is a problem as the two monitors do not appear to be genlocked, I'm working around it by running the Present calls through a threadpool. What could the underlying cause of this be?
Are the cards exactly the same (both GeForce 8400GS), and only the memory size differ? Quite often with different memory sizes come slightly different clock rates (i.e. your card with more memory might use slower memory!).
So the first thing to check would be GPU core & memory clock rates, using something like GPU-Z.
It's an easy test to see if the surface lock is the problem, just comment out the texture update and see if the framerate returns to 60hz. Unfortunately, writing to a locked surface and updating the resource kills perfomance, always has. Are you using mipmaps with the textures? I know DX9 added automatic generation of mipmaps, could be taking up a lot of time to generate those. If your constantly locking the same resource each frame, you could also try creating a pool of textures, kinda like triple-buffering except with textures. You would let the render use one texture, and on the next update you pick the next available texture in the pool that's not being used in to render. Unless of course your memory constrained or your only making diffs to the animated texture.