C++ & DirectX - setting shader - c++

Does someone know a fast way to invoke shader processing via DirectX?
Right now I'm setting shaders using D3DXCreateEffectFromFile calls, which create shaders in runtime (once per each shader) from *.fx files.
Rendering part for every object (every patch in my case - see further) then means something like:
// Preprocessing
effect->SetMatrix (or Vector or whatever - internal shader parameters) (...)
// Geometry rendering
// Pass the geometry to render
// ...
// Postprocessing
// End 'effect' passes
This is okay, but the profiler shows weird things - preprocessing (see code) takes about 60% of time (I'm rendering terrain object of 256 patches where every patch contains about 10k vertices).
Actual geometry rendering takes ~35% and postprocessing - 5% of total rendering time.
This seems pretty strange to me and I guess that D3DXEffect interface may not be the best solution for this sort of things.
I've got 2 questions:
1. Do I need to implement my own shader controller / wrapper (probably, low-level) and where should I start from?
2. Would compiling shaders help to somehow improve parameter setting performance?
Maybe somebody knows how to solve this kind of problem / some implemented shader interface or could give some advices about how is this kind of problem solved in modern game engines.
Thank you.

Actual geometry rendering takes ~35% and postprocessing - 5% of total rendering time
If you want to profile shader performance you need to use NVPerfHud or something similar. Using CPU profiler and measuring ticks is not going to help you - rendering is often asynchronous.
Do I need to implement my own shader controller / wrapper (probably, low-level)
Using your own shader wrapper isn't a bad idea - I never liked ID3DXEffect anyway.
With your own wrapper you'll have a total control of resources and program behavior.
Whether you need it or not is for you to decide. With ID3DXEffect you won't have a warranty that implementation is as fast as it could be - it could be wasting cpu cycles doing something you don't really need. D3DX library contains few classes that are useful, but aren't guaranteed to be efficient (ID3DXEffect, ID3DXMesh, All animation-related and skin-related functions, etc).
and where should I start from?
D3DXAssembleShader, IDirect3DDevice9::CreateVertexShader, IDirect3DDevice9::CreatePixelShader on DirectX 9, D3D10CompileShader on DirectX 10. Also download DirectX SDK and read shader documentation/tutorials.
Would compiling shaders help to somehow improve parameter setting performance?
Shaders are automatically compiled when you load them. You could compiling try with different optimization settings, but don't expect miracles.

Are you using a DirectX profiler or just timing your client code? Profiling DirectX API calls using timers in the client code is generally not that effective because it's not necessarily synchronously processing your state updates/draw calls as you make them. There's a lot of optimization that goes on behind the scenes. Here is an article about this for DX9 but I'm sure this hasn't changed for later versions:

I've used effects before in DirectX and the system generally works fine. It provides some nice features that might be a pain to implement yourself at a lower-level, so I would stick with it for the moment.
As bshields suggested, your timing information might be inaccurate. It sounds likely that the drawing actually is taking the most time, compared.
The shader is compiled when it's loaded. Precompiling will save you a half-second of startup time, but so long as the shader doesn't change during runtime, you won't see any actual speed increase. Precompiling is also kind of a pain, if you're still testing a shader. You can do it with the final copy, but unless you have a lot of shaders, you won't get much benefit while loading them.
If you're creating the shaders every frame or every time your geometry is rendered, that's probably the issue. Unless the shader itself (not parameters) changes every frame, you should create the effect once and reuse that.
I don't remember where SetParameter calls go, but you may want to check the docs to make sure your SetMatrix is in the right spot. Setting parameters after the pass has started won't help anything, certainly not speed. Make sure that's set up correctly. Also, set parameters as rarely as possible, there is some slight overhead involved. Per-frame sets will give you a notable slow-down, if you have too many.
All in all, the effects system does work fine in most cases and you shouldn't be seeing what you are. Make sure your profiling is correct, your shader is valid and optimized, and your calls are in the right places.


C++ OpenGL wrapper: interface similar to fixed pipeline, can export .collada

I have opengl code that uses the fixed pipeline.
Hitting two birds with one stone, I need a wrapper that can help me with the following tasks:
Convert the code to the new shader-based pipeline with minimal effort.
I have a class that calls opengl functions, such as: glBegin(triangles/lines), glVertex, glPushMatrix, glTranslate, glColor, gluSphere.
Ideally, I'd like it to derive from a class that supplies these functions in the base class. Behind the scenes, it would use the same high level logic as the fixed pipeline.
I'd like to export an opengl scene to .collada to load in an external renderer.
Opengl is low level rendering, and it doesn't have the concept of a scene. For example, this reddit post:
"You realize that you have to write a shim to capture all API calls
you are interested in to do that. Then, when finally, a draw call is
emitted you have to parse every single vertex and collect the data
from all over the memory from the buffers that you have recorded from
the APi calls that set up VAOs, VBOs and IBOs. Then you have to parse
the shader source code so that you can see which uniforms and vertex
attributes contribute to vertex clip coordinate generation. Then you
also have to synthesize/guess which outputs are normal, color, texture
coordinate and so on from the shader source if the resulting program
even have those in .obj file format-wise.
This gets even more complicated if Compute is used to generate data
inside the GPU for any of the buffers. If geometry or tessellator is
used then you also have to implement one of those so that you get
accurate outputs from the vertex processing. TL;DR - you have to write
your own OpenGL 4.5 driver that does exactly the same things a real
hardware driver would do. Good luck with that."
However, my scene is simple, using the fixed pipeline operations above.
I'd like the wrapper to keep track and construct a scene that can be exported.
EDIT: Since recommendation is off-topic, I'll ask the following question.
What I need above seems like something obvious that many should have found useful. Since I can't find a library that accomplishes that, I'm wondering if my approach is unreasonable?
More specifically, how do people port their legacy opengl code; do they write the relevant part from scratch, or does everyone implement his own wrapper as I suggested?
What about constructing a scene to export to collada?
Posted also:
Although there are some parts in legacy OpenGL that are not optimized in current drivers (like glDrawPixels, the raster drawing operations and indexed color mode), between modern hardware and the modest requirements of legacy applications, legacy OpenGL stuff runs well enough on modern systems.
The main reason to "modernize" legacy OpenGL code is, if one want to make use of the modern features. Any sort of "wrapper" will just run into the same kind of design problems that the OpenGL API ran between OpenGL-1.5 to OpenGL-2.1: Lots of built-in variables, default state, implicit action, etc. etc. This is difficult to document properly, and even more difficult to make use of reliably. Which is the reason you usually don't find these kinds of wrappers.
If you find yourself in the situation, that you absolutely must port your legacy code to modern OpenGL, e.g. to be interoperable with core contexts, then your best course of action will be to do a proper rewrite. Replace implcit mode calls to filling vertex buffers, replace calls to glTexEnv…, glMaterial…, glLight… with loading appropriate shaders and setting their uniforms.
Or, if you want a quick and dirty method: Just create two contexts, a modern one, and a legacy one and switch between them; often you can establish "list" sharing between them.

mipmap generation in opengl - is it hardware accelerated?

The purpose here isn't rendering, but gpgpu; it's for image blurring:
given an image, I need to blur it with a fixed given separable kernel (see e.g. Separable 2D Blur Kernel).
For GPU processing, a good popular method would be to first filter the lines, then filter the columns; and using the vertex shader and the fragment shader to do so (*)
However, if I have a fixed-sized kernel, I think I can use a fast-calculated mipmap that is close to the level I want, and then upsample it (as was suggested here) .
The question is therefore: will an opengl-created mipmap be faster than a mipmap I create myself using the method of (*)?
Put another way: is the mipmap creation optimized on the gpu itself? will it always outperform (speed-wise) user-created glsl code? or would it depend on the graphics card?
Thanks for the replies (Kahler, Jean-Simon Brochu). However, I still haven't seen any resources that explicitly say whether mipmaps generation by the gpu is faster than any user-created mipmaps, because of specific mipmap-generation-gpu-hardware...
OpenGL does not care how the functions are implemented.
OpenGL is a set of specifications, among them is the glGenerateMipmap.
Anyone can write a software renderer or develop a video card compliant to the specification. If it pass the tests, it's ~OpenGL certified~
That means that no function is mandatory to be performed on CPU or GPU, or anywhere, they just have to produce the OpenGL expected results.
Now for the practical side:
Nowadays, you can just assume the mipmap generation is done by the video card, because the major-vendors adopted this approach.
If you really want to know, you will have to check specifically to the video card you are programing to.
As for performance, assume you can't beat the video card.
Even if you come up with some highly optimized code performed in some high-tech-full-of-things-CPU, you will have to upload the mipmaps you generated to the GPU, and this operation alone will probably take more time then letting the GPU do the work after you've uploaded the full-resolution texture.
And, if you program the mipmaping as a shader, still unlikely to beat the hard-coded (maybe even hard wired) built-in function. (and that code-alone, not counting the fact that it may schedule better, process apart, etc)
This site explains the glGenerateMipmap history better =))

Profiling a graphics rendering without a profiler

Nowadays we have pretty advanced tools to iron out rendering, allowing to see the different stages, time taken by draw calls, etc. But without them the graphics pipeline is quite a black box when it comes to understand what is happening inside.
Suppose for some reason you have no such tool, or a very limited one. How would you measure anyway what is taking time in your rendering?
I am aware of tricks like discarding draw calls to see the CPU time, setting a 1x1 viewport to see the cost of geometry, using a dumb fragment shader to highlight the fillrate... They are useful already but only give a rough idea of what is going on, and tell nothing about the level of parallelism.
Also, getting the time spent in each stage per draw call seem to be difficult, especially when taking into account the lack of precision due to the noise when measuring.
What tricks do you use when your backpack is almost empty and you still have to profile your rendering? What is your personal Swiss army knife consisting in?
Frame time rendering time
Absolute time spent for small code/stage/etc. is not that relevant as GPU driver optimization/batching/parallelism/version makes it nearly impossible to have precise code measure without GPU counters. (which you can get if you use with vendors libs)
What you can measure easily is each single code change impact. You'll only get relative impact, and it's what you really need anyway. And that just using frame rendering time.
Ideally you should aim be able can edit shader or pipeline code during runtime, and have a direct way to check impact over a whole typical scene, like just comparing graphs between several code path. (beware of static scenes, otherwise you'll end with highly optimized static views, but poor dynamic scenes performance)
Here's the swiss army knife list:
scene states loader
scene recorder (camera paths/add-remove entities,texture, mesh, fake input, etc.) using scene states.
scene states saver
scene frame time logger (not just final average but each frame rendering time)
on-the-fly shader code reload
on-the-fly codepath switch
frame time log reader+graphs+statistic framework
Note that scene state load/save/record are handy for a lot of other things, from debugging to undo/redo to on-the-fly reload, not to mention savegames.
Add a screenshot taker + image diff, and you can unit test graphic code too.
If you can, add that to your CI server so that huge code impact doesn't go unnoticed. (helps also artists when they check-in their assets, without evaluating rendering impact)
A must read on that related CI graphic test work is there : http://aras-p.info/blog/2011/06/17/testing-graphics-code-4-years-later/
Note: I'm responding to the question: "Profiling a graphics rendering with a profiler", since that something I was looking for ;)
I'm working mostly on Mac, and I'm using multiple tools:
gDebugger version 5.8 is available on Windows and Mac (this tool has been bought by AMD, the v6 version is Windows only). It gives you statistics about state changes, texture usage, draw calls, etc. It's also usefull to debug texture mapping, and see how your scene is drawn, step by step.
PVRUniSCoEditor it's a shader editor. It compiles on the fly and give you precious details about estimated cycles and registers usage.
Instruments (from XCode Utilities, OSX only), it gets informations from the OpenGL driver, it's great to find bottleneck since you can track what part of the GPU is used at 100% (tiler, renderer, texture unit, etc...)
Adreno Profiler a Windows tool to profile Adreno-based mobile devices. (Very good tool if you work on Android apps ;))
What's your trick about the "dumb fragment shader to highlight the fillrate" ? (drawing a plain color ? or something more advanced ?)

What has happened with opengl? What kind of nightmare is it now?

I used opengl 2 years ago. In one afternoon I read a tuto, I drew a cube (and then learned how to load any 3d model) and learned home to move the camera around with the mouse. It was easy, less than 100 lines of codes. I didnt get the pipeline completely but I was able to do something.
Now I need to refresh opengl for some basic stuff, basically I need to load a 3D model (any model) and move the model around, with the camera fixed. Something I thought would be another afternoon.
I have spent 1 day and have nothing working. I am reading the recommended tuto http://www.arcsynthesis.org/gltut/ I dont get anything, now to draw just a cube you need a lot of lines and working with lots of buffer, use some special syntax for shaders.... what the hell I only want to draw a cube. Before it was just defining 6 sides.
What is going on with opengl? Some would argue that now is great, I think it is screwed.
Is there any easy library to work with Something that would make my life easier?
GLUT - http://www.opengl.org/resources/libraries/glut/
ASSIMP - http://assimp.sourceforge.net/
These two libraries are all you need to make a simple application where you import a model (various formats). Read it's documentation and examples to get a better understanding on how you can "glue" OpenGL and ASSIMP to work.
As to is OpenGL more hard to comprehend? No. What I've learned in recent years from OpenGL is that GFX programming is never simple or done in a few lines of code, you have to be organised, you have to be careful and even a simple primitive (e.g cube) needs to have more than 100 lines of code to make it decent and flexible (for example if you want more subdivisions on your polygons or texturing).
If you learned it only two years ago, then the tutorials were extremely outdated. Immediate Mode has been known to be deprecated for a very, very long time. Actually the first plans to abandon it and display lists date back to 2003.
Vertex Arrays have been around since version 1.1, and they have been the preferred method for sending geometry to OpenGL ever since; in immediate mode every vertex causes several function calls, so for any seriously complex object you spend more time managing the function call stack, than doing actual rendering work. If you used Vertex Arrays consequently since their introduction, switching over to Vertex Buffer Objects is as complicated as just inserting or replacing a few lines.
The biggest hurdle using OpenGL-3 is in Windows, where one has to use a proxy context to get access to the extension functions required to select OpenGL-3 capabilities for context creation. However again no big hurdle, 20 lines of code top. And some programs, like mine for example, create a proxy GL context anyway, to which all shareable data is uploaded, which allows to quicly destroy/recreate visible contexts, yet have full access to textures, VBOs and stuff (you can share VBOs, which is another reason for using them instead of plain vertex arrays; this might not look like something big, at least not if the context is used from a single process; however on plattforms like X11/GLX OpenGL contexts can be shared between X11 clients, which may even run on different machines!)
Also the existance of functions like the matrix manipulation stack led people into the misconception, OpenGL was some matrix math library, some even believed it was a particularily fast one. Neither is true. The removal of the matrix manipulation functions was a very important and right thing to do. Every serious OpenGL application will implement their very own matrix math anyway. For example any modern game using some kind of physics engine used to directly use in OpenGL (glLoadMatrix, or glUniformMatrix) the transform matrix spit out by the physics calculation, completely bypassing the rest of the matrix functions. This also means that the sole reason to have multiple matrix stacks (GL_PROJECTION, GL_MODELVIEW, GL_TEXTURE, GL_COLOR), namely being able to use the same set of manipulation functions on several matrices, was obsoleted and could have been replaced by something like glLoadMatrixSelected{f,d}v(GLenum target, GLfloat *matrix). However Uniforms and shaders already were around, so the logical step was not introducing a new function, but to reuse existing API, which had been used for this task already, anway, and instead remove what's no longer needed.
TL;DR: The new OpenGL-3 API greatly simplyfies using it. It's a lot clearer, has fewer pitfalls and IMHO is also more newbie-friendly.
You don't have to use buffer objects. You can use the deprecated immediate mode. It will be slower, but if you don't really care then go ahead and use OpenGL the way you used to. NeHe has some excellent tutorials on OpenGL 1.x stuff.
Swiftless has some good tutorials (only a few very basic ones) on OpenGL 3.x and 4.x, but the learning curve is, as you've found, very steep.
Does it have to be openGL? XNA offers an ability to draw 3d models without breaking your back.. Could be worth a look

GLUTesselator for realtime tesselation?

I'm trying to make a vector drawing application using OpenGL which will allow the user to see the result in real time. The way I have it set up is with an edge flag callback so the glu tesselator only outputs triangles which I then pass to a VBO. I'v tried t make all my algorithms as fast as possible and this is not where my issue is. According to a few code profilers, my big slowdown occurs in a call to GLUTessEndPolygon() which is the function that makes the polygon. I have found that when the shape exceeds 100 input verticies, it gets really really slow and basically destroys all the hard work I did to optimize everything else. What can I do? I provide the normal of (0,0,1). I also tried all the tips from the GL redbook. Is there a way to make the tesselator tesselate quicker but with less precision?
You might give poly2tri a try to see if it's any faster.