How to do LOD in DirectX 9? - c++

I know in DirectX 11 you can use the awesome tessellation feature for LODs, but knowing that DirectX 9 doesn't have this feature, how would I go about creating LODs for the models in my 3D application/game to speed it up?
I heard back in the old days before DirectX 10 or 11 came out, people used to create many of the same type of models but just with different polycounts (i.e: one with very low polycount for far away objects and one with a high polycount for very near objects).
But doing this would mean doubling or even tripling the size of models in the game right? Is there any other approaches in achieving LODs in DirectX 9? Or is this really the best soultion when it comes to DirectX 9? Can someone at least point me in the right direction for this issue I can atleast go away and do more research about it?

Generating multiple LOD meshes using mesh simplification algorithms (or by hand) might not be as bad as you think in terms of memory consumption. As in mipmaps, since your simplified mesh have much less vertices, they shouldn't triple the size of your in-game models. And you don't have to keep the high-resolution meshes in video memory if you're not going to be using them for a while.
An alternative to save memory is to simplify meshes by discarding vertices only. This way, you can use a single vertex buffer and have different index buffers for each LOD. You might get slightly lower quality LOD meshes, but the memory overhead of keeping them all in memory will be much smaller.
If I'm not mistaking, tessellation is for subdivision, so it wouldn't help you anyways if you want a coarser mesh (though it can probably help interpolate between LODs.)


Is sharing vertices between faces worth it?

I am currently working on a WebGL project, although I imagine this question is generic across many graphics APIs.
Let me use the example of a simple cube to demonstrate what I am asking.
A cube has 6 faces with 4 vertices per face so in total we have 24 vertices that make up the cube. However, we could reduce the total number of vertices to only 8 if we share vertices between faces. As I have been reading this can save a lot of precious GPU memory especially when working with complex models and scenes.
On the other hand though, I have experienced first-hand some of the drawbacks with sharing vertices between faces. These include:
Complex vertex normal calculations as we must find the 'average' normal for each vertex, taking into account the face normals of each face that said vertex is a part of.
Some vertices must be duplicated anyway to 'match up' with their corresponding UV coordinates.
As a vertex may be shared by many faces, we are not able to specify a different colour per face using per vertex colouring.
The book I have been reading really stresses the importance of vertex sharing to minimise memory usage, so when I came across some of the disadvantages of vertex sharing I was unsure as to how viable/helpful vertex shading really is, and as the author did not mention any of the downsides of vertex sharing I would like to get the opinions of you guys. So is the memory saving produced from vertex shading really that important?
The disadvantages you named are indeed very real, especially for shapes with lots of sharp edges or different textures. A cube is the worst possible example for vertex sharing, each vertex has 3 different normals and possibly texture coordinates. It is essentially impossible to share the vertices.
However think of some organic shape. Like a ball, the body of some animal, cars, trees or even something as simple as a desert or something. These shapes probably need a high amount of vertices to look like anything decent, but a lot of those vertices are shared between faces. They need the exact same normals, texture coordinates and whatevers in order to look smooth.
Furthermore, the first disadvantage is not really that important. Calculating the vertices can be done in preprocessing, in most cases even by the modeller. This is basically never done realtime, instead you simply already have it in this format. However if it does need to be done realtime you can imagine this becoming an actual issue, and you need to start thinking about the trade offs and profile. But even then it can probably be dealt with using geometry shaders, if the visual fidelity is needed this can be a preferable solution.
In conclusion it heavily depends on what you're doing. In some cases vertex sharing isn't really viable because of the reasons you mentioned. Regardless, in many many cases it can potentially save a lot of memory.

which is the most optimal and correct way to drawing many different dynamic 3D models (they are animated and change every frame)

I need to know how I can render many different 3D models, which change their geometry to each frame (are animated models), don't repeat models and textures.
I carry all models and for each created an "object" model class.
What is the most optimal way to render them?
To use 1 VBO for each 3D model
To use a single VBO for all models (to be all different, I do not see this option possible)
I work with OpenGL 3.x or higher, C++ on Windows.
TL; DR - there's no silver bullet when it comes to rendering performance
Why is that? That depends on the complicated process that gets your data, converts it, pushes it to GPU and then makes pixels on the screen flicker. So, instead of "one best way", a few of guideliness appeared that might usually improve the performance.
Keep all the necessary data on the GPU (because the closer to the screen, the shorter way electrons have to go :))
Send as little data to GPU between frames as possible
Don't sync needlessly between CPU and GPU (that's like trying to run two high speed trains on parallel tracks, but insisting on slowing them down to the point where you can pass something through the window every once in a while),
Now, it's obvious that if you want to have a model that will change, you can't have the cake and eat it. You have to made tradeoffs. Simply put, dynamic objects will never render as fast as static ones. So, what should you do?
Hint GPU about the data usage (GL_STREAM_DRAW or GL_DYNAMIC_DRAW) - that should guarantee optimal memory arrangement.
Don't use interleaved buffers to mix static vertex attributes with dynamic ones - if you divide the memory, you can batch-update the geometry leaving texture coordinates intact, for example.
Try to do as much as you can purely on the GPU - with compute shaders and transform feedback, it might well be possible to store whole animation data as a buffer itself and calculate it on GPU, avoiding expensive syncs.
And last but not least, always carefully measure the impact of your change on performance. Going blindly won't help. Measure accurately and thoroughly (even stuff like shader compilation time might matter sometimes!). Then, even if you go by trial-and-error, there's a hope you'll get somewhere.
And to address one of your points in particular; whether it's one large VBO and a few smaller ones doesn't really matter, but a huge one might have problems in fitting in memory. You can still update parts of it, and what matters most is the memory arrangement inside of it.

Hardware support for non-power-of-two textures

I have been hearing controversial opinions on whether it is safe to use non-power-of two textures in OpenGL applications. Some say all modern hardware supports NPOT textures perfectly, others say it doesn't or there is a big performance hit.
The reason I'm asking is because I want to render something to a frame buffer the size of the screen (which may not be a power of two) and use it as a texture. I want to understand what is going to happen to performance and portability in this case.
Arbitrary texture sizes have been specified as core part of OpenGL ever since OpenGL-2, which was a long time ago (2004). All GPUs designed every since do support NP2 textures just fine. The only question is how good the performance is.
However ever since GPUs got programmable any optimization based on the predictable patterns of fixed function texture gather access became sort of obsolete and GPUs now have caches optimized for general data locality and performance is not much of an issue now either. In fact, with P2 textures you may need to upscale the data to match the format, which increases the required memory bandwidth. However memory bandwidth is the #1 bottleneck of modern GPUs. So using a slightly smaller NP2 texture may actually improve performance.
In short: You can use NP2 textures safely and performance is not much of a big issue either.
All modern APIs (except some versions of OpenGL ES, I believe) on modern graphics hardware (the last 10 or so generations from ATi/AMD/nVidia and the last couple from Intel) support NP2 texture just fine. They've been in use, particularly for post-processing, for quite some time.
However, that's not to say they're as convenient as power-of-2 textures. One major case is memory packing; drivers can often pack textures into memory far better when they are powers of two. If you look at a texture with mipmaps, the base and all mips can be packed into an area 150% the original width and 100% the original height. It's also possible that certain texture sizes will line up memory pages with stride (texture row size, in bytes), which would provide an optimal memory access situation. NP2 makes this sort of optimization harder to perform, and so memory usage and addressing may be a hair less efficient. Whether you'll notice any effect is very much driver and application-dependent.
Offscreen effects are perhaps the most common usecase for NP2 textures, especially screen-sized textures. Almost every game on the market now that performs any kind of post-processing or deferred rendering has 1-15 offscreen buffers, many of which are the same size as the screen (for some effects, half or quarter-size are useful). These are generally well-supported, even with mipmaps.
Because NP2 textures are widely supported and almost a sure bet on desktops and consoles, using them should work just fine. If you're worried about platforms or hardware where they may not be supported, easy fallbacks include using the nearest power-of-2 size (may cause slightly lower quality, but will work) or dropping the effect entirely (with obvious consquences).
I have a lot of experience in making games (+4 years) and using texture atlases for iOS & Android though cross platform development using OpenGL 2.0
Stick with PoT textures with a maximum size of 2048x2048 because some devices (especially the cheap ones with cheap hardware) still don't support dynamic texture sizes, i know this from real life testers and seeing it first hand. There are so many devices out there now, you never know what sort of GPU you'll be facing.
You're iOS devices will also show black squares and artefacts if you are not using PoT textures.
Just a tip.
Even if arbitrary texture size is required by OpenGL X certain videocards are still not fully compliant with OpenGL. I had a friend with a IntelCard having problems with NPOT2 textures (I assume now Intel Cards are fully compliant).
Do you have any reason for using NPOT2 Textures? than do it, but remember that maybe some old hardware don't support them and you'll probably need some software fallback that can make your textures POT2.
Don't you have any reason for using NPOT2 Textures? then just use POT2 Textures. (certain compressed formats still requires POT2 textures)

GPGPU - effective ping-pong technique?

I'm trying to implement effective fluid solver on the GPU using WebGL and GLSL shader programming.
I've found interesting article:
See: 38.3.2 Slab Operations
I'm wondering if this technique of enforcing boundary conditions is possible with ping-pong rendering?
If I render only lines, what about an interior of the texture?
I've always assumed that the whole input texture must be copied to temporary texture (ofc boundary is updated during that process), as they are swapped after that operation.
This is interesting especially considering the fact, that Example 38-5. The Boundary Condition Fragment Program (visualization: shows scheme that IMHO requires ping-pong technique.
What do you think? Do I misunderstand something?
Generally I've found that texture write is extremely costly and that's why I would like to limit it somehow. Unfortunately, the ping-pong technique enforces a lot of texture writes.
I've actually implemented the technique described in that chapter using FrameBuffer objects as the render to texture method (but in desktop OpenGL since WebGL didn't exist at the time), so it's definitely possible. Unfortunately I don't believe I have the code any more, but if you tag any future questions you have with [webgl] I'll see if I can provide some help.
You will need to ping-pong several times per frame (the article mentions five steps, but I seem to recall the exact number depends on the quality of the simulation you want and on your exact boundary conditions). Using FBOs is quite a bit more efficient than it was when this was written (the author mentions using a GeForce FX 5950, which was a while ago), so I wouldn't worry about the overhead he mentions in the article. As long as you aren't bringing data back to the CPU, you shouldn't find too high a cost for switching between texture and the framebuffer.
You will have some leakage if your boundaries are only a pixel thick, but that may or may not be acceptable depending on how you render your results and the velocity of your fluid. Making the boundaries thicker may help, and there are papers that have been written since this one that explore different ways of confining the fluid within boundaries (I also recall a few on more efficient diffusion/pressure solvers that you might check out after you have this version'll find some interesting follow ups if you search for papers that cite the GPU gems article on google scholar).
Addendum: I'm not sure I entirely understand your question about boundaries. The key is that you must run a shader at each pixel of what you want to be a boundary, but it doesn't really matter how that pixel gets there, whether it's drawn with lines, points, or triangles (as long as its inputs are correct).
In the very general case (which might not apply if you only have a limited number of boundary primitives), you will likely have to draw a framebuffer-covering quad, since the interactions with the velocity and pressure fields are more complicated (any surrounding pixel could be another boundary pixel, instead of having simply defined edges). See section 38.5.4 (Arbitrary Boundaries) for some explanation of how to do it. If something isn't a boundary, you won't touch the vector field, and if it is, instead of hardcoding which directions you want to look in to sum vector values, you'll probably end up testing the surrounding pixels and only summing the ones that aren't boundaries so that you can enforce the boundary conditions.

DirectX9, DirectDraw, Optimization?

First off, I'm programming a game. Currently in the render function there are two calls to two different functions. One renders some text, one renders sprites.
On my computer (AMD Phenom(tm) II X4 955 Processor (4 CPUs), ~3.2GHz, 4096MB RAM DDR2, NVIDIA GeForce GTX 285) I have a render speed of ~2200 FPS when rendering around 200 sprites and about 100 FPS when rendering about 14,500.
I'm using a vector to store the information of each object I'm rendering and using one sprite with many draw calls.
VS2008 release mode with full optimization for C++. I know I've heard left and right don't optimize prematurely, but at this point, it's running great for me, but not so well on certain computers.
I can't imagine changing vectors out for arrays since I'm pushing and pulling things from the vector every frame, in an indeterminable method. Nearly randomly.
I've tried floats and doubles and the speed is no different.
Would it be different using DirectDraw rather than DirectX and the Sprite Render method? Since I've no idea the differences between DirectDraw and DirectX, I'm not 100% what I should be thinking about that.
The game runs fine on average computers, but what I'm comparing my game to is Touhou. Touhou runs at 60 FPS on the weakest computer I've tried, but my game won't run faster than 36~42 FPS. I can't imagine what I'm doing wrong, being so new to DirectX and C++.
Any assistance in this matter would be great, unfortunately I won't be around for awhile to add information or answers questions.
You need a profiler.
There's some good performance advice in the responses, but it doesn't matter. Trying to optimize a program without a profiler is like trying to write a program without a compiler. Do not guess, measure.
Now with that said, profiling graphics code is an infamous pain in the neck, and there aren't (to my knowledge) any good, free tools to help with it. So never mind that for now: start with an ordinary CPU profiler, and find out which of your calls is really taking up all your time.
I'm using a vector to store the information of each object I'm rendering and using one
sprite with many draw calls.
I'm not sure I understand what you're saying, but this sounds like you're drawing essentially the same object in a lot of different places. If that's the case, you probably want to look up DirectX Instancing. The basic idea is you specify 1) the geometry to draw, and 2) a number of places to draw it. This saves re-specifying the geometry every time you draw the object, so it can improve speed considerably.
I can't imagine changing vectors out for arrays since I'm pushing and pulling things from the vector every frame, in an indeterminable method. Nearly randomly.
Are you inserting and/or removing things from positions other than the back of the vector? In a vector, insertions and removals from the middle take O(n) time, that is, the amount of time it takes is proportional to the size of your vector.
If that is the case, then consider using an std::list instead. Note that with 10k+ objects this could easily be causing your performance issues, depending on how often you do it.
Profile your application, and determine if your bottleneck is the CPU or the GPU (or the transfer BUS between the two)
When determined you have a few choices :
1) If its the CPU, you can try instancing to reduce the number of draws call. Or if your target machine does not support Hardware instancing, try a kind of batching. To instance or Batch a sprite you have to use a QUAD (2 triangle orientated) as the default interface does.
2) If its the GPU, try to understand if its a shader causing the slowdown. If that's the case try to optimize it. If its not the shader, try to reduce overdraw. IF part of your objects are not transparent using front-to-back drawing.
3) If its the BUS, try to do as with the CPU, as with batching you reduce the number of Locks/Unlocks you need to transfer the data. (with instancing you would not need to update the buffer at all)
That's all. :P
P.S. A warning...DO NOT TRY TO PROFILE DirectX calls with a CPU profiler. (but use PerfHud from nVidia or GPUPerfStudio from ATI, or GPA from Intel)
Its just time losed, DirectX has a command buffer and you are not assured that a call made now it is executed that time. Most of the time it returns immediately and do nothing.