Getting transformed vertices back from the GPU in DirectX 10/11 - c++

The graphics engine I am developing has developed a major bottleneck which is matrix transforms on vertices (nearly no static vertices at all). So far I've been transforming the vertices with the CPU and updating the vertex buffer every frame (the data copying is a little bottleneck by itself but so far managable).
So I was thinking if I could just keep the mesh buffer inside the GPU, I could transform the vertices there and get that transformed set of vertices back to main memory for other processing (the subsequent processing requires a bit more inter connectivity than the GPU shaders allow). The might eliminate the bottleneck in the current code.
Any tips on how to do that? Thanks.

Look into the stream-output stage in DX11:
http://msdn.microsoft.com/en-us/library/windows/desktop/bb205121(v=vs.85).aspx
It allows you to attach a memory buffer (on the CPU or GPU) in which the results from the vertex shader (and possible geometry shader) are stored.

Related

What are indices in GLSL?

I am going through the OpenGL tutorials on learnopengl.com and feel I have a pretty solid grasp of what I have learned thus far. The only thing that I am having trouble with, is the indices in GLSL. All I know so far is that they are for transferring data between shaders, and between buffers and the vertex shader. I would like to understand the following:
What physical device contains these indices? (is it in RAM or memory on GPU?)
Why are there only 16 indices?
Why do I have to enable them before I can use them on the client side of the program with glEnableVertexAttribute(n)?
Thank you for taking the time to help me out!
Vertex data is stored (or rather buffered - the CPU may process it but it all gets sent to the GPU) in the GPU memory. Here's a picture:
16 is the minimum number of attributes that OpenGL will guarantee. As mentioned in the tutorial (in the beginning of the shaders section) and in the OpenGL documentation, the number of vertex attributes is limited by the graphics hardware. Each additional vertex attribute means one more thing for your GPU to process per vertex. Disabling all the vertex attributes by default and requiring the programmer to enable the attributes he/she needs allows your code to run efficiently.

Tiled deferred shading without compute shader

I'm building a deferred renderer and since I want to support a large amount of lights in the scene I've had a look at tiled deferred shading.
The problem is that I have to target OpenGL 3.3 hardware and it doesn't support GLSL compute shaders.
Is there a possibility to implement tiled deferred shading with normal shaders?
Tiled deferred rendering does not strictly require a compute shader. What it requires is that, for each tile, you have a series of lights which it will process. A compute shader is merely one way to accomplish that.
An alternative is to build the light lists for each frustum on the CPU, then upload that data to the GPU for its eventual use. Obviously it requires much more memory work than the CS version. But it's probably not that expensive, and it allows you to easily play with tile sizes to find the most optimal. More tiles means more CPU work and more data to be uploaded, but fewer lights-per-tile (generally speaking) and more efficient processing.
One way to do that for GL 3.3-class hardware is to make each tile a separate quad. The quad will be given, as part of its per-vertex parameters, the starting index for its part of the total light list and the total number of lights for that tile to process. The idea being that there is a globally-accessible array, and each tile has a contiguous region of this array that it will process.
This array could be the actual lights themselves, or it could be indices into a second (much smaller) array of lights. You'll have to measure the difference to tell if it's worthwhile to have the additional indirection in the access.
The primary array should probably be a buffer texture, since it can get quite large, depending on the number of lights and tiles. If you go with the indirect route, then the array of actual light data will likely fit into a uniform block. But in either case, you're going to need to employ buffer streaming techniques when uploading to it.

GPU particle metaball-surface rendering

I have a question about a very specific method on how to render surface particles. The method is explained very well in the Nvidia GPU Gems 3 chapter 7 "Point-Based Visualization of Metaballs on a GPU", link to this chapter.
The article is about rendering an implicit surface using points or splats that are evenly distributed over the surface. They say that the computation of these particles is done completely on the GPU. Only the data which defines the surface is sent from CPU to the GPU to keep the traffic as low as possible.
They also gave some pseudo code examples of fragment shader programs to compute the particle positions, velocity etc. and for me it looks like these programs should run once for every particle.
Now my question is, how do they store these particles? What kind of data structure is it?
It must be some kind of buffer or texture that can be accessed for reading as well as for writing operations on the GPU. But how do I render this buffer/texture again in the next rendering step?
My first idea was some kind of vertex-buffer-object which is sent to the GPU once at the beginning and continuously updated there at each rendering pass. Is that possible at all?
One requirement for me is that it must be implemented using OpenGL/GLSL, I hope that is possible.
Yes you need some kind of VBO and repeated passes over the same data. The data structure can be a SoA (Struct of Arrays) or AoS (Array of Structs) depending on how you prefer to code the access to the different properties of the array, ie:
SoA:
Positions Array
Speed Array
Normal Array
AoS:
Just one Array containing [Position, Speed, Normal].
AoS are the same as interleaved arrays for rendering where in only one array you keep all the properties of the mesh.
You could use either a VBO or a Texture, the only difference is the way the caching is done, since textures are optimized for 2D access.
The rendering is done in steps exactly like you are picturing it, so all you need to do is to "render" the physical stepping of the system using shaders that compute the properties you want and then bind the same structures to the true graphics rendering in a subsequent step.

issues abour shaders and transformations in opengl

If I'm not wrong, shaders are programs that run in GPU, right?
Do we send data to this programs using glUniformMatrix*?
I don't know if it's right but if I send a MVP matrix to the shader, the object's vertices that I want to render will use the position calculated by the shader right before calling the render function.
If I want to render a lot of objects and I must send the MVP matrix then render the object right after, so I will have a code that send to GPU -> render a lot of times. However if I'm not wrong again this is not a good practice because I'm losing performance because the cost of send information to GPU is very expensive. So a way to get a better performance is send all the informations to GPU then render all the objects.
And the questions of 1 million dollars is, How can the shader program identify that the MVP matrix is used by a single object and not another one?
If I'm not wrong, shaders are programs that run in GPU, right?
Possibly. Many implementations of OpenGL have software renderers that they can fall back to if resources on the GPU are constrained. But usually, yes, they're run on the GPU.
Do we send data to this programs using glUniformMatrix*?
That's the usual way. You also set things like texture coordinates either via immediate mode methods like glTexCoord*() (in legacy OpenGL), or via buffer objects.
I don't know if it's right but if I send a MVP matrix to the shader, the object's vertices that I want to render will use the position calculated by the shader right before calling the render function.
There are different types of shaders. A vertex shader is called once for each vertex. A fragment shader is called once per fragment (roughly once per output screen-space pixel that actually gets drawn). Generally you will probably want to send the model, view, and projection matrices separately to the vertex shader. (Or possibly in some combination that lifts some computations out of the shader.) Then you'll multiply each vertex by the appropriate matrix (or combo of matrices).
And there are other types of shaders beyond those, but those 2 are the most common.
If I want to render a lot of objects and I must send the MVP matrix then render the object right after, so I will have a code that send to GPU -> render a lot of times. However if I'm not wrong again this is not a good practice because I'm losing performance because the cost of send information to GPU is very expensive. So a way to get a better performance is send all the informations to GPU then render all the objects.
I wouldn't get overly worried about performance until you have shaders working properly. Performance can be dependent on a lot of different factors. One is how often you send or receive data to or from the GPU and how much data you're transferring. Another is how many passes you do for each shader, and another is the size of your textures, geometry, and other stuff.
And the questions of 1 million dollars is, How can the shader program identify that the MVP matrix is used by a single object and not another one?
The way I've done that in the past is to set the current shader program and uniforms via glUseProgram() and glUniform*(), then upload my geometry for an object, and repeat as necessary for each object or set of objects as needed.

OpenGL voxel engine slow

I'm making a voxel engine in C++ and OpenGL (à la Minecraft) and can't get decent fps on my 3GHz with ATI X1600... I'm all out of ideas.
When I have about 12000 cubes on the screen it falls to under 20fps - pathetic.
So far the optimizations I have are: frustum culling, back face culling (via OpenGL's glEnable(GL_CULL_FACE)), the engine draws only the visible faces (except the culled ones of course) and they're in an octree.
I've tried VBO's, I don't like them and they do not significantly increase the fps.
How can Minecraft's engine be so fast... I struggle with a 10000 cubes, whereas Minecraft can easily draw much more at higher fps.
Any ideas?
#genpfault: I analyze the connectivity and just generate faces for the outer, visible surface. The VBO had a single cube that I glTranslate()d
I'm not an expert at OpenGL, but as far as I understand this is going to save very little time because you still have to send every cube to the card.
Instead what you should do is generate faces for all of the outer visible surface, put that in a VBO, and send it to the card and continue to render that VBO until the geometry changes. This saves you a lot of the time your card is actually waiting on your processor to send it the geometry information.
You should profile your code to find out if the bottleneck in your application is on the CPU or GPU. For instance it might be that your culling/octtree algorithms are slow and in that case it is not an OpenGL-problem at all.
I would also keep count of the number of cubes you draw on each frame and display that on screen. Just so you know your culling routines work as expected.
Finally you don't mention if your cubes are textured. Try using smaller textures or disable textures and see how much the framerate increases.
gDEBugger is a great tool that will help you find bottlenecks with OpenGL.
I don't know if it's ok here to "bump" an old question but a few things came up my mind:
If your voxels are static you can speed up the whole rendering process by using an octree for frustum culling, etc. Furthermore you can also compile a static scene into a potential-visibility-set in the octree. The main principle of PVS is to precompute for evere node in the tree which other nodes are potential visible from it and store pointers to them in a vector. When it comes to rendering you first check in which node the camera is placed and then run frustum culling against all nodes in the PVS-vector of the node.(Carmack used something like that in the Quake engines, but with Binary Space Partitioning trees)
If the shading of your voxels is kindalike complex it is also fast to do a pre-Depth-Only-Pass, without writing into the colorbuffer,just to fill the Depthbuffer. After that you render a 2nd pass: disable writing to the Depthbuffer and render only to the Colorbuffer while checking the Depthbuffer. So you avoid expensive shader-computations which are later overwritten by a new fragment which is closer to the viewer.(Carmack used that in Quake3)
Another thing which will definitely speed up things is the use of Instancing. You store only the position of each voxel and, if nescessary, its scale and other parameters into a texturebufferobject. In the vertexshader you can then read the positions of the voxels to be spawned and create an instance of the voxel(i.e. a cube which is given to the shader in a vertexbufferobject). So you send the 8 Vertices + 8 Normals (3 *sizeof(float) *8 +3 *sizeof(float) *8 + floats for color/texture etc...) only once to the card in the VBO and then only the positions of the instances of the Cube(3*sizeof(float)*number of voxels) in the TBO.
Maybe it is possibile to parallelize things between GPU and CPU by combining all 3 steps in 2 threads, in the CPU-thread you check the octrees pvs and update a TBO for instancing in the next frame, the GPU-thread does meanwhile render the 2 passes while using an TBO for instancing which was created by the CPU thread in the previous step. After that you switch TBOs. If the Camera has not moved you don't even have to do the CPU-calculations again.
Another kind of tree you me be interested in is the so called k-d-tree, which is more general than octrees.
PS: sorry for my english, it's not the clearest....
There are 3rd-party libraries you could use to make the rendering more efficient. For example the C++ PolyVox library can take a volume and generate the mesh for you in an efficient way. It has built-in methods for reducing triangle count and helping to generate things like ambient occlusion. It's got a good community around it so getting support on the forum should be easy.
Have you used a common display list for all your cubes ?
Do you skip calling drawing code of cubes which are not visible to the user ?