Performance boost due to mipmapping - opengl

Why is performance boost due to mipmaps?
I read somewhere on the net that "when we have 256 x 256 texture data and want to map it to 4x4, then driver will copy only 4x4 mipmap level generated to GPU memory and not 256x 256 data. and sampling will work on 4x4 data copied on GPU memory which will save lot of computations" I just want to know whether it is correct or not?
Also, when glTeximage call happens it uploads texture data in gpu memory which is passed in glteximage call. Then it conflicts with above statement. When we call glgeneratemipmap() does it create all the mipmap levels on CPU side and then later copies all those levels on GPU side?

Mipmapping boosts performance for two main reasons. The first is because it decreases bandwidth between GPU memory and the texture unit and the second is because it improves caching inside the texture unit.
Imagine you have a distant object with a texture applied to it. With mipmaping the GPU instead of processing, reading and caching a huge texture (level 0) it reads a smaller one. In this scenario the bandwidth reduction is quite obvious.
Bandwidth and caching on mipmaping go hand in hand. Sampling the texture of the distance object most likely will read the texture in a sparse manner. For example read texel x,y and for the next fragment read the x+10,y+10. The x+10,y+10 may be in another cache line. When reading from a smaller texture though the cache misses can be far less.
glTexImageXX creates a single level and uploads the pixel data to the GPU. When calling glGenerateMipmap the driver will allocate GPU memory for the additional levels and most likely it will execute a single or multiple GPU jobs to fill those levels. In some cases it my do the same thing on CPU and then upload the data to GPU.

Related

Async SSBO Readback

When I call GetBufferSubData with my Shader Storage Buffer Object there is typically a 4ms delay. Is it possible for my application to do work during that time?
// start GetBufferSubData
// do client/app/CPU work
// (wait if needed)
// read results from GetBufferSubData
Or otherwise use some sort of API to asynchronously start copying buffer data from the GPU?
I was able to get an async readback working using glMapBufferRange and GL_MAP_PERSISTENT_BIT. However, when running a compute shader (multiple times back to back) on that buffer, this results in a massive performance degradation compared to no persistent mapping.
The issue with simply marking the buffer with GL_MAP_PERSISTENT_BIT was that this resulted in a substantial performance degradation (8x slower) when running a compute shader on that buffer (profiled using Nvidia Nsight Graphics). I suspect this is because of the mapping, OpenGL needs to read/write the buffer into a different location that is less performant on the GPU, but more performant/accessible by the CPU.
My solution was to create a much smaller buffer (1000x smaller, 16kb) that is persistently mapped that the CPU can use to read/write to the larger buffer in small increments when needed. This combination was much faster with only a minor API overhead that achieved my needs.

How well do opengl drivers handle large texture arrays in limited VRAM

My game engine tries to allocate large texture arrays to be able to batch majority (if not all) of its draw together. This array may become large enough that fails to allocate, at which point I'd (continually) split the texture array in halves.
Is it bad design to push the boundaries until receiving a glGetError:Out of memory and scale back from there?
Is my application a jerk because it's allocating huge chunks of VRAM, which may require swapping into GTT memory? As in, is it less ideal for the graphics driver to be manipulating a few large texture arrays rather than many individual textures when dealing with other OS operations?
It is hard to evaluate how well drivers handle large texture arrays. Behaviour of different drivers may vary a lot.
While using texture array can improve performance by reducing the number of draw calls, that should not be the main goal. Reduction of draw calls is somewhat important on mobile platforms, and even there, several dozens of them is not a problem. I'm not sure about your concerns and what exactly you try to optimise, but I would recommend using profiling tools from GPU vendor before doing any optimisation.
Is it bad design to push the boundaries until receiving a glGetError:Out of memory and scale back from there?
This is what typically done when data is dynamically loaded to the GPU. Once the error is received, old data should be unloaded to load a new one.
Is my application a jerk because it's allocating huge chunks of VRAM, which may require swapping into GTT memory?
There is no way to check if data was swapped to GTT or not (if driver supports GTT at all). The driver handles it on its own, and there is no access to that from OpenGL API. You may need to use profiling tools like Nsight, if you are using a GPU from NVidia.
However, if you are planning to have one giant textures array, it must fit into VRAM as a whole, it can not be partially in VRAM and in GTT. I would not recommend relying on GTT at all.
It must fit into VRAM, because when you bind it, the driver can not know beforehand which layers will be used and which won't since selection happens in the shader.
Despite the fact that textures array and 3dtexture are conceptually different, at hardware level they work very similarly, the difference is that the first one uses filtering in two dimensions and the second one - in three dimensions.
I was playing with large 3d textures for a while. I did experiments with GeForce 1070 (it has 6GB), and it handles textures ~1GB very good. The largest texture I managed to load was around 3GB (2048x2048x7**), but often it throws an error. Despite the fact that it should have a large amount of free VRAM that would fit the texture, it may fail to allocate such big chunk due to various reasons. So I would not recommend allocating textures that are comparable to the total size of VRAM unless it is absolutely necessary.

OpenGL program with Intel HD and NVidia GPU usage

I am new in OpenGL and I want somebody to explain me how the program uses GPU.
I have an array of triangles(class that contains 3 points). Here is the code that draw them( I know these functions are depricated).
glBegin(GL_LINES);
for(int i=0; i<trsize; ++i){
glVertex3d((GLdouble)trarr[i].p1().x(), (GLdouble)trarr[i].p1().y(), (GLdouble)trarr[i].p1().z());
glVertex3d((GLdouble)trarr[i].p2().x(), (GLdouble)trarr[i].p2().y(), (GLdouble)trarr[i].p2().z());
glVertex3d((GLdouble)trarr[i].p3().x(), (GLdouble)trarr[i].p3().y(), (GLdouble)trarr[i].p3().z());
}
glEnd();
And i also use depricated functions for rotating, transforming, etc.
When the size of array is bigger than 50k, the program works very slow.
I tried to use only Intel HD or only NVidia gtx860M (the default NVidia Program allows to choose GPU) but they both works very slow. Maybe Intel HD works even a bit faster.
So, why there is no difference between these two GPUs?
And will the program work faster with using shaders?
The probable bottleneck is looping over the vertices, accessing the array and pulling out the vertex data 50000 times per render then sending the data to the GPU for rendering.
Using a VBO would indeed be faster and compresses the cost of extracting the data and sending it to the GPU to once on initialization.
Even using a user memory buffer would speed it up because you won't be calling 50k functions but the driver can just do a memcopy of the relevant data.
When the size of array is bigger than 50k, the program works very slow.
The major bottleneck when drawing in intermediate mode is, that all your vertices have to be transferred in every frame from your programs memory to the GPU memory. The bus between GPU and CPU is limited in the amout of data it can transfer, so the best guess is, that 50k triangles are simply more than the bus can transport. Another problem is, that the driver has to process all the commands you send him on the CPU, which can also be a big overhead.
So, why there is no difference between these two GPUs?
There is (in general) a huge performance difference between the Intel HD card and a NVIDIA card, but the bus between them might be the same.
And will the program work faster with using shaders?
It will not benefit directly from the user of shaders, but definitely from storing the vertices once on the gpu memory (see VBO/VAO). The second improvement is, that you can render the whole VBO using only one draw call, which decreases the amount of instructions the cpu has to handle.
Seeing the same performance with two GPUs that have substantially different performance potential certainly suggests that your code is CPU limited. But I very much question some of the theories about the performance bottleneck in the other answers/comments.
Some simple calculations suggest that memory bandwidth should not come into play at all. With 50,000 triangles, with 3 vertices each, and 24 bytes per vertex, you're looking at 3,600,000 bytes of vertex data per frame. Say you're targeting 60 frames/second, this is a little over 200 MBytes/second. That's less than 1% of the memory bandwidth of a modern PC.
The most practical implementation of immediate mode on a modern GPU is for drivers to collect all the data into buffers, and then submit it all at once when a buffer fills up. So there's no need for a lot of kernel calls, and the data for each vertex is certainly not sent to the GPU separately.
Driver overhead is most likely the main culprit. With 50,000 triangles, and 3 API calls per triangle, this is 150,000 API calls per frame, or 9 million API calls/second if you're targeting 60 frames/second. That's a lot! For each of these calls, you'll have:
Looping and array accesses in your own code.
The actual function call.
Argument passing.
State management and logic in the driver code.
etc.
One important aspect that makes this much worse than it needs to be: You're using double values for your coordinates. That doubles the amount of data that needs to be passed around compared to using float values. And since the OpenGL vertex pipeline operates in single precision (*), the driver will have to convert all the values to float.
I suspect that you could get a significant performance improvement even with using the deprecated immediate mode calls if you started using float for all your coordinates (both your own storage, and for passing them to OpenGL). You could also use the version of the glVertex*() call that takes a single argument with a pointer to the vector, instead of 3 separate arguments. This would be glVertex3fv() for float vectors.
Making the move to VBOs is of course the real solution. It will reduce the number of API calls by orders of magnitude, and avoid any data copying as long as the vertex data does not change over time.
(*) OpenGL 4.1 adds support for double vertex attributes, but they require the use of specific API functions, and only make sense if single precision floats really are not precise enough.

One big OpenGL vertex buffer, or many small ones?

Let's say I have 5 entities (objects) with a method Render(). Every entity needs to set its own vertices in a buffer for rendering.
Which of the following two options is better?
Use one big pre-allocated buffer created with glGenBuffer, which every entity will use (id of buffer passed as argument to Render methods) by writing its vertices to the buffer with glBufferSubData.
Every entity creates and uses its own buffer.
If one big buffer is better, how can I render all vertices in this buffer (from all entities) properly, with proper shaders and everything?
Having multiple VBOs is fine as long as they have a certain size. What you want to avoid is to have a lot of small draw calls, and to have to bind different buffers very frequently.
How large the buffers have to be to avoid excessive overhead depends on so many factors that it's almost impossible to even give a rule of thumb. Factors that come into play include:
Hardware performance characteristics.
Driver efficiency.
Number of vertices relative to number of fragments (triangle size).
Complexity of shaders.
Generally it can make sense to keep similar/related objects that you typically draw at the same time in a single vertex buffer.
Putting everything in a single buffer seems extreme, and could in fact have adverse effects. Say you have a large "world", where you only render a small subset in any given frame. If you go to the extreme, an have all vertices in one giant buffer, that buffer needs to be accessible to the GPU for each draw call. Depending on the architecture, and how the buffer is allocated, this could mean:
Attempting to keep the buffer in dedicate GPU memory (e.g. VRAM), which could be problematic if it's too large.
Mapping the memory into the GPU address space.
Pinning/wiring the memory.
If any of the above needs to be applied to a very large buffer, but you end up using only a small fraction of it to render a frame, there is significant waste in these operations. In a system with VRAM, it could also prevent other allocations, like textures, to fit in VRAM.
If rendering is done with calls that can only access a subset of the buffer given by the arguments, like glDrawArrays() or glDrawRangeElements(), it might be possible for the driver to avoid making the whole buffer GPU accessible. But I wouldn't necessarily count on that happening.
It's easier to use one VBO (Vertex Buffer Object) with glGenBuffer for each entity you have but it's not always the best things to do, this depend on the use. But, in most cases, this is not a problem to have 1 VBO for each entity and the rendering is rarely affected.
Good info is located at: OpenGL Vertex Specification Best Practices

Storing many small textures in OpenGL

I'm building an OpenGL app with many small textures. I estimate that I will have a few hundred
textures on the screen at any given moment.
Can anyone recommend best practices for storing all these textures in memory so as to avoid potential performance issues?
I'm also interested in understanding how OpenGL manages textures. Will OpenGL try to store them into GPU memory? If so, how much GPU memory can I count on? If not, how often does OpenGL pass the textures from application memory to the GPU, and should I be worried about latency when this happens?
I'm working with OpenGL 3.3. I intend to use only modern features, i.e. no immediate mode stuff.
If you have a large number of small textures, you would be best off combining them into a single large texture with each of the small textures occupying known sub-regions (a technique sometimes called a "texture atlas"). Switching which texture is bound can be expensive, in that it will limit how much of your drawing you can batch together. By combining into one you can minimize the number of times you have to rebind. Alternatively, if your textures are very similarly sized, you might look into using an array texture (introduction here).
OpenGL does try to store your textures in GPU memory insofar as possible, but I do not believe that it is guaranteed to actually reside on the graphics card.
The amount of GPU memory you have available will be dependent on the hardware you run on and the other demands on the system at the time you run. What exactly "GPU memory" means will vary across machines, it can be discrete and used only be the GPU, shared with main memory, or some combination of the two.
Assuming your application is not constantly modifying the textures you should not need to be particularly concerned about latency issues. You will provide OpenGL with the textures once and from that point forward it will manage their location in memory. Assuming you don't need more texture data than can easily fit in GPU memory every frame, it shouldn't be cause for concern. If you do need to use a large amount of texture data, try to ensure that you batch all use of a certain texture together to minimize the number of round trips the data has to make. You can also look into the built-in texture compression facilities, supplying something like GL_COMPRESSED_RGBA to your call to glTexImage2D, see the man page for more details.
Of course, as always, your best bet will be to test these things yourself in a situation close to your expected use case. OpenGL provides a good number of guarantees, but much will vary depending on the particular implementation.