OpenGL vs DirectX, memory used when loading shaders

OpenGL vs DirectX, memory used when loading shaders - c++

I have an application which has many different shaders, and has both an OpenGL and DirectX 11 implementation. Other than language differences (between GLSL and HLSL), the shaders are identical on both platforms. In DirectX, the shaders are compiled offline using fxc (with the /Fh option), and compiled into the executable. In OpenGL, the text of the shader is embedded into the executable, and is compiled at runtime. It then will be saved as binary with glGetProgramBinary (if the platform is capable, which it is on the test case I have), and reloaded on subsequent runs with glProgramBinary.
I have noticed that in OpenGL, the total memory reported used by my application (reported in procexp) grows by an average of 49kb per call glProgramBinary. The increase seems to be not directly related to the size of the program's binary. Some increase memory roughly the size of the binary, where as some raise it by many times the size. Also, the binary sizes seem quite large, being ~12kb even for the simplest programs.
Conversely, in DirectX 11, the equivalent calls to ID3D11Device::CreateVertexShader and ID3D11Device::CreatePixelShader do increase the reported memory usage, but only by an average of 6kb per pair (again, the increase seems not directly proportional to the size of the binary). Also, the shader binary blobs compiled offline seem to be drastically smaller, with the smallest being less than 2kb.
Is there some way to reduce the amount of memory associated with shaders in OpenGL, such that it's more inline with the memory usage in DirectX? Or, is this a reporting problem (DirectX is actually using the memory, but it doesn't show up in the process' count)?
(Platform is NVidia 373.34, on Windows 10).

Related

Identical code using more than double the RAM memory on different computers

I'm creating a Minecraft clone in C++ with OpenGL.
I noticed today that when debugging the program on my laptop, the RAM usage is way higher that the RAM usage on my desktop PC (~1.3gb vs ~500mb). I'm getting these memory numbers from Visual Studio's diagnostics tools.
I'm using GitHub and even with the same branch, same commit, literally the exact same code, the laptop uses more RAM. I tried cleaning the solution, rebuilding, cloning again, nothing works.
The memory usage is different on the Windows Task Manager, too.
I'm out of ideas of what could be happening. The computers are on different platforms (laptop is Intel 10th, and desktop is Ryzen 3000), the laptop has less RAM (8gb vs 16gb). Both are using the latest Windows 10. I'm using Visual Studio Community 2019.
I'm not sure if a platform difference could cause such a huge impact on memory allocation.

Many laptop architectures use something called unified memory. That is to say, there is only one big pool of memory that is shared between the CPU and GPU (or the equivalent portions on an APU).
On such architectures, allocating video memory is essentially the same thing as allocating RAM. It's all hidden away by the graphics drivers though.
So a graphics-heavy application using more RAM on a laptop than on a desktop with a discrete GPU is not surprising. However, it's not so much that it uses more memory, just that the memory it uses gets tabulated differently.
Assuming both platforms run at the same resolution and the same assets are loaded, you'd expect GPU Memory + RAM usage on desktop would be roughly equivalent to RAM usage on the laptop.
Emphasis on the word roughly. Different graphic architectures/drivers use memory differently, so don't expect a 1-to-1 match here. For example:
A single 1080p framebuffer takes a few megabytes at the minimum, depending on how the driver interacts with the actual screen, how many of these are around is rarely obvious.
Tiled architectures can completely bypass needing large chunks of
memory altogether.
That's the most likely scenario here.

How well do opengl drivers handle large texture arrays in limited VRAM

My game engine tries to allocate large texture arrays to be able to batch majority (if not all) of its draw together. This array may become large enough that fails to allocate, at which point I'd (continually) split the texture array in halves.
Is it bad design to push the boundaries until receiving a glGetError:Out of memory and scale back from there?
Is my application a jerk because it's allocating huge chunks of VRAM, which may require swapping into GTT memory? As in, is it less ideal for the graphics driver to be manipulating a few large texture arrays rather than many individual textures when dealing with other OS operations?

It is hard to evaluate how well drivers handle large texture arrays. Behaviour of different drivers may vary a lot.
While using texture array can improve performance by reducing the number of draw calls, that should not be the main goal. Reduction of draw calls is somewhat important on mobile platforms, and even there, several dozens of them is not a problem. I'm not sure about your concerns and what exactly you try to optimise, but I would recommend using profiling tools from GPU vendor before doing any optimisation.
Is it bad design to push the boundaries until receiving a glGetError:Out of memory and scale back from there?
This is what typically done when data is dynamically loaded to the GPU. Once the error is received, old data should be unloaded to load a new one.
Is my application a jerk because it's allocating huge chunks of VRAM, which may require swapping into GTT memory?
There is no way to check if data was swapped to GTT or not (if driver supports GTT at all). The driver handles it on its own, and there is no access to that from OpenGL API. You may need to use profiling tools like Nsight, if you are using a GPU from NVidia.
However, if you are planning to have one giant textures array, it must fit into VRAM as a whole, it can not be partially in VRAM and in GTT. I would not recommend relying on GTT at all.
It must fit into VRAM, because when you bind it, the driver can not know beforehand which layers will be used and which won't since selection happens in the shader.
Despite the fact that textures array and 3dtexture are conceptually different, at hardware level they work very similarly, the difference is that the first one uses filtering in two dimensions and the second one - in three dimensions.
I was playing with large 3d textures for a while. I did experiments with GeForce 1070 (it has 6GB), and it handles textures ~1GB very good. The largest texture I managed to load was around 3GB (2048x2048x7**), but often it throws an error. Despite the fact that it should have a large amount of free VRAM that would fit the texture, it may fail to allocate such big chunk due to various reasons. So I would not recommend allocating textures that are comparable to the total size of VRAM unless it is absolutely necessary.

OpenGL program with Intel HD and NVidia GPU usage

I am new in OpenGL and I want somebody to explain me how the program uses GPU.
I have an array of triangles(class that contains 3 points). Here is the code that draw them( I know these functions are depricated).
glBegin(GL_LINES);
for(int i=0; i<trsize; ++i){
glVertex3d((GLdouble)trarr[i].p1().x(), (GLdouble)trarr[i].p1().y(), (GLdouble)trarr[i].p1().z());
glVertex3d((GLdouble)trarr[i].p2().x(), (GLdouble)trarr[i].p2().y(), (GLdouble)trarr[i].p2().z());
glVertex3d((GLdouble)trarr[i].p3().x(), (GLdouble)trarr[i].p3().y(), (GLdouble)trarr[i].p3().z());
}
glEnd();
And i also use depricated functions for rotating, transforming, etc.
When the size of array is bigger than 50k, the program works very slow.
I tried to use only Intel HD or only NVidia gtx860M (the default NVidia Program allows to choose GPU) but they both works very slow. Maybe Intel HD works even a bit faster.
So, why there is no difference between these two GPUs?
And will the program work faster with using shaders?

The probable bottleneck is looping over the vertices, accessing the array and pulling out the vertex data 50000 times per render then sending the data to the GPU for rendering.
Using a VBO would indeed be faster and compresses the cost of extracting the data and sending it to the GPU to once on initialization.
Even using a user memory buffer would speed it up because you won't be calling 50k functions but the driver can just do a memcopy of the relevant data.

When the size of array is bigger than 50k, the program works very slow.
The major bottleneck when drawing in intermediate mode is, that all your vertices have to be transferred in every frame from your programs memory to the GPU memory. The bus between GPU and CPU is limited in the amout of data it can transfer, so the best guess is, that 50k triangles are simply more than the bus can transport. Another problem is, that the driver has to process all the commands you send him on the CPU, which can also be a big overhead.
So, why there is no difference between these two GPUs?
There is (in general) a huge performance difference between the Intel HD card and a NVIDIA card, but the bus between them might be the same.
And will the program work faster with using shaders?
It will not benefit directly from the user of shaders, but definitely from storing the vertices once on the gpu memory (see VBO/VAO). The second improvement is, that you can render the whole VBO using only one draw call, which decreases the amount of instructions the cpu has to handle.

Seeing the same performance with two GPUs that have substantially different performance potential certainly suggests that your code is CPU limited. But I very much question some of the theories about the performance bottleneck in the other answers/comments.
Some simple calculations suggest that memory bandwidth should not come into play at all. With 50,000 triangles, with 3 vertices each, and 24 bytes per vertex, you're looking at 3,600,000 bytes of vertex data per frame. Say you're targeting 60 frames/second, this is a little over 200 MBytes/second. That's less than 1% of the memory bandwidth of a modern PC.
The most practical implementation of immediate mode on a modern GPU is for drivers to collect all the data into buffers, and then submit it all at once when a buffer fills up. So there's no need for a lot of kernel calls, and the data for each vertex is certainly not sent to the GPU separately.
Driver overhead is most likely the main culprit. With 50,000 triangles, and 3 API calls per triangle, this is 150,000 API calls per frame, or 9 million API calls/second if you're targeting 60 frames/second. That's a lot! For each of these calls, you'll have:
Looping and array accesses in your own code.
The actual function call.
Argument passing.
State management and logic in the driver code.
etc.
One important aspect that makes this much worse than it needs to be: You're using double values for your coordinates. That doubles the amount of data that needs to be passed around compared to using float values. And since the OpenGL vertex pipeline operates in single precision (*), the driver will have to convert all the values to float.
I suspect that you could get a significant performance improvement even with using the deprecated immediate mode calls if you started using float for all your coordinates (both your own storage, and for passing them to OpenGL). You could also use the version of the glVertex*() call that takes a single argument with a pointer to the vector, instead of 3 separate arguments. This would be glVertex3fv() for float vectors.
Making the move to VBOs is of course the real solution. It will reduce the number of API calls by orders of magnitude, and avoid any data copying as long as the vertex data does not change over time.
(*) OpenGL 4.1 adds support for double vertex attributes, but they require the use of specific API functions, and only make sense if single precision floats really are not precise enough.

What is texture memory, allocated with OpenGL, limited by?

I'm making a 2D game with OpenGL. Something I'm concerned with is texture memory consumption. 2D games use a few orders of magnitude more texture memory than 3D games, most of that coming from animation frames and backgrounds.
Obviously there's a limit to how much texture memory a program can allocate, but what determines that limit? Does the limit come from available general memory to the program, or is it limited by how much video memory is available on the GPU? Can it use swap space at all? What happens when video memory is exhausted?

OpenGL's memory model is very abstract. Up to version including 2.1 there were two kinds of memory fast "server" memory and slow "client" memory. But there are no limits that could be queried in any way. When a OpenGL implementation runs out of "server" (=GPU) memory it may start swapping or just report "out of memory" errors.
OpenGL-3 did away (mostly, OpenGL-4 finished that job) with the two different kinds of memory. There's just "memory" and the limits are quite arbitrary and depend on the OpenGL implementation (= GPU + driver) the program is running on. All OpenGL implementations are perfectly capable of swapping out textures not used in a while. So the only situation where you would run into a out of memory situation would be the attempt to create a very large texture. The more recent GPUs are in fact capable of swapping in and out parts of textures on a as-needed base. Things will get slow, but keep working.
Obviously there's a limit to how much texture memory a program can allocate, but what determines that limit?
The details of the OpenGL implementation of the system and depending on that the amount of memory installed in that system.

Storing many small textures in OpenGL

I'm building an OpenGL app with many small textures. I estimate that I will have a few hundred
textures on the screen at any given moment.
Can anyone recommend best practices for storing all these textures in memory so as to avoid potential performance issues?
I'm also interested in understanding how OpenGL manages textures. Will OpenGL try to store them into GPU memory? If so, how much GPU memory can I count on? If not, how often does OpenGL pass the textures from application memory to the GPU, and should I be worried about latency when this happens?
I'm working with OpenGL 3.3. I intend to use only modern features, i.e. no immediate mode stuff.

If you have a large number of small textures, you would be best off combining them into a single large texture with each of the small textures occupying known sub-regions (a technique sometimes called a "texture atlas"). Switching which texture is bound can be expensive, in that it will limit how much of your drawing you can batch together. By combining into one you can minimize the number of times you have to rebind. Alternatively, if your textures are very similarly sized, you might look into using an array texture (introduction here).
OpenGL does try to store your textures in GPU memory insofar as possible, but I do not believe that it is guaranteed to actually reside on the graphics card.
The amount of GPU memory you have available will be dependent on the hardware you run on and the other demands on the system at the time you run. What exactly "GPU memory" means will vary across machines, it can be discrete and used only be the GPU, shared with main memory, or some combination of the two.
Assuming your application is not constantly modifying the textures you should not need to be particularly concerned about latency issues. You will provide OpenGL with the textures once and from that point forward it will manage their location in memory. Assuming you don't need more texture data than can easily fit in GPU memory every frame, it shouldn't be cause for concern. If you do need to use a large amount of texture data, try to ensure that you batch all use of a certain texture together to minimize the number of round trips the data has to make. You can also look into the built-in texture compression facilities, supplying something like GL_COMPRESSED_RGBA to your call to glTexImage2D, see the man page for more details.
Of course, as always, your best bet will be to test these things yourself in a situation close to your expected use case. OpenGL provides a good number of guarantees, but much will vary depending on the particular implementation.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js