Maya promo video explains how GPU Cache affects user making application run faster. In frameworks like Cinder we redraw all geopetry we want to be in the scene on each frame update sending it to video card. So I worder what is behind GPU Caching from a programmer prespective? What OpenGL/DirectX APIs are behind such technology? How to "Cache" my mesh in GPU memory?
There is, to my knowledge, no way in OpenGL or DirectX to directly specify what is to be, and not to be, stored and tracked on the GPU cache. There are however methodologies that should be followed and maintained in order to make best use of the cache. Some of these include:
Batch, batch, batch.
Upload data directly to the GPU
Order indices to maximize vertex locality across the mesh.
Keep state changes to a minimum.
Keep shader changes to a minimum.
Keep texture changes to a minimum.
Use maximum texture compression whenever possible.
Use mipmapping whenever possible (to maximize texel sampling locality)
It is also important to keep in mind that there is no single GPU cache. There are multiple (vertex, texture, etc.) independent caches.
Sources:
OpenGL SuperBible - Memory Bandwidth and Vertices
GPU Gems - Graphics Pipeline Performance
GDC 2012 - Optimizing DirectX Graphics
First off, the "GPU cache" terminology that Maya uses probably refers to graphics data that is simply stored on the card refers to optimizing a mesh for device-independent storage and rendering in Maya . For card manufacturer's the notion of a "GPU cache" is different (in this case it means something more like the L1 or L2 CPU caches).
To answer your final question: Using OpenGL terminology, you generally create vertex buffer objects (VBO's). These will store the data on the card. Then, when you want to draw, you can simply instruct the card to use those buffers.
This will avoid the overhead of copying the mesh data from main (CPU) memory into graphics (GPU) memory. If you need to draw the mesh many times without changing the mesh data, it performs much better.
Related
I've been learning OpenGL, and as I sit trying to write my VBOs, PBOs, VAOs, textures, quads, bindings, fragment shaders, vertex shaders, and a whole suite of other modern abstractions upon abstractions built after decades of evolution, I wonder: Isn't the display nothing but a large block of memory?
I've heard of tales, that in the "good ol' days" (such as the Commodore 64), all you had to do was assign a value to an arbitrary byte in memory, and the screen would change a pixel. Extremely simple and elegant. In the modern day, this has changed with layers upon layers of abstractions and safeguards, such that changing a pixel on your display is several hundred feet away.
This begs the question, is it possible in the modern day to just "update a pixel of the screen"? Is it possible to write my own graphics driver or something, where I can send commands to some C wrapper which interfaces with the GPU to change those pixels? This is an extremely broad question, but I'm curious. The answer I'm looking for to this question would provide a rough outline of what you'd have to do in order to be able to arbitrarily get some C code to set a pixel on the screen, as well as a rough outline of why OpenGL has progressed the way it has - what problems did VBOs, PBOs, VAOs, bindings, shaders, etc. solve, and how we got to where we are today.
Isn't the display nothing but a large block of memory?
Yes, it is called a framebuffer.
I've heard of tales, that in the "good ol' days" (such as the Commodore 64)
Your current PC works like that right when you power it up! If you use the CPU to write into video memory, that is called a software renderer.
In the modern day, this has changed with layers upon layers of abstractions and safeguards, such that changing a pixel on your display is several hundred feet away.
No, they are not abstractions/safeguards for "changing pixels". Nowadays software renderers are not used anymore. Instead, you have to tell the GPU (which is another computer on its own) how to draw. That "talk" is what the APIs (like OpenGL) do for you.
Now, the GPUs are meant to be fast at drawing, and that requires specialized code and data structures. Those are all the things you mention: VBOs, PBOs, VAOs, shaders, etc. (in OpenGL parlance). There is no way around that, because GPUs are different hardware.
is it possible in the modern day to just "update a pixel of the screen"?
Yes, but that will end up being drawn somehow by the GPU, even if it looks to you like a memory write.
Is it possible to write my own graphics driver or something, where I can send commands to some C wrapper which interfaces with the GPU to change those pixels?
Yes, but that "C wrapper" is the graphics driver. A graphics driver for a modern GPU is very complex.
what you'd have to do in order to be able to arbitrarily get some C code to set a pixel on the screen
You cannot write a "C program" to write to a graphical screen because the C standard does not concern itself with graphical displays.
So it depends on your operating system, your hardware, whether you want 2D or 3D acceleration support, the API you choose...
as well as a rough outline of why OpenGL has progressed the way it has - what problems did VBOs, PBOs, VAOs, bindings, shaders, etc. solve, and how we got to where we are today.
See above.
You can make your own frame buffer - that is just an integer array - and do rasterization on it, then use for example the Windows GDI function SetBitmapBits() to draw it to the display in one go. The final draw-to-display command depends on the operating system.
How you do the rasterization on your framebuffer is completely up to you. You can use the CPU to draw individual pixels or rasterize lines and triangles, see for example this demo of my old CPU graphics engine using Windows GDI: https://youtu.be/GFzisvhtRS4.
Using the CPU is fine as long as you do not rasterize large datasets. From my experience, the limit to real-time 60fps rendering on the CPU is ~50k lines per frame.
If you want to rasterize really large datasets, you have to use a GPU in some way. Since the framebuffer is just an integer array, you can transfer it to/from the GPU using OpenCL or CUDA and on the GPU - if your dataset happens to already be in video memory - do all the rasterization extremely fast in parallel. For this you will need an additional z-buffer to decide which pixels to overdraw by occluding geometries. This way you can rasterize approximately 30 Million lines per frame at 60fps. This demo is rendered on the GPU in real time using OpenCL: https://youtu.be/lDsz2maaZEo
Is it possible in the modern day to just "update a pixel of the screen"?
Yes. In Windows for example, you can use SetPixel() to draw a pixel or BitBlt() to draw in bulk. See this Q/A
This works fine, but this means you're using the CPU for rendering and you'll find the GPU is much more effective for this task, especially if you require decent framerate and non-trivial graphics. The reason there's these "whole suite of other modern abstractions upon abstractions" is to serve as an interface to the GPU since it has an independent set of memory and totally different execution model. Other GPU libraries (OpenCL, DirectX, Vulkan, etc) all have the same kind of abstractions.
I've glossed over many nuances but I hope the point gets across.
Both SDL and Game Maker have the concept of surfaces, images that you may modify on the fly and display them. I'm using OpenGL 1 and i'd like to know if openGL has this concept of Surface.
The only way that i came up with was:
Every frame create / destroy a new texture based on needs.
Every frame, update said texture based on needs.
These approachs don't seem to be very performant, but i see no alternative. Maybe this is how they are implemented in the mentioned engines.
Yes these two are the ways you would do it in OpenGL 1.0. I dont think there are any other means as far as 1.0 spec is concerned.
Link : https://www.opengl.org/registry/doc/glspec10.pdf
Do note that the textures are stored on the device memory (GPU) which is fast to access for shading. And the above approaches copy it between host (CPU) memory and device memory. Hence the performance hit is the speed of host-device copy.
Why are you limited to OpenGL 1.0 spec. You can go higher and then you start getting more options.
Use GLSL shaders to directly edit content from one texture and output the same to another texture. Processing will be done on the GPU and a device-device copy is as fast as it gets.
Use CUDA. Map a texture to a CUDA array, use your kernel to modify the content. Or use OpenCL for non-NVIDIA cards.
This would be the better scenario so long as the modification can be executed in parallel this would benefit.
I would suggest trying the CPU copy method, as it might be fast enough for your needs. The host-device copy is getting faster with latest hardware. You might be able to get real-time 60fps or higher even with this copy, unless its a lot of textures you plan to execute this for.
I have an object type which renders (correctly) a texture onto a 2D mesh depending on rotation (simulates 3D). However it is quite slow loading/binding a new texture image for each view. Disabling the view-dependent texture loading results in very quick performance.
Buffering all views/textures of the object may not be a good option, it could contain on the order of 720 views (separate images) each of which may be 600x1000 pixels. There is no guarantees of end-user system specs either, and this is a peripheral application.
Are there any good intermediate OpenGL suggestions between loading textures on demand and buffering all view textures at once?
This is where it would be useful to have a Texture Cache and you load all the lowest resolution MIP levels of the 720 different images. This would be your 1x1, 2x2 or so on resolution images.
As you are detecting changes in the view you are updating the texture cache, prioritizing the textures that last were used so that the one currently in view will have a high priority and the ones that weren't used for a long time will have the lowest priority.
As textures increase in priority you would bring in the higher detail MIP levels of the textures and you can rebind the textures when they finish loading, the texture cache would load them asynchronously in a separate thread and then notify your main thread when they can be prepared as that needs to happen in the same thread as the GL Context.
There are some other ways of doing this with new extensions like Partially Resident Textures from AMD but the extension has some limitations that makes it a bit cumbersome to use.
If the rotation is smooth and slow, you can stream the data from disk depending on the view and prefetch data for the surrounding views.
If you can afford a lossy compression, you can put lots of data in RAM with an aggressive compression then move some of it to VRAM (with DXT/BC compression if possible).
You should check these articles:
JMP Van Waveren. Real-time texture streaming & decompression. Intel
Software Network, 2006.
JMP Van Waveren. Geospatial texture streaming from slow storage
devices. Intel Software Network, 2008.
J.P. van Waveren. id tech 5 challenges:from texture virtualization to
massive parallelization. SIGGRAPH Talk, 2009.
Currently, My app is using a large amount of memory after loading textures (~200Mb)
I am loading the textures into a char buffer, passing it along to OpenGL and then killing the buffer.
It would seem that this memory is used by OpenGL, which is doing its own texture management internally.
What measures could I take to reduce this?
Is it possible to prevent OpenGL from managing textures internally?
One typical solution is to keep track of which textures you are needing at a given position of your camera or time-frame, and only load those when you need (opposed to load every single texture at the loading the app). You will have to have a "manager" which controls the loading-unloading and bounding of the respective texture number (e.g. a container which associates a string, name of the texture, with an integer) assigned by the glBindTexture)
Other option is to reduce the overall quality/size of the textures you are using.
It would seem that this memory is used by OpenGL,
Yes
which is doing its own texture management internally.
No, not texture management. It just need to keep the data somewhere. On modern systems the GPU is shared by several processes running simultanously. And not all of the data may fit into fast GPU memory. So the OpenGL implementation must be able to swap data out. The GPU fast memory is not storage, it's just another cache level. Just like the system memory is cache for system storage.
Also GPUs may crash and modern drivers reset them in situ, without the user noticing. For this they need a full copy of the data as well.
Is it possible to prevent OpenGL from managing textures internally?
No, because this would either be tedious to do, or break things. But what you can do, is loading only the textures you really need for drawing a given scene.
If you look through my writings about OpenGL, you'll notice that for years I tell people not to writing silly things like "initGL" functions. Put everything into your drawing code. You'll go through a drawing scheduling phase anyway (you must sort translucent objects far-to-near, frustum culling, etc.). That gives you the opportunity to check which textures you need, and to load them. You can even go as far and load only lower resolution mipmap levels so that when a scene is initially shown it has low detail, and load the higher resolution mipmaps in the background; this of course requires appropriate setting of minimum and maximum mip levels to be set as either texture or sampler parameter.
We have a two-screen DirectX application that previously ran at a consistent 60 FPS (the monitors' sync rate) using a NVIDIA 8400GS (256MB). However, when we swapped out the card for one with 512 MB of RAM the frame rate struggles to get above 40 FPS. (It only gets this high because we're using triple-buffering.) The two cards are from the same manufacturer (PNY). All other things are equal, this is a Windows XP Embedded application and we started from a fresh image for each card. The driver version number is 169.21.
The application is all 2D. I.E. just a bunch of textured quads and a whole lot of pre-rendered graphics (hence the need to upgrade the card's memory). We also have compressed animations which the CPU decodes on the fly - this involves a texture lock. The locks take forever but I've also tried having a separate system memory texture for the CPU to update and then updating the rendered texture using the device's UpdateTexture method. No overall difference in performance.
Although I've read through every FAQ I can find on the internet about DirectX performance, this is still the first time I've worked on a DirectX project so any arcane bits of knowledge you have would be useful. :)
One other thing whilst I'm on the subject; when calling Present on the swap chains it seems DirectX waits for the present to complete regardless of the fact that I'm using D3DPRESENT_DONOTWAIT in both present parameters (PresentationInterval) and the flags of the call itself. Because this is a two-screen application this is a problem as the two monitors do not appear to be genlocked, I'm working around it by running the Present calls through a threadpool. What could the underlying cause of this be?
Are the cards exactly the same (both GeForce 8400GS), and only the memory size differ? Quite often with different memory sizes come slightly different clock rates (i.e. your card with more memory might use slower memory!).
So the first thing to check would be GPU core & memory clock rates, using something like GPU-Z.
It's an easy test to see if the surface lock is the problem, just comment out the texture update and see if the framerate returns to 60hz. Unfortunately, writing to a locked surface and updating the resource kills perfomance, always has. Are you using mipmaps with the textures? I know DX9 added automatic generation of mipmaps, could be taking up a lot of time to generate those. If your constantly locking the same resource each frame, you could also try creating a pool of textures, kinda like triple-buffering except with textures. You would let the render use one texture, and on the next update you pick the next available texture in the pool that's not being used in to render. Unless of course your memory constrained or your only making diffs to the animated texture.