Reducing RAM usage with regard to Textures - opengl

Currently, My app is using a large amount of memory after loading textures (~200Mb)
I am loading the textures into a char buffer, passing it along to OpenGL and then killing the buffer.
It would seem that this memory is used by OpenGL, which is doing its own texture management internally.
What measures could I take to reduce this?
Is it possible to prevent OpenGL from managing textures internally?

One typical solution is to keep track of which textures you are needing at a given position of your camera or time-frame, and only load those when you need (opposed to load every single texture at the loading the app). You will have to have a "manager" which controls the loading-unloading and bounding of the respective texture number (e.g. a container which associates a string, name of the texture, with an integer) assigned by the glBindTexture)
Other option is to reduce the overall quality/size of the textures you are using.

It would seem that this memory is used by OpenGL,
Yes
which is doing its own texture management internally.
No, not texture management. It just need to keep the data somewhere. On modern systems the GPU is shared by several processes running simultanously. And not all of the data may fit into fast GPU memory. So the OpenGL implementation must be able to swap data out. The GPU fast memory is not storage, it's just another cache level. Just like the system memory is cache for system storage.
Also GPUs may crash and modern drivers reset them in situ, without the user noticing. For this they need a full copy of the data as well.
Is it possible to prevent OpenGL from managing textures internally?
No, because this would either be tedious to do, or break things. But what you can do, is loading only the textures you really need for drawing a given scene.
If you look through my writings about OpenGL, you'll notice that for years I tell people not to writing silly things like "initGL" functions. Put everything into your drawing code. You'll go through a drawing scheduling phase anyway (you must sort translucent objects far-to-near, frustum culling, etc.). That gives you the opportunity to check which textures you need, and to load them. You can even go as far and load only lower resolution mipmap levels so that when a scene is initially shown it has low detail, and load the higher resolution mipmaps in the background; this of course requires appropriate setting of minimum and maximum mip levels to be set as either texture or sampler parameter.

Related

Blitting surfaces in OpenGL

Both SDL and Game Maker have the concept of surfaces, images that you may modify on the fly and display them. I'm using OpenGL 1 and i'd like to know if openGL has this concept of Surface.
The only way that i came up with was:
Every frame create / destroy a new texture based on needs.
Every frame, update said texture based on needs.
These approachs don't seem to be very performant, but i see no alternative. Maybe this is how they are implemented in the mentioned engines.
Yes these two are the ways you would do it in OpenGL 1.0. I dont think there are any other means as far as 1.0 spec is concerned.
Link : https://www.opengl.org/registry/doc/glspec10.pdf
Do note that the textures are stored on the device memory (GPU) which is fast to access for shading. And the above approaches copy it between host (CPU) memory and device memory. Hence the performance hit is the speed of host-device copy.
Why are you limited to OpenGL 1.0 spec. You can go higher and then you start getting more options.
Use GLSL shaders to directly edit content from one texture and output the same to another texture. Processing will be done on the GPU and a device-device copy is as fast as it gets.
Use CUDA. Map a texture to a CUDA array, use your kernel to modify the content. Or use OpenCL for non-NVIDIA cards.
This would be the better scenario so long as the modification can be executed in parallel this would benefit.
I would suggest trying the CPU copy method, as it might be fast enough for your needs. The host-device copy is getting faster with latest hardware. You might be able to get real-time 60fps or higher even with this copy, unless its a lot of textures you plan to execute this for.

What is the difference between clearing the framebuffer using glClear and simply drawing a rectangle to clear the framebuffer?

I think at least some old graphics drivers used to crash if glClear wasn't used and that glClear is probably faster in a lot of cases but why? How are 3-d graphics drivers usually implemented such that these uses would have different results?
On a high level, it can be faster because the OpenGL implementation knows ahead of time that the whole buffer needs to be set to the same color/value. The more you know about what exactly needs to be done, the more you can take advantage of possible accelerations.
Let's say setting a whole buffer to the same value is more efficient than setting the same pixels to variable values. With a glClear(), you know already that all pixels will have the same value. If you draw a screen sized quad with a fragment shader that emits a constant color, the driver would either have to recognize that situation by analyzing the shaders, or the system would have to compare the values coming out of the shader, to know that all pixels have the same value.
The reason why setting everything to the same value can be more efficient has to do with framebuffer compression and related technologies. GPUs often don't actually write each pixel out to the framebuffer, but use various kinds of compression schemes to reduce the memory bandwidth needed for framebuffer writes. If you imagine almost any kind of compression, all pixels having the same value is very favorable.
To give you some ideas about the published vendor specific technologies, here are a few sources. You can probably find more with a search.
Article talking about new framebuffer compression method in relatively recent AMD cards: http://techreport.com/review/26997/amd-radeon-r9-285-graphics-card-reviewed/2.
NVIDIA patent on zero bandwidth clears: http://www.google.com/patents/US8330766.
Blurb on ARM web site about Mali framebuffer compression: http://www.arm.com/products/multimedia/mali-technologies/arm-frame-buffer-compression.php.
Why is it faster? Because it is a function that bypasses most calculations that other types of drawings have to go through.
Alpha function, blend function, logical operation, stenciling, texture mapping, and depth-buffering are ignored by glClear
Source
Why do some drivers crash without it? It's hard to say, but it should have something to do with the implementation details of OpenGL. The functions does what it's supposed to do, but might do more that you don't know about.
OpenGL might infer from this function call other tasks that it needs to perform.

Large 3D scene streaming

I'm working on a 3D engine suitable for very large scene display.
Appart of the rendering itself (frustum culling, occlusion culling, etc.), I'm wondering what is the best solution for scene management.
Data is given as a huge list of 3D meshs, with no relation between them, so I can't generate portals, I think...
The main goal is to be able to run this engine on systems with low RAM (500MB-1GB), and the scenes loaded into it are very large and can contain millions of triangles, which leads to very intensive memory usage. I'm actually working with a loose octree right now, constructed on loading, it works well on small and medium scenes, but many scenes are just to huge to fit entirely in memory, so here come my question:
How would you handle scenes to load and unload chunks dynamically (and ideally seamlessly), and what would you base on to determine if a chunk should be loaded/unloaded? If needed, I can create a custom file format, as scenes are being exported using a custom exporter on known 3D authoring tools.
Important information: Many scenes can't be effectively occluded, because of their construction.
Example: A very huge pipe network, so there isn't so much occlusion but very high number of elements.
I think that the best solution will be a "solution pack", a pack of different techniques.
Level of detail(LOD) can reduce memory footprint if unused levels are not loaded. It can be changed more or less seamlessly by using an alpha mix between the old and the new detail. The easiest controller will use mesh distance to camera.
Freeing the host memory(RAM) when the object has been uploaded to the GPU (device), and obviously free all unsued memory (OpenGL resources too). Valgrind can help you with this one.
Use low quality meshes and use tessellation to increase visual quality.
Use VBO indexing, this should reduce VRAM usage and increase performance
Don't use meshes if possible, terrain can be rendered using heightmaps. Some things can be procedurally generated.
Use bump or/and normalmaps. This will improve quality, then you can reduce vertex count.
Divide those "pipes" into different meshes.
Fake 3D meshes with 2D images: impostors, skydomes...
If the vast amount of ram is going to be used by textures, there are commercial packages available such as the GraniteSDK that offer seamless LOD-based texture streaming using a virtual texture cache. See http://graphinesoftware.com/granite . Alternatively you can look at http://ir-ltd.net/
In fact you can use the same technique to construct poly's on the fly from texture data in the shader, but it's going to be a bit more complicated.
For voxels there is a techniques to construct oct-trees entirely in GPU memory, and page in/out the parts you really need. The rendering can then be done using raycasting. See this post: Use octree to organize 3D volume data in GPU , http://www.icare3d.org/research/GTC2012_Voxelization_public.pdf and http://www.cse.chalmers.se/~kampe/highResolutionSparseVoxelDAGs.pdf
It comes down to how static the scene is going to be, and following from that, how well you can pre-bake the data according to your vizualization needs. It would already help if you can determine visibility constraints up front (e.g. google Potential Visiblity Sets) and organize it so that you can stream it at request. Since the visualizer will have limits, you always end up with a strategy to fit a section of the data into GPU memory as quickly and accurately as possible.

What is "GPU Cache" from a OpenGL/DirectX programmer prespective?

Maya promo video explains how GPU Cache affects user making application run faster. In frameworks like Cinder we redraw all geopetry we want to be in the scene on each frame update sending it to video card. So I worder what is behind GPU Caching from a programmer prespective? What OpenGL/DirectX APIs are behind such technology? How to "Cache" my mesh in GPU memory?
There is, to my knowledge, no way in OpenGL or DirectX to directly specify what is to be, and not to be, stored and tracked on the GPU cache. There are however methodologies that should be followed and maintained in order to make best use of the cache. Some of these include:
Batch, batch, batch.
Upload data directly to the GPU
Order indices to maximize vertex locality across the mesh.
Keep state changes to a minimum.
Keep shader changes to a minimum.
Keep texture changes to a minimum.
Use maximum texture compression whenever possible.
Use mipmapping whenever possible (to maximize texel sampling locality)
It is also important to keep in mind that there is no single GPU cache. There are multiple (vertex, texture, etc.) independent caches.
Sources:
OpenGL SuperBible - Memory Bandwidth and Vertices
GPU Gems - Graphics Pipeline Performance
GDC 2012 - Optimizing DirectX Graphics
First off, the "GPU cache" terminology that Maya uses probably refers to graphics data that is simply stored on the card refers to optimizing a mesh for device-independent storage and rendering in Maya . For card manufacturer's the notion of a "GPU cache" is different (in this case it means something more like the L1 or L2 CPU caches).
To answer your final question: Using OpenGL terminology, you generally create vertex buffer objects (VBO's). These will store the data on the card. Then, when you want to draw, you can simply instruct the card to use those buffers.
This will avoid the overhead of copying the mesh data from main (CPU) memory into graphics (GPU) memory. If you need to draw the mesh many times without changing the mesh data, it performs much better.

Per-line texture processing accelerated with OpenGL/OpenCL

I have a rendering step which I would like to perform on a dynamically-generated texture.
The algorithm can operate on rows independently in parallel. For each row, the algorithm will visit each pixel in left-to-right order and modify it in situ (no distinct output buffer is needed, if that helps). Each pass uses state variables which must be reset at the beginning of each row and persist as we traverse the columns.
Can I set up OpenGL shaders, or OpenCL, or whatever, to do this? Please provide a minimal example with code.
If you have access to GL 4.x-class hardware that implements EXT_shader_image_load_store or ARB_shader_image_load_store, I imagine you could pull it off. Otherwise, in-situ read/write of an image is generally not possible (though there are ways with NV_texture_barrier).
That being said, once you start wanting pixels to share state the way you do, you kill off most of your potential gains from parallelism. If the value you compute for a pixel is dependent on the computations of the pixel to its left, then you cannot actually execute each pixel in parallel. Which means that the only parallelism your algorithm actually has is per-row.
That's not going to buy you much.
If you really want to do this, use OpenCL. It's much friendlier to this kind of thing.
Yes, you can do it. No, you don't need 4.X hardware for that, you need fragment shaders (with flow control), framebuffer objects and floating point texture support.
You need to encode your data into 2D texture.
Store "state variable" in 1st pixel for each row, and encode the rest of the data into the rest of the pixels. It goes without saying that it is recommended to use floating point texture format.
Use two framebuffers, and render them onto each other in a loop using fragment shader that updates "state variable" at the first column, and performs whatever operation you need on another column, which is "current". To reduce amount of wasted resources you can limit rendering to columns you want to process. NVidia OpenGL SDK examples had "game of life", "GDGPU fluid", "GPU partciles" demos that work in similar fashion - by encoding data into texture and then using shaders to update it.
However, because you can do it, it doesn't mean you should do it and it doesn't mean that it is guaranteed to be fast. Some GPUs might have a very high memory texture memory read speed, but relatively slow computation speed (and vice versa) and not all GPUs have many conveyors for processing things in parallel.
Also, depending on your app, CUDA or OpenCL might be more suitable.