Is there any difference when I allocate multiple small SSBOs for usage in compute shaders over a big one, internally mapped to many arrays?
By the difference I mean the read/write performance. Does the GPU memory even care about the SSBO partitioning or does it handle everything uniformly.
Here is an example in shader:
layout (std430, binding=1) buffer bufferA
{int elementsA[]};
layout (std430, binding=2) buffer bufferB
{int elementsB[]};
...
//VS
layout (std430, binding=1) buffer buffers
{
int elementsA[MAXCOUNT_A];
int elementsB[MAXCOUNT_B];
...
};
One big buffer would avoid the need of many allocations from CPU side and result in a cleaner code, leaving the memory partitioning to the shader code. Of course I'd need to specify maximum size for each array representing a buffer which might result in needless memory allocation. I am however more concerned about the runtime access speed.
Is this merging even a good practice? Now in my code I am getting too much small buffer allocations which kind of ugly :D.
GPU memory cares of what type of data storage you use. You must first ask yourself,why you need SSBOs in general? SSBO data maybe stored in global memory on GPU,whereas UBOs are in on chip shared memory,access to which is much faster. I would use SSBOs for really HUGE amount of data,where your application cannot live with UBO blocks size limits.
Now,regarding your question - you must try and profile. It is hard to tell if you're going to gain by using several buffers or just one. But,I would go for one long buffer,as it requires less bookkeeping,takes less binding slots, and will probably perform faster due to spatial locality of the data in video memory. But I leave the actual test to you.
Related
I use a rendering loop as follow;
Orphan the data and map the buffer.
Record the command and write the generated vertex in the buffer.
Unmap the buffer.
Iterate over the commands that can change states, bind textures or draw.
At the moment I use a single interleaved vertex format (SoA) that has all the attribute any of my shaders can use.
struct OneSizeFitAllVertex
{
float pos[3];
float uv0[2];
float uv1[2];
float col[4];
};
When using simpler shader that only use the position and color for example, I would only write the attribute that I care about in the mapped memory and the shader code would simply ignore all the unused attributes.
Because this feel wasteful, I am considering using different vertex format for each of my shaders.
Simple objects rendered using the simple shader would use SimpleVertex:
struct SimpleVertex
{
float pos[3];
float col[4];
};
While others, multi-textured objects, would be rendered using multitexture shader and use MultitextureVertex:
struct MultitextureVertex
{
float pos[3];
float uv0[2];
float uv1[2];
};
How should I handle these different format?
Should I write all the vertex of different format in the same mapped buffer and change my AttribPointers before drawing? This would save some space.
Should I map a different buffers for each vertex formats? Perhaps more efficient.
Or should I keep the 'one size fit all' vertex format? This is easier.
I am curious to learn what is the best practice in this situation.
Thanks.
There can be a lot of variations based on your underlying system architecture, but let's assume you're using a discrete GPU (e.g., AMD or NVIDIA) with dedicated graphics memory.
You don't mention if you interleave your attributes in an array of structures (AoS) (perhaps something like the following):
struct Vertex {
float position[3];
float normal[3];
float uv[2];
...
};
or group similar attributes together (commonly called a structure of arrays or (SoA))
struct VertexAttributes {
float positions[N][3];
float normals[N][3];
float uv[N][2];
...
};
This is relevant given you're buffer-mapping approach. When you map in an entire buffer, the GPU likely needs to copy its version of the buffer to the CPU, who provides you the pointer to update the values. When you unmap the buffer, the driver will copy the buffer back to the GPU. With the AoS layout and your technique, you'll touch small subsections of the entire buffer, and since the GPU driver doesn't know which bits of memory you've updated, it's only recourse is to copy the entire thing back to the GPU. Depending on the size, this can have significant impact at several levels (poor CPU cache-read utilization, consuming lots of CPU-to-GPU bus bandwidth, etc.). Unfortunately, there aren't good alternatives if you're only updating a small fraction of your vertex attributes. However, if you're updating all of the attributes, this approach is okay (although it's often recommended to use glBufferSubData or similar commands, which save the read back from the GPU to the CPU).
Conversely, if you're using the SoA approach, mapping the entire buffer in will cause the similar problems, but the situation can be better. Since the values for a specific attribute are contiguous in memory, you can use something like glMapBufferRange to only map in the memory you need (but again, I'd still recommend using glBufferSubData for the previously stated reasons). Given your current scenario, this is what I'd recommend.
Are there any benefits of having separate vertex buffers for static and dynamic objects in a DirectX 11 application? My approach is to have the vertices of all objects in a scene stored in the same vertex buffer.
However, I will only have to re-map a small number of objects (1 to 5) of the whole collection (up to 200 objects). The majority of objects are static and will not be transformed in any way. What is the best approach for doing this?
Mapping a big vertex buffer with discard forces the driver to allocate new memory every frame. Up to ~4 frames can be in flight, so there can be 4 copies of that buffer. This can lead to memory overcommitment and stuttering. For example, ATI advises to discard vertex buffers up to 4 mb max (GCN Performance Tweets). Besides, every time you will have to needlessly copy static data to a new vertex buffer.
Mapping with no overwrite should work better. It would require to manually manage the memory, so you won't overwrite the data which is in flight. I'm not sure about the performance implications, but for sure this isn't a recommended path.
Best approach would be to simplify driver's work by providing as many hints as possible. Create static vertex buffers with immutable flag, long lived with default flag and dynamic with dynamic flag. See vendor guides like GCN Performance Tweets or Don’t Throw it all Away: Efficient Buffer Management for additional details.
We use buffer objects for reducing copy operations from CPU-GPU and for texture buffer objects we can change target from vertex to texture in buffer objects. Is there any other advantage here of texture buffer objects? Also, it does not allow filtering, is there any disadvantage of this?
A buffer texture is similar to a 1D-texture but has a backing buffer store that's not part of the texture object (in contrast to any other texture object) but realized with an actual buffer object bound to TEXTURE_BUFFER. Using a buffer texture has several implications and, AFAIK, one use-case that can't be mapped to any other type of texture.
Note that a buffer texture is not a buffer object - a buffer texture is merely associated with a buffer object using glTexBuffer.
By comparison, buffer textures can be huge. Table 23.53 and following of the core OpenGL 4.4 spec defines a minimum maximum (i.e. the minimal value that implementations must provide) number of texels MAX_TEXTURE_BUFFER_SIZE. The potential number of texels being stored in your buffer object is computed as follows (as found in GL_ARB_texture_buffer_object):
floor(<buffer_size> / (<components> * sizeof(<base_type>))
The resulting value clamped to MAX_TEXTURE_BUFFER_SIZE is the number of addressable texels.
Example:
You have a buffer object storing 4MiB of data. What you want is a buffer texture for addressing RGBA texels, so you choose an internal format RGBA8. The addressable number of texels is then
floor(4MiB / (4 * sizeof(UNSIGNED_BYTE)) == 1024^2 texels == 2^20 texels
If your implementation supports this number, you can address the full range of values in your buffer object. The above isn't too impressive and can simply be achieved with any other texture on current implementations. However, the machine on which I'm writing this answer supports 2^28 == 268435456 texels.
With OpenGL 4.4 (and 4.3 and possibly with earlier 4.x versions), the MAX_TEXTURE_SIZE is 2 ^ 16 texels per 1D-texture, so a buffer texture can still be 4 times as large. On my local machine I can allocate a 2GiB buffer texture (even larger actually), but only a 1GiB 1D-texture when using RGBAF32 texels.
A use-case for buffer textures is random (and atomic, if desired) read-/write-access (the latter via image load/store) to a large data store inside a shader. Yes, you can do random read-access on arrays of uniforms inside one or multiple blocks but it get's very tedious if you have to process a lot of data and have to work with multiple blocks and even then, looking at the maximum combined size of all uniform components (where a single float component has a size of 4 bytes) in all uniform blocks for a single stage,
MAX_(stage)_UNIFORM_BLOCKS *
MAX_UNIFORM_BLOCK_SIZE +
MAX_(stage)_UNIFORM_COMPONENTS * 4
isn't really a lot of space to work with in a shader stage (depending on how large your implementation allows the above number to be).
An important difference between textures and buffer textures is that the data store, as a regular buffer object, can be used in operations where a texture simply does not work. The extension mentions:
The use of a buffer object to provide storage allows the texture data to
be specified in a number of different ways: via buffer object loads
(BufferData), direct CPU writes (MapBuffer), framebuffer readbacks
(EXT_pixel_buffer_object extension). A buffer object can also be loaded
by transform feedback (NV_transform_feedback extension), which captures
selected transformed attributes of vertices processed by the GL. Several
of these mechanisms do not require an extra data copy, which would be
required when using conventional TexImage-like entry points.
An implication of using buffer textures is that look-ups inside a shader can only be done via texelFetch. Buffer textures also aren't mip-mapped and, as you already mentioned, during fetches there is no filtering.
Addendum:
Since OpenGL 4.3, we have what is called a
Shader Storage Buffer. These too provide random (atomic) read-/write-access to a large data store but don't need to be accessed with texelFetch() or image load/store functions as is the case for buffer textures. Using buffer textures also implies having to deal with gvec4 return values, both with texelFetch() and imageLoad() / imageStore(). This becomes very tedious as soon as you want to work with structures (or arrays thereof) and you don't want to think of some stupid packing scheme using multiple instances of vec4 or using multiple buffer textures to achieve something similar. With a buffer accessed as shader storage, you can simple index into the data store and pull one or more instances of some struct {} directly from the buffer.
Also, since they are very similar to uniform blocks, using them should be fairly straight forward - if you know how to use uniform buffers, you don't have a long way to go learn how to use shader storage buffers.
It's also absolutely worth browsing the Issues section of the corresponding ARB extension.
Performance Implications
Daniel Rakos did some performance analysis years ago, both as a comparison of uniform buffers and buffer textures, and also on a little more general note based on information from AMD's OpenCL programming guide. There is now a very recent version, specifically targeting OpenCL optimization an AMD platforms.
There are many factors influencing performance:
access patterns and resulting caching behavior
cache line sizes and memory layou
what kind of memory is accessed (registers, local, global, L1/L2 etc.) and its respective memory bandwidth
how well memory fetching latency is hidden by doing something else in the meantime
what kind of hardware you're on, i.e. a dedicated graphics card with dedicated memory or some unified memory architecture
etc., etc.
As always when worrying about performance: implement something that works and see if that solutions is fast enough for your needs. Otherwise, implement two or more approaches to solving the problem, profile them and compare.
Also, vendor specific guides can offer a great deal of insight. The above mentioned OpenCL user and optimization guides provide a high-level architectural perspective and specific hints on how to optimize your CL kernels - stuff that's also relevant when developing shaders.
A one use case I have found was to store per primitive attributes (accessed in the fragment shader with help of gl_PrimitiveID) while still maintaining unique vertices in the indexed mesh.
In this question I'm interested in buffer-drawing in OpenGL, specifically in the tradeoff of using one buffer per data set vs one buffer for more than one data set.
Context:
Consider a data set of N vertices each represented by a set of attributes (e.g. color, texture, normals).
Each attribute is represented by a type (e.g. GLfloat, GLint) and a number of components (2, 3, 4). We want to draw this data. Schematically,
(non-interleaved representation)
data set
<-------------->
a_1 a_2 a_3
<---><---><---->
a_i = attribute; e.g. a2 = (3 GLfloats representing color, thus 3*N Glfloats)
We want to map this into the GL state, using glBufferSubData.
Problem
When mapping, we have to keep track of the data in our memory because glBufferSubData requires a start and size. This sounds to me like an allocation problem: we want to allocate memory and keep track of its position. Since we want fast access to it, we would like the data to be in the same memory position, e.g. with a std::vector<char>. Schematically,
data set 1 data set 2
<------------><-------------->
(both have same buffer id)
We commit to the gl state as:
// id is binded to one std::vector<char>, "data".
glBindBuffer(target, id);
// for each data_set (AFTER calling glBindBuffer).
// for each attribute
// "start": the start point of the attribute.
// "size": (sizeof*components of the attribute)*N.
glBufferSubData(target, start, size, &(data[0]))
(non non-interleaved for the sake of the code).
the problem arises when we want to add or remove vertices, e.g. when LOD changes. Because each data set must be a chunk, for instance to allow interleaved drawing (even in non-interleaved, each attribute is a chunk), we will end up with fragmentation in our std::vector<char>.
On the other hand, we can also set one chunk per buffer: instead of assigning chunks to the same buffer, we assign each chuck, now a std::vector<char>, to a different buffer. Schematically,
data set 1 (buffer id1)
<------------>
data set 2 (buffer id2)
<-------------->
We commit data to the gl state as:
// for each data_set (BEFORE calling glBindBuffer).
// "data" is the std::vector<char> of this data_set.
// id is now binded to the specific std::vector<char>
glBindBuffer(target, id);
// for each attribute
// "start": the start point of the attribute.
// "size": (sizeof*components of the attribute)*N.
glBufferSubData(target, start, size, &(data[0]))
Questions
I'm learning this, so, before any of the below: is this reasoning correct?
Assuming yes,
Is it a problem to have an arbitrary number of buffers?
Is "glBindBuffer" expected to scale with the number of buffers?
What are the major points to take into consideration in this decision?
It is not quite clear if you asking about performance trade-offs. But I will answer in this key.
Is it a problem to have an arbitrary number of buffers?
It is a problem came from a dark medieval times when pipelines was fixed and rest for now due to backward compatibility reasons. glBind* is considered as a (one of) performance bottleneck in modern OpenGL drivers, caused by bad locality of references and cache misses. Simply speaking, cache is cold and huge part of time CPU just waits in driver for data transferred from main memory. There is nothing drivers implementers can do with current API. Read Nvidia's short article about it and their bindless extensions proposals.
2. Is "glBindBuffer" expected to scale with the number of buffers?
Surely, the more objects (buffers in your case), more bind calls, more performance loss in driver. But merged, huge resource objects are less manageable.
3. What are the major points to take into consideration in this decision?
Only one. Profiling results ;)
"Premature optimization is the root of all evil", so try to stay as much objective as possible and believe only in numbers. When numbers will go bad, we can think of:
"Huge", "all in one" resources:
less bind calls
less context changes
harder to manage and debug, need some additional code infrastructure (to update resource data for example)
resizing (reallocation) very slow
Separate resources:
more bind calls, loosing time in driver
more context changes
easier to manage, less error-prone
easy to resize, allocate, reallocate
In the end, we can see have performance-complexity trade-off and different behavior when update data. To stick one approach or another, you must:
decide, would you like to keep things simple, manageable or add complexity and gain additional FPS (profile in graphics profilers to know how much. Does it worth it?)
know how often you resize/reallocate buffers (trace API calls in graphics debuggers).
Hope it helps somehow ;)
If you like theoretical assertions like this, probably you will be interested in another one, about interleaving (DirectX one)
If my vertex positions are shared, but my normals and UVs are not (to preserve hard edges and the likes), is it possible to use non-interleaved buffers in DirectX11 to solve this memory representation, such that I could use indice buffer with it? Or should I stick with duplicated vertex positions in an interleaved buffer?
And is there any performance concerns between interleaved and non-interleaved vertex buffers? Thank you!
How to
There are several ways. I'll describe the simplest one.
Just create separate vertex buffers:
ID3D11Buffer* positions;
ID3D11Buffer* texcoords;
ID3D11Buffer* normals;
Create input layout elements, incrementing InputSlot member for each component:
{ "POSITION", 0, DXGI_FORMAT_R32G32B32_FLOAT, 0, 0, D3D11_INPUT_PER_VERTEX_DATA, 0 },
{ "TEXCOORD", 0, DXGI_FORMAT_R32G32_FLOAT, 1, D3D11_APPEND_ALIGNED_ELEMENT, D3D11_INPUT_PER_VERTEX_DATA, 0 },
{ "NORMAL", 0, DXGI_FORMAT_R32G32B32_FLOAT, 2, D3D11_APPEND_ALIGNED_ELEMENT, D3D11_INPUT_PER_VERTEX_DATA, 0 },
// ^
// InputSlot
Bind buffers to their slots (better all in one shot):
ID3D11Buffer** vbs = {positions, texcoords, normals};
unsigned int strides[] = { /*strides go here*/ };
unsigned int offsets [] = { /*offsets go here*/ };
m_Context->IASetVertexBuffers(0, 3, vbs, strides, offsets );
Draw as usual.
You don't need to change HLSL code (HLSL will think as it have single buffer).
Note, that code snippets was written on-the-fly and can contain mistakes.
Edit: you can improve this approach, combining buffers by update rate: if texcoords and normals never changed, merge them.
As of performance
It is all about locality of references: the closer data, the faster access.
Interleaved buffer, in most cases, gives (by far) more performance for GPU side (i.e. rendering): for each vertex each attribute near each other. But separate buffers gives faster CPU access: arrays are contiguous, each next data is near previous.
So, overall, performance concerns depends on how often you writing to buffers. If your limiting factor is CPU writes, stick to separate buffers. If not, go for single one.
How will you know? Only one way - profile. Both, CPU side, and GPU side (via Graphics debugger/profiler from your GPU's vendor).
Another factors
The best practice is to limit CPU writes, so, if you will find that you are limited by buffer updating, you probably need to re-view your approach. Do we need to update buffer each frame if we have 500 fps? User won't see difference if you reduce buffer update rate to 30-60 times per second (unbind buffer update from frame update). So, if your updating strategy is reasonable, you will likely never be CPU-limited and best approach is classic interleaving.
You can also consider re-designing your data pipeline, or even somehow prepare data offline (we call it "baking"), so you will not need to cope with non-interleaved buffers. That will be quite reasonable too.
Reduce memory footprint or increase performance?
Memory-to-performance tradeoff. This is the eternal question. Duplicate memory to take advantages of interleaving? Or not?
Answer is... "that depends". You are programming new CryEngine, targeting top GPUs with gigabytes of memory? Or you're programming for embedded systems of mobile platform, where memory resources slow and limited? Does 1 megabyte memory worth hassle at all? Or you have huge models, 100 MB each? We don't know.
It's all up to you to decide. But remember: there are no free candies. If you'll find memory economy worth performance loss, do it. Profile and compare to be sure.
Hope it helps somehow. Happy coding! =)
Interleaved/Separate will mostly affect your Input Assembler stage (GPU side).
A perfect scenario for Interleaved is when your Buffer memory arrangement perfectly fits your vertex shader input. So your Input assembler can simply fetch the data.
In that case you'll be totally fine with interleaved, even tho testing with a large model (two versions of the same data, one interleaved, one separate), TimeStamp query didn't reported any major difference (some pretty minimal vertex processing and basic pixel shader).
Now having separate buffers makes it much easier to fine tune in case you use your geometry in different contexts.
Let's say you have Position/Normals/UV (like in your case).
Now you also have a shader in your pipeline that only requires Position (Shadow Map would be a pretty good example).
With separate buffers, you can simply create a new input layout which contains position only, And bind that buffer instead. Your IA stage has only to
load that buffer. Best of all you can even do that dynamically using shader reflection.
If you bind Interleaved data, you will have some overhead due to the fact it has to load with a stride.
When I tested that one I had about 20% gain using Separate instead of Interleaved, which can be quite decent, but since this type of processing can be largely architecture dependent, don't take it for granted (NVidia 740M for testing).
So simply put, profile (a lot), and check which gives you the best balance between your GPU and CPU loads.
Please also note that the overhead from Input Assembler will decrease from the complexity of your shader, if you apply some heavy calculations + add some tessellation + some decent shading, the time difference between interleaved/non interleaved will become progressively meaningless.
You should stick with interleaved buffers. Any other technique will require some form of indirection to your non-duplicated position buffer, which will cost you performance and cache efficiency.