glBufferData is very slow if I reserve the right amount of memory [closed] - opengl

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 6 days ago.
Improve this question
I encounter a situation where glBufferData performance is very bad if I reserve the right amount of memory, but very good (with added visual glitches) if I reserve more.
For the context, I have around 32000 particles to draw, updated every frame.
Here I reserve the exact amount of memory, giving poor performance (12fps)
glBufferData(GL_ARRAY_BUFFER, 16 * (sizeof(float)) * particlesFallingInFrustum.size(), &instancedData[0], GL_DYNAMIC_DRAW)
Here I reserve more memory, giving very good performance (60 fps)
glBufferData(GL_ARRAY_BUFFER, 16 * (sizeof(float)) * (particlesFallingInFrustum.size() + 60000), &instancedData[0], GL_DYNAMIC_DRAW)
Some details about the data (taken from the same frame):
particlesFallingInFrustum.size() = 32529. It's the number of particles to draw.
instancedData.size() = 520464. It's 16 floats per particle.
A hint, maybe: I encountered this problem after having implemented frustum culling, which reduce the number of particles to draw by ~70%. The case where I reserve way too much memory, with good performance, corresponds approximately to the number of memory I needed without frustum culling.
I'm probably missing something obvious but I don't find what.

glBufferData gave poor performance because I reserved less memory that I then asked glDrawElementsInstanced to draw.

Related

Optimizing rendering through packing vertex buffers [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 5 years ago.
Improve this question
John Carmack recently tweeted:
"Another outdated habit is making separate buffer objects for vertexes
and indexes."
I am afraid I don't fully understand what he means. Is he implying that packing all 3D data including vertex, index, and joint data into a single vertex buffer is optimal compared to separate buffers for each? And, if so, would such a technique apply only to OpenGL or could a Vulkan renderer benefit as well?
I think he means there's no particular need to put them in different buffer objects. You probably don't want to interleave them at fine-granularity, but putting e.g. all the indices for a mesh at the beginning of a buffer and then all the vertex data for the mesh following it is not going to be any worse than using separate buffer objects. Use offsets to point the binding points at the correct location in the buffer.
Whether it's better to put them in one buffer I don't know: if it is, I think it's probably ancillary things like having fewer larger memory allocations tends to be a little more efficient, you (or the driver) can do one large copy instead of two smaller ones when copies are necessary, etc.
Edit: I'd expect this all to apply to both GL and Vulkan.
John Carmack has replied with an answer regarding his original tweet:
"The performance should be the same, just less management. I wouldn't bother changing old code."
...
"It isn't a problem, it just isn't necessary on modern hardware. No big efficiency to packing them, just a modest management overhead."
So perhaps it isn't an outdated habit at all especially since it goes against the intended use-case for most APIs and in some cases can break compatibility -
as noted by Nico.

Is the HEAP a term for ram, processor memory or BOTH? And how many unions can I allocate for at once? [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 6 years ago.
Improve this question
I was hoping someone had some schooling they could lay down about the whole HEAP and stack ordeal. I am trying to make a program that would attempt to create about 20,000 instances of just one union and if so some day I may want to implement a much larger program. Other than my current project consisting of a maximum of just 20,000 unions stored where ever c++ will allocate them do you think I could up the anti into the millions while retaining a reasonable return speed on function calls, approximately 1,360,000 or so? And how do you think it will handle 20,000?
Heap is an area used for dynamic memory allocation.
It's usually used to allocate space for variable collection size, and/or to allocate a large amount of memory. It's definitely not a CPU register(s).
Besides this I think there is no guarantee what heap is.
This may be RAM, may be processor cache, even HDD storage. Let OS and hardware decide what it will be in particular case.

Deferred shading, store position or construct it from depth [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 7 years ago.
Improve this question
I'm in the middle of implementing deferred shading in an engine I'm working on, and now I have to make a decision on whether to use a full RGB32F texture to store positions, or reconstruct it from the depth buffer. So it's basically a RGB32F texel fetch vs a matrix vector multiplication in the fragment shader. Also the trade between memory and extra ALU operations.
Please direct me to useful resources and tell me your own experience with the subject.
In my opinion it is preferable to recalculate the position from depth. This is what I do in my deferred engine. The recalculation is a fast enough to not even show up when I've been profiling the render loop. And that (virtually no performance impact) compared to ~24MB of extra video memory usage (for a 1920x1080 texture) was an easy choice for me.

C++ -Sorting a Big List (RAM USAGE HIGH!) [closed]

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 8 years ago.
Improve this question
i'm writing this because i'm noticing that when i need to sort a list of n elements, my ram usage keeps growing even if all the elements are allocated and the only operations requested are swapping and moving elements...
The problem is not the speed of my algorithm, but the fact that on every new cycle a lot of ram gets allocated, and i don't understand why, could you please help me?
Thanks!
Write a test with 10 elements in the sequence
Run it under valgrind --tool=massif
...
Profit
There are tons of sorting algorithms and containers implementations around, many (if not most) container implementations allocate/deallocate memory on each insert/erase operation, so you really need to go all the way down to finest details and pick the right combination if dynamic allocation is a problem.

What's the best way to allocate HUGE amounts of memory? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
Questions asking for code must demonstrate a minimal understanding of the problem being solved. Include attempted solutions, why they didn't work, and the expected results. See also: Stack Overflow question checklist
Closed 9 years ago.
Improve this question
I'm allocating 10 GB of RAM for tons of objects that I will need. I want to be able to squeeze every last byte of RAM I can before hitting some problem like null pointer.
I know the allocator returns continuous memory, so if I have scattered memory from other programs, the max continuous size will be quite small (I assume), or smaller than the actual amount of remaining free memory.
Is it better to allocate the entire size of continuous memory I need in one go (10GB) or is it better to allocate smaller non-contiguous chunks and link them together?
Which one is more likely to always return all the memory I need?