How to deallocate glBufferData memory - c++

I created a Vertex Buffer Object class to manage lots of vertices in my application. The user calls the constructor to create a glBuffer and calls glBufferData to allocate a specified amount of space.
There is a class function called resize that allows the user to change the capacity of the VBO by calling again the glBufferData. My question is, how do I deallocate the previous allocation? Or Is it done automatically?
glDeleteBuffer, according to the opengl docs, only deletes the buffer itself with no mention of the actual memory allocated with glBufferData.
Can I keep calling glBufferData on the same bound buffer with no memory leak?

You can't create a memory leak by repeatedly calling glBufferData() for the same buffer object. The new allocation replaces the old one.
There's one subtle aspect that most of the time you don't need to worry about, but may still be useful to understand: There is a possibility of having multiple active allocations for the same buffer object temporarily. This happens due to the asynchronous nature of OpenGL. For illustration, picture a call sequence like this:
glBufferData(dataA)
glDraw()
glBufferData(dataB)
glDraw()
When you make the API call for item 3 in this sequence, the GPU may not yet have finished with the draw call from call 2. It may in fact still be queued up somewhere in the driver, and not handed over to the GPU yet. Since call 2 depends on dataA, that data can not be deleted until the GPU finished executing the draw call 2. In this case, the allocations for dataA and dataB temporarily exist at the same time.
When exactly dataA is actually deleted depends on the implementation. It just can't be earlier than the time where the GPU finishes with draw call 2. After that it could be immediately, based on some garbage collection timer, when memory runs low, or many other options.
glDeleteBuffer() will also delete the buffer memory. Very similarly to the point above, it may not happen immediately. Again, it can only be deleted after the GPU finished executing all pending operations that use the buffer memory.
If you don't plan to use the buffer object anymore, calling glDeleteBuffer() is the best option.

After 10 minutes I read the docs for the glBufferData page.
glBufferData creates a new data store for the buffer object currently bound to
target. Any pre-existing data store is deleted.
,which solves my question. I indeed can keep calling it to increase or decrease the size of my VBO.

glDeleteBuffer delete the buffer handle and the associated resource, if any, should be collected/released soon by the system.
If the buffer is currently binded the driver will unbind it (bind to zero), although it is ugly to delete a binded buffer.

Related

Why glBufferSubData need to wait until the VBO is not used by glDrawElements?

In OpenGL Insights, it says that "OpenGL driver has to wait because VBO is used
by glDrawElements from previous frame".
That confused me a lot.
As I know, glBufferSubData will copy the data to temporary memory, and then transfer to GPU later.
So why the driver still need to wait? it can just append the Transfer command to the command queue, delaying transfering the data to GPU until the glDrawElements is finished, right?
----- ADDED --------------------------------------------------------------------------
In OpenGL Insights, it says:
http://www.seas.upenn.edu/~pcozzi/OpenGLInsights/OpenGLInsights-AsynchronousBufferTransfers.pdf (Page 397)
However, when using glBufferSubData or glMapBuffer[Range], nothing in
the API itself prevents us from modifying data that are currently used
by the device for rendering the previous frame, as shown in Figure
28.3. Drivers have to avoid this problem by blocking the function until the desired data are not used anymore: this is called an
implicit synchronization.
And also in "Beyond Porting" by Valve & NVIDIA, it says:
http://media.steampowered.com/apps/steamdevdays/slides/beyondporting.pdf
MAP_UNSYNCHRONIZED
Avoids an application-GPU sync point (a CPU-GPU
sync point)
But causes the Client and Server threads to serialize
This
forces all pending work in the server thread to complete
It’s quite
expensive (almost always needs to be avoided)
Both of them pointed out that glBufferSubData/glMapBuffer will block the application thread, not just the driver thread.
Why is it?
There is no rule saying that the driver has to wait. It needs to ensure that buffer content is not modified before draw calls using the old content have finished executing. And it needs to consume the data that the caller passed in before the glBufferSubData() call returns. As long as the resulting behavior is correct, any implementation in the driver is fair game.
Let's illustrate the problem with a typical pseudo-call sequence, labelling the calls for later explanation:
(1) glBindBuffer(buf)
(2) glBufferSubData(dataA)
(3) glDraw()
(4) glBufferSubData(dataB)
(5) glDraw()
The constraints in play are:
The data pointed to by dataA cannot be accessed by the driver after call (2) returns. The OpenGL specs allow the caller to do anything it wants with the data after the call returns, so it needs to be consumed by the driver before the call returns.
The data pointed to by dataB cannot be accessed by the driver after call (4) returns.
The draw command resulting from call (3) needs to be executed while the content of buf is dataA.
The draw command resulting from call (5) needs to be executed while the content of buf is dataB.
Due to the inherently asynchronous nature of OpenGL, the interesting case is call (4). Let's say that dataA has been stored in buf at this point in time, and the draw command for call (3) has been queued up for execution by the GPU. But we can't rely on the GPU having executed that draw command yet. So we can't store dataB in buf because the pending draw command has to be executed by the GPU while dataA is still stored in buf. But we can't return from the call before we consumed dataB.
There are various approaches for handling this situation. The brute force solution is to simply block the execution of call (4) until the GPU has finished executing the draw command from call (3). That will certainly work, but can have very bad performance implications. Because we wait until the GPU completed work before we submit new work, the GPU will likely go temporarily idle. This is often called a "bubble" in the pipeline, and is very undesirable. On top of that, the application is also blocked from doing useful work until the call returns.
The simplest way to work around this is for the driver to copy dataB in call (4), and later place this copy of the data in buf, after the GPU has completed the draw command from call (3), but before the draw command from call (5) is executed. The downside is that it involves additional data copying, but it's often well worth it to prevent the pipeline bubbles.

Is it necessary to enqueue read/write when using CL_MEM_USE_HOST_PTR?

Assume that I am wait()ing for the kernel to compute the work.
I was wondering if, when allocating a buffer using the CL_MEM_USE_HOST_PTR flag, it is necessary to use enqueueRead/Write on the buffer, or they can always be omitted?
Note
I am aware of this note on the reference:
Calling clEnqueueReadBuffer to read a region of the buffer object with
the ptr argument value set to host_ptr + offset, where host_ptr is a
pointer to the memory region specified when the buffer object being
read is created with CL_MEM_USE_HOST_PTR, must meet the following
requirements in order to avoid undefined behavior:
All commands that use this buffer object have finished execution before the read command begins execution
The buffer object is not mapped
The buffer object is not used by any command-queue until the read command has finished execution
So, to clarify my question, I split it in two:
if I create a buffer using CL_MEM_USE_HOST_PTR flag, can I assume the OpenCL implementation will write to device cache when necessary, so I can always avoid to enqueueWriteBuffer()?
if I call event.wait() after launching a kernel, can I always avoid to enqueueReadBuffer() to access computed data on a buffer created with flag CL_MEM_USE_HOST_PTR?
Maybe I am overthinking about it, but even if the description of the flag is clear about the fact that the host memory will be used to store the data, it is not clear (or I did not find where it is cleared) about when data is available and if the read/write is always implicit.
You'll never have to use enqueueWriteBuffer(), however you have to use enqueueMapBuffer.
See http://www.khronos.org/registry/cl/specs/opencl-1.2.pdf page 89 (it's the same also in 1.1).
The data is available only after you have mapped the object and will again become undefined after you unmap the object. Also this old thread http://www.khronos.org/message_boards/showthread.php/6912-Clarify-CL_MEM_USE_HOST_PTR contains rather useful description.

Good allocator for cross thread allocation and free

I am planning to write a C++ networked application where:
I use a single thread to accept TCP connections and also to read data from them. I am planning to use epoll/select to do this. The data is written into buffers that are allocated using some arena allocator say jemalloc.
Once there is enough data from a single TCP client to form a protocol message, the data is published on a ring buffer. The ring buffer structures contain the fd for the connection and a pointer to the buffer containing the relevant data.
A worker thread processes entries from the ring buffers and sends some result data to the client. After processing each event, the worker thread frees the actual data buffer to return it to the arena allocator for re use.
I am leaving out details on how the publisher makes data written by it visible to the worker thread.
So my question is: Are there any allocators which optimize for this kind of behavior i.e. allocating objects on one thread and freeing on another?
I am worried specifically about having to use locks to return memory to an arena which is not the thread affinitized arena. I am also worried about false sharing since the producer thread and the worker thread will both write to the same region. Seems like jemalloc or tcmalloc both don't optimize for this.
Before you go down the path of implementing a highly optimized allocator for your multi-threaded application, you should first just use the standard new and delete operators for your implementation. After you have a correct implementation of your application, you can move to address bottlenecks that are discovered through profiling it.
If you get to the stage where it is obvious that the standard new and delete allocators are a bottleneck to the application, the following is the approach I have used:
Assumption: The number of threads are fixed and are statically created.
Each thread has their own arena.
Each object taken from an arena has a reference back to the arena it came from.
Each arena has a separate garbage list for each thread.
When a thread frees an object, it goes back the arena it came from, but is placed in the thread specific garbage list.
The thread that actually owns the arena treats its garbage list as the real free list.
Periodically, the thread that owns an arena performs a garbage collection pass to fold objects from the other thread garbage lists into the real free list.
The "periodical" garbage collection pass doesn't necessarily have to be time based. A subset of the garbage could be reaped on every allocation and free, for example.
The best way to deal with memory allocation and deallocation issues is to not deal with it.
You mention a ring buffer. Those are usually a fixed size. If you can come up with a fixed maximum size for your protocol messages you can allocate all the memory you will ever need at program start. When deallocating, keep the memory but reset it to a fresh state.
Now, your program may need to allocate and deallocate memory while dealing with each message but that will be done in each thread and cross-thread issues will not come into play.
This can work even if your message maximum size is too large to preallocate if you can allocate the amount of memory that most messages will use and have handlers for allocating more when necessary.

OpenGL program/shader uninitialization

What's the proper way to do this?
I'm doing these steps:
Create Shader(s)
Compile Shader(s)
Create Program
Attach Shader(s) to Program
Link Program
Delete Shader(s)
In http://www.opengl.org/wiki/GLSL_Object it says: You do not have to explicitly detach shader objects, even after linking the program. However, it is a good idea to do so once linking is complete, as otherwise the program object will keep its attached shader objects alive when you try to delete them.
And also from Proper way to delete GLSL shader? says it'll increase the memory if I don't delete the shaders.
So checking on http://www.opengl.org/sdk/docs/man/xhtml/glDetachShader.xml, it says If shader has already been flagged for deletion by a call to glDeleteShader and it is not attached to any other program object, it will be deleted after it has been detached.
So my #6 is useless unless I detach it after right?
Should I detach and delete after the Program has been compiled correctly (to save the memory) or should I detach/delete only when my application is closing down?
So my #6 is useless unless I detach it after right?
Yes. What the GL does is basically reference counting. As long as some other object is referencing the shader object, it will stay alive. If you delete the object, the actual deletion will be deferred until the last reference is removed.
Should I detach and delete after the Program has been compiled
correctly (to save the memory) or should I detach/delete only when my
application is closing down?
That is up to you. You can delete it as soon as you don't need it any more. If you do not plan to relink that shader, you can destroy all attached shader objects immediately after the initial link operation. However, shader objects aren't consuming much memory after all (and don't go to the GPU memory, only the final programs will) it is typically not a big deal if you delete them later, or even don't delete them at all, as all the GL resources will be destroyed when the GL context is destroyed (including the case that the application exits). Of course, if you create shaders dynamically at runtime, you should also dynamically delete the old and unused objects to avoid accumulating lots of unused objects and effectively leaking memory/object names and so on.

When can I release a source PBO?

I'm using PBOs to asynchronously move data between my cpu and gpu.
When moving from the GPU i know I can delete the source texture after I have called glMapBuffer on the PBO.
However, what about the other way around? When do I know that the transfer from the PBO to the texture (glTexSubImage2D(..., NULL)) is done and I can safely release or re-use the PBO? Is it as soon as I bind the texture or something else?
I think after calling glTexImage you are safe in deleting or reusing the buffer without errors, as the driver handles everything for you, including deferred destruction (that's the advantage of buffer objects). But this means, that calls to glMapBuffer may block until the preceding glTexImage copy has completed. If you want to reuse the buffer and just overwrite its whole content, it is common practice to realocate it with glBufferData before calling glMapBuffer. This way the driver knows you don't care about the previous content anymore and can allocate a new buffer that you can use immediately (the memory containing the previous content is then freed by the driver when it is really not used anymore). Just keep in mind that your buffer object is just a handle to memory, that the driver can manage and copy as it likes.
EDIT: This means in the other way (GPU-CPU) you can delete the source texture after glGetTexImage has returned, as the driver manages everything behind the scenes. The decision of using buffer objects or not should not have any implications on the order and time in which you call GL functions. Keep in mind that calling glDelete... does not immediately delete an object, it just enqueues this command into the GL command stream and even then, its up to the driver when it really frees any memory.