In OpenGL it is a common practice to orphan buffers that are used frequently. Ideally the drivers notices that a buffer of the same size is requested and if possible returns the old buffer if it is not needed anymore. The buffer only allocates new memory when the old buffer is still in use and can't be reused.
In OpenCL (on NVIDIA Hardware using the latest developer drivers) i am not sure about this technic. I got a 256kB buffer that is handled by the c++ wrapper refcounting which i reallocate frequently. Most of the time this works fine but in some cases OpenCL throws a CL_OUT_OF_MEMORY error while allocating a new buffer.
Do you think that i should switch my approach (e.g. using a constant number of buffers)? Or should i investigate in an other possible cause for this problem?
Kind regards,
Florian
OpenCL uses the C semantics for memory allocation and deallocation. As such, it will not automatically reuse buffers. You have to explicitly release a buffer and allocate a new buffer later. Alternatively, it seems to be a good practice to reuse buffers manually. Allocation can be a quite expensive operation.
Related
So I'm starting to learn Vulkan, and I still haven't done anything with multiple objects yet, thing is, if I'm making a game engine, and I'm implementing some kind of "drag and drop" thing, where you drag, for example, a cube from a panel, and drop it into the scene, is it better to... ?
Have separate vertex buffers.
Have just one, and make it grow kinda like a std::vector class or something, but this sounds really slow considering I have to perform a transfer operation with command buffers every time a resize needs to happen (every time a new object gets added to the scene).
A growing vertex buffer is usually the way to go, keep in mind Vulkan has very limited buffer handles of every type and is designed for sub-allocation (like in old school C). Excerpt from NVIDIA's Vulkan recommendations:
Use memory sub-allocation. vkAllocateMemory() is an expensive operation on the CPU. Cost can be reduced by suballocating from a large memory object. Memory is allocated in pages which have a fixed size; sub-allocation helps to decrease the memory footprint.
The only note here is to aggressively allocate your buffer as large as you believe it will grow. The other point on that page warns you against pushing the memory to its limits, the OS will give up and kill your program if you fail this:
When memory is over-committed on Windows, the OS may temporarily suspend a process from the GPU runlist in order to page out its allocations to make room for a different process’ allocations. There is no OS memory manager on Linux that mitigates over-commitment by automatically performing paging operations on memory objects.
I'm looking for a reliable way to determine current GPU memory usage with OpenCL.
I have found NVidia API: cudaMemGetInfo( size_t* free, size_t* total ) to get free memory and total memory on the current device.
But I'm looking for a solution for AMD and OpenCL. I did not find if there is similar functionality in OpenCL and I don't know if AMD has something equivalent.
I don't want to know how much free memory there is on OpenCL devices before allocating buffers but free memory afer allocating buffers.
A priori as indicated in How do I determine available device memory in OpenCL?, With OpenCL, there is no way, and there is no need to know it.
devices before allocating buffers but free memory afer allocating buffers.
For AMD, perhaps try CL_DEVICE_GLOBAL_FREE_MEMORY_AMD from the cl_amd_device_attribute_query extension - this extension will probably only work with proprietary drivers, though.
In general case, it's impossible, because AFAIK there's no way to know when buffers are allocated (on the device). In this sense OpenCL is higher-level than CUDA. Buffers belong to contexts, not devices. Calling clCreateBuffer() can but doesn't have to allocate any memory on any device; the implementations automatically migrate buffers to device memory before they execute a kernel which needs those buffers, and move them away from the device if they need to free memory for next kernel. Even if you get the free memory of a device, you can't 100% reliably use it to make decisions on whether to run a kernel, because clEnqueueNDRange() doesn't necessarily immediately launch a kernel (it just enqueues it; if there's something else in the queue, it can be delayed), and some other application on the same computer could get scheduled on the GPU in meantime.
If you want to avoid swapping memory, you'll have to make sure 1) your application is the only one using the GPU, 2) for each of your kernels, total buffer arguments size must be <= GLOBAL_MEM_SIZE.
I've been looking into writing applications using OpenGL to render data on-screen, and there is one thing that constantly comes up -- it is slow to copy data into the GPU.
I am currently switching between reading the OpenGL SuperBible 7th Edition and reading various tutorials online, and I have not come across when data is actually sent to the GPU, I only have guesses.
Is space allocated in the GPU's ram when I make calls to glBufferStorage/glCreateVertexArrays? Or is this space allocated in my application's memory and then copied over at a later time?
Is the pointer returned from glMapBuffer* a pointer to GPU memory, or is it a pointer to space allocated in my applications memory that is then copied over at a later time?
Assuming that the data is stored in my applications memory and copied over to the GPU, when is the data actually copied? When I make a call to glCrawArrays?
1: glCreateVertexArrays doesn't have anything to do with buffer objects or GPU memory (of that kind), so it's kinda irrelevant.
As for the rest, when OpenGL decides to allocate actual GPU memory is up to the OpenGL implementation. It can defer the actual allocation as long as it wants.
If you're asking about when your data is uploaded to OpenGL, OpenGL will always be finished with any pointer you pass it when that function call returns. So the implementation will either copy the data to the GPU-accessible memory within the call itself, or it will allocate some CPU memory and copy your data into that, scheduling the transfer to the actual GPU storage for later.
As a matter of practicality, you should assume that copying to the buffer doesn't happen immediately. This is because DMAs usually require certain memory alignment, and the pointer you pass may not have that alignment.
But usually, you shouldn't care. Let the implementation do its job.
2: Like the above, the implementation can do whatever it wants when you map memory. It might give you a genuine pointer to GPU-accessible memory. Or it might just allocate a block of CPU memory and DMA it up when you unmap the memory.
The only exception to this is persistent mapping. That feature requires that OpenGL give you an actual pointer to the actual GPU-accessible memory that the buffer resides in. This is because you never actually tell the implementation when you're finished writing to/reading from the memory.
This is also (part of) why OpenGL requires you to allocate buffer storage immutably to be able to use persistent mapping.
3: It is copied whenever the implementation feels that it needs to be.
OpenGL implementations are a black box. What they do is more-or-less up to them. The only requirement the specification makes is that their behavior act "as if" it were doing things the way the specification says. As such, the data can be copied whenever the implementation feels like copying it, so long as everything still works "as if" it had copied it immediately.
Making a draw call does not require that any buffer DMAs that this draw command relies on have completed at that time. It merely requires that those DMAs will happen before the GPU actually executes that drawing command. The implementation could do that by blocking in the glDraw* call until the DMAs have completed. But it can also use internal GPU synchronization mechanisms to tie the drawing command being issued to the completion of the DMA operation(s).
The only thing that will guarantee that the upload has actually completed is to call a function that will cause the GPU to access the buffer, then synchronizing the CPU with that command. Synchronizing after only the upload doesn't guarantee anything. The upload itself is not observable behavior, so synchronizing there may not have an effect.
Then again, it might. That's the point; you cannot know.
Not sure what the DX parlance is for these, but I'm sure they have a similar notion.
As far as I'm aware the advantage of VBO's is that they allocate memory that's directly available by the GPU. We can then upload data to this buffer, and keep it there for an extended number of frames, preventing all the overhead of uploading the data every frame. Additionally, we're able to alter this data on a per-datum basis, if we choose to.
Therefore, I can see the advantage of using VBO's for static geo, but I don't see any benefit at all for dynamic objects - since you pretty much have to update all the data every frame anyways?
There are several methods of updating buffers in OpenGL. If you have dynamic data, you can simply reinitialize the buffer storage every frame with the new data (eg. with glBufferData). You can also use client vertex buffer pointers, in compatibility contexts. However, these methods can cause 'churn' in the memory allocation of the driver. The new data storage essentially has to sit in system memory until the GPU driver handles it, and it's not possible to get feedback on this process.
In later OpenGL versions (4.4, and via extensions in earlier versions), some functionality was introduced to try and reduce the overhead of updating dynamic buffers, allowing for GPU allocated memory to be written without direct driver synchronization. This essentially requires that you have the glBufferStorage and glMapBufferRange functionality available. You create the buffer storage with the GL_DYNAMIC_STORAGE_BIT, and then map it with GL_MAP_PERSISTENT_BIT (you may require other flags, depending on whether you are reading and/or writing the data). However, this technique also requires that you use GPU fencing to ensure you are not overwriting the data as the GPU is reading it. Using this method makes updating VBOs much more efficient than reinitializing the data store, or using client buffers.
There is a good presentation on GDC Vault about this technique (skip to the DynamicStreaming heading).
AFAIK, by creating dynamic vertex buffer, you are giving graphic adapter driver a hint to place vertex buffer in memory which fast for CPU to write but also reasonably fast for GPU to read it. Driver usually will manage it to minimize GPU stall by giving non-overlapping memory area so that CPU can write while GPU read other memory area.
If you do not give hint, it is assume a static resource so it will be placed in memory which fast for GPU to read/write but very slow for CPU to write.
I've got a multithreaded OpenGL application using PBOs for data transfers between cpu and gpu.
I have pooled the allocation of PBOs, however, when the pools are empty my non-opengl threads have to block for a while until the OpenGL thread reaches a point where it can allocate the buffers (i.e. finished rendering the current frame). This waiting is causing some lag spikes in certain situations which I'd like to avoid.
Is it possible to allocate PBO's on another thread which are then used by the "main" OpenGL thread?
Yes, you can create objects on one thread which can be used on another. You'll need a new GL context to do this.
That being said, there are two concerns.
First, it depends on what you mean by "allocation of PBOs". You should never be allocating buffer objects in the middle of a frame. You should allocate all the buffers you need up front. When the time comes to use them, then you can simply use what you have.
By "allocate", I mean call glBufferData on a previously allocated buffer using a different size or driver hint than was used before. Or by using glGenBuffers and glDeleteBuffers in any way. Neither of these should happen within a frame.
Second, invalidating the buffer should never cause "lag spikes". By "invalidate", I mean reallocating the buffer with glBufferData using the same size and usage hint, or using glMapBufferRange with the GL_INVALIDATE_BUFFER bit. You should look at this page on how to stream buffer object data for details. If you're getting issues, then you're probably on NVIDIA hardware and using the wrong buffer object hint (ie: use STREAM).