OpenCL Buffer Instantiation in a Multi Device Environment - c++

I'm wondering how the system-side cl::Buffer objects instantiate in a multi-device context.
Let's say I have an OCL environment class, which generates, from cl::Platform, ONE cl::Context:
this->ocl_context = cl::Context(CL_DEVICE_TYPE_GPU, con_prop);
And then, the corresponding set of devices:
this->ocl_devices = this->ocl_context.getInfo<CL_CONTEXT_DEVICES>();
I generate one cl::CommandQueue object and one set of cl::Kernel(s) for EACH device.
Let's say I have 4 GPUs of the same type. Now I have 4x cl::Device objects in ocl_devices.
Now, what happens when I have a second handler class to manage computation on each device:
oclHandler h1(cl::Context* c, cl::CommandQueue cq1, std::vector<cl::Kernel*> k1);
...
oclHandler h2(cl::Context* c, cl::CommandQueue cq4, std::vector<cl::Kernel*> k4);
And then INSIDE EACH CLASS, I both instantiate:
oclHandler::classMemberFunction(
...
this->buffer =
cl::Buffer(
*(this->ocl_context),
CL_MEM_READ_WRITE,
mem_size,
NULL,
NULL
);
...
)
and then after, write to
oclHandler::classMemberFunction(
...
this->ocl_cq->enqueueWriteBuffer(
this->buffer,
CL_FALSE,
static_cast<unsigned int>(0),
mem_size,
ptr,
NULL,
NULL
);
...
this->ocl_cq.finish();
...
)
each buffer. There is a concern that, because the instantiation is for a cl::context, and not tied to a particular device, that there is maybe possibly quadruple memory address assignment on each device. I can't determine when the operation that says "on the device, this buffer runs from 0xXXXXXXXXXXXXXXXX for N bytes" occurs.
Should instantiate one context per device? That seems unnatural because I'd have to instantiate a context. See how many devices there are, and then instantiate d-1 more contexts....seems inefficient. My concern is with limiting available memory device side. I am doing computations on massive sets, and I'd probably be using all of the 6GB available on each card.
Thanks.
EDIT: is there a way to instantiate a buffer and fill it asynchronously without using a commandqueue? Like let's say I have 4 devices, and one buffer host side full of static, read only data. Let's say that buffer is like 500MB in size. If I want to just use clCreateBuffer, with
shared_buffer = new
cl::Buffer(
this->ocl_context,
CL_MEM_READ_ONLY | CL_MEM_ALLOC_HOST_PTR | CL_MEM_COPY_HOST_PTR,
total_mem_size,
ptr
NULL
);
that will start a blocking write, where I can not do anything host side until all of ptr's contents are copied to the newly allocated memory. I have a multithreaded device management system and I've created one cl::CommandQueue for each device, always passing along &shared_buffer for every kernel::setArg required. I'm having a hard time wrapping my head around what to do.

When you have a context that contains multiple devices, any buffers that you create within that context are visible to all of it's devices. This means that any device in the context could read from any buffer in the context, and the OpenCL implementation is in charge of making sure the data is actually moved to the correct devices as and when they need it. There are some grey areas around what should happen if multiple devices try and access the same buffer at the same time, but this kind of behaviour is generally avoided anyway.
Although all of the buffers are visible to all of the devices, this doesn't necessarily mean that they will be allocated on all of the devices. All of the OpenCL implementations that I've worked with use an 'allocate-on-first-use' policy, whereby the buffer is allocated on the device only when it is needed by that device. So in your particular case, you should end up with one buffer per device, as long as each buffer is only used by one device.
In theory an OpenCL implementation might pre-allocate all of the buffers on all the devices just in case they are needed, but I wouldn't expect this to happen in reality (and I've certainly never seen this happen). If you are running on a platform that has a GPU profiler available, you can often use the profiler to see when and where buffer allocations and data movement is actually occurring, to convince yourself that the system isn't doing anything undesirable.

Related

concurrent data transfer cuda kernel and host [duplicate]

I have some questions.
Recently I'm making a program by using CUDA.
In my program, there is one big data on Host programmed with std::map(string, vector(int)).
By using these datas some vector(int) are copied to GPUs global memory and processed on GPU
After processing, some results are generated on GPU and these results are copied to CPU.
These are all my program schedule.
cudaMemcpy( ... , cudaMemcpyHostToDevice)
kernel function(kernel function only can be done when necessary data is copied to GPU global memory)
cudaMemcpy( ... , cudaMemcpyDeviceToHost)
repeat 1~3steps 1000times (for another data(vector) )
But I want to reduce processing time.
So I decided to use cudaMemcpyAsync function in my program.
After searching some documents and web pages, I realize that to use cudaMemcpyAsync function host memory which has data to be copied to GPUs global memory must be allocated as pinned memory.
But my programs are using std::map, so I couldn't make this std::map data to pinned memory.
So instead of using this, I made a buffer array typed pinned memory and this buffer can always handle all the case of copying vector.
Finally, my program worked like this.
Memcpy (copy data from std::map to buffer using loop until whole data is copied to buffer)
cudaMemcpyAsync( ... , cudaMemcpyHostToDevice)
kernel(kernel function only can be executed when whole data is copied to GPU global memory)
cudaMemcpyAsync( ... , cudaMemcpyDeviceToHost)
repeat 1~4steps 1000times (for another data(vector) )
And my program became much faster than the previous case.
But problem(my curiosity) is at this point.
I tried to make another program in a similar way.
Memcpy (copy data from std::map to buffer only for one vector)
cudaMemcpyAsync( ... , cudaMemcpyHostToDevice)
loop 1~2 until whole data is copied to GPU global memory
kernel(kernel function only can be executed when necessary data is copied to GPU global memory)
cudaMemcpyAsync( ... , cudaMemcpyDeviceToHost)
repeat 1~5steps 1000times (for another data(vector) )
This method came out to be about 10% faster than the method discussed above.
But I don't know why.
I think cudaMemcpyAsync only can be overlapped with kernel function.
But my case I think it is not. Rather than it looks like can be overlapped between cudaMemcpyAsync functions.
Sorry for my long question but I really want to know why.
Can Someone teach or explain to me what is the exact facility "cudaMemcpyAsync" and what functions can be overlapped with "cudaMemcpyAsync" ?
The copying activity of cudaMemcpyAsync (as well as kernel activity) can be overlapped with any host code. Furthermore, data copy to and from the device (via cudaMemcpyAsync) can be overlapped with kernel activity. All 3 activities: host activity, data copy activity, and kernel activity, can be done asynchronously to each other, and can overlap each other.
As you have seen and demonstrated, host activity and data copy or kernel activity can be overlapped with each other in a relatively straightforward fashion: kernel launches return immediately to the host, as does cudaMemcpyAsync. However, to get best overlap opportunities between data copy and kernel activity, it's necessary to use some additional concepts. For best overlap opportunities, we need:
Host memory buffers that are pinned, e.g. via cudaHostAlloc()
Usage of cuda streams to separate various types of activity (data copy and kernel computation)
Usage of cudaMemcpyAsync (instead of cudaMemcpy)
Naturally your work also needs to be broken up in a separable way. This normally means that if your kernel is performing a specific function, you may need multiple invocations of this kernel so that each invocation can be working on a separate piece of data. This allows us to copy data block B to the device while the first kernel invocation is working on data block A, for example. In so doing we have the opportunity to overlap the copy of data block B with the kernel processing of data block A.
The main differences with cudaMemcpyAsync (as compared to cudaMemcpy) are that:
It can be issued in any stream (it takes a stream parameter)
Normally, it returns control to the host immediately (just like a kernel call does) rather than waiting for the data copy to be completed.
Item 1 is a necessary feature so that data copy can be overlapped with kernel computation. Item 2 is a necessary feature so that data copy can be overlapped with host activity.
Although the concepts of copy/compute overlap are pretty straightforward, in practice the implementation requires some work. For additional references, please refer to:
Overlap copy/compute section of the CUDA best practices guide.
Sample code showing a basic implementation of copy/compute overlap.
Sample code showing a full multi/concurrent kernel copy/compute overlap scenario.
Note that some of the above discussion is predicated on having a compute capability 2.0 or greater device (e.g. concurrent kernels). Also, different devices may have one or 2 copy engines, meaning simultaneous copy to the device and copy from the device is only possible on certain devices.

OpenCL Pipeline failed to allocate buffer with cl_mem_object_allocation_failure

I have an OpenCL pipeline that process image/video and it can be greedy with the memory sometimes. It is crashing on cl::Buffer() allocation like this:
cl_int err = CL_SUCCESS;
cl::Buffer tmp = cl::Buffer(m_context, CL_MEM_READ_WRITE, sizeData, NULL, &err);
with the error -4 - cl_mem_object_allocation_failure.
This occurs at a fix point in my pipeline by using very large images. If I just downscale the image a bit, it pass through the pipeline at this very memory intensive part.
I have access to a Nvidia card with 4go that bust at a certain point, and also tried on an AMD GPU with 2go which bust earlier.
According to this thread, there is no need to know the current allocation due to swapping with VRAM, but it seems that my pipeline bust the memory of my device.
So here are my question:
1) Is there any settings on my computer, or pipeline to set to allow more VRAM ?
2) Is it okay to use CL_DEVICE_GLOBAL_MEM_SIZE as reference of the maximum size to allocate, or I need to do CL_DEVICE_GLOBAL_MEM_SIZE - (local memory + private), or something like that ?
According to my own memory profiler, I have 92% of the CL_DEVICE_GLOBAL_MEM_SIZE allocated at the crash. And by resizing a bit, the pipeline says that I used 89% on the resized image and it passed, so I assume that my large image is on the edge to pass.
Some parts of your device's VRAM may be used for the pixel buffer, constant memory, or other uses. For AMD cards, you can set the environment variables GPU_MAX_HEAP_SIZE and GPU_MAX_ALLOC_PERCENT to use a larger part of the VRAM, though this may have unintended side-effects. Both are expressed as percentages of your physically available memory on the card. Additionally, there is a limit on the size for each memory allocation. You can get the maximum size for a single memory allocation by querying CL_DEVICE_MAX_MEM_ALLOC_SIZE, which may be less than CL_DEVICE_GLOBAL_MEM_SIZE. For AMD cards, this size can be controlled with GPU_SINGLE_ALLOC_PERCENT. This requires no changes to your code, simply set the variables before you call your executable:
GPU_MAX_ALLOC_PERCENT="100"
GPU_MAX_HEAP_SIZE="100"
GPU_SINGLE_ALLOC_PERCENT="100"
./your_program

Deriving the `VkMemoryRequirements`

Is there a way to get the right values for a VkMemoryRequirements structure without having to allocate a buffer first and without using vkGetBufferMemoryRequirements?
Is it supported/compliant?
Motivation - Short version
I have an application that does the following, and everything works as expected.
VkMemoryRequirements memReq;
vkGetBufferMemoryRequirements(application.shell->device, uniformBuffer, &memReq);
int memType = application.shell->findMemoryType(memReq.memoryTypeBits, VK_MEMORY_PROPERTY_HOST_VISIBLE_BIT | VK_MEMORY_PROPERTY_HOST_COHERENT_BIT);
Internally, findMemoryType loops over the memory types and checks that they have the required property flags.
If I replace the call to vkGetMemoryRequirements with hardcoded values (which are not portable, specific to my system and obtained through debugging), everything still works and I don't get any validation errors.
VkMemoryRequirements memReq = { 768, 256, 1665 };
int memType = application.shell->findMemoryType(memReq.memoryTypeBits, VK_MEMORY_PROPERTY_HOST_VISIBLE_BIT | VK_MEMORY_PROPERTY_HOST_COHERENT_BIT);
The above code is IMHO neat because enables to pre-allocate memory before you actually need it.
Motivation - Long version
In Vulkan you create buffers which initially are not backed by device memory and at a later stage you allocate the memory and bind it to the buffer using vkBindBufferMemory:
VkResult vkBindBufferMemory(
VkDevice device,
VkBuffer buffer,
VkDeviceMemory memory,
VkDeviceSize memoryOffset);
Its Vulkan spec states that:
memory must have been allocated using one of the memory types allowed
in the memoryTypeBits member of the VkMemoryRequirements structure
returned from a call to vkGetBufferMemoryRequirements with buffer
Which implies that before allocating the memory for a buffer, you should have already created the buffer.
I have a feeling that in some circumstances it would be useful to pre-allocate a chunk of memory before you actually need it; in most of the OpenGL flavors I have experience with this was not possible, but Vulkan should not suffer from this limitation, right?
Is there a (more or less automatic) way to get the memory requirements before creating the first buffer?
Is it supported/compliant?
Obviously, when you do allocate the memory for the first buffer you can allocate a little more so that when you need a second buffer you can bind it to another range in the same chunk. But my understanding is that, to comply with the spec, you would still need to call vkGetBufferMemoryRequirements on the second buffer, even if it is exactly the same type and the same size as the first one.
This question already recognizes that the answer is "no"; you just seem to want to do an end-run around what you already know. Which you can't.
The code you showed with the hard-coded values works because you already know the answer. It's not that Vulkan requires you to ask the question; Vulkan requires you to provide buffers that use the answer.
However, since "the answer" is implementation-specific, it changes depending on the hardware. It could change when you install a new driver. Indeed, it could change even depending on which extensions or Vulkan features you activate when creating the VkDevice.
That having been said:
Which implies that before allocating the memory for a buffer, you should have already created the buffer.
Incorrect. It requires that you have the answer and have selected memory and byte offsets appropriate to that answer. But Vulkan is specifically loose about what "the answer" actually means.
Vulkan has specific guarantees in place which allow you to know the answer for a particular buffer/image without necessarily having asked about that specific VkBuffer/Image object. The details are kind of complicated, but for buffers they are pretty lax.
The basic idea is that you can create a test VkBuffer/Image and ask about its memory properties. You can then use that answer to know what the properties of the buffers you intend to use which are "similar" to that. At the very least, Vulkan guarantees that two identical buffer/images (formats, usage flags, sizes, etc) will always produce identical memory properties.
But Vulkan also offers a few other guarantees. There are basically 3 things that the memory properties tell you:
The memory types that this object can be bound to.
The alignment requirement for the offset for the memory object.
The byte size the object will take up in memory.
For the size, you get only the most basic guarantee: equivalent buffer/images will produce equivalent sizes.
For the alignment, images are as strict as sizes: only equivalent images are guaranteed to produce equivalent alignment. But for buffers, things are more lax. If the test buffer differs only in usage flags, and the final buffer uses a subset of the usage flags, then the alignment for the final buffer will not be more restrictive than the test buffer. So you can use the alignment from the test buffer.
For the memory types, things are even more loose. For images, the only things that matter are:
Tiling
Certain memory flags (sparse/split-instance binding)
Whether the image format is color or depth/stencil
If the image format is depth/stencil, then the formats must match
External memory
Transient allocation usage
If all of these are the same for two VkImage objects, then the standard guarantees that all such images will support the same set of memory types.
For buffers, things are even more lax. For non-sparse buffers, if your test buffer differs from the final buffer only by usage flags, then if the final one has a subset of the usage flags of the test buffer, then the set of memory types it supports must include all of the ones from the test buffer. The final buffer could support more, but it must support at least those of such a test buffer.
Oh, and linear images and buffers must always be able to be used in at least one mappable, coherent memory type. Of course, this requires that you have created a valid VkDevice/Image with those usage and flags fields, so if the device doesn't allow (for example) linear images to be used as textures, then that gets stopped well before asking about memory properties.

How do UBOs/SSBOs differ from Vulkan's shader memory bindings?

In the article on Imagination's website, I've read the following paragraph:
For example, there are no glUniform*() equivalent entry points in Vulkan; instead, writing to GPU memory is the only way to pass data to shaders.
When you call glUniform*(), the OpenGL ES driver typically needs to allocate a driver managed buffer and copy data to it, the management of which incurs CPU overhead. In Vulkan, you simply map the memory address and write to that memory location directly.
Is there any difference between that and using Uniform Buffers? They are also allocated explicitely and can carry arbitrary data. Since Uniform Buffers are quite limited in size, perhaps Shader Storage Buffers are a better analogy.
From what I understand, this is not glUniform*() specific: glUniform*() is merely an example used by the author of the article to illustrate the way Vulkan works with regards to communication between the host and the GPU.
When you call glUniform*(), the OpenGL ES driver typically needs to allocate a driver managed buffer and copy data to it, the management of which incurs CPU overhead.
In this scenario, when a user calls glUniform*() with some data, that data is first copied to a buffer owned by the OpenGL implementation. This buffer is probably pinned, and can then be used by the driver to transfer the data through DMA to the device. That's two steps:
Copy user data to driver buffer;
Transfer buffer contents to GPU through DMA.
In Vulkan, you simply map the memory address and write to that memory location directly.
In this scenario, there is no intermediate copy of the user data. You ask Vulkan to map a region into the host's virtual address space, which you directly write to. The data gets to the device through DMA in a completely transparent way for the user.
From a performance standpoint, the benefits are obvious: zero copy. It also means the Vulkan implementation can be simpler, as it does not need to manage an intermediate buffer.
As the specs have not been released yet, here's a fictitious example of what it could look like:
// Assume Lights is some kind of handle to your buffer/data
float4* lights = vkMap(Lights);
for (int i = 0; i < light_count; ++i) {
// Goes directly to the device
lights[i] = make_light(/* stuff */);
}
vkUnmap(lights);

Read from a socket without the associated memcpy from kernel space to user space

In Linux, is there a way to read from a socket while avoiding the implicit memcpy of the data from kernel space to user space?
That is, instead of doing
ssize_t n = read(socket_fd, buffer, count);
which obviously requires the kernel to do a memcpy from the network buffer into my supplied buffer, I would do something like
ssize_t n = fancy_read(socket_fd, &buffer, count);
and on return have buffer pointing to non memcpy()'ed data received from the network.
Initially I thought AF_PACKET option to socket family can be of help, but it cannot.
Nevertheless it is possible technically, as there is nothing that prevents you from implementing kernel module handling system call that returns user mapped pointer to kernel data (even if it is not very safe).
There are couple of questions regarding the call you would like to have:
Memory management. How would you know the memory can still be accessed after fancy_read system call returned?
How would you tell the kernel to eventually free that memory? There would need to be some form of memory management in place and if you would like kernel to give you a safe pointer to nonmemcpy'ed memory than a lot of changes would need to go into the kernel to enable this feature. Just imagine that all that data couldn't be freed before you tell that it can, so kernel would need to keep track of all of these returned pointers.
These could be done in a lot of ways, so basically yes, this is possible but you need to take many things into consideration.