How does Vulkan's memory domain operation work - c++

I read some code example about copying images between cpu&gpu using VK_PIPELINE_STAGE_HOST_BIT. (For simplicity I'll use phsuedo code below)
For gpu->cpu It is like:
1.vkCmdCopyImage(..., src_img, ... dst_img, ...);
2.vkCmdPipelineBarrier(...,VK_PIPELINE_STAGE_TRANSFER_BIT, VK_PIPELINE_STAGE_HOST_BIT,...);
For cpu->gpu it is like:
vkCmdPipelineBarrier(...,VK_PIPELINE_STAGE_HOST_BIT, VK_PIPELINE_STAGE_TRANSFER_BIT,...);
vkCmdCopyImage(..., src_img, ... dst_img, ...);
I can understand the cpu->gpu part, as we need to use the barrier to make the src_img visible to gpu, so we do the barrier first, then copy src_img to dst_img. But for gpu->cpu, it does the copy first and then the barrier later. I wonder without using the barrier to make the src_img visible to the host first, how can the image copy succeed? FYI, the code example is from here, https://cpp.hotexamples.com/site/file?hash=0xf064d23ec4332e7951809e5112a592758b3c2c71a4560a9e77da0176b1a9193a Please refer to function CopyToImage and function CopyFromImage

I wonder without using the barrier to make the src_img visible to the host first, how can the image copy succeed?
Because the host isn't reading the source image; it's reading the destination image, the one being copied to. The copy command is reading the source image, since the copy command is the one copying the image.
the dest image is in the host memory
No, it is in device memory. You know that because the memory is encapsulated via a VkDeviceMemory object, was allocated via device memory allocation commands, and will be freed by a device memory deallocation command.
This particular allocation of memory is shared with the host, but it belongs to the device.
The copy operation is a device operation. The barrier exposes the results of that operation to the host.

Related

Where is the buffer allocated in opencl?

I was trying to create a memory buffer in OpenCL with C++ binding. The sentence looks like
cl::Buffer buffer(context,CL_MEM_READ_ONLY,sizeof(float)*(100));
This sentence confuses me because it doesn't specify which device the memory is allocated on. In principle context contains all devices, including cpu and gpu, on the chosen platform. Is it true that the buffer is put in a common region shared by all the devices?
The spec does not define where the memory is. For the API user, it is "in the context".
If you have one device only, probably (99.99%) is going to be in the device. (In rare cases it may be in the host if the device does not have enough memory for the time being)
In case of many different devices, it will be in one of them at the creation. But it may move transparently to another device depending on the kernel launches.
This is the reason why the call clEnqueueMIgrateMemObjects (OpenCL 1.2 only) exists.
It allows the user to tell some hints to the API about where the memory will be needed, and prepare the copy in advance.
Here is the definition of what it does:
clEnqueueMIgrateMemObjects provides a mechanism for assigning which device an OpenCL memory object resides. A user may wish to have more explicit control over the location of their memory objects on creation. This could be used to:
Ensure that an object is allocated on a specific device prior to usage.
Preemptively migrate an object from one device to another.
Typically, memory objects are implicitly migrated to a device for which enqueued commands, using the memory object, are targeted

Is it necessary to enqueue read/write when using CL_MEM_USE_HOST_PTR?

Assume that I am wait()ing for the kernel to compute the work.
I was wondering if, when allocating a buffer using the CL_MEM_USE_HOST_PTR flag, it is necessary to use enqueueRead/Write on the buffer, or they can always be omitted?
Note
I am aware of this note on the reference:
Calling clEnqueueReadBuffer to read a region of the buffer object with
the ptr argument value set to host_ptr + offset, where host_ptr is a
pointer to the memory region specified when the buffer object being
read is created with CL_MEM_USE_HOST_PTR, must meet the following
requirements in order to avoid undefined behavior:
All commands that use this buffer object have finished execution before the read command begins execution
The buffer object is not mapped
The buffer object is not used by any command-queue until the read command has finished execution
So, to clarify my question, I split it in two:
if I create a buffer using CL_MEM_USE_HOST_PTR flag, can I assume the OpenCL implementation will write to device cache when necessary, so I can always avoid to enqueueWriteBuffer()?
if I call event.wait() after launching a kernel, can I always avoid to enqueueReadBuffer() to access computed data on a buffer created with flag CL_MEM_USE_HOST_PTR?
Maybe I am overthinking about it, but even if the description of the flag is clear about the fact that the host memory will be used to store the data, it is not clear (or I did not find where it is cleared) about when data is available and if the read/write is always implicit.
You'll never have to use enqueueWriteBuffer(), however you have to use enqueueMapBuffer.
See http://www.khronos.org/registry/cl/specs/opencl-1.2.pdf page 89 (it's the same also in 1.1).
The data is available only after you have mapped the object and will again become undefined after you unmap the object. Also this old thread http://www.khronos.org/message_boards/showthread.php/6912-Clarify-CL_MEM_USE_HOST_PTR contains rather useful description.

the speed of read and write system call on shared memory object in comparison with one of memcpy

I' using shared memory (with semaphore) for communicating between two processes:
Fist, I open shared memory object using the call:
int fd = shm_open("name") [http://linux.die.net/man/3/shm_open]
Second, I map this shared mem object into my adress space using call:
void* ptr = mmap(..fd..) [http://linux.die.net/man/2/mmap2]
However, I want to use EPOLL in conjunction with shared memory file descriptor==> I don't use mmap anymore, and instead, using EPOLL for monitoring, and then add, write function for direct access to shared memory using fd (shared memmory file descriptor)
My question is that: how is the speed of direct reading and writing on shared memory object in comparison with memcpy on pointer returned by mmap?
read(fd, buffer) vs memcpy(des, source, size) //???
Hope to see your answer! Thanks!
read is a syscall and implies a privilege transition which implies address space manipulation (MMU) and then the kernel will call memcpy from your provided buffer to the destination address. It basically does the same thing you would do (call memcpy) but adding 2 expensive operations (privilege transitions) and a cheap one (finding the destination address).
We can conclude that the read/write is very likely to be slower.

When can I release a source PBO?

I'm using PBOs to asynchronously move data between my cpu and gpu.
When moving from the GPU i know I can delete the source texture after I have called glMapBuffer on the PBO.
However, what about the other way around? When do I know that the transfer from the PBO to the texture (glTexSubImage2D(..., NULL)) is done and I can safely release or re-use the PBO? Is it as soon as I bind the texture or something else?
I think after calling glTexImage you are safe in deleting or reusing the buffer without errors, as the driver handles everything for you, including deferred destruction (that's the advantage of buffer objects). But this means, that calls to glMapBuffer may block until the preceding glTexImage copy has completed. If you want to reuse the buffer and just overwrite its whole content, it is common practice to realocate it with glBufferData before calling glMapBuffer. This way the driver knows you don't care about the previous content anymore and can allocate a new buffer that you can use immediately (the memory containing the previous content is then freed by the driver when it is really not used anymore). Just keep in mind that your buffer object is just a handle to memory, that the driver can manage and copy as it likes.
EDIT: This means in the other way (GPU-CPU) you can delete the source texture after glGetTexImage has returned, as the driver manages everything behind the scenes. The decision of using buffer objects or not should not have any implications on the order and time in which you call GL functions. Keep in mind that calling glDelete... does not immediately delete an object, it just enqueues this command into the GL command stream and even then, its up to the driver when it really frees any memory.

Using shared memory under Windows. How to pass different data

I currently try to implement some interprocess communication using the Windows CreateFileMapping mechanism. I know that I need to create a file mapping object with CreateFileMapping first and then create a pointer to the actual data with MapViewOfFile. The example then puts data into the mapfile by using CopyMemory.
In my application I have an image buffer (1 MB large) which I want to send to another process. So now I inquire a pointer to the image and then copy the whole image buffer into the mapfile. But I wonder if this is really necessary. Isn't it possible to just copy an actual pointer in the shared memory which points to the image buffer data? I tried a bit but didn't succeed.
Different processes have different address spaces. If you pass a valid pointer in one process to another process, it will probably point to random data in the second process. So you will have to copy all the data.
I strongly recommend you use Boost::interprocess. It has lots of goodies to manage this kind of stuff & even includes some special Windows-only functions in case you need to interoperate w/ other processes that use particular Win32 features.
The most important thing is to use offset pointers rather than regular pointers. Offset pointers are basically relative pointers (they store the difference between where the pointer is and where the thing pointed to is). This means that even if the two pointers are mapped to different address spaces, as long as the mappings are identical in structure then you are fine.
I've used all kinds of complicated data structures with offset smart pointers and it worked like a charm.
Shared Memory doesn't mean sending and receiving of Data. Its a memory created for number of processes without violation. For that you have to follow some mechanisms like locks so that the data will not corrupt.
In process 1 :
CreateFileMapping() : It will create the Shared Memory Block, with the name provided in last parameter, if it is not already present and returns back a handle (you may call it a pointer), if successful.
MapViewOfFile() : It maps (includes) this shared block in the process address space and returns a handle (again u can say a pointer).
With this pointer returned by MapViewOfFile() only you can access that shared block.
In process 2 :
OpenFileMapping() : If the shared memory block is successfully created by CreateFileMapping(), you can use it with the same name (name used to create the shared memory block).
UnmapViewOfFile() : It will unmap (you can remove the shared memory block from that process address space). When you are done using the shared memory (i.e. access, modification etc) call this function .
Closehandle() : finally to detach the shared memory block from process , call this with argument,handle returned by OpenFileMapping() or CreateFileMapping().
Though these functions look simple, the behaviour is tricky if the flags are not selected properly.
If you wish to read or write shared memory, specify PAGE_EXECUTE_READWRITE in CreateFileMapping().
Whenever you wish to access shared memory after creating it successfully, use FILE_MAP_ALL_ACCESS in MapViewOfFile().
It is better to specify FALSE (do not inherit handle from parent process) in OpenFileMapping() as it will avoid confusion.
You CAN get shared memory to use the same address over 2 processes for Windows. It's achieveable with several techniques.
Using MapViewOfFileEx, here's the significant experpt from MSDN.
If a suggested mapping address is
supplied, the file is mapped at the
specified address (rounded down to the
nearest 64K-boundary) if there is
enough address space at the specified
address. If there is not enough
address space, the function fails.
Typically, the suggested address is
used to specify that a file should be
mapped at the same address in multiple
processes. This requires the region of
address space to be available in all
involved processes. No other memory
allocation can take place in the
region that is used for mapping,
including the use of the VirtualAlloc
or VirtualAllocEx function to reserve
memory.
If the lpBaseAddress parameter
specifies a base offset, the function
succeeds if the specified memory
region is not already in use by the
calling process. The system does not
ensure that the same memory region is
available for the memory mapped file
in other 32-bit processes.
Another related technique is to use a DLL with a section marked Read + Write + Shared. In this case, the OS will pretty much do the MapViewOfFileEx call for you and for any other process which loads the DLL.
You may have to mark your DLL to a FIXED load address, not relocateable etc.. naturally.
You can use Marshalling of pointers.
If it's possible, it would be best to have the image data loaded/generated directly into the shared memory area. This eliminates the memory copy and puts it directly where it needs to be. When it's ready you can signal the other process, giving it the offset into your shared memory where the data begins.