OpenCL: parrallel write from host to device buffer? - c++

I have a cl_mem buffer that's quite large (100 million floats). I'm trying to decrease the amount of time it takes to fill it with data from the host (I have to pass data from host to device many times, and currently I re-initialize the buffer each time).
Instead of initializing with clCreateBuffer/CL_MEM_COPY_HOST_PTR over and over, it seems it would be more efficient to initialize the buffer once, and then update its data with a multi-threaded approach each subsequent time (so multiple CPU threads each update subsets of the data simultaneously).
Is such an approach possible? I've looked into clEnqueueWriteBuffer, and while it allows a subset of a buffer to be updated, it seems like multiple calls to it would still be executed sequentially by the command queue. Do I need multiple command queues? Is this approach even possible?

It's not entirely clear from your question whether your initialisation/update would be the same every time, or whether the whole of the buffer needs updating between runs. Obviously the easiest way to speed things up will be to remove any duplication of effort and don't copy the same data multiple times.
Do your measurements suggest that you are not limited by the interface between your CPU and device? Because if you need to copy N MB every time, your device is connected to CPU/system memory by a B MB/s interface, and your copying time is not wildly more than N/B seconds, no amount of multithreading is going to help you.
If you are limited by the sequential nature of some CPU calculation and the subsequent copy to the the buffer, you could use the asynchronous variant of clEnqueueWriteBuffer() to start copying the first chunk of data while calculating the next, etc. Note that clEnqueueWriteBuffer()/CL_MEM_COPY_HOST_PTR typically makes use of the device's DMA engines, which typically doesn't require much intervention from the host CPUs, and so can run entirely in parallel with calculations. (Host memory bandwidth is of course shared as always.)
If that is too cumbersome for your purposes, it may be useful to use clEnqueueMapBuffer to map the buffer into the host application's address space. This allows any number of threads to access arbitrary areas of it simultaneously. Be aware, however, that this is no silver bullet, and unless your OpenCL implementation explicitly specifies how this is implemented in practice, it can be that you actually make things worse with it because it might end up copying more than previously.
If your device kernels don't actually end up reading all of the buffer (and you just don't know in advance which parts it will need), or possibly if they only read all of it precisely once, in a nice and predictable pattern, but your host code needs to read & write lots or write to random locations, you could try buffers created with CL_MEM_USE_HOST_PTR. This isn't zero-copy in all implementations, but the idea is to give the device direct access to host memory. You're again limited by the device uplink interface bandwidth, and latency is typically much worse than to device memory, but if your device doesn't actually need to read all of it, this could be faster as you don't have to push the whole buffer down the pipe.
Finally, if your CPUs are somehow preprocessing/unpacking the data, you could try offloading that to the device instead.

Related

How to eager commit allocated memory in C++?

The General Situation
An application that is extremely intensive on both bandwidth, CPU usage, and GPU usage needs to transfer about 10-15GB per second from one GPU to another. It's using the DX11 API to access the GPU, so upload to the GPU can only happen with buffers that require mapping for each single upload. The upload happens in chunks of 25MB at a time, and 16 threads are writing buffers to mapped buffers concurrently. There's not much that can be done about any of this. The actual concurrency level of the writes should be lower, if it weren't for the following bug.
It's a beefy workstation with 3 Pascal GPUs, a high-end Haswell processor, and quad-channel RAM. Not much can be improved on the hardware. It's running a desktop edition of Windows 10.
The Actual Problem
Once I pass ~50% CPU load, something in MmPageFault() (inside the Windows kernel, called when accessing memory which has been mapped into your address space, but was not committed by the OS yet) breaks horribly, and the remaining 50% CPU load is being wasted on a spin-lock inside MmPageFault(). The CPU becomes 100% utilized, and the application performance completely degrades.
I must assume that this is due to the immense amount of memory which needs to be allocated to the process each second and which is also completely unmapped from the process every time the DX11 buffer is unmapped. Correspondingly, it's actually thousands of calls to MmPageFault() per second, happening sequentially as memcpy() is writing sequentially to the buffer. For each single uncommitted page encountered.
One the CPU load goes beyond 50%, the optimistic spin-lock in the Windows kernel protecting the page management completely degrades performance-wise.
Considerations
The buffer is allocated by the DX11 driver. Nothing can be tweaked about the allocation strategy. Use of a different memory API and especially re-use is not possible.
Calls to the DX11 API (mapping/unmapping the buffers) all happens from a single thread. The actual copy operations potentially happen multi-threaded across more threads than there are virtual processors in the system.
Reducing the memory bandwidth requirements is not possible. It's a real-time application. In fact, the hard limit is currently the PCIe 3.0 16x bandwidth of the primary GPU. If I could, I would already need to push further.
Avoiding multi-threaded copies is not possible, as there are independent producer-consumer queues which can't be merged trivially.
The spin-lock performance degradation appears to be so rare (because the use case is pushing it that far) that on Google, you won't find a single result for the name of the spin-lock function.
Upgrading to an API which gives more control over the mappings (Vulkan) is in progress, but it's not suitable as a short-term fix. Switching to a better OS kernel is currently not an option for the same reason.
Reducing the CPU load doesn't work either; there is too much work which needs to be done other than the (usually trivial and inexpensive) buffer copy.
The Question
What can be done?
I need to reduce the number of individual pagefaults significantly. I know the address and size of the buffer which has been mapped into my process, and I also know that the memory has not been committed yet.
How can I ensure that the memory is committed with the least amount of transactions possible?
Exotic flags for DX11 which would prevent de-allocation of the buffers after unmapping, Windows APIs to force commit in a single transaction, pretty much anything is welcome.
The current state
// In the processing threads
{
DX11DeferredContext->Map(..., &buffer)
std::memcpy(buffer, source, size);
DX11DeferredContext->Unmap(...);
}
Current workaround, simplified pseudo code:
// During startup
{
SetProcessWorkingSetSize(GetCurrentProcess(), 2*1024*1024*1024, -1);
}
// In the DX11 render loop thread
{
DX11context->Map(..., &resource)
VirtualLock(resource.pData, resource.size);
notify();
wait();
DX11context->Unmap(...);
}
// In the processing threads
{
wait();
std::memcpy(buffer, source, size);
signal();
}
VirtualLock() forces the kernel to back the specified address range with RAM immediately. The call to the complementing VirtualUnlock() function is optional, it happens implicitly (and at no extra cost) when the address range is unmapped from the process. (If called explicitly, it costs about 1/3rd of the locking cost.)
In order for VirtualLock() to work at all, SetProcessWorkingSetSize() needs to be called first, as the sum of all memory regions locked by VirtualLock() can not exceed the minimum working set size configured for the process. Setting the "minimum" working set size to something higher than the baseline memory footprint of your process has no side effects unless your system is actually potentially swapping, your process will still not consume more RAM than the actual working set size.
Just the use of VirtualLock(), albeit in individual threads and using deferred DX11 contexts for Map / Unmap calls, did instantly decrease the performance penalty from 40-50% to slightly more acceptable 15%.
Discarding the use of a deferred context, and exclusively triggering both all soft faults, as well as the corresponding de-allocation when unmapping on a single thread, gave the necessary performance boost. The total cost of that spin-lock is now down to <1% of the total CPU usage.
Summary?
When you expect soft faults on Windows, try what you can to keep them all in the same thread. Performing a parallel memcpy itself is unproblematic, in some situations even necessary to fully utilize the memory bandwidth. However, that is only if the memory is already committed to RAM yet. VirtualLock() is the most efficient way to ensure that.
(Unless you are working with an API like DirectX which maps memory into your process, you are unlikely to encounter uncommitted memory frequently. If you are just working with standard C++ new or malloc your memory is pooled and recycled inside your process anyway, so soft faults are rare.)
Just make sure to avoid any form of concurrent page faults when working with Windows.

Fast memory allocation for real time data acquisition

I have a range of sensors connected to a PC that measure various physical parameters, like force, rotational speed and temperature. These sensors continuously produce samples at some sample rate. A sample consists of a timestamp and the measured dimension itself; the sample rates are in magnitudes of single-digit kilohertz (i.e., somewhere between 1 and 9000 samples per second).
The PC is supposed to read and store these samples during a given period of time. Afterwards the collected data is further treated and evaluated.
What would be a sensible way to buffer the samples? At some realistic setup the acquisition could easily gather a couple of megabytes per second. Also paging could be critical in case memory is allocated fast but needs swapping upon write.
I could think of a threaded approach where a separate thread allocates and manages a pool of (locked, so non-swappable) memory chunks. Given there are always enough of these chunks pre-allocated, further allocation would only block (in case other processes' pages have to be swapped out before) this memory pool's thread and the acquisition could proceed without interruption.
This basically is a conceptual question. Yet, to be more specific:
It should only rely on portable features, like POSIX. Features out Qt's universe is fine, too.
The sensors can be interfaced in various ways. IP is one possibility. Usually the sensors are directly connected to the PC via local links (RS232, USB, extension cards and such). That is, fast enough.
The timestamps are mostly applied by the acquisition hardware itself if it is capable in doing so, to avoid jitter over network etc.
Thinking it over
Should I really worry? Apparently the problem diverts into three scenarios:
There is only little data collected at all. It can easily be buffered in one large pre-allocated buffer.
Data is collected slowly. Allocating the buffers on the fly is perfectly fine.
There is so much data acquired at high sample rates. Then allocation is not the problem because the buffer will eventually overflow anyway. The problem is rather how to transfer the data from the memory buffer to permanent storage fast enough.
The idea for solving this type of problems can be as follows:
Separate the problem into 2 or more processes depending what you need to do with your data:
Acquirer
Analyzer (if you want to process data in real time)
Writer
Store data in a circular buffer in shared memory (I recommend using boost::interprocess).
Acquirer will continuously read data from the device and store it in a shared memory. In the meantime, once is enough data read for doing any analysis, the Analyzer will start processing it. It can store results into another circular buffer shared memory if needed. Also in the meantime Reader will read the data from shared memory (acquired or already processed) and store it in the output file.
You need to make sure all the processes are synchronized properly so that they do their job simultaneously and you don't lose the data (the data is not being overwritten before is processed or saved into output file).

Writer/Reader buffer mechanism for large size - high freq data c++

I need a single writer and multiple reader (up to 5) mechanism that the writer pushes the data of size almost 1 MB each and 15 packages per second continuously which will be writtern in c++. What I’m trying to do is one thread keeps writing the data while 5 readers are going to make some search operations according to the timestamp of the data simultaneously. I have to keep each data package 60 min, and then they can be removed from the container.
Since the data can grow like 15 MB * 60 sec * 60 min = 54000MB/h I need almost 50 GB space to keep the data and make the operations fast enough for both the writer and the readers. Bu the thing is we cannot keep that size data on cache or RAM so it must be in a Hard drive like SSD (HDD would be too slow for that kind of operation)
Up to now what I’ve been thinking is, to make a circular buffer (since I can calculate the max size) directly implemented to an SSD, which I couldn’t find a suitable example up to now and I don’t know if it is possible or not either, or to implement some kind of mapping mechanism that one circular array will be available in the RAM that just keeps the timestamps of the data and physical address of the memory for searching the data which is available on the hard drive. So at least the search operations would be faster I guess.
Since any kind of lock, mutex or semaphore will slow down the operations (especially write is critical we cannot loose data because of any read operation) I don’t want to use them. I know there are some shared locks available but I think again they have some drawbacks. Is there any way/idea to implement such kind of system with lock free, wait free and thread safe as well? Any Data structure (container), pattern, example code/project or other kind of suggestions will be highly appreciated, thank you…
EDIT: Is there any other idea rather than bigger amount of RAM?
This can be done on a commodity PC (and can scale to a server without code changes).
Locks are not a problem. With a single writer and few consumers that do time-consuming tasks on big data, you will have rare locking and practically zero lock contention, so it's a non-issue.
Anything from a simple spinlock (if you're really desperate for low latency) or preferrably a pthread_mutex (which falls back to being a spinlock most of the time, anyway) will do fine. Nothing fancy.
Note that you do not acquire a lock, receive a megabyte of data from a socket, write it to disk, and then release the lock. That's not how it works.
You receive a megabyte of data and write it to a region that you own exclusively, then acquire a lock, change a pointer (and thus transfer ownership), and release the lock. The lock protects the metadata, not every single byte in a gigabyte-sized buffer. Long running tasks, short lock times, contention = zero.
As for the actual data, writing out 15MiB/s is absolutely no challenge, a normal harddisk will do 5-6 times as much, and a SSD will easily do 10 to 20 times that. It also isn't something you even need to do yourself. It's something you can leave to the operating system to manage.
I would create a 54.1GB1 file on disk and memory map that (assuming it's a 64bit system, a reasonable assumption when talking of multi-gigabyte-ram-servers, this is no problem). The operating system takes care of the rest. You just write your data to the mapped region which you use as circular buffer2.
What was most recently written will be more or less guaranteed3 to be resident in RAM, so the consumers can access it without faulting. Older data may or may not be in RAM, depending on whether your server has enough physical RAM available.
Data that is older can still be accessed, but likely at slightly slower speed (if there is not enough physical RAM to keep the whole set resident). It will however not affect the producer or the consumers reading the recently written data (unless the machine is so awfully low-spec that it can't even hold 2-3 of your 1MiB blocks in RAM, but then you have a different problem!).
You are not very concrete on how you intend to process data, other than there will be 5 consumers, so I will not go too deep into this part. You may have to implement a job scheduling system, or you can just divide each incoming block in 5 smaller chunks, or whatever -- depending on what exactly you want to do.
What you need to account for in any case is the region (either as pointer, or better as offset into the mapping) of data in your mapped ringbuffer that is "valid" and the region that is "unused".
The producer is the owner of the mapping, and it "allows" the consumers to access the data within the bounds given in the metadata (a begin/end pair of offsets). Only the producer may change this metadata.
Anyone (including the producer) accessing this metadata needs to acquire a lock.
It is probably even possible to do this with atomic operations, but seeing how you only lock rarely, I wouldn't even bother. It's a no-brainer using a lock, and there are no subtle mistakes that you can make.
Since the producer knows that the consumers will only look at data within well-defined bounds, it can write to areas outside the bounds (the area known being "emtpy") without locking. It only needs to lock to change the bounds afterwards.
As 54.1Gib > 54Gib, you have a hundred spare 1MiB blocks in the mapping that you can write to. That's probably much more than needed (2 or 3 should do), but it doesn't hurt to have a few extra. As you write to a new block (and increase the valid range by 1), also adjust the other end of the "valid range". That way, threads will no longer be allowed to access an old block, but a thread still working in that block can finish its work (the data still exists).
If one is strict about correctness, this may create a race condition if processing a block takes extremely long (over 1 1/2 minutes in this case). If you want to be absolutely sure, you'll need another lock which may in the worst case block the producer. That's something you absolutely didn't want, but blocking the producer in the worst case is the only thing that is 100% correct in every contrieved case unless a hypothetical computer has unlimited memory.
Given the situation, I think this theoretical race is an "allowable" thing. If processing a single block really takes that long with so much data steadily coming in, you have a much more serious problem at hand, so practically, it's a non-issue.
If your boss decides, at some point in the future, that you should keep more than 1 hour of backlog, you can enlarge the file and remap, and when the "empty" region is next at the end of the old buffer's size, simply extend the "known" file size, and adjust your max_size value in the producer. The consumer threads don't even need to know. You could of course create another file, copy the data, swap, and keep the consumers blocked in the mean time, but I deem that an inferior solution. It is probably not necessary for a size increase to be immediately visible, but on the other hand it is highly desirable that it is an "invisible" process.
If you put more RAM into the computer, your program will "magically" use it, without you needing to change anything. The operating system will simply keep more pages in RAM. If you add another few consumers, it will still work the same.
1 Intentionally bigger than what you need, let there be a few "extra" 1MiB blocks.
2 Preferrably, you can madvise the operating system (if you use a system that has a destructive DONT_NEED hint, such as Linux) that you are no longer interested in the contents before overwriting a region. But if you don't do that, it will work either way, only slightly less efficient because the OS will possibly do a read-modify-write operation where a write operation would have been enough.
3 There is of course never really a guarantee, but it's what will be the case anyway.
54GB/hour = 15MB/s. Good SSD these days can write 300+ MB/s. If you keep 1 hour in RAM and then occasionally flush older data to disk, you should be able to handle 10x more than 15MB/s (provided your search algorithm is fast enough to keep up).
Regarding fast locking mechanism between your threads, I would suggest looking into RCU - Read-Copy Update. Linux kernel is currently using it to achieve very efficient locking.
Do you have some minimum hardware requirements? 54GB in memory is perfectly possible these days (many motherboards can take 4x16GB these days, and that's not even server hardware). So if you want to require an SSD, you could maybe just as well require a lot of RAM and have an in-memory circular buffer as you suggest.
Also, if there's sufficient redudancy in the data, it may be viable to use some cheap compression algorithms (those which are easy on the CPU, i.e. some sort of 'level 0' compression). I.e. you don't store the raw data, but some compressed format (and possibly some index) which is decompressed by the readers.
Many good recommendations around. I'd like just to add that for circular buffer implementation you can have a look at Boost Circular Buffer

Speeding up file I/O: mmap() vs. read()

I have a Linux application that reads 150-200 files (4-10GB) in parallel. Each file is read in turn in small, variably sized blocks, typically less than 2K each.
I currently need to maintain over 200 MB/s read rate combined from the set of files. The disks handle this just fine. There is a projected requirement of over 1 GB/s (which is out of the disk's reach at the moment).
We have implemented two different read systems both make heavy use of posix_advise: first is a mmaped read in which we map the entirety of the data set and read on demand.
The second is a read()/seek() based system.
Both work well but only for the moderate cases, the read() method manages our overall file cache much better and can deal well with 100s of GB of files, but is badly rate limited, mmap is able to pre-cache data making the sustained data rate of over 200MB/s easy to maintain, but cannot deal with large total data set sizes.
So my question comes to these:
A: Can read() type file i/o be further optimized beyond the posix_advise calls on Linux, or having tuned the disk scheduler, VMM and posix_advise calls is that as good as we can expect?
B: Are there systematic ways for mmap to better deal with very large mapped data?
Mmap-vs-reading-blocks
is a similar problem to what I am working and provided a good starting point on this problem, along with the discussions in mmap-vs-read.
Reads back to what? What is the final destination of this data?
Since it sounds like you are completely IO bound, mmap and read should make no difference. The interesting part is in how you get the data to your receiver.
Assuming you're putting this data to a pipe, I recommend you just dump the contents of each file in its entirety into the pipe. To do this using zero-copy, try the splice system call. You might also try copying the file manually, or forking an instance of cat or some other tool that can buffer heavily with the current file as stdin, and the pipe as stdout.
if (pid = fork()) {
waitpid(pid, ...);
} else {
dup2(dest, 1);
dup2(source, 0);
execlp("cat", "cat");
}
Update0
If your processing is file-agnostic, and doesn't require random access, you want to create a pipeline using the options outlined above. Your processing step should accept data from stdin, or a pipe.
To answer your more specific questions:
A: Can read() type file i/o be further optimized beyond the posix_advise calls on Linux, or having tuned the disk scheduler, VMM and posix_advise calls is that as good as we can expect?
That's as good as it gets with regard to telling the kernel what to do from userspace. The rest is up to you: buffering, threading etc. but it's dangerous and probably unproductive guess work. I'd just go with splicing the files into a pipe.
B: Are there systematic ways for mmap to better deal with very large mapped data?
Yes. The following options may give you awesome performance benefits (and may make mmap worth using over read, with testing):
MAP_HUGETLB
Allocate the mapping using "huge pages."
This will reduce the paging overhead in the kernel, which is great if you will be mapping gigabyte sized files.
MAP_NORESERVE
Do not reserve swap space for this mapping. When swap space is reserved, one has the guarantee that it is possible to modify the mapping. When swap space is not reserved one might get SIGSEGV upon a write if no physical memory is available.
This will prevent you running out of memory while keeping your implementation simple if you don't actually have enough physical memory + swap for the entire mapping.**
MAP_POPULATE
Populate (prefault) page tables for a mapping. For a file mapping, this causes read-ahead on the file. Later accesses to the mapping will not be blocked by page faults.
This may give you speed-ups with sufficient hardware resources, and if the prefetching is ordered, and lazy. I suspect this flag is redundant, the VFS likely does this better by default.
Perhaps using the readahead system call might help, if your program can predict in advance the file fragments it wants to read (but this is only a guess, I could be wrong).
And I think you should tune your application, and perhaps even your algorithms, to read data in chunk much bigger than a few kilobytes. Can't than be half a megabyte instead?
The problem here doesn't seem to be which api is used. It doesn't matter if you use mmap() or read(), the disc still has to seek to the specified point and read the data (although the os does help to optimize the access).
mmap() has advantages over read() if you read very small chunks (a couple of bytes) because you don't have call the os for every chunk, which becomes very slow.
I would also advise like Basile did to read more than 2kb consecutively so the disc doesn't have to seek that often.

Writing data chunks while processing - is there a convergence value due to hardware constraints?

I'm processing data from a hard disk from one large file (processing is fast and not a lot of overhead) and then have to write the results back (hundreds of thousands of files).
I started writing the results straight away in files, one at a time, which was the slowest option. I figured it gets a lot faster if I build a vector of a certain amount of the files and then write them all at once, then go back to processing while the hard disk is occupied in writing all that stuff that i poured into it (that at least seems to be what happens).
My question is, can I somehow estimate a convergence value for the amount of data that I should write from the hardware constraints ? To me it seems to be a hard disk buffer thing, I have 16MB buffer on that hard disk and get these values (all for ~100000 files):
Buffer size time (minutes)
------------------------------
no Buffer ~ 8:30
1 MB ~ 6:15
10 MB ~ 5:45
50 MB ~ 7:00
Or is this just a coincidence ?
I would also be interested in experience / rules of thumb about how writing performance is to be optimized in general, for example are larger hard disk blocks helpful, etc.
Edit:
Hardware is a pretty standard consumer drive (I'm a student, not a data center) WD 3,5 1TB/7200/16MB/USB2, HFS+ journalled, OS is MacOS 10.5. I'll soon give it a try on Ext3/Linux and internal disk rather than external).
Can I somehow estimate a convergence value for the amount of data that I should write from the hardware constraints?
Not in the long term. The problem is that your write performance is going to depend heavily on at least four things:
Which filesystem you're using
What disk-scheduling algorithm the kernel is using
The hardware characteristics of your disk
The hardware interconnect you're using
For example, USB is slower than IDE, which is slower than SATA. It wouldn't surprise me if XFS were much faster than ext2 for writing many small files. And kernels change all the time. So there are just too many factors here to make simple predictions easy.
If I were you I'd take these two steps:
Split my program into multiple threads (or even processes) and use one thread to deliver system calls open, write, and close to the OS as quickly as possible. Bonus points if you can make the number of threads a run-time parameter.
Instead of trying to estimate performance from hardware characteristics, write a program that tries a bunch of alternatives and finds the fastest one for your particular combination of hardware and software on that day. Save the fastest alternative in a file or even compile it into your code. This strategy was pioneered by Matteo Frigo for FFTW and it is remarkably effective.
Then when you change your disk, your interconnect, your kernel, or your CPU, you can just re-run the configuration program and presto! Your code will be optimized for best performance.
The important thing here is to get as many outstanding writes as possible, so the OS can optimize hard disk access. This means using async I/O, or using a task pool to actually write the new files to disk.
That being said, you should look at optimizing your read access. OS's (at least windows) is already really good at helping write access via buffering "under the hood", but if your reading in serial there isn't too much it can do to help. If use async I/O or (again) a task pool to process/read multiple parts of the file at once, you'll probably see increased perf.
Parsing XML should be doable at practically disk read speed, tens of MB/s. Your SAX implementation might not be doing that.
You might want to use some dirty tricks. 100.000s of files to write is not going to be efficient with the normal API.
Test this by writing sequentially to a single file first, not 100.000. Compare the performance. If the difference is interesting, read on.
If you really understand the file system you're writing to, you can make sure you're writing a contiguous block you just later split into multiple files in the directory structure.
You want smaller blocks in this case, not larger ones, as your files are going to be small. All free space in a block is going to be zeroed.
[edit] Do you really have an external need for those 100K files? A single file with an index could be sufficient.
Expanding on Norman's answer: if your files are all going into one filesystem, use only one helper thread.
Communication between the read thread and write helper(s) consists of a two-std::vector double-buffer per helper. (One buffer owned by the write process and one by the read process.) The read thread fills the buffer until a specified limit then blocks. The write thread times the write speed with gettimeofday or whatever, and adjusts the limit. If writing went faster than last time, increase the buffer by X%. If it went slower, adjust by –X%. X can be small.