Predictively computing potentially needed values on large shared data structure with infrequent updates - c++

I have a system I need to design with low latency in mind, processing power and memory are generous. I have a large (several GB) data structure that is updated once every few seconds. Many (read only) operations are going to run against this data structure between updates, in parallel, accessing it heavily. As soon as an update occurs, all computations in progress should be cleanly cancelled, as their results are invalidated by the update.
The issue I'm running into here is that writes are infrequent enough, and readers access so often that locking around individual reader access would have a huge hit to performance. I'm fine with the readers reading invalid data, but then I need to deal with any invariants broken (assertions) or segfaults due to stale pointers, etc. At the same time, I can't have readers block writers, so reader-writer locks acquired at every reader's thread start is unacceptable.
The only solution I can think of has a number of issues, which is to allocate a mapping with mmap, put the readers in separate processes, and mprotect the memory to kill the workers when it's time to update. I'd prefer a cross-platform solution (ideally pure C++), however, and ideally without forking every few seconds. This would also require some surgery to get all the data structures located in shm.
Something like a revocable lock would do exactly what I need, but I don't know of any libraries that provide such functionality.

If this was a database I'd use multi-versions concurrency control. Readers obtain a logical snapshot while the underlying physical data structures are mostly lock-free (or locked very shortly and fine-grainedly).
You say your memory is generously equipped. Can you just create a complete copy of the data structure? Then you modify the copy and swap it out atomically.
Or, can you use immutable data-structures so that readers continue to use the old version and the writer creates new objects?
Or, you implement MVCC in a fine-grained way. Let's say you want to version a hash-set. Instead of keeping one value per key, you keep one value per key per version. Readers read from the latest version that is <= the version that existed when they started to read. Writers create a new version number for each write "transaction". Only when all writes are complete readers would start picking up changes from the new version. This is how MVCC-databases do it.
Besides these approaches I also liked your mmap idea. I don't think you need a separate process is your OS supports copy-on-write memory mappings. Then you can map the same memory area multiple times and provide a stable snapshot to readers.

Related

Does Multiple reader single writer implementation in g++-4.4(Not in C++11/14) via boost::shared_mutex impact performance?

Usage: In our production we have around 100 thread which can access the cache we are trying to implement. If cache is missed then information will be fetched from the database and cache will be updated via writer thread.
To achieve this we are planning to implement multiple read and single writer We cannot update the g++ version since we are using g++-4.4
Update: Each worker thread can work for both read and write. If cache is missed then information is cached from the DB.
Problem Statement:
We need to implement the cache to enhance the performance.
For this, cache read are more frequent and write operations to the cache is very much less.
I think we can use boost::shared_mutex boost::shared_lock, boost::upgrade_lock, boost::upgrade_to_unique_lock implementation
But we learnt that boost::shared_mutex has performance issues:
Performance comparison on reader writer locks
Lib boost devel
Questions
Does boost::shared_mutex impact the performance in case read are much frequent?
What are other constructs and design approaches we can take while considering compiler version g++4.4?
Is there a way-around on how to design it, such that reads are lock free?
Also, we are intended to use Map to keep the information for cache.
If writes were non-existent, one possibility can be 2-level cache where you first have a thread-local cache, and then the normal cache with mutex or reader/writer lock.
If writes are extremely rare, you can do the same. But have some lock-free way of invalidating the thread-local cache, e.g. an atomic int updated with every write, and in those cases clear the thread-local cache.
You need to profile it.
In case you're stuck because you don't have a "similar enough" environment where you can actually test things, you can probably write a simple wrapper using pthreads: pthread_rwlock_t
pthread_rwlock_rdlock
pthread_rwlock_wrlock
pthread_rwlock_unlock
Of course you can design things to be lock free. Most obvious solution would be to not share state. (If you do share state, you'll have to check that your target platform supports atomic instructions). However, without any knowledge of your application domain, I feel very safe suggesting you do not want lock-free. See e.g. Do lock-free algorithms really perform better than their lock-full counterparts?
It all depends on the frequency of the updates, the size of the cache and how much is changed in the update.
Let's assume you have a rather big cache with a lot of changes on each update. Then I would use a read-copy-update pattern, which is lock-free.
If your cached data is pretty small and one time read (e.g. a single integer) RCU is also a good choice.
A big cache, with small updates or a big cache with updates which are to frequent for RCU a Read-Write Lock is a good choice.
Alongside other answers suggesting you profile it, a large benefit can be had if you can somehow structure or predict the type, order and size of the requests.
If particular types of data are requested in a typical cycle, it would be better to split up the cache per data type. You will increase cache-hit/miss ratios and the size of each cache can be adapted to the type. You will also reduce possible contention.
Likewise, the size of the requests is important when choosing your update approach. Smaller data fragments may be stored longer or even pooled together, while larger chunks may be requested less frequently.
Even with a basic prediction scheme in place that covers only the most frequent fetch patterns, you may already improve performance quite a bit. It's definitely worth it to try and train e.g. a NN (Neural Network) to guess the next request in advance.

Writer/Reader buffer mechanism for large size - high freq data c++

I need a single writer and multiple reader (up to 5) mechanism that the writer pushes the data of size almost 1 MB each and 15 packages per second continuously which will be writtern in c++. What I’m trying to do is one thread keeps writing the data while 5 readers are going to make some search operations according to the timestamp of the data simultaneously. I have to keep each data package 60 min, and then they can be removed from the container.
Since the data can grow like 15 MB * 60 sec * 60 min = 54000MB/h I need almost 50 GB space to keep the data and make the operations fast enough for both the writer and the readers. Bu the thing is we cannot keep that size data on cache or RAM so it must be in a Hard drive like SSD (HDD would be too slow for that kind of operation)
Up to now what I’ve been thinking is, to make a circular buffer (since I can calculate the max size) directly implemented to an SSD, which I couldn’t find a suitable example up to now and I don’t know if it is possible or not either, or to implement some kind of mapping mechanism that one circular array will be available in the RAM that just keeps the timestamps of the data and physical address of the memory for searching the data which is available on the hard drive. So at least the search operations would be faster I guess.
Since any kind of lock, mutex or semaphore will slow down the operations (especially write is critical we cannot loose data because of any read operation) I don’t want to use them. I know there are some shared locks available but I think again they have some drawbacks. Is there any way/idea to implement such kind of system with lock free, wait free and thread safe as well? Any Data structure (container), pattern, example code/project or other kind of suggestions will be highly appreciated, thank you…
EDIT: Is there any other idea rather than bigger amount of RAM?
This can be done on a commodity PC (and can scale to a server without code changes).
Locks are not a problem. With a single writer and few consumers that do time-consuming tasks on big data, you will have rare locking and practically zero lock contention, so it's a non-issue.
Anything from a simple spinlock (if you're really desperate for low latency) or preferrably a pthread_mutex (which falls back to being a spinlock most of the time, anyway) will do fine. Nothing fancy.
Note that you do not acquire a lock, receive a megabyte of data from a socket, write it to disk, and then release the lock. That's not how it works.
You receive a megabyte of data and write it to a region that you own exclusively, then acquire a lock, change a pointer (and thus transfer ownership), and release the lock. The lock protects the metadata, not every single byte in a gigabyte-sized buffer. Long running tasks, short lock times, contention = zero.
As for the actual data, writing out 15MiB/s is absolutely no challenge, a normal harddisk will do 5-6 times as much, and a SSD will easily do 10 to 20 times that. It also isn't something you even need to do yourself. It's something you can leave to the operating system to manage.
I would create a 54.1GB1 file on disk and memory map that (assuming it's a 64bit system, a reasonable assumption when talking of multi-gigabyte-ram-servers, this is no problem). The operating system takes care of the rest. You just write your data to the mapped region which you use as circular buffer2.
What was most recently written will be more or less guaranteed3 to be resident in RAM, so the consumers can access it without faulting. Older data may or may not be in RAM, depending on whether your server has enough physical RAM available.
Data that is older can still be accessed, but likely at slightly slower speed (if there is not enough physical RAM to keep the whole set resident). It will however not affect the producer or the consumers reading the recently written data (unless the machine is so awfully low-spec that it can't even hold 2-3 of your 1MiB blocks in RAM, but then you have a different problem!).
You are not very concrete on how you intend to process data, other than there will be 5 consumers, so I will not go too deep into this part. You may have to implement a job scheduling system, or you can just divide each incoming block in 5 smaller chunks, or whatever -- depending on what exactly you want to do.
What you need to account for in any case is the region (either as pointer, or better as offset into the mapping) of data in your mapped ringbuffer that is "valid" and the region that is "unused".
The producer is the owner of the mapping, and it "allows" the consumers to access the data within the bounds given in the metadata (a begin/end pair of offsets). Only the producer may change this metadata.
Anyone (including the producer) accessing this metadata needs to acquire a lock.
It is probably even possible to do this with atomic operations, but seeing how you only lock rarely, I wouldn't even bother. It's a no-brainer using a lock, and there are no subtle mistakes that you can make.
Since the producer knows that the consumers will only look at data within well-defined bounds, it can write to areas outside the bounds (the area known being "emtpy") without locking. It only needs to lock to change the bounds afterwards.
As 54.1Gib > 54Gib, you have a hundred spare 1MiB blocks in the mapping that you can write to. That's probably much more than needed (2 or 3 should do), but it doesn't hurt to have a few extra. As you write to a new block (and increase the valid range by 1), also adjust the other end of the "valid range". That way, threads will no longer be allowed to access an old block, but a thread still working in that block can finish its work (the data still exists).
If one is strict about correctness, this may create a race condition if processing a block takes extremely long (over 1 1/2 minutes in this case). If you want to be absolutely sure, you'll need another lock which may in the worst case block the producer. That's something you absolutely didn't want, but blocking the producer in the worst case is the only thing that is 100% correct in every contrieved case unless a hypothetical computer has unlimited memory.
Given the situation, I think this theoretical race is an "allowable" thing. If processing a single block really takes that long with so much data steadily coming in, you have a much more serious problem at hand, so practically, it's a non-issue.
If your boss decides, at some point in the future, that you should keep more than 1 hour of backlog, you can enlarge the file and remap, and when the "empty" region is next at the end of the old buffer's size, simply extend the "known" file size, and adjust your max_size value in the producer. The consumer threads don't even need to know. You could of course create another file, copy the data, swap, and keep the consumers blocked in the mean time, but I deem that an inferior solution. It is probably not necessary for a size increase to be immediately visible, but on the other hand it is highly desirable that it is an "invisible" process.
If you put more RAM into the computer, your program will "magically" use it, without you needing to change anything. The operating system will simply keep more pages in RAM. If you add another few consumers, it will still work the same.
1 Intentionally bigger than what you need, let there be a few "extra" 1MiB blocks.
2 Preferrably, you can madvise the operating system (if you use a system that has a destructive DONT_NEED hint, such as Linux) that you are no longer interested in the contents before overwriting a region. But if you don't do that, it will work either way, only slightly less efficient because the OS will possibly do a read-modify-write operation where a write operation would have been enough.
3 There is of course never really a guarantee, but it's what will be the case anyway.
54GB/hour = 15MB/s. Good SSD these days can write 300+ MB/s. If you keep 1 hour in RAM and then occasionally flush older data to disk, you should be able to handle 10x more than 15MB/s (provided your search algorithm is fast enough to keep up).
Regarding fast locking mechanism between your threads, I would suggest looking into RCU - Read-Copy Update. Linux kernel is currently using it to achieve very efficient locking.
Do you have some minimum hardware requirements? 54GB in memory is perfectly possible these days (many motherboards can take 4x16GB these days, and that's not even server hardware). So if you want to require an SSD, you could maybe just as well require a lot of RAM and have an in-memory circular buffer as you suggest.
Also, if there's sufficient redudancy in the data, it may be viable to use some cheap compression algorithms (those which are easy on the CPU, i.e. some sort of 'level 0' compression). I.e. you don't store the raw data, but some compressed format (and possibly some index) which is decompressed by the readers.
Many good recommendations around. I'd like just to add that for circular buffer implementation you can have a look at Boost Circular Buffer

Implementing a lock free data structure on disk

I have an interesting challenge, for those with strong background in lock-free data structures, and disk based data structures.
I'm looking for a way to build in C++ a data structure to hold a varying amount of objects.
The limitation are such:
The data structure must reside on disk.
There is one thread writing to the data structure and many others reading from it.
Every read is atomic. (lets assume I can atomically read a block of size 32/64KB for this and that all objects are small than that in size.
A write should not block a read, for that it is possible to assume that I can write in an atomic way a block of 32/64KB as well.
Locks cannot be used at all.
Any suggestions?
I was thinking of using something like a B-Tree and when needed to split nodes and write new data than move them to new nodes at the end of the file and then just update the pointers to the nodes which will reside for example in some other file (the original blocks will be marked as free and added to a freestore)
However, I run into a problem if my mapping file is greater than 32/64Kb.. Let say I want it to hold even just 1 million object pointers than at 4 bytes/pointer I get to 4 million bytes which is roughly 4 Megs... (and at 1 billion objects even more than that..) Which means the mappings file cannot be written in an atomic manner.
So if someone has a better suggestion as to maybe how to implement the above - or even some direction it would be greatly appreciated.
As far as I know all opensource/commercial implementations of B-Tree use locks of some sort, which I cannot use.
Thanks,
Max.
You won't get very far by just assuming reads/writes are atomic -- mainly because they're not, and you'll end up emulating it in a way that'll kill performance.
It sounds like you want to research MVCC, which is the pretty standard mechanism to use when designing a lock-free database. The basic concept is that every read gets a "snapshot" of the database -- usually implemented in a lock-free way by leaving old pages alone and performing any modifications to new pages only. Once the old pages are finished being used by readers, they're finally marked for re-use.
While MVCC is significantly more involved than a CPU/RAM lock-free structure, once you have it many of the same optimistic lock-free patterns apply towards using it.
LMDB will do all of this with no problem. It is an MVCC B+tree and readers are completely lockless.

How to LRU-cache numerous objects made of C++ STL heavy structures?

I have big C++/STL data structures (myStructType) with imbricated lists and maps. I have many objects of this type I want to LRU-cache with a key. I can reload objects from disk when needed. Moreover, it has to be shared in a multiprocessing high performance application running on a BSD plateform.
I can see several solutions:
I can consider a life-time sorted list of pair<size_t lifeTime, myStructType v> plus a map to o(1) access the index of the desired object in the list from its key, I can use shm and mmap to store everything, and a lock to manage access (cf here).
I can use a redis server configured for LRU, and redesign my data structures to redis key/value and key/lists pairs.
I can use a redis server configured for LRU, and serialise my data structures (myStructType) to have a simple key/value to manage with redis.
There may be other solutions of course. How would you do that, or better, how have you successfully done that, keeping in mind high performance ?
In addition, I would like to avoid heavy dependencies like Boost.
I actually built caches (not only LRU) recently.
Options 2 and 3 are quite likely not faster than re-reading from disk. That's effectively no cache at all. Also, this would be a far heavier dependency than Boost.
Option 1 can be challenging. For instance, you suggest "a lock". That would be quite a contended lock, as it must protect each and every lifetime update, plus all LRU operations. Since your objects are already heavy, it may be worthwhile to have a unique lock per object. There are intermediate variants of this solution, where there is more than one lock, but also more than one object per lock. (You still need a key to protect the whole map, but that's for replacement only)
You can also consider if you really need strict LRU. That strategy assumes that the chances of an object being reused decreases over time. If that's not actually true, random replacement is just as good. You can also consider evicting more than one element at a time. One of the challenges is that when an element needs removing, it would be so from all threads, but it's sufficient if one thread removes it. That's why a batch removal helps: if a thread tries to take a lock for batch removal and it fails, it can continue under the assumption that the cache will have free space soon.
One quick win is to not update the LRU time of the last used element. It was already the newest, making it any newer won't help. This of course only has an effect if you often use that element quickly again, but (as noted above) otherwise you'd just use random eviction.

Application area of lock-striping

The ConcurrentHashMap of JDK uses a lock-striping technique. It is a nice idea to minimize locking overhead. Are there any other libraries or tools that take advantage of it?
For example, does database engine use it?
If the technique is not so much useful in other areas, what is the limitation of it?
Lock striping is useful when there is a way of breaking a high contention lock into multiple locks without compromising data integrity. If this is possible or not should take some thought and is not always the case. The data structure is also the contributing factor to the decision. So if we use a large array for implementing a hash table, using a single lock for the entire hash table for synchronizing it will lead to threads sequentially accessing the data structure. If this is the same location on the hash table then it is necessary but, what if they are accessing the two extremes of the table.
There is definitely a lot of time saved using lock striping. Multiple runs of a scenario gives almost halves the execution time.
The down side of lock striping is it is difficult to get the state of the data structure that is affected by striping. In the example the size of the table, or trying to list/enumerate the whole table may be cumbersome since we need to acquire all of the striped locks.