Application area of lock-striping - concurrency

The ConcurrentHashMap of JDK uses a lock-striping technique. It is a nice idea to minimize locking overhead. Are there any other libraries or tools that take advantage of it?
For example, does database engine use it?
If the technique is not so much useful in other areas, what is the limitation of it?

Lock striping is useful when there is a way of breaking a high contention lock into multiple locks without compromising data integrity. If this is possible or not should take some thought and is not always the case. The data structure is also the contributing factor to the decision. So if we use a large array for implementing a hash table, using a single lock for the entire hash table for synchronizing it will lead to threads sequentially accessing the data structure. If this is the same location on the hash table then it is necessary but, what if they are accessing the two extremes of the table.
There is definitely a lot of time saved using lock striping. Multiple runs of a scenario gives almost halves the execution time.
The down side of lock striping is it is difficult to get the state of the data structure that is affected by striping. In the example the size of the table, or trying to list/enumerate the whole table may be cumbersome since we need to acquire all of the striped locks.

Related

Sharing a data set across threads vs. splitting up the data per thread

I have written a small program that generates images of the Mandelbrot set, and I have been using it as an opportunity to teach myself multithreading.
I currently have four threads that each handle calculating a quarter of the data. When they finish, the data is aggregated to then be drawn to a bitmap.
I'm currently pre-calculating all the complex numbers for each pixel in the main thread and putting them into an vector. Then, I split the vector into four smaller vectors to pass into each thread to modify.
Is there a best practice here? Should I be splitting up my data set so that the threads can work without interfering with eachother, or should I just use one data set and use mutexs/locking? I suppose benchmarking would probably be my best bet.
Thanks, let me know if you'd want to see my code.
The best practice is make threads as independent of each other as possible. I'm not familiar with the particular problem you're trying to solve, but if it allows equally dividing work among threads, splitting up the data set will be the most efficient way. When splitting data, have false sharing in mind, and minimize cross-thread data movements.
Choosing other parallelisation strategies makes sense on cases where, e.g.,:
Eliminating cross-thread dependencies requires a change to the algorithm that will cause too much extra work.
The amount of work per thread isn't balanced, and you need some dynamic work assignment to ensure all threads are busy until work is completed.
The algorithm is composed of different stages such that task parallelism is more efficient than data parallelism (namely, each stage is handled by a different thread, and data is pipelined between threads. This makes sense if there are too many dependencies within each stage).
Bear in mind that a mutex/lock means wasted time waiting, as well as possibly non-trivial synchronisation overhead if the mutex is a kernel object. However, correctness comes first: if other options are too difficult to get right, you'll lose more than you'll gain. Finally, always compare your parallel implementation to a sequential one. Due to data movements and dependencies, the sequential implementation often runs faster than the parallel one.

Does Multiple reader single writer implementation in g++-4.4(Not in C++11/14) via boost::shared_mutex impact performance?

Usage: In our production we have around 100 thread which can access the cache we are trying to implement. If cache is missed then information will be fetched from the database and cache will be updated via writer thread.
To achieve this we are planning to implement multiple read and single writer We cannot update the g++ version since we are using g++-4.4
Update: Each worker thread can work for both read and write. If cache is missed then information is cached from the DB.
Problem Statement:
We need to implement the cache to enhance the performance.
For this, cache read are more frequent and write operations to the cache is very much less.
I think we can use boost::shared_mutex boost::shared_lock, boost::upgrade_lock, boost::upgrade_to_unique_lock implementation
But we learnt that boost::shared_mutex has performance issues:
Performance comparison on reader writer locks
Lib boost devel
Questions
Does boost::shared_mutex impact the performance in case read are much frequent?
What are other constructs and design approaches we can take while considering compiler version g++4.4?
Is there a way-around on how to design it, such that reads are lock free?
Also, we are intended to use Map to keep the information for cache.
If writes were non-existent, one possibility can be 2-level cache where you first have a thread-local cache, and then the normal cache with mutex or reader/writer lock.
If writes are extremely rare, you can do the same. But have some lock-free way of invalidating the thread-local cache, e.g. an atomic int updated with every write, and in those cases clear the thread-local cache.
You need to profile it.
In case you're stuck because you don't have a "similar enough" environment where you can actually test things, you can probably write a simple wrapper using pthreads: pthread_rwlock_t
pthread_rwlock_rdlock
pthread_rwlock_wrlock
pthread_rwlock_unlock
Of course you can design things to be lock free. Most obvious solution would be to not share state. (If you do share state, you'll have to check that your target platform supports atomic instructions). However, without any knowledge of your application domain, I feel very safe suggesting you do not want lock-free. See e.g. Do lock-free algorithms really perform better than their lock-full counterparts?
It all depends on the frequency of the updates, the size of the cache and how much is changed in the update.
Let's assume you have a rather big cache with a lot of changes on each update. Then I would use a read-copy-update pattern, which is lock-free.
If your cached data is pretty small and one time read (e.g. a single integer) RCU is also a good choice.
A big cache, with small updates or a big cache with updates which are to frequent for RCU a Read-Write Lock is a good choice.
Alongside other answers suggesting you profile it, a large benefit can be had if you can somehow structure or predict the type, order and size of the requests.
If particular types of data are requested in a typical cycle, it would be better to split up the cache per data type. You will increase cache-hit/miss ratios and the size of each cache can be adapted to the type. You will also reduce possible contention.
Likewise, the size of the requests is important when choosing your update approach. Smaller data fragments may be stored longer or even pooled together, while larger chunks may be requested less frequently.
Even with a basic prediction scheme in place that covers only the most frequent fetch patterns, you may already improve performance quite a bit. It's definitely worth it to try and train e.g. a NN (Neural Network) to guess the next request in advance.

Predictively computing potentially needed values on large shared data structure with infrequent updates

I have a system I need to design with low latency in mind, processing power and memory are generous. I have a large (several GB) data structure that is updated once every few seconds. Many (read only) operations are going to run against this data structure between updates, in parallel, accessing it heavily. As soon as an update occurs, all computations in progress should be cleanly cancelled, as their results are invalidated by the update.
The issue I'm running into here is that writes are infrequent enough, and readers access so often that locking around individual reader access would have a huge hit to performance. I'm fine with the readers reading invalid data, but then I need to deal with any invariants broken (assertions) or segfaults due to stale pointers, etc. At the same time, I can't have readers block writers, so reader-writer locks acquired at every reader's thread start is unacceptable.
The only solution I can think of has a number of issues, which is to allocate a mapping with mmap, put the readers in separate processes, and mprotect the memory to kill the workers when it's time to update. I'd prefer a cross-platform solution (ideally pure C++), however, and ideally without forking every few seconds. This would also require some surgery to get all the data structures located in shm.
Something like a revocable lock would do exactly what I need, but I don't know of any libraries that provide such functionality.
If this was a database I'd use multi-versions concurrency control. Readers obtain a logical snapshot while the underlying physical data structures are mostly lock-free (or locked very shortly and fine-grainedly).
You say your memory is generously equipped. Can you just create a complete copy of the data structure? Then you modify the copy and swap it out atomically.
Or, can you use immutable data-structures so that readers continue to use the old version and the writer creates new objects?
Or, you implement MVCC in a fine-grained way. Let's say you want to version a hash-set. Instead of keeping one value per key, you keep one value per key per version. Readers read from the latest version that is <= the version that existed when they started to read. Writers create a new version number for each write "transaction". Only when all writes are complete readers would start picking up changes from the new version. This is how MVCC-databases do it.
Besides these approaches I also liked your mmap idea. I don't think you need a separate process is your OS supports copy-on-write memory mappings. Then you can map the same memory area multiple times and provide a stable snapshot to readers.

Implementing a lock free data structure on disk

I have an interesting challenge, for those with strong background in lock-free data structures, and disk based data structures.
I'm looking for a way to build in C++ a data structure to hold a varying amount of objects.
The limitation are such:
The data structure must reside on disk.
There is one thread writing to the data structure and many others reading from it.
Every read is atomic. (lets assume I can atomically read a block of size 32/64KB for this and that all objects are small than that in size.
A write should not block a read, for that it is possible to assume that I can write in an atomic way a block of 32/64KB as well.
Locks cannot be used at all.
Any suggestions?
I was thinking of using something like a B-Tree and when needed to split nodes and write new data than move them to new nodes at the end of the file and then just update the pointers to the nodes which will reside for example in some other file (the original blocks will be marked as free and added to a freestore)
However, I run into a problem if my mapping file is greater than 32/64Kb.. Let say I want it to hold even just 1 million object pointers than at 4 bytes/pointer I get to 4 million bytes which is roughly 4 Megs... (and at 1 billion objects even more than that..) Which means the mappings file cannot be written in an atomic manner.
So if someone has a better suggestion as to maybe how to implement the above - or even some direction it would be greatly appreciated.
As far as I know all opensource/commercial implementations of B-Tree use locks of some sort, which I cannot use.
Thanks,
Max.
You won't get very far by just assuming reads/writes are atomic -- mainly because they're not, and you'll end up emulating it in a way that'll kill performance.
It sounds like you want to research MVCC, which is the pretty standard mechanism to use when designing a lock-free database. The basic concept is that every read gets a "snapshot" of the database -- usually implemented in a lock-free way by leaving old pages alone and performing any modifications to new pages only. Once the old pages are finished being used by readers, they're finally marked for re-use.
While MVCC is significantly more involved than a CPU/RAM lock-free structure, once you have it many of the same optimistic lock-free patterns apply towards using it.
LMDB will do all of this with no problem. It is an MVCC B+tree and readers are completely lockless.

How to LRU-cache numerous objects made of C++ STL heavy structures?

I have big C++/STL data structures (myStructType) with imbricated lists and maps. I have many objects of this type I want to LRU-cache with a key. I can reload objects from disk when needed. Moreover, it has to be shared in a multiprocessing high performance application running on a BSD plateform.
I can see several solutions:
I can consider a life-time sorted list of pair<size_t lifeTime, myStructType v> plus a map to o(1) access the index of the desired object in the list from its key, I can use shm and mmap to store everything, and a lock to manage access (cf here).
I can use a redis server configured for LRU, and redesign my data structures to redis key/value and key/lists pairs.
I can use a redis server configured for LRU, and serialise my data structures (myStructType) to have a simple key/value to manage with redis.
There may be other solutions of course. How would you do that, or better, how have you successfully done that, keeping in mind high performance ?
In addition, I would like to avoid heavy dependencies like Boost.
I actually built caches (not only LRU) recently.
Options 2 and 3 are quite likely not faster than re-reading from disk. That's effectively no cache at all. Also, this would be a far heavier dependency than Boost.
Option 1 can be challenging. For instance, you suggest "a lock". That would be quite a contended lock, as it must protect each and every lifetime update, plus all LRU operations. Since your objects are already heavy, it may be worthwhile to have a unique lock per object. There are intermediate variants of this solution, where there is more than one lock, but also more than one object per lock. (You still need a key to protect the whole map, but that's for replacement only)
You can also consider if you really need strict LRU. That strategy assumes that the chances of an object being reused decreases over time. If that's not actually true, random replacement is just as good. You can also consider evicting more than one element at a time. One of the challenges is that when an element needs removing, it would be so from all threads, but it's sufficient if one thread removes it. That's why a batch removal helps: if a thread tries to take a lock for batch removal and it fails, it can continue under the assumption that the cache will have free space soon.
One quick win is to not update the LRU time of the last used element. It was already the newest, making it any newer won't help. This of course only has an effect if you often use that element quickly again, but (as noted above) otherwise you'd just use random eviction.