How to LRU-cache numerous objects made of C++ STL heavy structures? - c++

I have big C++/STL data structures (myStructType) with imbricated lists and maps. I have many objects of this type I want to LRU-cache with a key. I can reload objects from disk when needed. Moreover, it has to be shared in a multiprocessing high performance application running on a BSD plateform.
I can see several solutions:
I can consider a life-time sorted list of pair<size_t lifeTime, myStructType v> plus a map to o(1) access the index of the desired object in the list from its key, I can use shm and mmap to store everything, and a lock to manage access (cf here).
I can use a redis server configured for LRU, and redesign my data structures to redis key/value and key/lists pairs.
I can use a redis server configured for LRU, and serialise my data structures (myStructType) to have a simple key/value to manage with redis.
There may be other solutions of course. How would you do that, or better, how have you successfully done that, keeping in mind high performance ?
In addition, I would like to avoid heavy dependencies like Boost.

I actually built caches (not only LRU) recently.
Options 2 and 3 are quite likely not faster than re-reading from disk. That's effectively no cache at all. Also, this would be a far heavier dependency than Boost.
Option 1 can be challenging. For instance, you suggest "a lock". That would be quite a contended lock, as it must protect each and every lifetime update, plus all LRU operations. Since your objects are already heavy, it may be worthwhile to have a unique lock per object. There are intermediate variants of this solution, where there is more than one lock, but also more than one object per lock. (You still need a key to protect the whole map, but that's for replacement only)
You can also consider if you really need strict LRU. That strategy assumes that the chances of an object being reused decreases over time. If that's not actually true, random replacement is just as good. You can also consider evicting more than one element at a time. One of the challenges is that when an element needs removing, it would be so from all threads, but it's sufficient if one thread removes it. That's why a batch removal helps: if a thread tries to take a lock for batch removal and it fails, it can continue under the assumption that the cache will have free space soon.
One quick win is to not update the LRU time of the last used element. It was already the newest, making it any newer won't help. This of course only has an effect if you often use that element quickly again, but (as noted above) otherwise you'd just use random eviction.

Related

Does Multiple reader single writer implementation in g++-4.4(Not in C++11/14) via boost::shared_mutex impact performance?

Usage: In our production we have around 100 thread which can access the cache we are trying to implement. If cache is missed then information will be fetched from the database and cache will be updated via writer thread.
To achieve this we are planning to implement multiple read and single writer We cannot update the g++ version since we are using g++-4.4
Update: Each worker thread can work for both read and write. If cache is missed then information is cached from the DB.
Problem Statement:
We need to implement the cache to enhance the performance.
For this, cache read are more frequent and write operations to the cache is very much less.
I think we can use boost::shared_mutex boost::shared_lock, boost::upgrade_lock, boost::upgrade_to_unique_lock implementation
But we learnt that boost::shared_mutex has performance issues:
Performance comparison on reader writer locks
Lib boost devel
Questions
Does boost::shared_mutex impact the performance in case read are much frequent?
What are other constructs and design approaches we can take while considering compiler version g++4.4?
Is there a way-around on how to design it, such that reads are lock free?
Also, we are intended to use Map to keep the information for cache.
If writes were non-existent, one possibility can be 2-level cache where you first have a thread-local cache, and then the normal cache with mutex or reader/writer lock.
If writes are extremely rare, you can do the same. But have some lock-free way of invalidating the thread-local cache, e.g. an atomic int updated with every write, and in those cases clear the thread-local cache.
You need to profile it.
In case you're stuck because you don't have a "similar enough" environment where you can actually test things, you can probably write a simple wrapper using pthreads: pthread_rwlock_t
pthread_rwlock_rdlock
pthread_rwlock_wrlock
pthread_rwlock_unlock
Of course you can design things to be lock free. Most obvious solution would be to not share state. (If you do share state, you'll have to check that your target platform supports atomic instructions). However, without any knowledge of your application domain, I feel very safe suggesting you do not want lock-free. See e.g. Do lock-free algorithms really perform better than their lock-full counterparts?
It all depends on the frequency of the updates, the size of the cache and how much is changed in the update.
Let's assume you have a rather big cache with a lot of changes on each update. Then I would use a read-copy-update pattern, which is lock-free.
If your cached data is pretty small and one time read (e.g. a single integer) RCU is also a good choice.
A big cache, with small updates or a big cache with updates which are to frequent for RCU a Read-Write Lock is a good choice.
Alongside other answers suggesting you profile it, a large benefit can be had if you can somehow structure or predict the type, order and size of the requests.
If particular types of data are requested in a typical cycle, it would be better to split up the cache per data type. You will increase cache-hit/miss ratios and the size of each cache can be adapted to the type. You will also reduce possible contention.
Likewise, the size of the requests is important when choosing your update approach. Smaller data fragments may be stored longer or even pooled together, while larger chunks may be requested less frequently.
Even with a basic prediction scheme in place that covers only the most frequent fetch patterns, you may already improve performance quite a bit. It's definitely worth it to try and train e.g. a NN (Neural Network) to guess the next request in advance.

Predictively computing potentially needed values on large shared data structure with infrequent updates

I have a system I need to design with low latency in mind, processing power and memory are generous. I have a large (several GB) data structure that is updated once every few seconds. Many (read only) operations are going to run against this data structure between updates, in parallel, accessing it heavily. As soon as an update occurs, all computations in progress should be cleanly cancelled, as their results are invalidated by the update.
The issue I'm running into here is that writes are infrequent enough, and readers access so often that locking around individual reader access would have a huge hit to performance. I'm fine with the readers reading invalid data, but then I need to deal with any invariants broken (assertions) or segfaults due to stale pointers, etc. At the same time, I can't have readers block writers, so reader-writer locks acquired at every reader's thread start is unacceptable.
The only solution I can think of has a number of issues, which is to allocate a mapping with mmap, put the readers in separate processes, and mprotect the memory to kill the workers when it's time to update. I'd prefer a cross-platform solution (ideally pure C++), however, and ideally without forking every few seconds. This would also require some surgery to get all the data structures located in shm.
Something like a revocable lock would do exactly what I need, but I don't know of any libraries that provide such functionality.
If this was a database I'd use multi-versions concurrency control. Readers obtain a logical snapshot while the underlying physical data structures are mostly lock-free (or locked very shortly and fine-grainedly).
You say your memory is generously equipped. Can you just create a complete copy of the data structure? Then you modify the copy and swap it out atomically.
Or, can you use immutable data-structures so that readers continue to use the old version and the writer creates new objects?
Or, you implement MVCC in a fine-grained way. Let's say you want to version a hash-set. Instead of keeping one value per key, you keep one value per key per version. Readers read from the latest version that is <= the version that existed when they started to read. Writers create a new version number for each write "transaction". Only when all writes are complete readers would start picking up changes from the new version. This is how MVCC-databases do it.
Besides these approaches I also liked your mmap idea. I don't think you need a separate process is your OS supports copy-on-write memory mappings. Then you can map the same memory area multiple times and provide a stable snapshot to readers.

(preferably boost) lock-free array/vector/map/etc?

Considering my lack of c++ knowledge, please try to read my intent and not my poor technical question.
This is the backbone of my program https://github.com/zaphoyd/websocketpp/blob/experimental/examples/broadcast_server/broadcast_server.cpp
I'm building a websocket server with websocket++ (and oh is websocket++ sweet. I highly recommend), and I can easily manipulate per user data thread-safely because it really doesn't need to be manipulated by different threads; however, I do want to be able to write to an array (I'm going to use the catch-all term "array" from weaker languages like vb, php, js) in one function thread (with multiple iterations that could be running simultanously) and also read in 1 or more threads.
Take stack as an example: if I wanted to have all of the ids (PRIMARY column of all articles) sorted in a particular way, in this case by net votes, and held in memory, I'm thinking I would have a function that's called in its' own boost::thread, fired whenever a vote on the site comes in to reorder the array.
How can I do this without locking & blocking? I'm 100% fine with users reading from an old array while another is being built, but I absolutely do not want their reads or the thread writes to ever fail/be blocked.
Does a lock-free array exist? If not, is there some way to build the new array in a temporary array and then write it to the actual array when the building is finished without locking & blocking?
Have you looked at Boost.Lockfree?
Uh, uh, uh. Complicated.
Look here (for an example): RCU -- and this is only about multiple reads along with ONE write.
My guess is that multiple writers at once are not going to work. You should rather look for a more efficient representation than an array, one that allows for faster updates. How about a balanced tree? log(n) should never block anything in a noticeable fashion.
Regarding boost -- I'm happy that it finally has proper support for thread synchronization.
Of course, you could also keep a copy and batch the updates. Then a background process merges the updates and copies the result for the readers.

How should I handle thread synchronisation in a shared tree data structure?

This is just for a kind of concurrency refresher...
Imagine I have a B+ tree data structure in memory - multiple items per node, only leaf nodes contain items, leaf nodes also form a linked list for easy sequential access. Inserts and deletes mostly only affect a leaf node, but can cause nodes to split or merge in a process that may propagate to the root.
I have a single-thread implementation, and the updates follow a kind of pre-planning approach. A recursion steps up the tree from leaf level as far as nodes need to change, building a linked list (linking local variable in different recursions) that describes the changes needed. When it knows what's needed, it can check whether it can allocate all needed nodes, and apply all needed changes (or not) by referencing this plan before falling out of the recursion.
This implementation also "maintains" iterators on updates, so iterators aren't invalidated by inserts/deletes unless the specific item they point to is deleted. Inserts/deletes within the same node cause the iterators pointing into that node to be updated.
Trouble is, I need to make it multithreaded - supporting potentially many readers and writers at once.
I want multiple readers to be able to read and write at the same time, so long as there is no risk of corruption as a result. So for reading, I don't want mutually exclusive access at all, even to a single node. For writing, I want to lock the minimum number of nodes needed for the change. And I want to avoid deadlock, of course.
Thankfully, it isn't something I actually need to do - but since I've neglected my concurrency skills, this just seems like a good thought experiment.
This is obviously similar to the kinds of problems that databases and filesystems have to handle, so I'm guessing I might get some references to that kind of thing, which would be great.
So - how would I handle the thread synchronisation for this? I can vaguely see a role for mutexes and/or semaphores on nodes, but what strategies would I use to work with them?
Definitely challenging task! I see that you c++ programmer, however I believe that in c++ there are similar concepts as in java and I'll try to help from java standpoint.
So for reading, I don't want mutually exclusive access at all, even to a single node
You could use ReadWriteLock. It be held simultaneously by multiple reader threads, so long as there are no writers. The write lock is exclusive. You just have to use exclusive access when doing writing. Do you have analogue in c++?
And I want to avoid deadlock, of course.
Just lock multiple nodes in order of levels (eg from top to bottom). That will guarantee you protection from deadlocks(that would be smth similar to Lamport's Bakery Algorithm).
As for databases - they resolve deadlocks by killing one process :-).
One more strategy is to implement unblocking tree structure in the similar manner how Cliff Click implemented unblocking hash map(state machine with all cases covered):
video
Cheers

Alternatives for locks for synchronisation

I'm currently in the process of developing my own little threading library, mainly for learning purposes, and am at the part of the message queue which will involve a lot of synchronisation in various places. Previously I've mainly used locks, mutexes and condition variables a bit which all are variations of the same theme, a lock for a section that should only be used by one thread at a time.
Are there any different solutions to synchronisation than using locks? I've read lock-free synchronization at places, but some consider hiding the locks in containers to be lock-free, which I disagree with. you just don't explicitly use the locks yourself.
Lock-free algorithms typically involve using compare-and-swap (CAS) or similar CPU instructions that update some value in memory not only atomically, but also conditionally and with an indicator of success. That way you can code something like this:
1 do
2 {
3 current_value = the_varibale
4 new_value = ...some expression using current_value...
5 } while(!compare_and_swap(the_variable, current_value, new_value));
compare_and_swap() atomically checks whether the_variable's value is still current_value, and only if that's so will it update the_variable's value to new_value and return true
exact calling syntax will vary with the CPU, and may involve assembly language or system/compiler-provided wrapper functions (use the latter if available - there may be other compiler optimisations or issues that their usage restricts to safe behaviours); generally, check your docs
The significance is that when another thread updates the variable after the read on line 3 but before the CAS on line 5 attempts the update, the compare and swap instruction will fail because the state from which you're updating is not the one you used to calculate the desired target state. Such do/while loops can be said to "spin" rather than lock, as they go round and round the loop until CAS succeeds.
Crucially, your existing threading library can be expected to have a two-stage locking approach for mutex, read-write locks etc. involving:
First stage: spinning using CAS or similar (i.e. spin on { read the current value, if it's not set then cas(current = not set, new = set) }) - which means other threads doing a quick update often won't result in your thread swapping out to wait, and all the relatively time-consuming overheads associated with that.
The second stage is only used if some limit of loop iterations or elapsed time is exceeded: it asks the operating system to queue the thread until it knows (or at least suspects) the lock is free to acquire.
The implication of this is that if you're using a mutex to protect access to a variable, then you are unlikely to do any better by implementing your own CAS-based "mutex" to protect the same variable.
Lock free algorithms come into their own when you are working directly on a variable that's small enough to update directly with the CAS instruction itself. Instead of being...
get a mutex (by spinning on CAS, falling back on slower OS queue)
update variable
release mutex
...they're simplified (and made faster) by simply having the spin on CAS do the variable update directly. Of course, you may find the work to calculate new value from old painful to repeat speculatively, but unless there's a LOT of contention you're not wasting that effort often.
This ability to update only a single location in memory has far-reaching implications, and work-arounds can require some creativity. For example, if you had a container using lock-free algorithms, you may decide to calculate a potential change to an element in the container, but can't sync that with updating a size variable elsewhere in memory. You may need to live without size, or be able to use an approximate size where you do a CAS-spin to increment or decrement the size later, but any given read of size may be slightly wrong. You may need to merge two logically-related data structures - such as a free list and the element-container - to share an index, then bit-pack the core fields for each into the same atomically-sized word at the start of each record. These kinds of data optimisations can be very invasive, and sometimes won't get you the behavioural characteristics you'd like. Mutexes et al are much easier in this regard, and at least you know you won't need a rewrite to mutexes if requirements evolve just that step too far. That said, clever use of a lock-free approach really can be adequate for a lot of needs, and yield a very gratifying performance and scalability improvement.
A core (good) consequence of lock-free algorithms is that one thread can't be holding the mutex then happen to get swapped out by the scheduler, such that other threads can't work until it resumes; rather - with CAS - they can spin safely and efficiently without an OS fallback option.
Things that lock free algorithms can be good for include updating usage/reference counters, modifying pointers to cleanly switch the pointed-to data, free lists, linked lists, marking hash-table buckets used/unused, and load-balancing. Many others of course.
As you say, simply hiding use of mutexes behind some API is not lock free.
There are a lot of different approaches to synchronization. There are various variants of message-passing (for example, CSP) or transactional memory.
Both of these may be implemented using locks, but that's an implementation detail.
And then of course, for some purposes, there are lock-free algorithms or data-structures, which make do with just a few atomic instructions (such as compare-and-swap), but this isn't really a general-purpose replacement for locks.
There are several implementations of some data structures, which can be implemented in a lock free configuration. For example, the producer/consumer pattern can often be implemented using lock-free linked list structures.
However, most lock-free solutions require significant thought on the part of the person designing the specific program/specific problem domain. They aren't generally applicable for all problems. For examples of such implementations, take a look at Intel's Threading Building Blocks library.
Most important to note is that no lock-free solution is free. You're going to give something up to make that work, at the bare minimum in implementation complexity, and probably performance in scenarios where you're running on a single core (for example, a linked list is MUCH slower than a vector). Make sure you benchmark before using lock free on the base assumption that it would be faster.
Side note: I really hope you're not using condition variables, because there's no way to ensure that their access operates as you wish in C and C++.
Yet another library to add to your reading list: Fast Flow
What's interesting in your case is that they are based on lock-free queues. They have implemented a simple lock-free queue and then have built more complex queues out of it.
And since the code is free, you can peruse it and get the code for the lock-free queue, which is far from trivial to get right.