ConcurrentHashMap, which concurrent features improved in JDK8

ConcurrentHashMap, which concurrent features improved in JDK8 - concurrency

Can any concurrent expert explain in ConcurrentHashMap, which concurrent features improved comparing with which in previous JDKs

Well, the ConcurrentHashMap has been entirely rewritten. Before Java 8, each ConcurrentHashMap had a “concurrency level” which was fixed at construction time. For compatibility reasons, there is still a constructor accepting such a level though not using it in the original way. The map was split into as many segments, as its concurrency level, each of them having its own lock, so in theory, there could be up to concurrency level concurrent updates, if they all happened to target different segments, which depends on the hashing.
In Java 8, each hash bucket can get updated individually, so as long as there are no hash collisions, there can be as many concurrent updates as its current capacity. This is in line with the new features like the compute methods which guaranty atomic updates, hence, locking of at least the hash bucket which gets updated. In the best case, they lock indeed only that single bucket.
Further, the ConcurrentHashMap benefits from the general hash improvements applied to all kind of hash maps. When there are hash collisions for a certain bucket, the implementation will resort to a sorted map like structure within that bucket, thus degrading to a O(log(n)) complexity rather than the O(n) complexity of the old implementation when searching the bucket.

I think there are several changes compared with JDK7:
Lazy initialization: in JDK8, the memory used for each segment is allocated only when some entity is added to the map. In JDK7,this is done when the map is created.
Some new function is added in JDK8 like forEach, reduce, search etc.
Inner structure change : the TreeBin (red-black tree) is used in jdk8 to improve the search efficiency.

Related

Concurrent linked/skip lists: 'update' and 'get' operations

I have an assignment that requires setting up a data structure with concurrent reads/writes (an order book for a matching engine in a trading exchange), and I have settled on concurrent linked/skip lists. I've looked at several of the following articles/reports, where some are lock-free, and many are fine-grained locked (listed below in no particular order):
Practical concurrent unrolled lists using lazy synchronisation
Practical lock-freedom page 53
A Contention-Friendly, Non-Blocking Skip List
A Provably Correct Concurrent Skip List
A Simple Optimistic Skiplist Algorithm
All of these have fairly detailed algorithm pseudocode listings, but there are two issues I note with all of them:
They are all maps—they associate some key to some value, whereas I need the Node class in all these algorithms to simply contain some struct T (more particularly, I need the number of units in that order, the unit selling/buying price, the order ID, and an insertion timestamp). MSDN has a very nice C# implementation of a skip list containing some T (albeit not concurrent, which is a strict requirement), and is straightforward to adapt to C++ (ergo the tag).
They don't have update and get operations—what they do have are find (which returns a boolean value, and not the node value itself), insert, and delete operations. I am wondering if I can somehow compose and modify the latter three to create the former two, but I am a little lost.
How might I implement get/update so that concurrency and thread ordering is maintained?
I am leaning towards fine-grained locking (which is easier to reason about, even if slower) than lock-free algorithms (which I don't fully understand).

Why are Get and MultiGet significantly slower for large key sets compared to using an Iterator?

I'm currently playing around with RocksDB (C++) and was curious about some performance metrics I've experienced.
For testing purposes, my database keys are file paths and the values are filenames. My database has around 2M entries in it. I'm running RocksDB locally on a MacBook Pro 2016 (SSD).
My use case is dominated by reads. Full key scans are quite common as are key scans that include a "significant" number of keys. (50%+)
I'm curious about the following observations:
1. An Iterator is dramatically faster than calling Get when performing full key scans.
When I want to look at all of the keys in the database, I'm seeing a 4-8x performance improvement when using an Iterator instead of calling Get for each key. The use of MultiGet makes no difference.
In the case of calling Get roughly 2M times, the keys have been previously fetched into a vector and sorted lexicographically. Why is calling Get repeatedly so much slower than using an Iterator? Is there a way to narrow the performance gap between the two APIs?
2. When fetching around half the keys, the performance between using an Iterator and Get starts to become negligible.
As the number of keys to fetch is reduced, then making multiple calls to Get starts to take about as long as using an Iterator as the iterator is paying the price of scanning over keys that aren't in the desired keyset.
Is there some "magic" ratio where this becomes true for most databases? For example, if I need to scan over 25% of the keys, then calling Get is faster, but if it's 75% of the keys, then an Iterator is faster. But those numbers are just "made up" by rough testing.
3. Fetching keys in sorted order does not appear to improve performance.
If I pre-sort the keys I want to fetch into the same order that an Iterator would return them in, that does not appear to make calling Get multiple times any faster. Why is that? It's mentioned in the documentation that it's recommended to sort keys before doing a batch insert. Does Get not benefit from the same look-ahead caching that an Iterator benefits from?
4. What settings are recommended for a read-heavy use case?
Finally, are there any specific settings recommended for a read-heavy use case that might involve scanning a significant number of keys at once?
macOS 10.14.3, MacBook Pro 2016 SSD, RocksDB 5.18.3, Xcode 10.1

RocksDB internally represents its data as a log-structured merge tree which has several sorted layers by default (this can be changed with plugins/config). The intuition from Paul's first answer holds, except there is no classical index; the data is actually sorted on disk with pointers to the next files. The lookup operation has on average logarithmic complexity, but advancing an iterator in a sorted range is constant time. So for dense sequential reads, iterating is much faster.
The point where the costs balance out is determined not only by the number of keys you read, but also by the size of the database. As the database grows, the lookup becomes slower, while Next() remains constant. Very recent inserts are likely to be read very fast, since they may still be in memory (memtables).
Sorting the keys actually just improves your cache hit-rate. Depending on your disk, the difference may be very small, e.g., if you have an NVMe SSD, the difference in access time is just not as drastic anymore as it was when it was RAM vs. HDD. If you have to do several operations over the same or even different key-sets doing them by key-order (f(a-c) g(a-c) f(d-g)...) instead of sequentially should improve your performance, since you will have more cache-hits and also benefit from the RocksDB block cache.
The tuning guide is a good starting point, especially the video on database solutions, but if RocksDB is too slow for you also consider using a DB based on a different storage algorithm. LSM is typically better for write-heady workloads, and while RocksDB lets you control read vs. write vs. space amplification very well, a b-tree or ISAM based solution may just be much faster for range-reads/repeated reads.

I don't know anything about RocksDB per-se, but I can answer a lot of this from first principles.
An Iterator is dramatically faster than calling Get when performing full key scans.
This is likely to be because Get has to do a full lookup in the underlying index (starting from the top) whereas advancing an iterator can be achieved by just moving from the current node to the next. Assuming the index is implemented as a red-black tree or similar, there's a lot less work in the second method than the first.
When fetching around half the keys, the performance between using an Iterator and Get starts to become negligible.
So you are skipping entries by calling iterator->Next () multiple times? If so, then there will come a point where it's cheaper to call Get for each key instead, yes. Exactly when that happens will depend on the number of entries in the index (since that determines the number of levels in the tree).
Fetching keys in sorted order does not appear to improve performance.
No, I would not expect it to. Get is (presumably) stateless.
What settings are recommended for a read-heavy use case?
That I don't know, sorry, but you might read:
https://github.com/facebook/rocksdb/wiki/RocksDB-Tuning-Guide

Why does CouchDB use an append-only B+ tree and not a HAMT

I'm reading up on datastructures, especially immutable ones like the append-only B+ tree used in CouchDB and the Hash array mapped trie used in Clojure and some other functional programming languages.
The main reason datastructures that work well in memory might not work well on disk appears to be time spent on disk seeks due to fragmentation, as with a normal binary tree.
However, HAMT is also very shallow, so doesn't require any more seeks than a B tree.
Another suggested reason is that deletions from a array mapped trie are more expensive tha from a B tree. This is based on the assumption that we're talking about a dense vector, and doesn't apply when using either as a hash map.
What's more, it seems that a B tree does more rebalancing, so using it in an append-only manner produces more garbage.
So why do CouchDB and practically every other database and filesystem use B trees?
[edit] fractal trees? log-structured merge tree? mind = blown
[edit] Real-life B trees use a degree in the thousands, while a HAMT has a degree of 32. A HAMT of degree 1024 would be possible, but slower due to popcnt handling 32 or 64 bits at a time.

B-trees are used because they are a well-understood algorithm that achieves "ideal" sorted-order read-cost. Because keys are sorted, moving to the next or previous key is very cheap.
HAMTs or other hash storage, stores keys in random order. Keys are retrieved by their exact value, and there is no efficient way to find to the next or previous key.
Regarding degree, it is normally selected indirectly, by selecting page size. HAMTs are most often used in memory, with pages sized for cache lines, while B-trees are most often used with secondary storage, where page sizes are related to IO and VM parameters.
Log Structured Merge (LSM) is a different approach to sorted order storage which achieves more optimal write-efficiency, by trading off some read efficiency. That hit to read efficiency can be a problem for read-modify-write workloads, but the fewer uncached reads there are, the more LSM provides better overall throughput vs B-tree - at the cost of higher worst case read latency.
LSM also offers the promise of a wider-performance envelope. Putting new data into its proper place is "deferred", offering the possibility to tune read-to-write efficiency by controlling the proportion of deferred cleanup work to live work. In theory, an ideal-LSM with zero-deferral is a B-tree and with 100%-deferral is a log.
However, LSM is more of a "family" of algorithms than a specific algorithm like a B-tree. Their usage is growing in popularity, but it is hindered by the lack of a de-facto optimal LSM design. LevelDB/RocksDB is one of the more practical LSM implementations, but it is far from optimal.
Another approach to achieving write-throughput efficiency is to write-optimize B-trees through write-deferral, while attempting to maintain their optimal read-throughput.
Fractal-trees, shuttle-trees, stratified-trees are this type of design, and represent a hybrid gray area between B-tree and LSM. Rather than deferring writes to an offline process, they amortize write-deferral in a fixed way. For example, such a design might represent a fixed 60%-write-deferral fraction. This means they can't achieve the 100% write-deferral performance of an LSM, but they also have a more predictable read-performance, making them more practical drop-in replacements for B-trees. (As in the commercial Tokutek MySQL and MongoDB fractal-tree backends)

Btrees are ordered by their key while in a hash map similar keys have very different hash values so are stored far each other. Now think of a query that do a range scan "give me yesterday's sales": with a hash map you have to scan all the map to find them, with a btree on the sales_dtm columns you'll find them nicely clustered and you exactly know where to start and stop reading.

How to LRU-cache numerous objects made of C++ STL heavy structures?

I have big C++/STL data structures (myStructType) with imbricated lists and maps. I have many objects of this type I want to LRU-cache with a key. I can reload objects from disk when needed. Moreover, it has to be shared in a multiprocessing high performance application running on a BSD plateform.
I can see several solutions:
I can consider a life-time sorted list of pair<size_t lifeTime, myStructType v> plus a map to o(1) access the index of the desired object in the list from its key, I can use shm and mmap to store everything, and a lock to manage access (cf here).
I can use a redis server configured for LRU, and redesign my data structures to redis key/value and key/lists pairs.
I can use a redis server configured for LRU, and serialise my data structures (myStructType) to have a simple key/value to manage with redis.
There may be other solutions of course. How would you do that, or better, how have you successfully done that, keeping in mind high performance ?
In addition, I would like to avoid heavy dependencies like Boost.

I actually built caches (not only LRU) recently.
Options 2 and 3 are quite likely not faster than re-reading from disk. That's effectively no cache at all. Also, this would be a far heavier dependency than Boost.
Option 1 can be challenging. For instance, you suggest "a lock". That would be quite a contended lock, as it must protect each and every lifetime update, plus all LRU operations. Since your objects are already heavy, it may be worthwhile to have a unique lock per object. There are intermediate variants of this solution, where there is more than one lock, but also more than one object per lock. (You still need a key to protect the whole map, but that's for replacement only)
You can also consider if you really need strict LRU. That strategy assumes that the chances of an object being reused decreases over time. If that's not actually true, random replacement is just as good. You can also consider evicting more than one element at a time. One of the challenges is that when an element needs removing, it would be so from all threads, but it's sufficient if one thread removes it. That's why a batch removal helps: if a thread tries to take a lock for batch removal and it fails, it can continue under the assumption that the cache will have free space soon.
One quick win is to not update the LRU time of the last used element. It was already the newest, making it any newer won't help. This of course only has an effect if you often use that element quickly again, but (as noted above) otherwise you'd just use random eviction.

std::map vs. self-written std::vector based dictionary

I'm building a content storage system for my game engine and I'm looking at possible alternatives for storing the data. Since this is a game, it's obvious that performance is important. Especially considering various entities in the engine will be requesting resources from the data structures of the content manager upon their creation. I'd like to be able to search resources by a name instead of an index number, so a dictionary of some sort would be appropriate.
What are the pros and cons to using an std::map and to creating my own dictionary class based on std::vector? Are there any speed differences (if so, where will performance take a hit? I.e. appending vs. accessing) and is there any point in taking the time to writing my own class?
For some background on what needs to happen:
Writing to the data structures occurs only at one time, when the engine loads. So no writing actually occurs during gameplay. When the engine exits, these data structures are to be cleaned up. Reading from them can occur at any time, whenever an entity is created or a map is swapped. There can be as little as one entity being created at a time, or as many as 20, each needing a variable number of resources. Resource size can also vary depending on the size of the file being read in at the start of the engine, images being the smallest and music being the largest depending on the format (.ogg or .midi).

Map: std::map has guaranteed logarithmic lookup complexity. It's usually implemented by experts and will be of high quality (e.g. exception safety). You can use custom allocators for custom memory requirements.
Your solution: It'll be written by you. A vector is for contiguous storage with random access by position, so how will you implement lookup by value? Can you do it with guaranteed logarithmic complexity or better? Do you have specific memory requirements? Are you sure you can implement a the lookup algorithm correctly and efficiently?
3rd option: If you key type is string (or something that's expensive to compare), do also consider std::unordered_map, which has constant-time lookup by value in typical situations (but not quite guaranteed).

If you want the speed guarantee of std::map as well as the low memory usage of std::vector you could put your data in a std::vector, std::sort it and then use std::lower_bound to find the elements.

std::map is written with performance in mind anyway, whilst it does have some overhead as they have attempted to generalize to all circumstances, it will probably end up more efficient than your own implementation anyway. It uses a red-black binary tree, giving all of it's operations O[log n] efficiency (aside from copying and iterating for obvious reasons).
How often will you be reading/writing to the map, and how long will each element be in it? Also, you have to consider how often will you need to resize etc. Each of these questions is crucial to choosing the correct data structure for your implementation.
Overall, one of the std functions will probably be what you want, unless you need functionality which is not in a single one of them, or if you have an idea which could improve on their time complexities.
EDIT: Based on your update, I would agree with Kerrek SB that if you're using C++0x, then std::unordered_map would be a good data structure to use in this case. However, bear in mind that your performance can degrade to linear time complexity if you have conflicting hashes (this cannot happen with std::map), as it will store the two pair's in the same bucket. Whilst this is rare, the probability of it obviously increases with the number of elements. So if you're writing a huge game, it's possible that std::unordered_map could become less optimal than std::map. Just a consideration. :)

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js