How does two read operations on the same segment at the same time work in ConcurrentHashMap? - concurrency

I am trying to understand how read operation works internally in HashTable, HashMap and ConcurrentHashMap.
ConcurrentHashMap is internally divided into segments of size 32. So at max 32 threads can read at a time. What happens when we get two read operations on the same segment at the same time in ConcurrentHashMap.
Also, I would like to know how multiple read on the same element work in HashTable, HashMap?

ConcurrentHashMap is internally divided into segments of size 32. So at max 32 threads can read at a time.
That is not correct. According to the ConcurrentHashMap documentation, "retrieval operations do not entail locking" [emphasis in original]. This means that arbitrarily many threads can read the map at a time. (Note: the above link is to the documentation for the latest version of Java, but the same is true of all versions. The above-quoted statement been there since the documentation for JDK 1.5, which was the initial version where ConcurrentHashMap was introduced.)
Even for updates, which do involve locking, this statement isn't true: ConcurrentHashMap has not been partitioned into "segments" since Java 7; back when it was partitioned into segments, what mattered for concurrency was the number of segments, not their size; the number of segments was user-configurable; and the default number of segments was 16, not 32.
Also, I would like to know how multiple read on the same element work in HashTable, HashMap?
Hashtable.get locks the entire map, so it doesn't matter whether concurrent reads are of the same element or different ones; either way, one locks out the other.
HashMap doesn't worry about concurrency; multiple threads can happily read the map at the same time, but it's not safe to have updates concurrently with reads or other updates.

Related

Multithreaded array of arrays?

I have a data structure which consists of 1,000 array elements, each array element is a smaller array of 8 ints:
std::array<std::array<int, 8>, 1000>
The data structure contains two "pointers", which track the largest and smallest populated array elements (within the "outer", 1000-element array). So for example they might be:
min = 247
max = 842
How can I read and write to this data structure from multiple threads? I am worried about race conditions between pushing/popping elements and maintaining the two "pointers". My basic mode of operation is:
// Pop element from current index
// Calculate new index
// Write element to new index
// Update min and max "pointers"
You are correct that your current algorithm is not thread safe, there are a number of places where contention could occur.
This is impossible to optimize without more information though. You need to know where the slow-down is happening before you can improve it - and for that you need metrics. Profile your code and find out what bits are actually taking the time, because you can only gain by parallelizing those bits and even then you may find that it's actually memory or something else that is the limiting factor, not CPU.
The simplest approach will then be to just lock the entire structure for the full process. This will only work if the threads are doing a lot of other processing as well, if not you will actually lose performance compared to single threading.
After that you can consider having a separate lock for different sections of the data structure. You will need to properly analyse what you are using when and where and work out what would be useful to split. For example you might have chunks of the sub arrays with each chunk having its own lock.
Be careful of deadlocks in this situation though, you might have a thread claim 32 then want 79 while another thread already has 79 and then wants 32. Make sure you always claim locks in the same order.
The fastest option (if possible) may even be to give each thread it's own copy of the data structure, each processes 1/N of the work and then merge the results at the end. This way no synchronization is needed at all during processing.
But again it all comes back to the metrics and profiling. This is not a simple problem.

Is there any workaround to "reserve" a cache fraction?

Assume I have to write a C or C++ computational intensive function that has 2 arrays as input and one array as output. If the computation uses the 2 input arrays more often than it updates the output array, I'll end up in a situation where the output array seldom gets cached because it's evicted in order to fetch the 2 input arrays.
I want to reserve one fraction of the cache for the output array and enforce somehow that those lines don't get evicted once they are fetched, in order to always write partial results in the cache.
Update1(output[]) // Output gets cached
DoCompute1(input1[]); // Input 1 gets cached
DoCompute2(input2[]); // Input 2 gets cached
Update2(output[]); // Output is not in the cache anymore and has to get cached again
...
I know there are mechanisms to help eviction: clflush, clevict, _mm_clevict, etc. Are there any mechanisms for the opposite?
I am thinking of 3 possible solutions:
Using _mm_prefetch from time to time to fetch the data back if it has been evicted. However this might generate unnecessary traffic plus that I need to be very careful to when to introduce them;
Trying to do processing on smaller chunks of data. However this would work only if the problem allows it;
Disabling hardware prefetchers where that's possible to reduce the rate of unwanted evictions.
Other than that, is there any elegant solution?
Intel CPUs have something called No Eviction Mode (NEM) but I doubt this is what you need.
While you are attempting to optimise the second (unnecessary) fetch of output[], have you given thought to using SSE2/3/4 registers to store your intermediate output values, update them when necessary, and writing them back only when all updates related to that part of output[] are done?
I have done something similar while computing FFTs (Fast Fourier Transforms) where part of the output is in registers and they are moved out (to memory) only when it is known they will not be accessed anymore. Until then, all updates happen to the registers. You'll need to introduce inline assembly to effectively use SSE* registers. Of course, such optimisations are highly dependent on the nature of the algorithm and data placement.
I am trying to get a better understanding of the question:
If it is true that the 'output' array is strictly for output, and you never do something like
output[i] = Foo(newVal, output[i]);
then, all elements in output[] are strictly write. If so, all you would ever need to 'reserve' is one cache-line. Isn't that correct?
In this scenario, all writes to 'output' generate cache-fills and could compete with the cachelines needed for 'input' arrays.
Wouldn't you want a cap on the cachelines 'output' can consume as opposed to reserving a certain number of lines.
I see two options, which may or may not work depending on the CPU you are targeting, and on your precise program flow:
If output is only written to and not read, you can use streaming-stores, i.e., a write instruction with a no-read hint, so it will not be fetched into cache.
You can use prefetching with a non-temporally-aligned (NTA) hint for input. I don't know how this is implemented in general, but I know for sure that on some Intel CPUs (e.g., the Xeon Phi) each hardware thread uses a specific way of cache for NTA data, i.e., with an 8-way cache 1/8th per thread.
I guess solution to this is hidden inside, the algorithm employed and the L1 cache size and cache line size.
Though I am not sure how much performance improvement we will see with this.
We can probably introduce artificial reads which cleverly dodge compiler and while execution, do not hurt computations as well. Single artificial read should fill cache lines as many needed to accommodate one page. Therefore, algorithm should be modified to compute blocks of output array. Something like the ones used in matrix multiplication of huge matrices, done using GPUs. They use blocks of matrices for computation and writing result.
As pointed out earlier, the write to output array should happen in a stream.
To bring in artificial read, we should initialize at compile time the output array at right places, once in each block, probably with 0 or 1.

iterate over a vector using multiple threads (no data sharing, or vector modification)

I have a large vector of objects and I just need to iterate over the vector using multiple threads and read the objects (no modification to data or the vector). What is the most efficient method to do this? Could it be done in a lock free fashion, maybe using an atomic variable? what is most easy to read implementation of such multithreading process?
Edit:
I do not want more than one thread reads the same element of vector (reading is time consuming in this case). When one thread is reading an element, I want the next thread reads the first not-yet-read element. For example when thread 1 is reading object 1, I want thread 2 reads object 2. whenever one of them is done, it can read object 3, so on and so forth.
Splitting the input in equal parts is really really easy, it doesn't use locks and doesn't cause memory sharing. So try that, measure how much time each thread needs to complete and check if it's a relevant difference.
If the difference is relevant, consider using an array of one atomic<bool> per element, before reading an element, the thread does compare_exchange_strong on the flag related to that element (I think you can even use memory_order_relaxed, but use memory_order_acq_rel at first, only try relaxed if the performance doesn't satisfy you) and only actually processes the element if the exchange succeeds. Otherwise it tries with the next element, because someone is processing or has already processed the current one.
If you can't then you can use a single atomic<int> to store the index of the next element to be processed. The threads just use fetch_add or the postfix++ to atomically get the next element to process and increment the counter (the considerations for the memory ordering are the same as above). If the variance in reading times is high (as determined by step 1), you will have low contention on the atomic variable, so it will perform well.
If the contention is still too high, and you get a significant slowdown, try to estimate in advance how much time it will take to read an element. If you can, then sort your vector by estimated read time, and make the n-th thread read every n-th element, so that the load will be split more evenly.

What is the appropriate madvise setting for reading a file backwards?

I am using gcc 4.7.2 on a 64-bit Linux box.
I have 20 large sorted binary POD files that I need to read as a part of the final merge in an external merge-sort.
Normally, I would mmap all the files for reading and use a multiset<T,LessThan> to manage the merge sort from small to large, before doing a mmap write out to disk.
However, I realised that if I keep a std::mutex on each of these files, I can create a second thread which reads the file backwards, and sort from large to small at the same time. If I decide, beforehand, that the first thread will take exactly n/2 elements and the second thread will take the rest, I will have no need for a mutex on the output end of things.
Reading lock contentions, can be expected to occur, on average, maybe 1 in 20 in this particular case, so that's acceptable.
Now, here's my question. In the first case, it is obvious that I should call madvise with MADV_SEQUENTIAL, but I have no idea what I should do for the second case, where I'm reading the file backwards.
I see no MADV_REVERSE in the man pages. Should I use MADV_NORMAL or maybe don't call madvise at all?
Recall that an external sort is needed when the volume of data is so large that it will not fit into memory. So we are left with a more complex algorithm to use disk as a temporary store. Divide-and-conquer algorithms will usually involve breaking up the data, doing partial sorts, and then merging the partial sorts.
My steps for an external merge-sort
Take n=1 billion random numbers and break them into 20 shards of equal sizes.
Sort each shard individually from small to large, and write each out into its own file.
Open 40 mmap's, 2 for each file, one for going forward, one for going backwards, associate a mutex with each file.
Instantiate a std::multiset<T,LessThan> buff_fwd; for the forward thread and a std::multiset<T,GreaterThan> buff_rev for the reverse thread. Some people prefer to use priority queues here, but essentially, and sort-on-insert container will work here.
I like to call the two buffers surface and rockbottom, representing the smallest and largest numbers not yet added to the final sort.
Add items from the shards until n/2 is used up, and flush the shards to one output file using mmap from beginning towards the middle, and from the end towards to middle in the other thread. You can basically flush at will, but at least one should do it before either buffer uses up too much memory.
I would suggest:
MADV_RANDOM
To prevent useless read-ahead (which is in the wrong direction).

Best Data Structure for this Multithreaded Use Case: Is Intrusive List good?

I need to design a data structure that supports the following operations:
search element in the data structure based on a key that is an interval. For example the value within interval 1-5 may be 3, from 6-11 may be 5 and so on. The intervals are contiguous and there is no overlap between them.
find previous and next interval - this is almost as frequent as search for an interval.
split the intervals, join consecutive intervals
concurrency: I have restricted the design to one writer thread and other reader threads and as follows. The writer thread can split or join intervals, or modify the value in an interval. Any reader thread reads the value only within one interval (a reader may read multiple intervals, but then I don't have to serialize all reads - it is okay to have writes in between two reads). There are about 20-80 reads by each reader per write. Further, I still have to decide the number of readers but it would be around 2-8.
I consider using list for adding and deleting elements in the middle. There would only be a limited number of intervals - so probably using map won't be right. This kind of access (one writer, multiple readers) is not well-supported by STL list. boost::intrusive::list seems appropriate. On top of the intrusive list, I will have to acquire locks to read/write the intervals.
Also, I understand intrusive list may be used for better cache locality (along with appropriate memory allocation for the contained objects) than STL list.
Is the approach alright? If yes, I would also be interested to know about you experience with using intrusive::list, particularly for a multithreaded application.
You have 2 different issues here:
How to represent your data structure
How to make it thread safe, in an efficient manor
Your data structure will be doing (20-80) x (2-8) reads for every write.
(1). First, lets assume your range is a data structure as follows
struct Interval
{
Interval(int start, int length)
: m_start(start),
m_length(length)
{}
int m_start;
int m_length;
int value; // Or whatever
};
Since reads massively outnumber writes, lookup needs to be fast, while modifications don't.
Using a list for your data structure means O(N) lookups and O(1) modification - exactly the wrong way around.
The simplest possible representation of your structure is a vector. If the intervals are held in sorted order, lookups are O(logN) and modifications are O(N).
To implement this, just add a comparator to Interval:
bool operator<(const Interval& rhs) const
{
return m_start < rhs.m_start;
}
You can then use std::lower_bound to find the first interval equal or lower to your search interval in O(logN).
Next and previous interval are O(1) - decrement or increment the returned iterator.
Splitting an interval means inserting a new element after the current one and adjusting the length of the current - O(N).
Joining two intervals means adding the length of the next one to the current one and erasing the next one - O(N).
You should reserve() enough space in the vector for the maximum number of elements to minimise resizing overhead.
(2). Following Knuth, 'premature optimisation is the root of all evil'.
A single read/write lock on the structure holding your vector<Interval> will more than likely be sufficient. The only possible problems are (2a) writer starvation because readers monopolise the lock, or (2b) reader starvation because the writers updates take too long.
(2a) If (and only if) you face writer starvation, you could make the locking more granular. Its extremely likely that this won't be the case though. To do this:
Make your vector hold its intervals by pointer, not value. This is so the resizes do not move your objects around in memory. Have each interval contain a read/write lock.
For reads:
Take the read lock of the collection, then of the interval you want. If you don't need to read any other intervals, give up the collection lock as soon as you have acquired the interval lock, to allow other threads to proceed.
If you need to read other buckets, you can read-lock them in any order until you give up the collection read lock, at which point the writer may add or remove any intervals you havent locked. Order doesn't matter while acquiring these locks since the writer cannot be changing the vector while you hold the a read lock on the collection, and read locks do not contend.
For writes:
Take the write lock of the collection, then of the interval you want. Note that you must hold the collection write lock for the entirety of of any updates that will add or remove intervals. You can give up the collection lock if you are only updating one interval. Otherwise, you need to hold the write lock and acquire a write lock on any intervals you will be modifying. You can acquire the interval locks in any order, since no readers can acquire new read locks without the collection lock.
The above works by being more selfish to the writer thread, which should eliminate starvation.
(2b) If you face reader starvation, which is even more unlikely, the best solution is to split the collection writes and reads apart. Hold the collection by shared pointer, and have a single write lock on it.
For reads:
Take the write lock and a copy of the shared_ptr. Give up the write lock. The reader can now read the collection without any locks (its immutable).
For writes:
Take a shared_ptr to the collection as per the reader, giving up the lock. Make a private copy of the collection and modify it (no locks required since its a private copy). Take the write lock again and replace the existing shared_ptr with your new collection. The last thread to finish with the old collection will destroy it. All future threads will use the newly updated collection.
Note that this algorithm only works with one writer, as per your problem description.
A concurrent binary tree might be a good fit, allowing reads and writes to different intervals to proceed in parallel.