Concurrency and slice iteration

Concurrency and slice iteration - concurrency

I have a read-only slice of objects of a certain type.
A lot of concurrent goroutines will iterate over this slice to find a certain object in it and use it.
This slice is strictly read-only, it is written to only once (when the application is launched) and then never again.
Will concurrent goroutines interfere with each other when iterating over the slice? (Do I need to apply a read-lock?)

To the best of my knowledge, as long as the slice is read-only, there's no problem. Reading a slice does not mutate it.

Related

iterate over a vector using multiple threads (no data sharing, or vector modification)

I have a large vector of objects and I just need to iterate over the vector using multiple threads and read the objects (no modification to data or the vector). What is the most efficient method to do this? Could it be done in a lock free fashion, maybe using an atomic variable? what is most easy to read implementation of such multithreading process?
Edit:
I do not want more than one thread reads the same element of vector (reading is time consuming in this case). When one thread is reading an element, I want the next thread reads the first not-yet-read element. For example when thread 1 is reading object 1, I want thread 2 reads object 2. whenever one of them is done, it can read object 3, so on and so forth.

Splitting the input in equal parts is really really easy, it doesn't use locks and doesn't cause memory sharing. So try that, measure how much time each thread needs to complete and check if it's a relevant difference.
If the difference is relevant, consider using an array of one atomic<bool> per element, before reading an element, the thread does compare_exchange_strong on the flag related to that element (I think you can even use memory_order_relaxed, but use memory_order_acq_rel at first, only try relaxed if the performance doesn't satisfy you) and only actually processes the element if the exchange succeeds. Otherwise it tries with the next element, because someone is processing or has already processed the current one.
If you can't then you can use a single atomic<int> to store the index of the next element to be processed. The threads just use fetch_add or the postfix++ to atomically get the next element to process and increment the counter (the considerations for the memory ordering are the same as above). If the variance in reading times is high (as determined by step 1), you will have low contention on the atomic variable, so it will perform well.
If the contention is still too high, and you get a significant slowdown, try to estimate in advance how much time it will take to read an element. If you can, then sort your vector by estimated read time, and make the n-th thread read every n-th element, so that the load will be split more evenly.

Multithreading - In an array what should I protect?

I'm working on some code that has a global array that can be accessed by two threads for reading writing purposes.
There will be no batch processing where a range of indexes are read or written, so I'm trying to figure out if I should lock the entire array or only the array index I am currently using.
The easiest solution would be to consider the array a CS and put a big fat lock around it, but can I avoid this and just lock an index?
Cheers.

Locking one index implies that you can keep track of which thread is accessing what part of the array. Keeping track of this information, which is shared between the reading and the writing thread, implies that you have one lock around this information. So, you still end up with a global lock.
In this situation, I think that the most efficient approaches are:
- using a reader/writer lock
- or dividing the big array into a few subsets, each subset using a distinct lock.

If this is C++ i suggest you to use STL containers. std::vector or something else which suits your job. They are fast, easy to use, no memory leaks.
If you want to do it all by your self, then of course one method will be to use a single mutex ( which is bad ).
or you can use some reader writer thingy for the whole array.
I think its not feasible to make each element of an array thread safe with its own lock!! that would eat your memory. Check the link and there are 3 solutions with different out comes. Test them out and use the best for your case. ( don't think like "ok i think my program needs the readers preference algorithm". try using it in your system and decide. because we really cant assume such things sometimes )

There is no way of knowing what will be optimal unless you profile under realistic running conditions. I would suggest implementing an array-like class, where you can lock a varying number of elements in groups. Then you fine-tune the size of these groups.
Another option would be to enqueue all read/write operations using an active object. This would make all access sequential, and means you could use a non-concurrent array type to store the data. It would require some sort of concurrent queue data structure under the hood.

Optimal strategy to make a C++ hash table, thread safe

(I am interested in design of implementation NOT a readymade construct that will do it all.)
Suppose we have a class HashTable (not hash-map implemented as a tree but hash-table)
and say there are eight threads.
Suppose read to write ratio is about 100:1 or even better 1000:1.
Case A) Only one thread is a writer and others including writer can read from HashTable(they may simply iterate over entire hash table)
Case B) All threads are identical and all could read/write.
Can someone suggest best strategy to make the class thread safe with following consideration
1. Top priority to least lock contention
2. Second priority to least number of locks
My understanding so far is thus :
One BIG reader-writer lock(semaphore).
Specialize the semaphore so that there could be eight instances writer-resource for case B, where each each writer resource locks one row(or range for that matter).
(so i guess 1+8 mutexes)
Please let me know if I am thinking on the correct line, and how could we improve on this solution.

With such high read/write ratios, you should consider a lock free solution, e.g. nbds.
EDIT:
In general, lock free algorithms work as follows:
arrange your data structures such that for each function you intend to support there is a point at which you are able to, in one atomic operation, determine whether its results are valid (i.e. other threads have not mutated its inputs since they have been read) and commit to them; with no changes to state visible to other threads unless you commit. This will involve leveraging platform-specific functions such as Win32's atomic compare-and-swap or Cell's cache line reservation opcodes.
each supported function becomes a loop that repeatedly reads the inputs and attempts to perform the work, until the commit succeeds.
In cases of very low contention, this is a performance win over locking algorithms since functions mostly succeed the first time through without incurring the overhead of acquiring a lock. As contention increases, the gains become more dubious.
Typically the amount of data it is possible to atomically manipulate is small - 32 or 64 bits is common - so for functions involving many reads and writes, the resulting algorithms become complex and potentially very difficult to reason about. For this reason, it is preferable to look for and adopt a mature, well-tested and well-understood third party lock free solution for your problem in preference to rolling your own.
Hashtable implementation details will depend on various aspects of the hash and table design. Do we expect to be able to grow the table? If so, we need a way to copy bulk data from the old table into the new safely. Do we expect hash collisions? If so, we need some way of walking colliding data. How do we make sure another thread doesn't delete a key/value pair between a lookup returning it and the caller making use of it? Some form of reference counting, perhaps? - but who owns the reference? - or simply copying the value on lookup? - but what if values are large?
Lock-free stacks are well understood and relatively straightforward to implement (to remove an item from the stack, get the current top, attempt to replace it with its next pointer until you succeed, return it; to add an item, get the current top and set it as the item's next pointer, until you succeed in writing a pointer to the item as the new top; on architectures with reserve/conditional write semantics, this is enough, on architectures only supporting CAS you need to append a nonce or version number to the atomically manipulated data to avoid the ABA problem). They are one way of keeping track of free space for keys/data in an atomic lock free manner, allowing you to reduce a key/value pair - the data actually stored in a hashtable entry - to a pointer/offset or two, a small enough amount to be manipulated using your architecture's atomic instructions. There are others.
Reads then become a case of looking up the entry, checking the kvp against the requested key, doing whatever it takes to make sure the value will remain valid when we return it (taking a copy / increasing its reference count), checking the entry hasn't been modified since we began the read, returning the value if so, undoing any reference count changes and repeating the read if not.
Writes will depend on what we're doing about collisions; in the trivial case, they are simply a case of finding the correct empty slot and writing the new kvp.
The above is greatly simplified and insufficient to produce your own safe implementation, especially if you are not familiar with lock-free/wait-free techniques. Possible complications include the ABA problem, priority inversion, starvation of particular threads; I have not addressed hash collisions.
The nbds page links to an excellent presentation on a real world approach that allows growth / collisions. Others exist, a quick Google finds lots of papers.
Lock free and wait free algorithms are fascinating areas of research; I encourage the reader to Google around. That said, naive lock free implementations can easily look reasonable and behave correctly much of the time while in reality being subtly unsafe. While it is important to have a solid grasp on the principles, I strongly recommend using an existing, well-understood and proven implementation over rolling your own.

You may want to look at Java's ConcurrentHashMap implementation for one possible implementation.
The basic idea is NOT to lock for every read operation but only for writes. Since in your interview they specifically mentioned an extremely high read:write ratio it makes sense trying to stuff as much overhead as possible into writes.
The ConcurrentHashMap divides the hashtable into so called "Segments" that are themselves concurrently readable hashtables and keep every single segment in a consistent state to allow traversing without locking.
When reading you basically have the usual hashmap get() with the difference that you have to worry about reading stale values, so things like the value of the correct node, the first node of the segment table and next pointers have to be volatile (with c++'s non-existent memory model you probably can't do this portably; c++0x should help here, but haven't looked at it so far).
When putting a new element in there you get all the overhead, first of all having to lock the given segment. After locking it's basically a usual put() operation, but you have to guarantee atomic writes when updating the next pointer of a node (pointing to the newly created node whose next pointer has to be already correctly pointing to the old next node) or overwriting the value of a node.
When growing the segment, you have to rehash the existing nodes and put them into the new, larger table. The important part is to clone nodes for the new table as not to influence the old table (by changing their next pointers too early) until the new table is complete and replaces the old one (they use some clever trick there that means they only have to clone about 1/6 of the nodes - nice that but I'm not really sure how they reach that number).
Note that garbage collection makes this a whole lot easier because you don't have to worry about the old nodes that weren't reused - as soon as all readers are finished they will automatically be GCed. That's solvable though, but I'm not sure what the best approach would be.
I hope the basic idea is somewhat clear - obviously there are several points that aren't trivially ported to c++, but it should give you a good idea.

No need to lock the whole table, just have a lock per bucket. That immediately gives parallelism. Inserting a new node to the table requires a lock on the bucket about to have the head node modified. New nodes are always added at the head of the table so that readers can iterate through the nodes without worrying about seeing new nodes.
Each node has a r/w lock; readers iterating get a read lock lock. Node modification requires a write lock.
Iteration without the bucket lock leading to node removal requires an attempt to take the bucket lock, and if it fails it must release the locks and retry to avoid deadlock because the lock order is different.
Brief overview.

You can try atomic_hashtable for c
https://github.com/Taymindis/atomic_hashtable for read, write, and delete without locking while multithreading accessing the data, Simple and Stable
API documents given in README.

c++ multithread optimization

in my code I have 2/4 threads performing montecarlo simulations. Each of them runs a number of experiments and they all collect the results into a stl vector.
My question is this: suppose each thread runs 1000 experiments sequentially. Is is better to store the result into the shared vector one at the time, or every once in a while? If they wait until they have some consistent amount of data, writing into the vector will take longer, so I'm not sure whether the second solution is necessarily better than the first one.
PS each experiment is numerical computation, so no IO operations.
Thanks

If you are going to wait until all the results are computed before you use any of the results, preallocate space for 4,000 results in the vector and have each thread write into one range of elements in the vector. No locking is required because no two threads access the same element in the vector.
If you want to use the results as they are computed, use some sort of a concurrent queue data structure instead of a vector.

If you're only putting 2000 to 4000 elements in the vector I doubt it would make much of a difference either way.
Do whatever is most natural for the algorithm. If that doesn't work well enough look into doing it the other way.
After thinking about it for a bit, it might serve both purposes (simplicity and speed) to have each thread store results to a local vector then copy the contents of the local vector to the 'global' vector (protected by a lock) when the thread is done. Of course, that's as long as whatever's waiting for the results can wait until a thread is fully finished before getting an update.

a singly linked list may be a better choice than vector here.
If there is only one thread reading and one thread writing to a fifo .. you don't need any synchronization . The trick is to keep at least one 'dummy' element always in the list, and fifo is empty if head == tail . The head and tail pointers can be manipulated for push and pop, such that there is no need for synchronization..
Using this .. you can make several Q's .. which will not need any synchronization
If new/delete is taking time .. you can have Q's to hold reusable elements.
best of luck .
remember .. Exactly one reader, and Exactly one writer .. no more, no less .
the trick is createa LOT of Q's like this , Q to recycle objects also .. and
you'll not need any thread synchronization stuff ...
If your Q's do run empty .. just a sleep() / wakeup() functionality is needed.
and in case i haven't already said .. Exactly one reader, and Exactly one writer.

Design options for references into a thread safe cache when evicting older entries

I'm trying to design a simple cache that follows the following rules:
Entries have a unique key
When the number of entries in the cache exceeds a certain limit, the older items are evicted (to keep the cache from getting too large).
Each entry's data is immutable until the entry is removed from the cache.
A 'reader' can access an entry in the cache and the entry must be valid for the lifetime of the reader.
Each reader can be on its own thread, and all readers access the same cache instance.
Thread safety with this cache is important, since we don't want readers holding a reference to an entry, only to have it evicted by another thread somewhere else.
Hence, my current implementation just copies out the whole entry when reading from the cache. This is fine for smaller objects, but once objects get too large there's too much copying going on. It's also not so great with large numbers of readers that are accessing the same cached entry.
Since the data is immutable, it would be great if every reader to the same message could just hold a reference instead of a copy, but in some thread safe manner (so it wouldn't get evicted).
A previous implementation used reference counting to achieve this...but it's very tricky with threads, and I went with this simpler approach.
Are there any other patterns/ideas I could use to improve this design?

In a native system without a higher power (such as a VM) capable of performing garbage collection, you aren't going to do much better performance or complexity wise than reference counting.
You are are correct the reference counting can be tricky - not only does the increment and decrement have to atomic, but you need to ensure that the object can't be deleted out from under you before you are able to increment it. Thus, if you store the reference counter inside the object, you'll have to somehow avoid the race that occurs between the time you read the pointer to the object out of the cache, and manage to increment the pointer.
If your structure is a standard container, which is not already thread-safe, you will also have to protect the container from unsupported concurrent access. This protection can dovetail nicely with avoiding the reference counting race condition described above - if you use a read-writer lock to protect the structure, combined with atomic increments of the in-object reference counter while still holding the reader lock, you'll be protected from anyone deleting the object out from under you before you get the reference count, since such mutators must be "writers".
Here, objects can be evicted from the cache while still having a positive reference count - they will be destroyed when the last outstanding reference is dropped (by your smart pointer class). This is typically considered a feature, since it means that at least some object can always be removed from the cache, but it also has the downside that there is no strict upper on the number of objects "alive" in memory, since the reference counting allows objects to say alive even after they've left the cache. Whether this is acceptable to you depends on your requirements and details such as how long other threads may hold references to objects.
If you don't have access to (non-standard) atomic increment routines, you can use a mutex to do the atomic increment/decrement, although this may increase the cost significantly in both time and per-object space.
If you want to get more exotic (and faster) you'll need to design a container which is itself threadsafe, and come up with a more complex reference counting mechanism. For example, you may be able to create a hash table where the primary bucket array is never re-allocated, so can be accessed without locking. Furthermore, you can use non-portable double-wide CAS (compare and swap) operations on that array to both read a pointer and increment a reference count adjacent to it (128 bits of stuff on a 64-bit arch), allowing you to avoid the race mentioned above.
A completely different track would be to implement some kind of "delayed safe delete" strategy. Here avoid reference counting entirely. You remove references from your cache, but do not delete objects immediately, since other threads may still hold pointers to the object. Then later at some "safe" time you delete the object. Of course, the trick is discover when such a safe time exists. Basic strategies involve each thread signaling when they "enter" and "leave" a danger zone during which they may access the cache and hold references to contained objects. Once all threads which were in the danger zone when an object was removed from the cache have left the danger zone, you can free the object while being sure that no more references are held.
How practical this is depends on whether you have logical "enter" and "leave" points in your application (many request-oriented applications will), and whether the "enter" and "leave" costs can be amortized across many cache accesses. The upside is no reference counting! Of course, you still need a thread-safe container.
You can find references to many academic papers on the topic and some practical performance considerations by examining the papers linked here.

I think you effectively want a reader/writer lock per entry. Readers read lock and unlock as they are using it. The eviction thread has to obtain a write lock (which forces all readers to complete before it can be acquired). There needs to be some way for a reader to then tell (before acquiring a read lock) whether the entry in question has been evicted concurrently.
On the downside, one lock per entry is expensive for a big cache (in terms of memory). You can address that by using a lock across for a set of entries - this trades off memory vs concurrency. Need to be careful about deadlock scenarios in that case.

Sounds like a monitor with a std::map as the buffer would be useful in this situation.

I'm thinking that if you want to share a reference, you'll need to keep a count. So long as you use the interlocked inc/dec functions, this should be simple enough even for multiple threads.

It seems to me that the reference counting solution is only tricky in that the updating/testing for eviction of said reference counters must be inside a critical section protected by a mutex. So long as more than one process doesn't access the reference counters at a time it should be thread safe.

Have a circular queue, and don't allow multiple threads to write to it or the cache will be useless. Each thread should have its own cache, maybe with read access to the other caches but not write access.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js