(I am interested in design of implementation NOT a readymade construct that will do it all.)
Suppose we have a class HashTable (not hash-map implemented as a tree but hash-table)
and say there are eight threads.
Suppose read to write ratio is about 100:1 or even better 1000:1.
Case A) Only one thread is a writer and others including writer can read from HashTable(they may simply iterate over entire hash table)
Case B) All threads are identical and all could read/write.
Can someone suggest best strategy to make the class thread safe with following consideration
1. Top priority to least lock contention
2. Second priority to least number of locks
My understanding so far is thus :
One BIG reader-writer lock(semaphore).
Specialize the semaphore so that there could be eight instances writer-resource for case B, where each each writer resource locks one row(or range for that matter).
(so i guess 1+8 mutexes)
Please let me know if I am thinking on the correct line, and how could we improve on this solution.
With such high read/write ratios, you should consider a lock free solution, e.g. nbds.
EDIT:
In general, lock free algorithms work as follows:
arrange your data structures such that for each function you intend to support there is a point at which you are able to, in one atomic operation, determine whether its results are valid (i.e. other threads have not mutated its inputs since they have been read) and commit to them; with no changes to state visible to other threads unless you commit. This will involve leveraging platform-specific functions such as Win32's atomic compare-and-swap or Cell's cache line reservation opcodes.
each supported function becomes a loop that repeatedly reads the inputs and attempts to perform the work, until the commit succeeds.
In cases of very low contention, this is a performance win over locking algorithms since functions mostly succeed the first time through without incurring the overhead of acquiring a lock. As contention increases, the gains become more dubious.
Typically the amount of data it is possible to atomically manipulate is small - 32 or 64 bits is common - so for functions involving many reads and writes, the resulting algorithms become complex and potentially very difficult to reason about. For this reason, it is preferable to look for and adopt a mature, well-tested and well-understood third party lock free solution for your problem in preference to rolling your own.
Hashtable implementation details will depend on various aspects of the hash and table design. Do we expect to be able to grow the table? If so, we need a way to copy bulk data from the old table into the new safely. Do we expect hash collisions? If so, we need some way of walking colliding data. How do we make sure another thread doesn't delete a key/value pair between a lookup returning it and the caller making use of it? Some form of reference counting, perhaps? - but who owns the reference? - or simply copying the value on lookup? - but what if values are large?
Lock-free stacks are well understood and relatively straightforward to implement (to remove an item from the stack, get the current top, attempt to replace it with its next pointer until you succeed, return it; to add an item, get the current top and set it as the item's next pointer, until you succeed in writing a pointer to the item as the new top; on architectures with reserve/conditional write semantics, this is enough, on architectures only supporting CAS you need to append a nonce or version number to the atomically manipulated data to avoid the ABA problem). They are one way of keeping track of free space for keys/data in an atomic lock free manner, allowing you to reduce a key/value pair - the data actually stored in a hashtable entry - to a pointer/offset or two, a small enough amount to be manipulated using your architecture's atomic instructions. There are others.
Reads then become a case of looking up the entry, checking the kvp against the requested key, doing whatever it takes to make sure the value will remain valid when we return it (taking a copy / increasing its reference count), checking the entry hasn't been modified since we began the read, returning the value if so, undoing any reference count changes and repeating the read if not.
Writes will depend on what we're doing about collisions; in the trivial case, they are simply a case of finding the correct empty slot and writing the new kvp.
The above is greatly simplified and insufficient to produce your own safe implementation, especially if you are not familiar with lock-free/wait-free techniques. Possible complications include the ABA problem, priority inversion, starvation of particular threads; I have not addressed hash collisions.
The nbds page links to an excellent presentation on a real world approach that allows growth / collisions. Others exist, a quick Google finds lots of papers.
Lock free and wait free algorithms are fascinating areas of research; I encourage the reader to Google around. That said, naive lock free implementations can easily look reasonable and behave correctly much of the time while in reality being subtly unsafe. While it is important to have a solid grasp on the principles, I strongly recommend using an existing, well-understood and proven implementation over rolling your own.
You may want to look at Java's ConcurrentHashMap implementation for one possible implementation.
The basic idea is NOT to lock for every read operation but only for writes. Since in your interview they specifically mentioned an extremely high read:write ratio it makes sense trying to stuff as much overhead as possible into writes.
The ConcurrentHashMap divides the hashtable into so called "Segments" that are themselves concurrently readable hashtables and keep every single segment in a consistent state to allow traversing without locking.
When reading you basically have the usual hashmap get() with the difference that you have to worry about reading stale values, so things like the value of the correct node, the first node of the segment table and next pointers have to be volatile (with c++'s non-existent memory model you probably can't do this portably; c++0x should help here, but haven't looked at it so far).
When putting a new element in there you get all the overhead, first of all having to lock the given segment. After locking it's basically a usual put() operation, but you have to guarantee atomic writes when updating the next pointer of a node (pointing to the newly created node whose next pointer has to be already correctly pointing to the old next node) or overwriting the value of a node.
When growing the segment, you have to rehash the existing nodes and put them into the new, larger table. The important part is to clone nodes for the new table as not to influence the old table (by changing their next pointers too early) until the new table is complete and replaces the old one (they use some clever trick there that means they only have to clone about 1/6 of the nodes - nice that but I'm not really sure how they reach that number).
Note that garbage collection makes this a whole lot easier because you don't have to worry about the old nodes that weren't reused - as soon as all readers are finished they will automatically be GCed. That's solvable though, but I'm not sure what the best approach would be.
I hope the basic idea is somewhat clear - obviously there are several points that aren't trivially ported to c++, but it should give you a good idea.
No need to lock the whole table, just have a lock per bucket. That immediately gives parallelism. Inserting a new node to the table requires a lock on the bucket about to have the head node modified. New nodes are always added at the head of the table so that readers can iterate through the nodes without worrying about seeing new nodes.
Each node has a r/w lock; readers iterating get a read lock lock. Node modification requires a write lock.
Iteration without the bucket lock leading to node removal requires an attempt to take the bucket lock, and if it fails it must release the locks and retry to avoid deadlock because the lock order is different.
Brief overview.
You can try atomic_hashtable for c
https://github.com/Taymindis/atomic_hashtable for read, write, and delete without locking while multithreading accessing the data, Simple and Stable
API documents given in README.
Related
I have two threads running. They share an array. One of the threads adds new elements to the array (and removes them) and the other uses this array (read operations only).
Is it necessary for me to lock the array before I add/remove to/from it or read from it?
Further details:
I will need to keep iterating over the entire array in the other thread. No write operations over there as previously mentioned. "Just scanning something like a fixed-size circular buffer"
The easy thing to do in such cases is to use a lock. However locks can be very slow. I did not want to use locks if their use can be avoided. Also, as it came out from the discussions, it might not be necessary (it actually isn't) to lock all operations on the array. Just locking the management of an iterator for the array (count variable that will be used by the other thread) is enough
I don't think the question is "too broad". If it still comes out to be so, please let me know. I know the question isn't perfect. I had to combine at least 3 answers in order to be able to solve the question - which suggests most people were not able to fully understand all the issues and were forced to do some guess work. But most of it came out through the comments which I have tried to incorporate in the question. The answers helped me solve my problem quite objectively and I think the answers provided here are quite a helpful resource for someone starting out with multithreading.
If two threads perform an operation on the same memory location, and at least one operation is a write operation, you have a so-called data race. According to C11 and C++11, the behaviour of programs with data races is undefined.
So, you have to use some kind of synchronization mechanism, for example:
std::atomic
std::mutex
If you are writing and reading from the same location from multiple threads you will need to to perform locking or use atomics. We can see this by looking at the C11 draft standard(The C++11 standard looks almost identical, the equivalent section would be 1.10) says the following in section 5.1.2.4 Multi-threaded executions and data races:
Two expression evaluations conflict if one of them modifies a memory
location and the other one reads or modifies the same memory location.
and:
The execution of a program contains a data race if it contains two
conflicting actions in different threads, at least one of which is not
atomic, and neither happens before the other. Any such data race
results in undefined behavior.
and:
Compiler transformations that introduce assignments to a potentially
shared memory location that would not be modified by the abstract
machine are generally precluded by this standard, since such an
assignment might overwrite another assignment by a different thread in
cases in which an abstract machine execution would not have
encountered a data race. This includes implementations of data member
assignment that overwrite adjacent members in separate memory
locations. We also generally preclude reordering of atomic loads in
cases in which the atomics in question may alias, since this may
violate the "visible sequence" rules.
If you were just adding data to the array then in the C++ world a std::atomic index would be sufficient since you can add more elements and then atomically increment the index. But since you want to grow and shrink the array then you will need to use a mutex, in the C++ world std::lock_guard would be a typical choice.
To answer your question: maybe.
Simply put, the way that the question is framed doesn't provide enough information about whether or not a lock is required.
In most standard use cases, the answer would be yes. And most of the answers here are covering that case pretty well.
I'll cover the other case.
When would you not need a lock given the information you have provided?
There are some other questions here that would help better define whether you need a lock, whether you can use a lock-free synchronization method, or whether or not you can get away with no explicit synchronization.
Will writing data ever be non-atomic? Meaning, will writing data ever result in "torn data"? If your data is a single 32 bit value on an x86 system, and your data is aligned, then you would have a case where writing your data is already atomic. It's safe to assume that if your data is of any size larger than the size of a pointer (4 bytes on x86, 8 on x64), then your writes cannot be atomic without a lock.
Will the size of your array ever change in a way that requires reallocation? If your reader is walking through your data, will the data suddenly be "gone" (memory has been "delete"d)? Unless your reader takes this into account (unlikely), you'll need a lock if reallocation is possible.
When you write data to your array, is it ok if the reader "sees" old data?
If your data can be written atomically, your array won't suddenly not be there, and it's ok for the reader to see old data... then you won't need a lock. Even with those conditions being met, it would be appropriate to use the built in atomic functions for reading and storing. But, that's a case where you wouldn't need a lock :)
Probably safest to use a lock since you were unsure enough to ask this question. But, if you want to play around with the edge case of where you don't need a lock... there you go :)
One of the threads adds new elements to the array [...] and the other [reads] this array
In order to add and remove elements to/from an array, you will need an index that specifies the last place of the array where the valid data is stored. Such index is necessary, because arrays cannot be resized without potential reallocation (which is a different story altogether). You may also need a second index to mark the initial location from which the reading is allowed.
If you have an index or two like this, and assuming that you never re-allocate the array, it is not necessary to lock when you write to the array itself, as long as you lock the writes of valid indexes.
int lastValid = 0;
int shared[MAX];
...
int count = toAddCount;
// Add the new data
for (int i = lastValid ; count != 0 ; count--, i++) {
shared[i] = new_data(...);
}
// Lock a mutex before modifying lastValid
// You need to use the same mutex to protect the read of lastValid variable
lock_mutex(lastValid_mutex);
lastValid += toAddCount;
unlock_mutex(lastValid_mutex);
The reason this works is that when you perform writes to shared[] outside the locked region, the reader does not "look" past the lastValid index. Once the writing is complete, you lock the mutex, which normally causes a flush of the CPU cache, so the writes to shared[] would be complete before the reader is allowed to see the data.
Lock? No. But you do need some synchronization mechanism.
What you're describing sounds an awful like a "SPSC" (Single Producer Single Consumer) queue, of which there are tons of lockfree implementations out there including one in the Boost.Lockfree
The general way these work is that underneath the covers you have a circular buffer containing your objects and an index. The writer knows the last index it wrote to, and if it needs to write new data it (1) writes to the next slot, (2) updates the index by setting the index to the previous slot + 1, and then (3) signals the reader. The reader then reads until it hits an index that doesn't contain the index it expects and waits for the next signal. Deletes are implicit since new items in the buffer overwrite previous ones.
You need a way to atomically update the index, which is provided by atomic<> and has direct hardware support. You need a way for a writer to signal the reader. You also might need memory fences depending on the platform s.t. (1-3) occur in order. You don't need anything as heavy as a lock.
"Classical" POSIX would indeed need a lock for such a situation, but this is overkill. You just have to ensure that the reads and writes are atomic. C and C++ have that in the language since their 2011 versions of their standards. Compilers start to implement it, at least the latest versions of Clang and GCC have it.
It depends. One situation where it could be bad is if you are removing an item in one thread then reading the last item by its index in your read thread. That read thread would throw an OOB error.
As far as I know, this is exactly the usecase for a lock. Two threads which access one array concurrently must ensure that one thread is ready with its work.
Thread B might read unfinished data if thread A did not finish work.
If it's a fixed-size array, and you don't need to communicate anything extra like indices written/updated, then you can avoid mutual exclusion with the caveat that the reader may see:
no updates at all
If your memory ordering is relaxed enough that this happens, you need a store fence in the writer and a load fence in the consumer to fix it
partial writes
if the stored type is not atomic on your platform (int generally should be)
or your values are un-aligned, and especially if they may span cache lines
This is all dependent on your platform though - hardware, OS and compiler can all affect it. You haven't told us what they are.
The portable C++11 solution is to use an array of atomic<int>. You still need to decide what memory ordering constraints you require, and what that means for correctness and performance on your platform.
If you use e.g. vector for your array (so that it can dynamically grow), then reallocation may occur during the writes, you lose.
If you use data entries larger than is always written and read atomically (virtually any complex data type), you lose.
If the compiler / optimizer decides to keep certain things in registers (such as the counter holding the number of valid entries in the array) during some operations, you lose.
Or even if the compiler / optimizer decides to switch order of execution for your array element assignments and counter increments/decrements, you lose.
So you certianly do need some sort of synchronization. What is the best way to do so (for example it may be worth while to lock only parts of the array), depends on your specifics (how often and in what pattern do the threads access the array).
I'm working on some code that has a global array that can be accessed by two threads for reading writing purposes.
There will be no batch processing where a range of indexes are read or written, so I'm trying to figure out if I should lock the entire array or only the array index I am currently using.
The easiest solution would be to consider the array a CS and put a big fat lock around it, but can I avoid this and just lock an index?
Cheers.
Locking one index implies that you can keep track of which thread is accessing what part of the array. Keeping track of this information, which is shared between the reading and the writing thread, implies that you have one lock around this information. So, you still end up with a global lock.
In this situation, I think that the most efficient approaches are:
- using a reader/writer lock
- or dividing the big array into a few subsets, each subset using a distinct lock.
If this is C++ i suggest you to use STL containers. std::vector or something else which suits your job. They are fast, easy to use, no memory leaks.
If you want to do it all by your self, then of course one method will be to use a single mutex ( which is bad ).
or you can use some reader writer thingy for the whole array.
I think its not feasible to make each element of an array thread safe with its own lock!! that would eat your memory. Check the link and there are 3 solutions with different out comes. Test them out and use the best for your case. ( don't think like "ok i think my program needs the readers preference algorithm". try using it in your system and decide. because we really cant assume such things sometimes )
There is no way of knowing what will be optimal unless you profile under realistic running conditions. I would suggest implementing an array-like class, where you can lock a varying number of elements in groups. Then you fine-tune the size of these groups.
Another option would be to enqueue all read/write operations using an active object. This would make all access sequential, and means you could use a non-concurrent array type to store the data. It would require some sort of concurrent queue data structure under the hood.
std::shared_ptr needs to allocate a control block on the heap which holds the reference count. There was another approach I learnt from http://ootips.org/yonat/4dev/smart-pointers.html which keeps all the references in a doubly linked list. It doesn't need additional allocations nor a counter but the reference object itself is larger.
Is there a benchmark or any clear reason showing one implementation is better than the others?
The standard does in theory allow a linked list to be used, but because copying a shared_ptr must be threadsafe it would be harder to implement that with a linked list. The list would need to be protected by a mutex (or be a lockfree list, which is much harder) so that every time a shared_ptr is copied or goes out of scope the list can be safely modified by multiple threads.
It's much simpler and in general more efficient to do reference counting using atomic types and use atomic operations for the ref count updates.
Edit: To answer the comment below, just using atomic pointers to implement the linked list wouldn't be enough. To add or remove a node from the list (which correspond to increasing and decreasing the use_count) you would need to update two pointers atomically, to adjust the links in the nodes before and after the one being added/removed. std::atomic<T*> allows you to update a single pointer atomically, but doesn't help if you need to update two such objects atomically. Since those two pointers live in separate nodes they're not adjacent so even a quadword CAS won't help.
Alternatives are a mutex protecting the whole list (obviously bad for contention) or a mutex per list node where you lock the mutex of each node involved in any update, which uses more memory and affects up to three nodes at a time, i.e. requires locking three mutexes. If you have use_count() less than or equal to five then copying/destroying any one shared_ptr contends with copying/destroying any of the other instances that shares ownership of the same pointer. You might get less contention with high use counts where most updates are to non-neighbour nodes distant from each other, but probably not in the general case. Plenty of programs use shared_ptr with use counts in single digits. Even when use counts are high and there's no contention on any nodes, you still have to lock three mutexes and create/destroy a list node (possibly requiring a heap allocation) and update the pointers in its neighbouring nodes, so an atomic increment/decrement is much simpler and could still be faster despite the contention on the atomic integers.
Last time I mentioned on the committee reflector that shared_ptr isn't required to use ref counts and could use a list I got the replies:
Does anyone actually do that, given that the Standard recognizes multithreading now?
and (in reference to the thread-safety requirements):
It's much harder to make that work (efficiently) for a reference-linked implementation. I'm not even sure that it's possible, though it may be.
I am looking for a method to implement lock-free queue data structure that supports single producer, and multiple consumers. I have looked at the classic method by Maged Michael and Michael Scott (1996) but their version uses linked lists. I would like an implementation that makes use of bounded circular buffer. Something that uses atomic variables?
On a side note, I am not sure why these classic methods are designed for linked lists that require a lot of dynamic memory management. In a multi-threaded program, all memory management routines are serialized. Aren't we defeating the benefits of lock-free methods by using them in conjunction with dynamic data structures?
I am trying to code this in C/C++ using pthread library on a Intel 64-bit architecture.
Thank you,
Shirish
The use of a circular buffer makes a lock necessary, since blocking is needed to prevent the head from going past the tail. But otherwise the head and tail pointers can easily be updated atomically. Or in some cases the buffer can be so large that overwriting is not an issue. (in real life you will see this in automated trading systems, with circular buffers sized to hold X minutes of market data. If you are X minutes behind, you have wayyyy worse problems than overwriting your buffer).
When I implemented the MS queue in C++, I built a lock free allocator using a stack, which is very easy to implement. If I have MSQueue then at compile time I know sizeof(MSQueue::node). Then I make a stack of N buffers of the required size. The N can grow, i.e. if pop() returns null, it is easy to go ask the heap for more blocks, and these are pushed onto the stack. Outside of the possibly blocking call for more memory, this is a lock free operation.
Note that the T cannot have a non-trivial dtor. I worked on a version that did allow for non-trivial dtors, that actually worked. But I found that it was easier just to make the T a pointer to the T that I wanted, where the producer released ownership, and the consumer acquired ownership. This of course requires that the T itself is allocated using lockfree methods, but the same allocator I made with the stack works here as well.
In any case the point of lock-free programming is not that the data structures themselves are slower. The points are this:
lock free makes me independent of the scheduler. Lock-based programming depends on the scheduler to make sure that the holders of a lock are running so that they can release the lock. This is what causes "priority inversion" On Linux there are some lock attributes to make sure this happens
If I am independent of the scheduler, the OS has a far easier time managing timeslices, and I get far less context switching
it is easier to write correct multithreaded programs using lockfree methods since I dont have to worry about deadlock , livelock, scheduling, syncronization, etc This is espcially true with shared memory implementations, where a process could die while holding a lock in shared memory, and there is no way to release the lock
lock free methods are far easier to scale. In fact, I have implemented lock free methods using messaging over a network. Distributed locks like this are a nightmare
That said, there are many cases where lock-based methods are preferable and/or required
when updating things that are expensive or impossible to copy. Most lock free methods use some sort of versioning, i.e. make a copy of the object, update it, and check if the shared version is still the same as when you copied it, then make the current version you update version. Els ecopy it again, apply the update, and check again. Keep doing this until it works. This is fine when the objects are small, but it they are large, or contain file handles, etc then not recommended
Most types are impossible to access in a lock free way, e.g. any STL container. These have invariants that require non atomic access , for example assert(vector.size()==vector.end()-vector.begin()). So if you are updating/reading a vector that is shared, you have to lock it.
This is an old question, but no one has provided an accepted solution. So I offer this info for others who may be searching.
This website: http://www.1024cores.net
Provides some really useful lockfree/waitfree data structures with thorough explanations.
What you are seeking is a lock-free solution to the reader/writer problem.
See: http://www.1024cores.net/home/lock-free-algorithms/reader-writer-problem
For a traditional one-block circular buffer I think this simply cannot be done safely with atomic operations. You need to do so much in one read. Suppose you have a structure that has this:
uint8_t* buf;
unsigned int size; // Actual max. buffer size
unsigned int length; // Actual stored data length (suppose in write prohibited from being > size)
unsigned int offset; // Start of current stored data
On a read you need to do the following (this is how I implemented it anyway, you can swap some steps like I'll discuss afterwards):
Check if the read length does not surpass stored length
Check if the offset+read length do not surpass buffer boundaries
Read data out
Increase offset, decrease length
What should you certainly do synchronised (so atomic) to make this work? Actually combine steps 1 and 4 in one atomic step, or to clarify: do this synchronised:
check read_length, this can be sth like read_length=min(read_length,length);
decrease length with read_length: length-=read_length
get a local copy from offset unsigned int local_offset = offset
increase offset with read_length: offset+=read_length
Afterwards you can just do a memcpy (or whatever) starting from your local_offset, check if your read goes over circular buffer size (split in 2 memcpy's), ... . This is 'quite' threadsafe, your write method could still write over the memory you're reading, so make sure your buffer is really large enough to minimize that possibility.
Now, while I can imagine you can combine 3 and 4 (I guess that's what they do in the linked-list case) or even 1 and 2 in atomic operations, I cannot see you do this whole deal in one atomic operation :).
You can however try to drop 'length' checking if your consumers are very smart and will always know what to read. You'd also need a new woffset variable then, because the old method of (offset+length)%size to determine write offset wouldn't work anymore. Note this is close to the case of a linked list, where you actually always read one element (= fixed, known size) from the list. Also here, if you make it a circular linked list, you can read to much or write to a position you're reading at that moment!
Finally: my advise, just go with locks, I use a CircularBuffer class, completely safe for reading & writing) for a realtime 720p60 video streamer and I have got no speed issues at all from locking.
This is an old question but no one has provided an answer that precisely answers it. Given that still comes up high in search results for (nearly) the same question, there should be an answer, given that one exists.
There may be more than one solution, but here is one that has an implementation:
https://github.com/tudinfse/FFQ
The conference paper referenced in the readme details the algorithm.
I'm trying to design a simple cache that follows the following rules:
Entries have a unique key
When the number of entries in the cache exceeds a certain limit, the older items are evicted (to keep the cache from getting too large).
Each entry's data is immutable until the entry is removed from the cache.
A 'reader' can access an entry in the cache and the entry must be valid for the lifetime of the reader.
Each reader can be on its own thread, and all readers access the same cache instance.
Thread safety with this cache is important, since we don't want readers holding a reference to an entry, only to have it evicted by another thread somewhere else.
Hence, my current implementation just copies out the whole entry when reading from the cache. This is fine for smaller objects, but once objects get too large there's too much copying going on. It's also not so great with large numbers of readers that are accessing the same cached entry.
Since the data is immutable, it would be great if every reader to the same message could just hold a reference instead of a copy, but in some thread safe manner (so it wouldn't get evicted).
A previous implementation used reference counting to achieve this...but it's very tricky with threads, and I went with this simpler approach.
Are there any other patterns/ideas I could use to improve this design?
In a native system without a higher power (such as a VM) capable of performing garbage collection, you aren't going to do much better performance or complexity wise than reference counting.
You are are correct the reference counting can be tricky - not only does the increment and decrement have to atomic, but you need to ensure that the object can't be deleted out from under you before you are able to increment it. Thus, if you store the reference counter inside the object, you'll have to somehow avoid the race that occurs between the time you read the pointer to the object out of the cache, and manage to increment the pointer.
If your structure is a standard container, which is not already thread-safe, you will also have to protect the container from unsupported concurrent access. This protection can dovetail nicely with avoiding the reference counting race condition described above - if you use a read-writer lock to protect the structure, combined with atomic increments of the in-object reference counter while still holding the reader lock, you'll be protected from anyone deleting the object out from under you before you get the reference count, since such mutators must be "writers".
Here, objects can be evicted from the cache while still having a positive reference count - they will be destroyed when the last outstanding reference is dropped (by your smart pointer class). This is typically considered a feature, since it means that at least some object can always be removed from the cache, but it also has the downside that there is no strict upper on the number of objects "alive" in memory, since the reference counting allows objects to say alive even after they've left the cache. Whether this is acceptable to you depends on your requirements and details such as how long other threads may hold references to objects.
If you don't have access to (non-standard) atomic increment routines, you can use a mutex to do the atomic increment/decrement, although this may increase the cost significantly in both time and per-object space.
If you want to get more exotic (and faster) you'll need to design a container which is itself threadsafe, and come up with a more complex reference counting mechanism. For example, you may be able to create a hash table where the primary bucket array is never re-allocated, so can be accessed without locking. Furthermore, you can use non-portable double-wide CAS (compare and swap) operations on that array to both read a pointer and increment a reference count adjacent to it (128 bits of stuff on a 64-bit arch), allowing you to avoid the race mentioned above.
A completely different track would be to implement some kind of "delayed safe delete" strategy. Here avoid reference counting entirely. You remove references from your cache, but do not delete objects immediately, since other threads may still hold pointers to the object. Then later at some "safe" time you delete the object. Of course, the trick is discover when such a safe time exists. Basic strategies involve each thread signaling when they "enter" and "leave" a danger zone during which they may access the cache and hold references to contained objects. Once all threads which were in the danger zone when an object was removed from the cache have left the danger zone, you can free the object while being sure that no more references are held.
How practical this is depends on whether you have logical "enter" and "leave" points in your application (many request-oriented applications will), and whether the "enter" and "leave" costs can be amortized across many cache accesses. The upside is no reference counting! Of course, you still need a thread-safe container.
You can find references to many academic papers on the topic and some practical performance considerations by examining the papers linked here.
I think you effectively want a reader/writer lock per entry. Readers read lock and unlock as they are using it. The eviction thread has to obtain a write lock (which forces all readers to complete before it can be acquired). There needs to be some way for a reader to then tell (before acquiring a read lock) whether the entry in question has been evicted concurrently.
On the downside, one lock per entry is expensive for a big cache (in terms of memory). You can address that by using a lock across for a set of entries - this trades off memory vs concurrency. Need to be careful about deadlock scenarios in that case.
Sounds like a monitor with a std::map as the buffer would be useful in this situation.
I'm thinking that if you want to share a reference, you'll need to keep a count. So long as you use the interlocked inc/dec functions, this should be simple enough even for multiple threads.
It seems to me that the reference counting solution is only tricky in that the updating/testing for eviction of said reference counters must be inside a critical section protected by a mutex. So long as more than one process doesn't access the reference counters at a time it should be thread safe.
Have a circular queue, and don't allow multiple threads to write to it or the cache will be useless. Each thread should have its own cache, maybe with read access to the other caches but not write access.