I was thinking about reference counting based on atomic integers that would be safe from overflow. How to do it?
Please let's not focus on whether such overflow is a realistic problem or not. The task itself got my interest even if not practically important.
Example
Example implementation of reference counting is shown as an example in Boost.Atomic. Based on that example we can extract following sample code:
struct T
{
boost::atomic<boost::uintmax_t> counter;
};
void add_reference(T* ptr)
{
ptr->counter.fetch_add(1, boost::memory_order_relaxed);
}
void release_reference(T* ptr)
{
if (ptr->counter.fetch_sub(1, boost::memory_order_release) == 1) {
boost::atomic_thread_fence(boost::memory_order_acquire);
delete ptr;
}
}
In addition following explanation is given
Increasing the reference counter can always be done with memory_order_relaxed: New references to an object can only be formed from an existing reference, and passing an existing reference from one thread to another must already provide any required synchronization.
It is important to enforce any possible access to the object in one thread (through an existing reference) to happen before deleting the object in a different thread. This is achieved by a "release" operation after dropping a reference (any access to the object through this reference must obviously happened before), and an "acquire" operation before deleting the object.
It would be possible to use memory_order_acq_rel for the fetch_sub operation, but this results in unneeded "acquire" operations when the reference counter does not yet reach zero and may impose a performance penalty.
EDIT >>>
It seems that Boost.Atomic documentation might be wrong here. The acq_rel might be needed after all.
At least such is the implementation of boost::shared_ptr when done using std::atomic (there are other implementations as well). See file boost/smart_ptr/detail/sp_counted_base_std_atomic.hpp.
Also Herb Sutter mentions it in his lecture C++ and Beyond 2012: Herb Sutter - atomic<> Weapons, 2 of 2 (reference counting part starts at 1:19:51). Also he seems to be discouraging use of fences in this talk.
Thanks to user 2501 for pointing that out in comments below.
<<< END EDIT
Initial attempts
Now the problem is that add_reference as written could (at some point) overflow. And it would do so silently. Which obviously could lead to problems when calling matched release_reference that would prematurely destroy the object. (Provided that add_reference would be then called once again to reach 1.)
I was thinking how to make add_reference detect overflow and fail gracefully without risking anything.
Comparing to 0 once we leave fetch_add will not do since between the two some other thread could call add_reference again (reaching 1) and then release_reference (erroneously destroying the object in effect).
Checking first (with load) will not help either. This way some other thread could add its own reference between our calls to load and fetch_add.
Is this the solution?
Then I thought that maybe we could start with load but only if then we do compare_exchange.
So first we do load and get a local value. If it is std::numeric_limits<boost::uintmax_t>::max() then we fail. add_reference cannot add another reference as all possible are already taken.
Otherwise we make another local value which is the previous local reference count plus 1.
And now we do compare_exchange providing as expected value the original local reference count (this ensures that no other thread modified reference count in the mean time) and as the desired value the incremented local reference count.
Since compare_exchange can fail we have to do this (including load) in a loop. Until it succeeds or max value is detected.
Some questions
Is such solution correct?
What memory ordering would be required to make it valid?
Which compare_exchange should be used? _weak or _strong?
Would it affect release_reference function?
Is it used in practice?
The solution is correct, maybe it could be improved with one thing. Currently, if the value reaches max in the local CPU, it can be decreased by another CPU but the current CPU would still cache the old value. It would be worth doing dummy compare_exchange with the same expected and newValue to confirm the max is still there and only then throw an exception (or whatever you want).
For the rest:
It doesn't matter whether you use _weak or _strong as it will run in loop anyway and therefore the next load will get quite reliably the latest value.
For the add_reference and release_reference - who would then check whether it was really added or not? Would it throw an exception. If yes, it would work probably. But generally it's better to allow such a low level things not to fail and rather use uintptr_t for the reference counter so it could never overflow as it's big enough to cover the address space and therefore any number of objects existing at the same time.
No, it's not used in practice for the above reasons.
Quick math: say uint is 32 bits, so max uint is 4G (4 billion something). Each reference/pointer is at least 4 bytes (8 if you are on a 64 bit system) so to overflow you need 16Gbytes of memory dedicated to storing references pointing to the same object, which should point to a serious deign flaw.
I would say it's not a problem today, nor in the foreseeable future.
This question is moot. Even assuming atomic increment takes 1 CPU cycle (it does not!), on 4GHz CPU it would take half a year to wrap around 64bit integer, providing CPU does nothing else but keep incrementing.
Taking into account realities of an actual program, I find it hard to believe this is a real issue which can pester you.
Related
Is it possible to perform atomic and non-atomic ops on the same memory location?
I ask not because I actually want to do this, but because I'm trying to understand the C11/C++11 memory model. They define a "data race" like so:
The execution of a program contains a data race if it contains two
conflicting actions in different threads, at least one of which is not
atomic, and neither happens before the other. Any such data race
results in undefined behavior.
-- C11 §5.1.2.4 p25, C++11 § 1.10 p21
Its the "at least one of which is not atomic" part that is troubling me. If it weren't possible to mix atomic and non-atomic ops, it would just say "on an object which is not atomic."
I can't see any straightforward way of performing non-atomic operations on atomic variables. std::atomic<T> in C++ doesn't define any operations with non-atomic semantics. In C, all direct reads/writes of an atomic variable appear to be translated into atomic operations.
I suppose memcpy() and other direct memory operations might be a way of performing a non-atomic read/write on an atomic variable? ie. memcpy(&atomicvar, othermem, sizeof(atomicvar))? But is this even defined behavior? In C++, std::atomic is not copyable, so would it be defined behavior to memcpy() it in C or C++?
Initialization of an atomic variable (whether through a constructor or atomic_init()) is defined to not be atomic. But this is a one-time operation: you're not allowed to initialize an atomic variable a second time. Placement new or an explicit destructor call could would also not be atomic. But in all of these cases, it doesn't seem like it would be defined behavior anyway to have a concurrent atomic operation that might be operating on an uninitialized value.
Performing atomic operations on non-atomic variables seems totally impossible: neither C nor C++ define any atomic functions that can operate on non-atomic variables.
So what is the story here? Is it really about memcpy(), or initialization/destruction, or something else?
I think you're overlooking another case, the reverse order. Consider an initialized int whose storage is reused to create an std::atomic_int. All atomic operations happen after its ctor finishes, and therefore on initialized memory. But any concurrent, non-atomic access to the now-overwritten int has to be barred as well.
(I'm assuming here that the storage lifetime is sufficient and plays no role)
I'm not entirely sure because I think that the second access to int would be invalid anyway as the type of the accessing expression int doesn't match the object's type at the time (std::atomic<int>). However, "the object's type at the time" assumes a single linear time progression which doesn't hold in a multi-threaded environment. C++11 in general has that solved by making such assumptions about "the global state" Undefined Behavior per se, and the rule from the question appears to fit in that framework.
So perhaps rephrasing: if a single memory location contains an atomic object as well as a non-atomic object, and if the destruction of the earliest created (older) object is not sequenced-before the creation of the other (newer) object, then access to the older object conflicts with access to the newer object unless the former is scheduled-before the latter.
disclaimer: I am not a parallelism guru.
Is it possible to mix atomic/non-atomic ops on the same memory, and if
so, how?
you can write it in the code and compile, but it will probably yield undefined behaviour.
when talking about atomics, it is important to understand what kind o problems do they solve.
As you might know, what we call in shortly "memory" is multi-layered set of entities which are capable to hold memory.
first we have the RAM, then the cache lines , then the registers.
on mono-core processors, we don't have any synchronization problem. on multi-core processors we have all of them. every core has it own set of registers and cache lines.
this casues few problems.
First one of them is memory reordering - the CPU may decide on runtime to scrumble some reading/writing instructions to make the code run faster. this may yield some strange results that are completly invisible on the high-level code that brought this set of instruction. the most classic example of this phenomanon is the "two threads - two integer" example:
int i=0;
int j=0;
thread a -> i=1, then print j
thread b -> j=1 then print i;
logically, the result "00" cannot be. either a ends first, the result may be "01", either b ends first, the result may be "10". if both of them ends in the same time, the result may be "11". yet, if you build small program which imitates this situtation and run it in a loop, very quicly you will see the result "00"
another problem is memory invisibility. like I mentioned before, the variable's value may be cached in one of the cache lines, or be stored in one of the registered. when the CPU updates a variables value - it may delay the writing of the new value back to the RAM. it may keep the value in the cache/regiter because it was told (by the compiler optimizations) that that value will be updated again soon, so in order to make the program faster - update the value again and only then write it back to the RAM. it may cause undefined behaviour if other CPU (and consequently a thread or a process) depends on the new value.
for example, look at this psuedo code:
bool b = true;
while (b) -> print 'a'
new thread -> sleep 4 seconds -> b=false;
the character 'a' may be printed infinitly, because b may be cached and never be updated.
there are many more problems when dealing with paralelism.
atomics solves these kind of issues by (in a nutshell) telling the compiler/CPU how to read and write data to/from the RAM correctly without doing un-wanted scrumbling (read about memory orders). a memory order may force the cpu to write it's values back to the RAM, or read the valuse from the RAM even if they are cached.
So, although you can mix non atomics actions with atomic ones, you only doing part of the job.
for example let's go back to the second example:
atomic bool b = true;
while (reload b) print 'a'
new thread - > b = (non atomicly) false.
so although one thread re-read the value of b from the RAM again and again but the other thread may not write false back to the RAM.
So although you can mix these kind of operations in the code, it will yield underfined behavior.
I'm interested in this topic since I have code in which sometimes I need to access a range of addresses serially, and at other times to access the same addresses in parallel with some way of managing contention.
So not exactly the situation posed by the original question which (I think) implies concurrent, or nearly so, atomic and non atomic operationsin parallel code, but close.
I have managed by some devious casting to persuade my C11 compiler to allow me to access an integer and much more usefully a pointer both atomically and non-atomically ("directly"), having established that both types are officially lock-free on my x86_64 system. That is that the sizes of the atomic and non atomic types are the same.
I definitely would not attempt to mix both types of access to an address in a parallel context, that would be doomed to fail. However I have been successful in using "direct" syntax operations in serial code and "atomic" syntax in parallel code, giving me the best of both worlds of the fastest possible access (and much simpler syntax) in serial, and safely managed contention when in parallel.
So you can do it so long as you don't try to mix both methods in parallel code and you stick to using lock-free types, which probably means up to the size of a pointer.
I'm interested in this topic since I have code in which sometimes I need to access a range of addresses serially, and at other times to access the same addresses in parallel with some way of managing contention.
So not exactly the situation posed by the original question which (I think) implies concurrent, or nearly so, atomic and non atomic operations in parallel code, but close.
I have managed by some devious casting to persuade my C11 compiler to allow me to access an integer and much more usefully a pointer both atomically and non-atomically ("directly"), having established that both types are officially lock-free on my x86_64 system. My, possibly simplistic, interpretation of that is that the sizes of the atomic and non atomic types are the same and that the hardware can update such types in a single operation.
I definitely would not attempt to mix both types of access to an address in a parallel context, i think that would be doomed to fail. However I have been successful in using "direct" syntax operations in serial code and "atomic" syntax in parallel code, giving me the best of both worlds of the fastest possible access (and much simpler syntax) in serial, and safely managed contention when in parallel.
So you can do it so long as you don't try to mix both methods in parallel code and you stick to using lock-free types, which probably means up to the size of a pointer.
In C++, I feel like I've always been lead to believe that things like var++ and var-- are reasonably threadsafe - AKA - you have a guarantee that your value will increase or decrease at some point in time.
It was upon this belief that I built my understanding of non-blocking algorithms for thread-safe operations.
Yet, today I'm in shock, as I have a variable that is not getting incremented - and am therefore questioning the validity of a large amount of my past work.
In a small program, I have a global variable initialized at 0.
Eight P-Threads are started up that each call varname++ a total of 1024 times, totalling to 8*1024 increments.
Yet after all threads have finished executing, the value of varname is significantly less than 8*1024.
Did I miss the boat here?
Can someone please enlighten me?
What exactly would lead you to the belief those where threadsafe? In general they are not. The reason for that is simple: Arithmetic is generally done on registers, so var++ might be transformed to something like the following:
load var to R1
inc R1
store R1 to var
If another thread modifies var between the load and the store you will obviously loose that update. In reality this problem will be even worse, since the compiler can decide to keep the variable in a register for as long as it wants (well for as long as it can prove that var isn't accessed through any pointers to it (in the same thread)).
Having multiple threads access the same variable is defined to be a data race (and therefore undefined behaviour) by the (C++11) standard, unless none of the thread modifies the variable (if all do read access only, you are fine).
For threadsafety operations, you need to use either locking (e.g. using std::mutex in C++11) or atomic operations. If you use C++11, you can use std::atomic<int> (or whatever type your counter is) as the type for var to get threadsafe modifications. (Arithmetic) Operations on std::atomic<T> (like the increment and decrement operators) are guaranteed to be threadsafe by the standard.
C++'s pre- and post-increment operators are not thread-safe.
Similar question here: I've heard i++ isn't thread safe, is ++i thread-safe?
Yes, you have missed something--namely that your reads and writes are not atomic. So a number of threads could read the value, then increment it, then write it back, and if those operations are all happening "in parallel", the value will only be incremented by one.
You can fix this using C++11's (or Boost's) std::atomic type wrapper, described here:
http://en.cppreference.com/w/cpp/atomic/atomic
I have the following situation (caused by a defect in the code):
There's a shared variable of primitive type (let it be int) that is initialized during program startup from strictly one thread to value N (let it be 0). Then (strictly after the variable is initialized) during the program runtime various threads are started and they in some random order either read that variable or overwrite it with the very same value N (0 in this example). There's no synchronization around accessing the variable.
Can this situation cause unexpected behavior in the program?
It's incredibly unlikely but not impossible according to the standard.
There's nothing stating what the underlying representation of an integer is, not does the standard specify how the values are loaded.
I can envisage, however weird, an implementation where the underlying bit pattern for 0 is 10101010 and the architecture only supports loading data into memory by bit-shifting it over eight cycles but reading it as a single unit in one cycle.
If another thread reads the value while the bit pattern is being shifted in (e.g., 00000001, 00000010,00000101 and so on), you will have a problem.
The chances of anyone designing such a bizarre architecture is so close to zero as to be negligible. But, unfortunately, it's not zero. All I'm trying to get across is that you shouldn't rely on assumptions at all when it comes to standards compliance.
And please, before you vote me down, feel free to quote the part of the standard that states this is not possible :-)
Since C++ does not currently have a standard concurrency model, it would depend entirely on your threading implementation and whatever guarantees it gives. It is all but certainly unsafe in the general case, however, because of the potential for torn reads. There might be specific cases where it would "work" or at least "appear to work."
In C++0x (which does have a standard concurrency model), your scenario would formally result in undefined behavior. There is a long, detailed, hard-to-read specification of the concurrency model in the C++0x Final Committee Draft §1.10, but it basically boils down to this:
Two expression evaluations conflict if one of them modifies a memory location and the other one accesses or modifies the same memory location (§1.10/3).
The execution of a program contains a data race if it contains two conflicting actions in different threads, at least one of which is not atomic, and neither happens before the other. Any such data race results in undefined behavior (§1.10/14).
Your expression evaluations clearly conflict because they modify and read the same memory location, and since the object is not atomic and access is not synchronized using a lock, you have undefined behavior.
No. Of course, you could end up with a data race if one of the threads later tries to change the value. You will also end up with a little cache contention, but I doubt this will have noticable effect.
You can not really rely on it. For primitive types you should be fine, and if the operation is atomic (eg a correctly aligned int on most platforms) then writing and reading different values is safe (note by this I mean like "x = 5;", not "x += 5;" which is never atomic and is not thread safe).
For non-primitive types even if its the same value all bets are off since there may be a copy constructor that does something un-safe (like allocating memory).
Yes it is possible for unexpected behavior to happen in this scenario. Consider the case where the initial value of the variable was not 0. It is possible for one thread to start the set to 0 and another thread see the variable with only some of the bytes set.
For types int this is very unlikely as most processor's will have atomic assignment of word sized values. However once you hit 8 bit numeric values (long on some platforms) or large structs, this begins to be an issue.
If no other thread (and this includes the main thread) can change the value of the 0 to anything else (lets say 1) while those threads are initializing then it you will not have problems. But if any other thread had the potential to change the value during the start-up phase you could have a problem. You are playing a dangerous game and I would recommend locking before reading the value.
I'm trying to design a simple cache that follows the following rules:
Entries have a unique key
When the number of entries in the cache exceeds a certain limit, the older items are evicted (to keep the cache from getting too large).
Each entry's data is immutable until the entry is removed from the cache.
A 'reader' can access an entry in the cache and the entry must be valid for the lifetime of the reader.
Each reader can be on its own thread, and all readers access the same cache instance.
Thread safety with this cache is important, since we don't want readers holding a reference to an entry, only to have it evicted by another thread somewhere else.
Hence, my current implementation just copies out the whole entry when reading from the cache. This is fine for smaller objects, but once objects get too large there's too much copying going on. It's also not so great with large numbers of readers that are accessing the same cached entry.
Since the data is immutable, it would be great if every reader to the same message could just hold a reference instead of a copy, but in some thread safe manner (so it wouldn't get evicted).
A previous implementation used reference counting to achieve this...but it's very tricky with threads, and I went with this simpler approach.
Are there any other patterns/ideas I could use to improve this design?
In a native system without a higher power (such as a VM) capable of performing garbage collection, you aren't going to do much better performance or complexity wise than reference counting.
You are are correct the reference counting can be tricky - not only does the increment and decrement have to atomic, but you need to ensure that the object can't be deleted out from under you before you are able to increment it. Thus, if you store the reference counter inside the object, you'll have to somehow avoid the race that occurs between the time you read the pointer to the object out of the cache, and manage to increment the pointer.
If your structure is a standard container, which is not already thread-safe, you will also have to protect the container from unsupported concurrent access. This protection can dovetail nicely with avoiding the reference counting race condition described above - if you use a read-writer lock to protect the structure, combined with atomic increments of the in-object reference counter while still holding the reader lock, you'll be protected from anyone deleting the object out from under you before you get the reference count, since such mutators must be "writers".
Here, objects can be evicted from the cache while still having a positive reference count - they will be destroyed when the last outstanding reference is dropped (by your smart pointer class). This is typically considered a feature, since it means that at least some object can always be removed from the cache, but it also has the downside that there is no strict upper on the number of objects "alive" in memory, since the reference counting allows objects to say alive even after they've left the cache. Whether this is acceptable to you depends on your requirements and details such as how long other threads may hold references to objects.
If you don't have access to (non-standard) atomic increment routines, you can use a mutex to do the atomic increment/decrement, although this may increase the cost significantly in both time and per-object space.
If you want to get more exotic (and faster) you'll need to design a container which is itself threadsafe, and come up with a more complex reference counting mechanism. For example, you may be able to create a hash table where the primary bucket array is never re-allocated, so can be accessed without locking. Furthermore, you can use non-portable double-wide CAS (compare and swap) operations on that array to both read a pointer and increment a reference count adjacent to it (128 bits of stuff on a 64-bit arch), allowing you to avoid the race mentioned above.
A completely different track would be to implement some kind of "delayed safe delete" strategy. Here avoid reference counting entirely. You remove references from your cache, but do not delete objects immediately, since other threads may still hold pointers to the object. Then later at some "safe" time you delete the object. Of course, the trick is discover when such a safe time exists. Basic strategies involve each thread signaling when they "enter" and "leave" a danger zone during which they may access the cache and hold references to contained objects. Once all threads which were in the danger zone when an object was removed from the cache have left the danger zone, you can free the object while being sure that no more references are held.
How practical this is depends on whether you have logical "enter" and "leave" points in your application (many request-oriented applications will), and whether the "enter" and "leave" costs can be amortized across many cache accesses. The upside is no reference counting! Of course, you still need a thread-safe container.
You can find references to many academic papers on the topic and some practical performance considerations by examining the papers linked here.
I think you effectively want a reader/writer lock per entry. Readers read lock and unlock as they are using it. The eviction thread has to obtain a write lock (which forces all readers to complete before it can be acquired). There needs to be some way for a reader to then tell (before acquiring a read lock) whether the entry in question has been evicted concurrently.
On the downside, one lock per entry is expensive for a big cache (in terms of memory). You can address that by using a lock across for a set of entries - this trades off memory vs concurrency. Need to be careful about deadlock scenarios in that case.
Sounds like a monitor with a std::map as the buffer would be useful in this situation.
I'm thinking that if you want to share a reference, you'll need to keep a count. So long as you use the interlocked inc/dec functions, this should be simple enough even for multiple threads.
It seems to me that the reference counting solution is only tricky in that the updating/testing for eviction of said reference counters must be inside a critical section protected by a mutex. So long as more than one process doesn't access the reference counters at a time it should be thread safe.
Have a circular queue, and don't allow multiple threads to write to it or the cache will be useless. Each thread should have its own cache, maybe with read access to the other caches but not write access.
I need to provide synchronization to some members of a structure.
If the structure is something like this
struct SharedStruct {
int Value1;
int Value2;
}
and I have a global variable
SharedStruct obj;
I want that the write from a processor
obj.Value1 = 5; // Processor B
to be immediately visible to the other processors, so that when I test the value
if(obj.Value1 == 5) { DoSmth(); } // Processor A
else DoSmthElse();
to get the new value, not some old value from the cache.
First I though that if I use volatile when writing/reading the values, it is enough. But I read that volatile can't solve this kind o issues.
The members are guaranteed to be properly aligned on 2/4/8 byte boundaries, and writes should be atomic in this case, but I'm not sure how the cache could interfere with this.
Using memory barriers (mfence, sfence, etc.) would be enough ? Or some interlocked operations are required ?
Or maybe something like
lock mov addr, REGISTER
?
The easiest would obviously be some locking mechanism, but speed is critical and can't afford locks :(
Edit
Maybe I should clarify a bit. The value is set only once (behaves like a flag). All the other threads need just to read it. That's why I think that it may be a way to force the read of this new value without using locks.
Thanks in advance!
There Ain't No Such Thing As A Free Lunch. If your data is being accessed from multiple threads, and it is necessary that updates are immediately visible by those other threads, then you have to protect the shared struct by a mutex, or a readers/writers lock, or some similar mechanism.
Performance is a valid concern when synchronizing code, but it is trumped by correctness. Generally speaking, aim for correctness first and then profile your code. Worrying about performance when you haven't yet nailed down correctness is premature optimization.
Use explicitly atomic instructions. I believe most compilers offer these as intrinsics. Compare and Exchange is another good one.
If you intend to write a lockless algorithm, you need to write it so that your changes only take effect when conditions are as expected.
For example, if you intend to insert a linked list object, use the compare/exchange stuff so that it only inserts if the pointer still points at the same location when you actually do the update.
Or if you are going to decrement a reference count and free the memory at count 0, you will want to pre-free it by making it unavailable somehow, check that the count is still 0 and then really free it. Or something like that.
Using a lock, operate, unlock design is generally a lot easier. The lock-free algorithms are really difficult to get right.
All the other answers here seem to hand wave about the complexities of updating shared variables using mutexes, etc. It is true that you want the update to be atomic.
And you could use various OS primitives to ensure that, and that would be good
programming style.
However, on most modern processors (certainly the x86), writes of small, aligned scalar values is atomic and immediately visible to other processors due to cache coherency.
So in this special case, you don't need all the synchronizing junk; the hardware does the
atomic operation for you. Certainly this is safe with 4 byte values (e.g., "int" in 32 bit C compilers).
So you could just initialize Value1 with an uninteresting value (say 0) before you start the parallel threads, and simply write other values there. If the question is exiting the loop on a fixed value (e.g., if value1 == 5) this will be perfectly safe.
If you insist on capturing the first value written, this won't work. But if you have a parallel set of threads, and any value written other than the uninteresting one will do, this is also fine.
I second peterb's answer to aim for correctness first. Yes, you can use memory barriers here, but they will not do what you want.
You said immediately. However, how immediate this update ever can be, you could (and will) end up with the if() clause being executed, then the flag being set, and than the DoSmthElse() being executed afterwards. This is called a race condition...
You want to synchronize something, it seems, but it is not this flag.
Making the field volatile should make the change "immediately" visible in other threads, but there is no guarantee that the instant at which thread A executes the update doesn't occur after thread B tests the value but before thread B executes the body of the if/else statement.
It sounds like what you really want to do is make that if/else statement atomic, and that will require either a lock, or an algorithm that is tolerant of this sort of situation.