I wrote my atomic_inc for increment an integer using asm, it is actually used for referencing counting for shared objects. gcc 4.8.2 -fsanitize=thread reports data races and I finally found it was likely cause by my atomic_inc. I don't believe my code has a bug there to do with data races, is that a false positive by tsan?
static inline int atomic_add(volatile int *count, int add) {
__asm__ __volatile__(
"lock xadd %0, (%1);"
: "=a"(add)
: "r"(count), "a"(add)
: "memory"
);
return add;
}
void MyClass::Ref() {
// std::unique_lock<std::mutex> lock(s_ref);
atomic_add(&_refs, 1);
}
void MyClass::Unref() {
// std::unique_lock<std::mutex> lock(s_ref);
int n = atomic_add(&_refs, -1) - 1;
// lock.unlock();
assert(n >= 0);
if (n <= 0) {
delete this;
}
}
Part of your problem is that gcc doesn't look inside the asm.
The other part of your problem is that volatile doesn't make a variable thread-safe.
Given __asm__ means you are committed to gcc, why not use the gcc intrinsics? (They are documented and well tested, and gcc will understand their semantics.)
As to whether the warning is a false positive, I don't know. The safe thing to do is to assume the problem is genuine. It is really hard to see problems in multi-threaded code (even when you know they are there). (Once we ripped out a very clever piece of code using a published mutex algorithm that was failing and replaced it with a simple spin-lock. That fixed the failures, but we never could find why it failed.)
As others have already said, the tool cannot see inside your asm. But you shouldn't do that anyway.
Just use std::atomic and be done with it - that's both thread safe and portable and the compiler knows how to optimize it - unlike your current code.
Related
I've got a single writer which has to increment a variable at a fairly high frequence and also one or more readers who access this variable on a lower frequency.
The write is triggered by an external interrupt.
Since i need to write with high speed i don't want to use mutexes or other expensive locking mechanisms.
The approach i came up with was copying the value after writing to it. The reader now can compare the original with the copy. If they are equal, the variable's content is valid.
Here my implementation in C++
template<typename T>
class SafeValue
{
private:
volatile T _value;
volatile T _valueCheck;
public:
void setValue(T newValue)
{
_value = newValue;
_valueCheck = _value;
}
T getValue()
{
volatile T value;
volatile T valueCheck;
do
{
valueCheck = _valueCheck;
value = _value;
} while(value != valueCheck);
return value;
}
}
The idea behind this is to detect data races while reading and retry if they happen. However, i don't know if this will always work. I haven't found anything about this aproach online, therefore my question:
Is there any problem with my aproach when used with a single writer and multiple readers?
I already know that high writing frequencys may cause starvation of the reader. Are there more bad effects i have to be cautious of? Could it even be that this isn't threadsafe at all?
Edit 1:
My target system is a ARM Cortex-A15.
T should be able to become at least any primitive integral type.
Edit 2:
std::atomic is too slow on reader and writer site. I benchmarked it on my system. Writes are roughly 30 times slower, reads roughly 50 times compared to unprotected, primitive operations.
Is this single variable just an integer, pointer, or plain old value type, you can probably just use std::atomic.
You should try using std::atomic first, but make sure that your compiler knows and understands your target architecture. Since you are targeting Cortex-A15 (ARMv7-A cpu), make sure to use -march=armv7-a or even -mcpu=cortex-a15.
The first shall generate ldrexd instruction which should be atomic according to ARM docs:
Single-copy atomicity
In ARMv7, the single-copy atomic processor accesses are:
all byte accesses
all halfword accesses to halfword-aligned locations
all word accesses to word-aligned locations
memory accesses caused by LDREXD and STREXD instructions to doubleword-aligned locations.
The latter shall generate ldrd instruction which should be atomic on targets supporting Large Physical Address Extension:
In an implementation that includes the Large Physical Address Extension, LDRD and STRD accesses to 64-bit aligned locations are 64-bit single-copy atomic as seen by translation table walks and accesses to translation tables.
--- Note ---
The Large Physical Address Extension adds this requirement to avoid the need to complex measures to avoid atomicity issues when changing translation table entries, without creating a requirement that all locations in the memory system are 64-bit single-copy atomic.
You can also check how Linux kernel implements those:
#ifdef CONFIG_ARM_LPAE
static inline long long atomic64_read(const atomic64_t *v)
{
long long result;
__asm__ __volatile__("# atomic64_read\n"
" ldrd %0, %H0, [%1]"
: "=&r" (result)
: "r" (&v->counter), "Qo" (v->counter)
);
return result;
}
#else
static inline long long atomic64_read(const atomic64_t *v)
{
long long result;
__asm__ __volatile__("# atomic64_read\n"
" ldrexd %0, %H0, [%1]"
: "=&r" (result)
: "r" (&v->counter), "Qo" (v->counter)
);
return result;
}
#endif
There's no way anyone can know. You would have to see if either your compiler documents any multi-threaded semantics that would guarantee that this will work or look at the generated assembler code and convince yourself that it will work. Be warned that in the latter case, it is always possible that a later version of the compiler, or different optimizations options or a newer CPU, might break the code.
I'd suggest testing std::atomic with the appropriate memory_order. If for some reason that's too slow, use inline assembly.
Another option is to have a buffer of non-atomic values the publisher produces and an atomic pointer to the latest.
#include <atomic>
#include <utility>
template<class T>
class PublisherValue {
static auto constexpr N = 32;
T values_[N];
std::atomic<T*> current_{values_};
public:
PublisherValue() = default;
PublisherValue(PublisherValue const&) = delete;
PublisherValue& operator=(PublisherValue const&) = delete;
// Single writer thread only.
template<class U>
void store(U&& value) {
T* p = current_.load(std::memory_order_relaxed);
if(++p == values_ + N)
p = values_;
*p = std::forward<U>(value);
current_.store(p, std::memory_order_release); // (1)
}
// Multiple readers. Make a copy to avoid referring the value for too long.
T load() const {
return *current_.load(std::memory_order_consume); // Sync with (1).
}
};
This is wait-free, but there is a small chance that a reader might be de-scheduled while copying the value and hence read the oldest value while it has been partially overwritten. Making N bigger reduces this risk.
Chandler Carruth introduced two functions in his CppCon2015 talk that can be used to do some fine-grained inhibition of the optimizer. They are useful to write micro-benchmarks that the optimizer won't simply nuke into meaninglessness.
void clobber() {
asm volatile("" : : : "memory");
}
void escape(void* p) {
asm volatile("" : : "g"(p) : "memory");
}
These use inline assembly statements to change the assumptions of the optimizer.
The assembly statement in clobber states that the assembly code in it can read and write anywhere in memory. The actual assembly code is empty, but the optimizer won't look into it because it's asm volatile. It believes it when we tell it the code might read and write everywhere in memory. This effectively prevents the optimizer from reordering or discarding memory writes prior to the call to clobber, and forces memory reads after the call to clobber†.
The one in escape, additionally makes the pointer p visible to the assembly block. Again, because the optimizer won't look into the actual inline assembly code that code can be empty, and the optimizer will still assume that the block uses the address pointed by the pointer p. This effectively forces whatever p points to be in memory and not not in a register, because the assembly block might perform a read from that address.
(This is important because the clobber function won't force reads nor writes for anything that the compilers decides to put in a register, since the assembly statement in clobber doesn't state that anything in particular must be visible to the assembly.)
All of this happens without any additional code being generated directly by these "barriers". They are purely compile-time artifacts.
These use language extensions supported in GCC and in Clang, though. Is there a way to have similar behaviour when using MSVC?
†To understand why the optimizer has to think this way, imagine if the assembly block were a loop adding 1 to every byte in memory.
Given your approximation of escape(), you should also be fine with the following approximation of clobber() (note that this is a draft idea, deferring some of the solution to the implementation of the function nextLocationToClobber()):
// always returns false, but in an undeducible way
bool isClobberingEnabled();
// The challenge is to implement this function in a way,
// that will make even the smartest optimizer believe that
// it can deliver a valid pointer pointing anywhere in the heap,
// stack or the static memory.
volatile char* nextLocationToClobber();
const bool clobberingIsEnabled = isClobberingEnabled();
volatile char* clobberingPtr;
inline void clobber() {
if ( clobberingIsEnabled ) {
// This will never be executed, but the compiler
// cannot know about it.
clobberingPtr = nextLocationToClobber();
*clobberingPtr = *clobberingPtr;
}
}
UPDATE
Question: How would you ensure that isClobberingEnabled returns false "in an undeducible way"? Certainly it would be trivial to place the definition in another translation unit, but the minute you enable LTCG, that strategy is defeated. What did you have in mind?
Answer: We can take advantage of a hard-to-prove property from the number theory, for example, Fermat's Last Theorem:
bool undeducible_false() {
// It took mathematicians more than 3 centuries to prove Fermat's
// last theorem in its most general form. Hardly that knowledge
// has been put into compilers (or the compiler will try hard
// enough to check all one million possible combinations below).
// Caveat: avoid integer overflow (Fermat's theorem
// doesn't hold for modulo arithmetic)
std::uint32_t a = std::clock() % 100 + 1;
std::uint32_t b = std::rand() % 100 + 1;
std::uint32_t c = reinterpret_cast<std::uintptr_t>(&a) % 100 + 1;
return a*a*a + b*b*b == c*c*c;
}
I have used the following in place of escape.
#ifdef _MSC_VER
#pragma optimize("", off)
template <typename T>
inline void escape(T* p) {
*reinterpret_cast<char volatile*>(p) =
*reinterpret_cast<char const volatile*>(p); // thanks, #milleniumbug
}
#pragma optimize("", on)
#endif
It's not perfect but it's close enough, I think.
Sadly, I don't have a way to emulate clobber.
I have been reading for a while in order to understand better whats going on when multithread programming with a modern (multicore) CPU. However, while I was reading this, I noticed the code below in the "Explicit Compiler Barriers" section, which does not use volatile for IsPublished global.
#define COMPILER_BARRIER() asm volatile("" ::: "memory")
int Value;
int IsPublished = 0;
void sendValue(int x)
{
Value = x;
COMPILER_BARRIER(); // prevent reordering of stores
IsPublished = 1;
}
int tryRecvValue()
{
if (IsPublished)
{
COMPILER_BARRIER(); // prevent reordering of loads
return Value;
}
return -1; // or some other value to mean not yet received
}
The question is, is it safe to omit volatile for IsPublished here? Many people mention that "volatile" keyword has nothing much to do with multithread programming and I agree with them. However, during the compiler optimizations "Constant Folding/Propagation" can be applied and as the wiki page shows it is possible to change if (IsPublished) into if (false) if compiler do not knows much about who can change the value of IsPublished. Do I miss or misunderstood something here?
Memory barriers can prevent compiler ordering and out-of-order execution for CPU, but as I said in the previos paragraph do I still need volatile in order to avoid "Constant Folding/Propagation" which is a dangereous optimization especially using globals as flags in a lock-free code?
If tryRecvValue() is called once, it is safe to omit volatile for IsPublished. The same is true in case, when between calls to tryRecvValue() there is a function call, for which compiler cannot prove, that it does not change false value of IsPublished.
// Example 1(Safe)
int v = tryRecvValue();
if(v == -1) exit(1);
// Example 2(Unsafe): tryRecvValue may be inlined and 'IsPublished' may be not re-read between iterations.
int v;
while(true)
{
v = tryRecvValue();
if(v != -1) break;
}
// Example 3(Safe)
int v;
while(true)
{
v = tryRecvValue();
if(v != -1) break;
some_extern_call(); // Possibly can change 'IsPublished'
}
Constant propagation can be applied only when compiler can prove value of the variable. Because IsPublished is declared as non-constant, its value can be proven only if:
Variable is assigned to the given value or read from variable is followed by the branch, executed only in case when variable has given value.
Variable is read (again) in the same program's thread.
Between 2 and 3 variable is not changed within given program's thread.
Unless you call tryRecvValue() in some sort of .init function, compiler will never see IsPublished initialization in the same thread with its reading. So, proving false value of this variable according to its initialization is not possible.
Proving false value of IsPublished according to false (empty) branch in tryRecvValue function is possible, see Example 2 in the code above.
As discussed in this question, C++11 optimizes endless loops away.
However, in embedded devices which have a single purpose, endless loops make sense and are actually quite often used. Even a completely empty while(1); is useful for a watchdog-assisted reset. Terminating but empty loops can also be useful in embedded development.
Is there an elegant way to specifically tell the compiler to not remove empty or endless loops, without disabling optimization altogether?
One of the requirements for a loop to be removed (as mentioned in that question) is that it
does not access or modify volatile objects
So,
void wait_forever(void)
{
volatile int i = 1;
while (i) ;
}
should do the trick, although I would certainly verify this by looking at the disassembly of a program produced with your particular toolchain.
A function like this would be a good candidate for GCC's noreturn attribute as well.
void wait_forever(void) __attribute__ ((noreturn));
void wait_forever(void)
{
volatile int i = 1;
while (i) ;
}
int main(void)
{
if (something_bad_happened)
wait_forever();
}
Herlihy and Shavit's book (The Art of Multiprocessor Programming) solution to memory reclamation uses Java's AtomicStampedReference<T>;.
To write one in C++ for the x86_64 I imagine requires at least a 12 byte swap operation - 8 for a 64bit pointer and 4 for the int.
Is there x86 hardware support for this and if not, any pointers on how to do wait-free memory reclamation without it?
Yes, there is hardware support, though I don't know if it is exposed by C++ libraries. Anyway, if you don't mind doing some low-level unportable assembly language trickery - look up the CMPXCHG16B instruction in Intel manuals.
Windows gives you a bunch of Interlocked functions that are atomic and can probably be used to do what you want. Similar functions exist for other platforms, and I believe Boost has an interlocked library as well.
Your question isn't super clear and I don't have a copy of Herlihy and Shavit laying around. Perhaps if you elaborated or gave psuedo code outlining what you want to do, we can give you a more specific answer.
Ok hopefully, I have the book,
For others that may provides answers, the point is to implement this class :
class AtomicReference<T>{
public:
void set(T *ref, int stamp){ ... }
T *get(int *stamp){ ... }
private:
T *_ref;
int _stamp;
};
in a lock-free way so that :
set() updates the reference and the stamp, atomicly.
get() returns the reference and set *stamp to the stamp corresponding to the reference.
JDonner please, correct me if I am wrong.
Now my answer : I don't think you can do it without a lock somewhere (a lock can be while(test_and_set() != ..)). Therefore there is no lockfree algorithm for this. This would mean that it is possible to build an N-bythe register a lock-free way for any N.
If you look at the book pragma 9.8.1, The AtomicMarkableReference wich is the same with a single bit insteam of an integer stamp. The author suggest to "steal" a bit from a pointer to extract the mark and the pointer from a single word (alsmost quoted) This obviously mean that they want to use a single atomic register to do it.
However, there may be a way to bluid a wait-free memory reclamation without it. I don't know.
Yes, x64 supports this; you need to use CMPXCHG16B.
You can save a bit on memory by relying on the fact that the pointer will use less than 64 bits. First, define a compare&set function (this ASM works in GCC & ICC):
inline bool CAS_ (volatile uint64_t* mem, uint64_t old_val, uint64_t new_val)
{
unsigned long old_high = old_val >> 32, old_low = old_val;
unsigned long new_high = new_val >> 32, new_low = new_val;
char res = 0;
asm volatile("lock; cmpxchg8b (%6);"
"setz %7; "
: "=a" (old_low), // 0
"=d" (old_high) // 1
: "0" (old_low), // 2
"1" (old_high), // 3
"b" (new_low), // 4
"c" (new_high), // 5
"r" (mem), // 6
"m" (res) // 7
: "cc", "memory");
return res;
}
You'll then need to build a tagged-pointer type. I'm assuming a 40-bit pointer with a cacheline-width of 128-bytes (like Nehalem). Aligning to the cache-line will give enormous speed improvements by reducing false-sharing, contention, etc.; this has the obvious trade-off of using a lot more memory, in some situations.
template <typename pointer_type, typename tag_type, int PtrAlign=7, int PtrWidth=40>
struct tagged_pointer
{
static const unsigned int PtrMask = (1 << (PtrWidth - PtrAlign)) - 1;
static const unsigned int TagMask = ~ PtrMask;
typedef unsigned long raw_value_type;
raw_value_type raw_m_;
tagged_pointer () : raw_m_(0) {}
tagged_pointer (pointer_type ptr) { this->pack(ptr, 0); }
tagged_pointer (pointer_type ptr, tag_type tag) { this->pack(ptr, tag); }
void pack (pointer_type ptr, tag_type tag)
{
this->raw_m_ = 0;
this->raw_m_ |= ((ptr >> PtrAlign) & PtrMask);
this->raw_m_ |= ((tag << (PtrWidth - PtrAlign)) & TagMask);
}
pointer_type ptr () const
{
raw_value_type p = (this->raw_m_ & PtrMask) << PtrAlign;
return *reinterpret_cast<pointer_type*>(&p);
}
tag_type tag () const
{
raw_value_type t = (this->raw_m_ & TagMask) >> (PtrWidth - PtrAlign_;
return *reinterpret_cast<tag_type*>(&t);
}
};
I haven't had a chance to debug this code, so you'll need to do that, but this is the general idea.
Note, on x86_64 architecture and gcc you can enable 128 bit CAS. It can be enabled with -mcx16 gcc option.
int main()
{
__int128_t x = 0;
__sync_bool_compare_and_swap(&x,0,10);
return 0;
}
Compile with:
gcc -mcx16 file.c
The cmpxchg16b operation provides the expected operation but beware that some older x86-64 processors don't have this instruction.
You then just need to build an entity with the counter and and the pointer and the asm-inline code. I've written a blog post on the subject here:Implementing Generic Double Word Compare And Swap
Nevertheless, you don't need this operation if you just want to prevent early-free and ABA issues. The hazard pointer is more simpler and doesn't require specific asm code (as long as you use C++11 atomic values.) I've got a repo on bitbucket with experimental implementations of various lock-free algorithms: Lock Free Experiment (beware all these implementations are toys for experimentation, not reliable and tested code for production.)