When is a call to "synchronize()" not necessary in C++ AMP? - c++

Background: For a C++ AMP overview, see Daniel Moth's recent BUILD talk.
Going through the initial walk-throughs here, here, here, and here.
Only in that last reference do they make a call to array_view.synchronize().
In these simple examples, is a call to synchronize() not needed? When is it safe to exclude? Can we trust parallel_for_each to behave "synchronously" without it (w/r/t the proceeding code)?

Use synchronize() when you want to access the data without going through the array_view interface. If all of your access to the data uses array_view operators and functions, you don't need to use synchronize(). As Daniel mentioned, the destructor of an array_view forces a synchronize as well, and it's better to call synchronize() in that case so you can get any exceptions that might be thrown.
The synchronize function forces an update to the buffer within the calling context -- that is if you write data on the GPU and then call synchronize in CPU code, at that point the updated values are copied to CPU memory.
This seems obvious from the name, but I mention it because other array_view operations can cause a 'synchronize' as well. C++ AMP array_view tries it best to make copying between the CPU and GPU memory implict -- any operation which reads data through the array view interface will cause a copy as well.
std::vector<int> v(10);
array_view<int, 1> av(10, v);
parallel_for_each(av.grid, [=](index<1> i) restrict(direct3d) {
av[i] = 7;
}
// at this point, data isn't copied back
std::wcout << v[0]; // should print 0
// using the array_view to access data will force a copy
std::wcout << av[0]; // should print 7
// at this point data is copied back
std::wcout << v[0]; // should print 7

my_array_view_instance.synchronize is not required for the simple examples I showed because the destructor calls synchronize. Having said that, I am not following best practice (sorry), which is to explicitly call synchronize. The reason is that if any exceptions are thrown at that point, you would not observe them if you left them up to the destructor, so please call synchronize explicitly.
Cheers
Daniel

Just noticed the second question in your post about parallel_for_each being synchronous vs asyncrhonous (sorry, I am used to 1 question per thread ;-)
"Can we trust parallel_for_each to behave "synchronously" without it (w/r/t the proceeding code)?"
The answer to that is on my post about parallel_for_each:
http://www.danielmoth.com/Blog/parallelforeach-From-Amph-Part-1.aspx
..and also in the BUILD recording you pointed to from 29:20-33:00
http://channel9.msdn.com/Events/BUILD/BUILD2011/TOOL-802T
In a nutshell, no you cannot trust it to be synchronous, it is asyncrhonous. The (implicit or explicit) synchronization point is any code that tries to access the data that is expected to be copied back from the GPU as a result of your parallel loop.
Cheers
Daniel

I'm betting that it's never safe to exclude because in a multi-threaded (concurrent or parallel) it's never safe to assume anything. There are certain guarantees which certain constructs give you but, you have to super careful and meticulous not to break these guarantees by introducing something which you think is fine to do but when in reality there's a lot of complexity underpinning the whole thing.
Haven't spent any time with C++-AMP yet but I'm inclined to try it out.

Related

When will variables be removed by optimization?

I'm working with threads and I have question about how the compilers are allowed to optimize the following code:
void MyClass::f(){
Parent* p = this->m_parent;
this->m_done = true;
p->function();
}
It is very important that p (on the stack or in a register) is used to call the function instead of this->m_parent. Since as soon as m_done becomes true then this may be deleted from another thread if it happens to run its cleanup pass (I have had actual crashes due to this), in which case m_parent could contain garbage, but the thread stack/registers would be intact.
My initial testing on GCC/Linux shows that I don't have a race condition but I would like to know if this will be the case on other compilers as well?
Is this a case for gulp volatile? I've looked at What kinds of optimizations does 'volatile' prevent in C++? and Is 'volatile' needed in this multi-threaded C++ code? but I didn't feel like either of them applied to my problem.
I feel uneasy relying on this as I'm unsure what the compiler is allowed to do here. I see the following cases:
No/Beneficial optimization, the pointer in this->m_parent is stored on the stack/register and this value is later used to call function() this is wanted behavior.
Compiler removes p but this->m_parent is by chance available in a register already and the compiler uses this for the call to function() would work but be unreliable between compilers. This is bad as it could expose bugs moving to a different platform/compiler version.
Compiler removes p and reads this->m_parent before the call to function(), this creates a race condition which I can't have.
Could some one please shed some light on what the compiler is allowed to do here?
Edit
I forgot to mention that this->m_done is a std::atomic<bool> and I'm using C++11.
This code will work perfectly as written in C++11 if m_done is std::atomic<bool>. The read of m_parent is sequenced-before the write to m_done which will synchronize with a hypothetical read of m_done in another thread that is sequenced-before a hypothetical write to m_parent. Taken together, that means that the standard guarantees that this thread's read of m_parent happens-before the other thread's write.
You might run into a re-ordering problem. Check memory barriers to solve this problem. Place it so that the loading of p and the setting of m_done is done in exactly that order (place it between both instructions) and you should be fine.
Implementations are available in C++11 and boost.
Um, some other thread might delete the object out from under you while you're in one of its member functions? Undefined behavior. Plain and simple.

Platform-Specific Workarounds to C++ Static Destruction / Construction Order Problems

I am developing with Visual Studio 2008 in standard (unmanaged) C++ under Windows XP Pro SP 3.
I have created a thread-safe wrapper around std::cout. This wrapper object is a drop-in replacement (i.e. same name) for what use to be a macro that was #defined to cout. It is used by a lot of code. Its behavior is probably pretty much as you would expect:
At construction, it creates a critical section.
During calls to operator<<(), it locks the critical section, passes the data to be printed to cout, and finally releases the critical section.
At destruction, it destroys the critical section.
This wrapper lives in static storage (it's global). Like all such objects, it constructs before main() starts and it destructs after main() exits.
Using my wrapper in the destructor of another object that also lives in static storage is problematic. Since the construction / destruction order of such objects is indeterminate, I may very well try to lock a critical section that has been destroyed. The symptom I'm seeing is that my program blocks on the lock attempt (though I suppose anything is liable to happen).
As far as ways to deal with this...
I could do nothing in the destructor; specifically, I would let the critical section continue to live. The C++ standard guarantees that cout never dies during program execution, and this would be the best possible attempt at making my wrapper behave similarly. Of course, my wrapper would "officially" be dead after its empty destructor runs, but it would probably (I hate that word) be as functional as it was before its destructor ran. On my platform, this does seem to be the case. But oh my gosh is this ugly, non-portable, and liable to future breakage...
I hold the critical section (but not the stream reference to cout) in a pimpl. All critical section accesses via the pimpl are preceded by a check for non-nullness of the pimpl. It so happens that I forgot to set the pimpl to 0 after calling delete on it in the destructor. If I were to set this to 0 (which I should do anyway), calls into my wrapper after it destructed would do nothing with the critical section but will still pass data to be printed to cout. On my platform, this also seems to work. Again, ugly...
I could tell my teammates to not use my wrapper after main() exits. Unfortunately, the aerodynamics of this would be about the same as that of a tank.
QUESTIONS:
* Question 1 *
For case 1, if I leave the critical section undestroyed, there will be a resource leak of a critical section in the OS. Will this leak persist after my program has fully exited? If not, case 1 becomes more viable.
* Question 2 *
For cases 1 and 2, does anybody know if on my particular platform I can indeed safely continue to use my wrapper after its empty destructor runs? It appears I can, but I want to see if anybody knows anything definitive about how my platform behaves in this case...
* Question 3 *
My proposals are obviously imperfect, but I do not see a truly correct solution. Does anybody out there know of a correct solution to this problem?
Side note: Of course, there is a converse problem that could occur if I try to use my wrapper in the constructor of another object that also lives in static storage. In that case, I may try to lock a critical section that has not yet been created. I would like to use the "construct on first use" idiom to fix this, but that entails a syntactic change of all the code that uses my wrapper. This would require giving up the naturalness of using the << operator. And there's way too much code to change anyway. So, this is not a workable option. I'm not very far into the thought process on this half of the problem, but I will ask one question that might be part of another imperfect way to deal with the problem...
* Question 4 *
As I've said, my wrapper lives in static storage (it's global) and it has a pimpl (hormonal problem :) ). I have the impression that the raw bytes of a variable in static storage are set to 0 at load time (unless initialized differently in code). This would mean that my wrapper's pimpl has a value of 0 before construction of my wrapper. Is this correct?
Thank You,
Dave
The first thing is that I would reconsider what you are doing altogether. You cannot create a thread safe interface by merely adding locking to each one of the operations. Thread safety must be designed into the interface. The problem with a drop in replacement as the one you propose is that it will make each single operation thread safe (I think they already are) but that does not avoid unwanted interleaving.
Consider two threads that executed cout << "Hi" << endl;, locking each operation does not exclude "HiHi\n\n" as output, and things get much ore complicated with manipulators, where one thread might change the format for the next value to be printed, but another thread might trigger the next write, in which case the two formats will be wrong.
On the particular question that you ask, You can consider using the same approach that the standard library takes with the iostreams:
Instead of creating the objects as globals, create a helper type that performs reference counting on the number of instances of the type. The constructor would check if the object is the first of its type to be created and initialize the thread safe wrapper. The last object to be destroyed would destroy your wrapper. The next piece of the puzzle is creating a global static variable of that type in a header that in turn includes the iostreams header. The last piece of the puzzle is that your users should include your header instead of iostreams.

What are the threading guarantees of nowadays C and C++ compilers?

I'm wondering what are the guarantees that compilers make to ensure that threaded writes to memory have visible effects in other threads.
I know countless cases in which this is problematic, and I'm sure that if you're interested in answering you know it too, but please focus on the cases I'll be presenting.
More precisely, I am concerned about the circumstances that can lead to threads missing memory updates done by other threads. I don't care (at this point) if the updates are non-atomic or badly synchronized: as long as the concerned threads notice the changes, I'll be happy.
I hope that compilers makes the distinction between two kinds of variable accesses:
Accesses to variables that necessarily have an address;
Accesses to variables that don't necessarily have an address.
For instance, if you take this snippet:
void sleepingbeauty()
{
int i = 1;
while (i) sleep(1);
}
Since i is a local, I assume that my compiler can optimize it away, and just let the sleeping beauty fall to eternal slumber.
void onedaymyprincewillcome(int* i);
void sleepingbeauty()
{
int i = 1;
onedaymyprincewillcome(&i);
while (i) sleep(1);
}
Since i is a local, but its address is taken and passed to another function, I assume that my compiler will now know that it's an "addressable" variable, and generate memory reads to it to ensure that maybe some day the prince will come.
int i = 1;
void sleepingbeauty()
{
while (i) sleep(1);
}
Since i is a global, I assume that my compiler knows the variable has an address and will generate reads to it instead of caching the value.
void sleepingbeauty(int* ptr)
{
*ptr = 1;
while (*ptr) sleep(1);
}
I hope that the dereference operator is explicit enough to have my compiler generate a memory read on each loop iteration.
I'm fairly sure that this is the memory access model used by every C and C++ compiler in production out there, but I don't think there are any guarantees. In fact, the C++03 is even blind to the existence of threads, so this question wouldn't even make sense with the standard in mind. I'm not sure about C, though.
Is there some documentation out there that specifies if I'm right or wrong? I know these are muddy waters since these may not be on standards grounds, it seems like an important issue to me.
Besides the compiler generating reads, I'm also worried that the CPU cache could technically retain an outdated value, and that even though my compiler did its best to bring the reads and writes about, the values never synchronise between threads. Can this happen?
Accesses to variables that don't necessarily have an address.
All variables must have addresses (from the language's prospective -- compilers are allowed to avoid giving things addresses if they can, but that's not visible from inside the language). It's a side effect that everything must be "pointerable" that everything has an address -- even the empty class typically has size of at least a char so that a pointer can be created to it.
Since i is a local, but its address is taken and passed to another function, I assume that my compiler will now know that it's an "addressable" variables, and generate memory reads to it to ensure that maybe some day the prince will come.
That depends on the content of onedaymyprincewillcome. The compiler may inline that function if it wishes and still make no memory reads.
Since i is a global, I assume that my compiler knows the variable has an address and will generate reads to it.
Yes, but it really doesn't matter if there are reads to it. These reads might simply be going to cache on your current local CPU core, not actually going all the way back to main memory. You would need something like a memory barrier for this, and no C++ compiler is going to do that for you.
I hope that the dereference operator is explicit enough to have my compiler generate a memory read on each loop iteration.
Nope -- not required. The function may be inlined, which would allow the compiler to completely remove these things if it so desires.
The only language feature in the standard that lets you control things like this w.r.t. threading is volatile, which simply requires that the compiler generate reads. That does not mean the value will be consistent though because of the CPU cache issue -- you need memory barriers for that.
If you need true multithreading correctness, you're going to be using some platform specific library to generate memory barriers and things like that, or you're going to need a C++0x compiler which supports std::atomic, which does make these kinds of requirements on variables explicit.
You assume wrong.
void onedaymyprincewillcome(int* i);
void sleepingbeauty()
{
int i = 1;
onedaymyprincewillcome(&i);
while (i) sleep(1);
}
In this code, your compiler will load i from memory each time through the loop. Why? NOT because it thinks another thread could alter its value, but because it thinks that sleep could modify its value. It has nothing to do with whether or not i has an address or must have an address, and everything to do with the operations that this thread performs which could modify the code.
In particular, it is not guaranteed that assigning to an int is even atomic, although this happens to be true on all platforms we use these days.
Too many things go wrong if you don't use the proper synchronization primitives for your threaded programs. For example,
char *str = 0;
asynch_get_string(&str);
while (!str)
sleep(1);
puts(str);
This could (and even will, on some platforms) sometimes print out utter garbage and crash the program. It looks safe, but because you are not using the proper synchronization primitives, the change to ptr could be seen by your thread before the change to the memory location it refers to, even though the other thread initializes the string before setting the pointer.
So just don't, don't, don't do this kind of stuff. And no, volatile is not a fix.
Summary: The basic problem is that the compiler only changes what order the instructions go in, and where the load and store operations go. This is not enough to guarantee thread safety in general, because the processor is free to change the order of loads and stores, and the order of loads and stores is not preserved between processors. In order to ensure things happen in the right order, you need memory barriers. You can either write the assembly yourself or you can use a mutex / semaphore / critical section / etc, which does the right thing for you.
While the C++98 and C++03 standards do not dictate a standard memory model that must be used by compilers, C++0x does, and you can read about it here: http://www.hpl.hp.com/personal/Hans_Boehm/misc_slides/c++mm.pdf
In the end, for C++98 and C++03, it's really up to the compiler and the hardware platform. Typically there will not be any memory barrier or fence-operation issued by the compiler for normally written code unless you use a compiler intrinsic or something from your OS's standard library for synchronization. Most mutex/semaphore implementations also include a built-in memory barrier operation to prevent speculative reads and writes across the locking and unlocking operations on the mutex by the CPU, as well as prevent any re-ordering of operations across the same read or write calls by the compiler.
Finally, as Billy points out in the comments, on Intel x86 and x86_64 platforms, any read or write operation in a single byte increment is atomic, as well as a read or write of a register value to any 4-byte aligned memory location on x86 and 4 or 8-byte aligned memory location on x86_64. On other platforms, that may not be the case and you would have to consult the platform's documentation.
The only control you have over optimisation is volatile.
Compilers make NO gaurantee about concurrent threads accessing the same location at the same time. You will need to some type of locking mechanism.
I can only speak for C and since synchronization is a CPU-implemented functionality a C programmer would need to call a library function for the OS containg an access to the lock (CriticalSection functions in the Windows NT engine) or implement something simpler (such as a spinlock) and access the functionality himself.
volatile is a good property to use at the module level. Sometimes a non-static (public) variable will work too.
local (stack) variables will not be accessible from other threads and should not be.
variables at the module level are good candidates for access by multiple threads but will require synchronizetion functions to work predictably.
Locks are unavoidable but they can be used more or less wisely resulting in a negligible or considerable performance penalty.
I answered a similar question here concerning unsynchronized threads but I think you'll be better off browsing on similar topics to get high-quality answers.
I'm not sure you understand the basics of the topic you claim to be discussing. Two threads, each starting at exactly the same time and looping one million times each performing an inc on the same variable will NOT result in a final value of two million (two * one million increments). The value will end up somewhere in-between one and two million.
The first increment will cause the value to be read from RAM into the L1 (via first the L3 then the L2) cache of the accessing thread/core. The increment is performed and the new value written initially to L1 for propagation to lower caches. When it reaches L3 (the highest cache common to both cores) the memory location will be invalidated to the other core's caches. This may seem safe but in the meantime the other core has simultaneously performed an increment based on the same initial value in the variable. The invalidation from the write by the first value will be superseeded by the write from the second core invalidating the data in the caches of the first core.
Sounds like a mess? It is! The cores are so fast that what happens in the caches falls way behind: the cores are where the action is. This is why you need explicit locks: to make sure that the new value winds up low enough in the memory hierarchy such that other cores will read the new value and nothing else. Or put another way: slow things down so the caches can catch up with the cores.
A compiler does not "feel." A compiler is rule-based and, if constructed correctly, will optimize to the extent that the rules allow and the compiler writers are able to construct the optimizer. If a variable is volatile and the code is multi-threaded the rules won't allow the compiler to skip a read. Simple as that even though on the face of it it may appear devilishly tricky.
I'll have to repeat myself and say that locks cannot be implemented in a compiler because they are specific to the OS. The generated code will call all functions without knowing if they are empty, contain lock code or will trigger a nuclear explosion. In the same way the code will not be aware of a lock being in progress since the core will insert wait states until the lock request has resulted in the lock being in place. The lock is something that exists in the core and in the mind of the programmer. The code shouldn't (and doesn't!) care.
I'm writing this answer because most of the help came from comments to questions, and not always from the authors of the answers. I already upvoted the answers that helped me most, and I'm making this a community wiki to not abuse the knowledge of others. (If you want to upvote this answer, consider also upvoting Billy's and Dietrich's answers too: they were the most helpful authors to me.)
There are two problems to address when values written from a thread need to be visible from another thread:
Caching (a value written from a CPU could never make it to another CPU);
Optimizations (a compiler could optimize away the reads to a variable if it feels it can't be changed).
The first one is rather easy. On modern Intel processors, there is a concept of cache coherence, which means changes to a cache propagate to other CPU caches.
Turns out the optimization part isn't too hard either. As soon as the compiler cannot guarantee that a function call cannot change the content of a variable, even in a single-threaded model, it won't optimize the reads away. In my examples, the compiler doesn't know that sleep cannot change i, and this is why reads are issued at every operation. It doesn't need to be sleep though, any function for which the compiler doesn't have the implementation details would do. I suppose that a particularly well-suited function to use would be one that emits a memory barrier.
In the future, it's possible that compilers will have better knowledge of currently impenetrable functions. However, when that time will come, I expect that there will be standard ways to ensure that changes are propagated correctly. (This is coming with C++11 and the std::atomic<T> class. I don't know for C1x.)

Is this code thread-safe?

Let's say we have a thread-safe compare-and-swap function like
long CAS(long * Dest ,long Val ,long Cmp)
which compares Dest and Cmp, copies Val to Dest if comparison is succesful and returns the original value of Dest atomically.
So I would like to ask you if the code below is thread-safe.
while(true)
{
long dummy = *DestVar;
if(dummy == CAS(DestVar,Value,dummy) )
{
break;
}
}
EDIT:
Dest and Val parameters are the pointers to variables that created on the heap.
InterlockedCompareExchange is an example to out CAS function.
Edit. An edit to the question means most of this isn't relevant. Still, I'll leave this as all the concerns in the C# case also carry to the C++ case, but the C++ case brings many more concerns as stated, so it's not entirely irrelevant.
Yes, but...
Assuming you mean that this CAS is atomic (which is the case with C# Interlocked.CompareExchange and with some things available to use in some C++ libraries) the it's thread-safe in and of itself.
However DestVar = Value could be thread-safe in and of itself too (it will be in C#, whether it is in C++ or not is implementation dependent).
In C# a write to an integer is guaranteed to be atomic. As such, doing DestVar = Value will not fail due to something happening in another thread. It's "thread-safe".
In C++ there are no such guarantees, but there are on some processors (in fact, let's just drop C++ for now, there's enough complexity when it comes to the stronger guarantees of C#, and C++ has all of those complexities and more when it comes to these sort of issues).
Now, the use of atomic CAS operations in themselves will always be "thead-safe", but this is not where the complexity of thread safety comes in. It's the thread-safety of combinations of operations that is important.
In your code, at each loop either the value will be atomically over-written, or it won't. In the case where it won't it'll try again and keep going until it does. It could end up spinning for a while, but it will eventually work.
And in doing so it will have exactly the same effect as simple assignment - including the possibility of messing with what's happening in another thread and causing a serious thread-related bug.
Take a look, for comparison, with the answer at Is this use of a static queue thread-safe? and the explanation of how it works. Note that in each case a CAS is either allowed to fail because its failure means another thread has done something "useful" or when it's checked for success more is done than just stopping the loop. It's combinations of CASs that each pay attention to the possible state caused by other operations that allow for lock-free wait-free code that is thread-safe.
And now we've done with that, note also that you couldn't port that directly to C++ (it depends on garbage collection to make some possible ABA scenarios of little consequence, with C++ there are situations where there could be memory leaks). It really does also matter which language you are talking about.
It's impossible to tell, for any environment. You do not define the following:
What are the memory locations of DestVar and Value? On the heap or on the stack? If they are on the stack, then it is thread safe, as there is not another thread that can access that memory location.
If DestVar and Value are on the heap, then are they reference types or value types (have copy by assignment semantics). If the latter, then it is thread safe.
Does CAS synchronize access to itself? In other words, does it have some sort of mutual exclusion strucuture that has allows for only one call at a time? If so, then it is thread-safe.
If any of the conditions mentioned above are untrue, then it is indeterminable whether or not this is all thread safe. With more information about the conditions mentioned above (as well as whether or not this is C++ or C#, yes, it does matter) an answer can be provided.
Actually, this code is kind of broken. Either you need to know how the compiler is reading *DestVar (before or after CAS), which has wildly different semantics, or you are trying to spin on *DestVar until some other thread changes it. It's certainly not the former, since that would be crazy. If it's the latter, then you should use your original code. As it stands, your revision is not thread safe, since it isn't safe at all.

Do I need to protect read access to an STL container in a multithreading environment?

I have one std::list<> container and these threads:
One writer thread which adds elements indefinitely.
One reader/writer thread which reads and removes elements while available.
Several reader threads which access the SIZE of the container (by using the size() method)
There is a normal mutex which protects the access to the list from the first two threads. My question is, do the size reader threads need to acquire this mutex too? should I use a read/write mutex?
I'm in a windows environment using Visual C++ 6.
Update: It looks like the answer is not clear yet. To sum up the main doubt: Do I still need to protect the SIZE reader threads even if they only call size() (which returns a simple variable) taking into account that I don't need the exact value (i.e. I can assume a +/- 1 variation)? How a race condition could make my size() call return an invalid value (i.e. one totally unrelated to the good one)?
Answer: In general, the reader threads must be protected to avoid race conditions. Nevertheless, in my opinion, some of the questions stated above in the update haven't been answered yet.
Thanks in advance!
Thank you all for your answers!
Yes, the read threads will need some sort of mutex control, otherwise the write will change things from under it.
A reader/writer mutex should be enough. But strictly speaking this is an implmentation-specific issue. It's possible that an implementation may have mutable members even in const objects that are read-only in your code.
Checkout the concurrent containers provided by Intel's Open Source Threading Building Blocks library. Look under "Container Snippets" on the Code Samples page for some examples. They have concurrent / thread-safe containers for vectors, hash maps and queues.
I don't believe the STL containers are thread-safe, as there isn't a good way to handle threads cross-platform. The call to size() is simple, but it still needs to be protected.
This sounds like a great place to use read-write locks.
You should consider some SLT implementation might calculate the size when called.
To overcome this, you could define a new variable
volatile unsigned int ContainerSize = 0;
Update the variable only inside already protected update calls, but you can read / test the variable without protection (taking into account you don't need the exact value).
Yes. I would suggest wrapping your STL library with a class that enforces serial access. Or find a similar class that's already been debugged.
I'm going to say no. In this case.
Without a mutex, what you might find is that the size() returns the wrong value occasionally, as items are added or removed. If this is acceptable to you, go for it.
however, if you need the accurate size of the list when the reader needs to know it, you'll have to put a critical section around every size call in addition to the one you have around the add and erase calls.
PS. VC6 size() simply returns the _Size internal member, so not having a mutex won't be a problem with your particular implementation, except that it might return 1 when a second element is being added, or vice-versa.
PPS. someone mentioned a RW lock, this is a good thing, especially if you're tempted to access the list objects later. Change your mutex to a Boost::shared_mutex would be beneficial then. However no mutex whatsoever is needed if all you're calling is size().
The STL in VC++ version 6 is not thread-safe, see this recent question. So it seems your question is a bit irrelevant. Even if you did everything right, you may still have problems.
Whether or not size() is safe (for the definition of "safe" that you provide) is implementation-dependent. Even if you are covered on your platform, for your version of your compiler at your optimization level for your version of thread library and C runtime, please do not code this way. It will come back to byte you, and it will be hell to debug. You're setting yourself up for failure.