Multithreaded paranoia - c++

This is a complex question, please consider carefully before answering.
Consider this situation. Two threads (a reader and a writer) access a single global int. Is this safe? Normally, I would respond without thought, yes!
However, it seems to me that Herb Sutter doesn't think so. In his articles on effective concurrency he discusses a flawed lock-free queue and the corrected version.
In the end of the first article and the beginning of the second he discusses a rarely considered trait of variables, write ordering. Int's are atomic, good, but ints aren't necessarily ordered which could destroy any lock-free algorithm, including my above scenario. I fully agree that the only way to guarantee correct multithreaded behavior on all platforms present and future is to use atomics(AKA memory barriers) or mutexes.
My question; is write re-odering ever a problem on real hardware? Or is the multithreaded paranoia just being pedantic?
What about classic uniprocessor systems?
What about simpler RISC processors like an embedded power-pc?
Clarification: I'm more interested in what Mr. Sutter said about the hardware (processor/cache) reordering variable writes. I can stop the optimizer from breaking code with compiler switches or hand inspection of the assembly post-compilation. However, I'd like to know if the hardware can still mess up the code in practice.

Your idea of inspecting the assembly is not good enough; the reordering can happen at the hardware level.
To answer your question "is this ever a problem on read hardware:" Yes! In fact I've run into that problem myself.
Is it OK to skirt the issue with uniprocessor systems or other special-case situations? I would argue "no" because five years from now you might need to run on multi-core after all, and then finding all these locations will be tricky (impossible?).
One exception: Software designed for embedded hardware applications where indeed you have completely control over the hardware. In fact I have "cheated" like this in those situations on e.g. an ARM processor.

Yup - use memory barriers to prevent instruction reordering where needed. In some C++ compilers, the volatile keyword has been expanded to insert implicit memory barriers for every read and write - but this isn't a portable solution. (Likewise with the Interlocked* win32 APIs). Vista even adds some new finer-grained Interlocked APIs which let you specify read or write semantics.
Unfortunately, C++ has such a loose memory model that any kind of code like this is going to be non-portable to some extent and you'll have to write different versions for different platforms.

Like you said, because of reordering done at cache or processor level, you actually do need some sort of memory barrier to ensure proper synchronisation, especially for multi-processors (and especially on non-x86 platforms). (I am given to believe that single-processor systems don't have these issues, but don't quote me on this---I'm certainly more inclined to play safe and do the synchronised access anyway.)

We have run into the problem, albeit on Itanium processors where the instruction reordering is more aggressive than x86/x64.
The fix was to use an Interlocked instruction since there was (at the time) no way of telling the compiler to simply but a write barrier after the assignment.
We really need language extension to deal with this cleanly. Use of volatile (if supported by the compiler) is too coarse grained for the cases where you are trying to squeeze as much performance out of a piece of code as possible.

is this ever a problem on real hardware?
Absolutely, particularly now with the move to multiple cores for current and future CPUs. If you're dependent on ordered atomicity to implement features in your application and you are unable to guarantee this requirement via your chosen platform or the use of synchronization primitives, under all conditions i.e. customer moves from a single-core CPU to multi-core CPU, then you are just waiting for a problem to occur.
Quoting from the referred to Herb Sutter article (second one)
Ordered atomic variables are spelled in different ways on popular platforms and environments. For example:
volatile in C#/.NET, as in volatile int.
volatile or * Atomic* in Java, as in volatile int, AtomicInteger.
atomic<T> in C++0x, the forthcoming ISO C++ Standard, as in atomic<int>.
I have not seen how C++0x implements ordered atomicity so I'm unable to specify whether the upcoming language feature is a pure library implementation or relies on changes to the language also. You could review the proposal to see if it can be incorporated as a non-standard extension to your current tool chain until the new standard is available, it may even be available already for your situation.

It is a problem on real hardware. A friend of mine works for IBM and makes his living primarily by sussing out this kind of problem in customers' codes.
If you want to see how bad things can get, search for academic papers on the Java Memory Model (and also now the C++ memory model). Given the reordering that real hardware can do, trying to figure out what's safe in a high-level language is a nightmare.

No this isn't safe and there is real hardware avaialble that exhibits this problem, for example the memory model in the powerpc chip on xbox 360 allows writes to be reordered. This is exacerbated by the lack of barriers in the intrinsics, see this article on msdn for more details.

The answer to the question" is it safe" is inherently ambiguous.
It's always safe, even for doubles, in the sense that your computer won't catch fire.
It's safe, in the sense that you always will get a value that the int held at some time in the past,
It's not safe, in the sense that you may get a value which is/will be updated by another thread.
"Atomic" means that you get the second guarantee. Since double usually is not atomic, you could get 32 old and 32 new bits. That's clearly unsafe.

When I asked the question I most interested in uniprocessor powerpc. In one of the comments InSciTek Jeff mentioned the powerpc SYNC and ISYNC instructions. Those where the key to a definitive answer. I found it here on IBM's site.
The article is large and pretty dense, but the take away is No, it is not safe. On older powerpc's the memory optimizers where not sophisticated enough to cause problems on a uniprocessor. However, the newer ones are much more aggressive, and can break even simple access to a global int.

Related

Are there proposals for modeling cache in standard C++? Or any plan?

As I learn more and more about standard C++, I see more and more speakers, authors, and bloggers emphasize the importance of cache-hit for performant programs. Yet I haven't seen any efforts, in the standard or any proposals, to deal with this issue, except the usual suggestion of "use vectors, because the memory is contiguous".
My observation can certainly be biased, and of course, different hardware platforms have different memory hierarchy structures, PC and embedded systems are totally different worlds (my experience is with PC only). Striving to be portable and to avoid making assumptions that would restrict the use case is the core philosophy of C++. But cache use is too important to be a topic left un-dealt-with. And, in my primitive understanding, as multicore become (or is already) the main hardware platform programs run on, cache utility become even more important.
So, does anyone know if there's any plan to address this topic? Or it should not be addressed in the standard at all because it is an implementation-level problem?
Thank you.

Force two threads to access a global variable directly in memory?

I know that volatile in C++ does not have the same meaning as in Java, so if I'm writing a C++ application for Windows, how can I share a variable between two threads and not allowing for each thread to cache its own copy of the variable?
Does using critical sections solves this problem or does it only allows for atomicity?
Actually, on Visual Studio, volatile does have pretty much the same meaning as in Java (or C#). Or at least, it used to, and still does by default; see Microsoft's documentation for details.
That said, in standard C++, it is true that volatile means approximately nothing. Also, in standard terms, threads do not "cache" anything and your question is ill-formed. The relevant concepts are atomicity and ordering, the standard term for the latter being the "happens-before" relationship. Everything you need to design, implement, and reason about multi-threaded algorithms is captured in these concepts; the notion of "cache" has nothing to do with it.
Standard C++11 provides many mechanisms for enforcing atomicity and ordering. You will get a better answer if you ask a specific question about implementing a specific algorithm.
[Update, to clarify]
Note that I am not saying you are using the wrong terminology; I am saying you are using the wrong concepts.
The standard does not talk about "cached variables" using different words... It does not talk about cached variables at all. That is because the concept is neither necessary nor sufficient for reasoning about threads. You can know everything about caches and still be unable to analyze concurrent algorithms, and you can know nothing about caches and be able to analyze them perfectly.
Similarly, "accessing a variable directly" is not just the wrong way to talk; the very concept is meaningless in (standard) C++. The notion of "do it right now" means nothing when each thread is progressing at a different rate and observing state changes in a different order. In standard C++, there simply is no "access directly" or "right now"; there is only happens-before.
This is not an academic point. The wrong mental model for concurrency is almost guaranteed to lead to fuzzy thinking and sloppy, buggy code.
Your question really does have no answer as phrased. The right answer could be to use std::atomic or to use std::mutex or to use std::atomic_thread_fence, depending on exactly what it is you are actually trying to do. I am suggesting you ask a question that states clearly what that is.

When should you not use [[carries_dependency]]?

I've found questions (like this one) asking what [[carries_dependency]] does, and that's not what I'm asking here.
I want to know when you shouldn't use it, because the answers I've read all make it sound like you can plaster this code everywhere and magically you'd get equal or faster code. One comment said the code can be equal or slower, but the poster didn't elaborate.
I imagine appropriate places to use this is on any function return or parameter that is a pointer or reference and that will be passed or returned within the calling thread, and it shouldn't be used on callbacks or thread entry points.
Can someone comment on my understanding and elaborate on the subject in general, of when and when not to use it?
EDIT: I know there's this tome on the subject, should any other reader be interested; it may contain my answer, but I haven't had the chance to read through it yet.
In modern C++ you should generally not use std::memory_order_consume or [[carries_dependency]] at all. They're essentially deprecated while the committee comes up with a better mechanism that compilers can practically implement.
And that hopefully doesn't require sprinkling [[carries_dependency]] and kill_dependency all over the place.
2016-06 P0371R1: Temporarily discourage memory_order_consume
It is widely accepted that the current definition of memory_order_consume in the standard is not useful. All current compilers essentially map it to memory_order_acquire. The difficulties appear to stem both from the high implementation complexity, from the fact that the current definition uses a fairly general definition of "dependency", thus requiring frequent and inconvenient use of the kill_dependency call, and from the frequent need for [[carries_dependency]] annotations. Details can be found in e.g. P0098R0.
Notably that in C++ x - x still carries a dependency but most compilers would naturally break the dependency and replace that expression with a constant 0. But also compilers do sometimes turn data dependencies into control dependencies if they can prove something about value-ranges after a branch.
On modern compilers that just promote mo_consume to mo_acquire, fully aggressive optimizations can always happen; there's never anything to gain from [[carries_dependency]] and kill_dependency even in code that uses mo_consume, let alone in other code.
This strengthening to mo_acquire has potentially-significant performance cost (an extra barrier) for real use-cases like RCU on weakly-ordered ISAs like POWER and ARM. See this video of Paul E. McKenney's CppCon 2015 talk C++ Atomics: The Sad Story of memory_order_consume. (Link includes a summary).
If you want real dependency-ordering read-only performance, you have to "roll your own", e.g. by using mo_relaxed and checking the asm to verify it compiled to asm with a dependency. (Avoid doing anything "weird" with such a value, like passing it across functions.) DEC Alpha is basically dead and all other ISAs provide dependency ordering in asm without barriers, as long as the asm itself has a data dependency.
If you don't want to roll your own and live dangerously, it might not hurt to keep using mo_consume in "simple" use-cases where it should be able to work; perhaps some future mo_consume implementation will have the same name and work in a way that's compatible with C++11.
There is ongoing work on making a new consume, e.g. 2018's http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2018/p0750r1.html
because the answers I've read all make it sound like you can plaster
this code everywhere and magically you'd get equal or faster code
The only way you can get faster code is when that annotation allows the omission of a fence.
So the only case where it could possibly be useful is:
your program uses consume ordering on an atomic load operation, in an important frequently executed code;
the "consume value" isn't just used immediately and locally, but also passed to other functions;
the target CPU gives specific guarantees for consuming operations (as strong as a given fence before that operation, just for that operation);
the compiler writers take their job seriously: they manage to translate high level language consuming of a value to CPU level consuming, to get the benefit from CPU guarantees.
That's a bunch of necessary conditions to possibly get measurably faster code.
(And the latest trend in the C++ community is to give up inventing a proper compiling scheme that's safe in all cases and to come up with a completely different way for the user to instruct the compiler to produce code that "consumes" values, with much more explicit, naively translatable, C++ code.)
One comment said the code can be equal or slower, but the poster
didn't elaborate.
Of course annotations of the kind that you can randomly put on programs simply cannot make code more efficient in general! That would be too easy and also self contradictory.
Either some annotation specifies a constrain on your code, that is a promise to the compiler, and you can't put it anywhere where it doesn't correspond an guarantee in the code (like noexcept in C++, restrict in C), or it would break code in various ways (an exception in a noexcept function stops the program, aliasing of restricted pointers can cause funny miscompilation and bad behavior (formerly the behavior is not defined in that case); then the compiler can use it to optimize the code in specific ways.
Or that annotation doesn't constrain the code in any way, and the compiler can't count on anything and the annotation does not create any more optimization opportunity.
If you get more efficient code in some cases at no cost of breaking program with an annotation then you must potentially get less efficient code in other cases. That's true in general and specifically true with consume semantic, which imposes the previously described constrained on translation of C++ constructs.
I imagine appropriate places to use this is on any function return or
parameter that is a pointer or reference and that will be passed or
returned within the calling thread
No, the one and only case where it might be useful is when the intended calling function will probably use consume memory order.

Do Fortran 95 constructs such as WHERE, FORALL and SPREAD generally result in faster parallel code?

I have read through the Fortran 95 book by Metcalf, Reid and Cohen, and Numerical Recipes in Fortran 90. They recommend using WHERE, FORALL and SPREAD amongst other things to avoid unnecessary serialisation of your program.
However, I stumbled upon this answer which claims that FORALL is good in theory, but pointless in practice - you might as well write loops as they parallelise just as well and you can explicitly parallelise them using OpenMP (or automatic features of some compilers such as Intel).
Can anyone verify from experience whether they have generally found these constructs to offer any advantages over explicit loops and if statements in terms of parallel performance?
And are there any other parallel features of the language which are good in principal but not worth it in practice?
I appreciate that the answers to these questions are somewhat implementation dependant, so I'm most interested in gfortran, Intel CPUs and SMP parallelism.
As I said in my answer to the other question, there is a general belief that FORALL has not been as useful as was hoped when it was introduced to the language. As already explained in other answers, it has restrictive requirements and a limited role, and compilers have become quite good at optimizing regular loops. Compilers keep getting better, and capabilities vary from compiler to compiler. Another clue is that the Fortran 2008 is trying again... besides adding explicit parallelization to the language (co-arrays, already mentioned), there is also "do concurrent", a new loop form that requires restrictions that should better allow the compiler to perform automatic parallization optimizations, yet should be sufficiently general to be useful -- see ftp://ftp.nag.co.uk/sc22wg5/N1701-N1750/N1729.pdf.
In terms of obtaining speed, mostly I select good algorithms and program for readability & maintainability. Only if the program is too slow do I locate the bottle necks and recode or implement multi-threading (OpenMP). It will be a rare case where FORALL or WHERE versus an explicit do loop will have a meaningful speed difference -- I'd look more to how clearly they state the intent of the program.
I've looked shallowly into this and, sad to report, generally find that writing my loops explicitly results in faster programs than the parallel constructs you write about. Even simple whole-array assignments such as A = 0 are generally outperformed by do-loops.
I don't have any data to hand and if I did it would be out of date. I really ought to pull all this into a test suite and try again, compilers do improve (sometimes they get worse too).
I do still use the parallel constructs, especially whole-array operations, when they are the most natural way to express what I'm trying to achieve. I haven't ever tested these constructs inside OpenMP workshare constructs. I really ought to.
FORALL is a generalised masked assignment statement (as is WHERE). It is not a looping construct.
Compilers can parallelise FORALL/WHERE using SIMD instructions (SSE2, SSE3 etc) and is very useful to get a bit of low-level parallelisation. Of course, some poorer compilers don't bother and just serialise the code as a loop.
OpenMP and MPI is more useful at a coarser level of granularity.
In theory, using such assignments lets the compiler know what you want to do and should allow it to optimize it better. In practice, see the answer from Mark... I also think it's useful if the code looks cleaner that way. I have used things such as FORALL myself a couple of times, but didn't notice any performance changes over regular DO loops.
As for optimization, what kind of parallellism do you intent to use? I very much dislike OpenMP, but I guess if you inted to use that, you should test these constructs first.
*This should be a comment, not an answer, but it won't fit into that little box, so I'm putting it here. Don't hold it against me :-) Anyways, to continue somewhat onto #steabert's comment on his answer. OpenMP and MPI are two different things; one rarely gets to choose between the two since it's more dictated by your architecture than personal choice. As far as learning concepts of paralellism go, I would recommend OpenMP any day; it is simpler and one easily continues the transition to MPI later on.
But, that's not what I wanted to say. This is - a few days back from now, Intel has announced that it has started supporting Co-Arrays, a F2008 feature previously only supported by g95. They're not intending to put down g95, but the fact remains that Intel's compiler is more widely used for production code, so this is definitely an interesting line of developemnt. They also changed some things in their Visual Fortran Compiler (the name, for a start :-)
More info after the link: http://software.intel.com/en-us/articles/intel-compilers/

What is triple check locking pattern

Reference:- "Modern C++ Design: Generic Programming and Design Patterns Applied" by Andrei Alexandrescu
Chapter 6 Implementing Singletons.
Even if you put volatile then also it is not guaranteed to make Double Check locking pattern safe and portable.Why it is so?
If some one can put any good link that explains what is relaxed memory model and what is exactly problem with Double Check pattern.{Or someone can explain}
I used to think volatile has solve the problem but seems its not correct until I read the book.
Even if you put volatile then also it is not guaranteed to make Double Check locking pattern safe and portable.Why it is so?
I'll try to provide some context.
There are three (Boehm & McLaren) portable use cases for volatile in C++ none of which has anything to do with multithreading. Alexandrescu did come up with a hack a long time back to cajole C++'s type system to help with multithreaded programming but that's that. See David Butenhof's reply on comp.programming.threads regarding the same.
The confusion regarding this qualifier stems from the fact that with some compilers (Intel's for example) volatile brings into play memory fence semantics -- which is what you need for safe multithreading. But this is not portable and should not be relied upon.
Further, most professional grade compilers fail to implement volatile flawlessly. See Regehr et al.
Confusion probably gathers much from the fact that Java, the other language, has completely different semantics for volatile and they do involve memory fences. However, note that even the Double Checked Locking is not free from issues.
Most of the time using a simple lock that you always check is fast enough for a singleton. You can always cache the singleton in a variable or field if you access it a lot.
Don’t think about “hacks” to avoid locking until you have proved that a simple lock is not fast enough. Writing bug free portable threading code is hard enough without making more problems for yourself.
I don't know explicitly what "triple checked locking" is but since almost a decade ago when the compiler writers and hardware engineers started to take control away from the programmers my lazy init has had 3 states:
enum eINITIALIZATION_STATE {FIRST_INITIALIZER=0,INITIALIZING,INITIALIZED} ;
( Just imagine if you sent out library software that was totally incompatible with important multi threaded algorithms in the field! But I have been writing code for SMP machines for over 30 years and the multi-core "revolution" (HA!) just really got rolling in the last decade. The compiler writers and HW engineers either didn't know what they were breaking or they decided it did not matter... But enough editorializing! )
Sooo...
enum eINITIALIZATION_STATE {FIRST_INITIALIZER=0,INITIALIZING,INITIALIZED} ;
then you have an initialization control variable:
static int init_state; //its 0 initialized 'cause static
then the idiom is:
IF(init_state == INITIALIZED)
return your pointer or what ever;//the "normal" post initialization path
ENDIF
IF(init_state == FIRST_INITIALIZER)
IF(compare_and_swap(&init_state,
INITIALIZING,
FIRST_INITIALIZER)==FIRST_INITIALIZER)
//first initializer race resolved - I'm first.
do your initialization here;
//And now resolve the race induced by the compiler writers and the HW guys
COMPILER_FENCE();//needs macro for portability
HARDWARE_FENCE();//needs macro for portability
//on intel using cas here supplies the HARDWARE_FENCE();
compare_and_swap(&init_state,INITIALIZER,INITIALIZING);
//now you can return your pointer or what ever or just fall through
ENDIF
ENDIF
DOWHILE(*const_cast<volatile const int*>(&init_state)!=INITIALIZED)
relinquish or spin;
ENDDO
return your pointer or what ever;
I am not sure if this is what was meant but because of the 3 states I suspect this may be equivalent (or at least similar) to whatever was meant by "triple checked locking" .