What is triple check locking pattern - c++

Reference:- "Modern C++ Design: Generic Programming and Design Patterns Applied" by Andrei Alexandrescu
Chapter 6 Implementing Singletons.
Even if you put volatile then also it is not guaranteed to make Double Check locking pattern safe and portable.Why it is so?
If some one can put any good link that explains what is relaxed memory model and what is exactly problem with Double Check pattern.{Or someone can explain}
I used to think volatile has solve the problem but seems its not correct until I read the book.

Even if you put volatile then also it is not guaranteed to make Double Check locking pattern safe and portable.Why it is so?
I'll try to provide some context.
There are three (Boehm & McLaren) portable use cases for volatile in C++ none of which has anything to do with multithreading. Alexandrescu did come up with a hack a long time back to cajole C++'s type system to help with multithreaded programming but that's that. See David Butenhof's reply on comp.programming.threads regarding the same.
The confusion regarding this qualifier stems from the fact that with some compilers (Intel's for example) volatile brings into play memory fence semantics -- which is what you need for safe multithreading. But this is not portable and should not be relied upon.
Further, most professional grade compilers fail to implement volatile flawlessly. See Regehr et al.
Confusion probably gathers much from the fact that Java, the other language, has completely different semantics for volatile and they do involve memory fences. However, note that even the Double Checked Locking is not free from issues.

Most of the time using a simple lock that you always check is fast enough for a singleton. You can always cache the singleton in a variable or field if you access it a lot.
Don’t think about “hacks” to avoid locking until you have proved that a simple lock is not fast enough. Writing bug free portable threading code is hard enough without making more problems for yourself.

I don't know explicitly what "triple checked locking" is but since almost a decade ago when the compiler writers and hardware engineers started to take control away from the programmers my lazy init has had 3 states:
enum eINITIALIZATION_STATE {FIRST_INITIALIZER=0,INITIALIZING,INITIALIZED} ;
( Just imagine if you sent out library software that was totally incompatible with important multi threaded algorithms in the field! But I have been writing code for SMP machines for over 30 years and the multi-core "revolution" (HA!) just really got rolling in the last decade. The compiler writers and HW engineers either didn't know what they were breaking or they decided it did not matter... But enough editorializing! )
Sooo...
enum eINITIALIZATION_STATE {FIRST_INITIALIZER=0,INITIALIZING,INITIALIZED} ;
then you have an initialization control variable:
static int init_state; //its 0 initialized 'cause static
then the idiom is:
IF(init_state == INITIALIZED)
return your pointer or what ever;//the "normal" post initialization path
ENDIF
IF(init_state == FIRST_INITIALIZER)
IF(compare_and_swap(&init_state,
INITIALIZING,
FIRST_INITIALIZER)==FIRST_INITIALIZER)
//first initializer race resolved - I'm first.
do your initialization here;
//And now resolve the race induced by the compiler writers and the HW guys
COMPILER_FENCE();//needs macro for portability
HARDWARE_FENCE();//needs macro for portability
//on intel using cas here supplies the HARDWARE_FENCE();
compare_and_swap(&init_state,INITIALIZER,INITIALIZING);
//now you can return your pointer or what ever or just fall through
ENDIF
ENDIF
DOWHILE(*const_cast<volatile const int*>(&init_state)!=INITIALIZED)
relinquish or spin;
ENDDO
return your pointer or what ever;
I am not sure if this is what was meant but because of the 3 states I suspect this may be equivalent (or at least similar) to whatever was meant by "triple checked locking" .

Related

Force two threads to access a global variable directly in memory?

I know that volatile in C++ does not have the same meaning as in Java, so if I'm writing a C++ application for Windows, how can I share a variable between two threads and not allowing for each thread to cache its own copy of the variable?
Does using critical sections solves this problem or does it only allows for atomicity?
Actually, on Visual Studio, volatile does have pretty much the same meaning as in Java (or C#). Or at least, it used to, and still does by default; see Microsoft's documentation for details.
That said, in standard C++, it is true that volatile means approximately nothing. Also, in standard terms, threads do not "cache" anything and your question is ill-formed. The relevant concepts are atomicity and ordering, the standard term for the latter being the "happens-before" relationship. Everything you need to design, implement, and reason about multi-threaded algorithms is captured in these concepts; the notion of "cache" has nothing to do with it.
Standard C++11 provides many mechanisms for enforcing atomicity and ordering. You will get a better answer if you ask a specific question about implementing a specific algorithm.
[Update, to clarify]
Note that I am not saying you are using the wrong terminology; I am saying you are using the wrong concepts.
The standard does not talk about "cached variables" using different words... It does not talk about cached variables at all. That is because the concept is neither necessary nor sufficient for reasoning about threads. You can know everything about caches and still be unable to analyze concurrent algorithms, and you can know nothing about caches and be able to analyze them perfectly.
Similarly, "accessing a variable directly" is not just the wrong way to talk; the very concept is meaningless in (standard) C++. The notion of "do it right now" means nothing when each thread is progressing at a different rate and observing state changes in a different order. In standard C++, there simply is no "access directly" or "right now"; there is only happens-before.
This is not an academic point. The wrong mental model for concurrency is almost guaranteed to lead to fuzzy thinking and sloppy, buggy code.
Your question really does have no answer as phrased. The right answer could be to use std::atomic or to use std::mutex or to use std::atomic_thread_fence, depending on exactly what it is you are actually trying to do. I am suggesting you ask a question that states clearly what that is.

When should you not use [[carries_dependency]]?

I've found questions (like this one) asking what [[carries_dependency]] does, and that's not what I'm asking here.
I want to know when you shouldn't use it, because the answers I've read all make it sound like you can plaster this code everywhere and magically you'd get equal or faster code. One comment said the code can be equal or slower, but the poster didn't elaborate.
I imagine appropriate places to use this is on any function return or parameter that is a pointer or reference and that will be passed or returned within the calling thread, and it shouldn't be used on callbacks or thread entry points.
Can someone comment on my understanding and elaborate on the subject in general, of when and when not to use it?
EDIT: I know there's this tome on the subject, should any other reader be interested; it may contain my answer, but I haven't had the chance to read through it yet.
In modern C++ you should generally not use std::memory_order_consume or [[carries_dependency]] at all. They're essentially deprecated while the committee comes up with a better mechanism that compilers can practically implement.
And that hopefully doesn't require sprinkling [[carries_dependency]] and kill_dependency all over the place.
2016-06 P0371R1: Temporarily discourage memory_order_consume
It is widely accepted that the current definition of memory_order_consume in the standard is not useful. All current compilers essentially map it to memory_order_acquire. The difficulties appear to stem both from the high implementation complexity, from the fact that the current definition uses a fairly general definition of "dependency", thus requiring frequent and inconvenient use of the kill_dependency call, and from the frequent need for [[carries_dependency]] annotations. Details can be found in e.g. P0098R0.
Notably that in C++ x - x still carries a dependency but most compilers would naturally break the dependency and replace that expression with a constant 0. But also compilers do sometimes turn data dependencies into control dependencies if they can prove something about value-ranges after a branch.
On modern compilers that just promote mo_consume to mo_acquire, fully aggressive optimizations can always happen; there's never anything to gain from [[carries_dependency]] and kill_dependency even in code that uses mo_consume, let alone in other code.
This strengthening to mo_acquire has potentially-significant performance cost (an extra barrier) for real use-cases like RCU on weakly-ordered ISAs like POWER and ARM. See this video of Paul E. McKenney's CppCon 2015 talk C++ Atomics: The Sad Story of memory_order_consume. (Link includes a summary).
If you want real dependency-ordering read-only performance, you have to "roll your own", e.g. by using mo_relaxed and checking the asm to verify it compiled to asm with a dependency. (Avoid doing anything "weird" with such a value, like passing it across functions.) DEC Alpha is basically dead and all other ISAs provide dependency ordering in asm without barriers, as long as the asm itself has a data dependency.
If you don't want to roll your own and live dangerously, it might not hurt to keep using mo_consume in "simple" use-cases where it should be able to work; perhaps some future mo_consume implementation will have the same name and work in a way that's compatible with C++11.
There is ongoing work on making a new consume, e.g. 2018's http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2018/p0750r1.html
because the answers I've read all make it sound like you can plaster
this code everywhere and magically you'd get equal or faster code
The only way you can get faster code is when that annotation allows the omission of a fence.
So the only case where it could possibly be useful is:
your program uses consume ordering on an atomic load operation, in an important frequently executed code;
the "consume value" isn't just used immediately and locally, but also passed to other functions;
the target CPU gives specific guarantees for consuming operations (as strong as a given fence before that operation, just for that operation);
the compiler writers take their job seriously: they manage to translate high level language consuming of a value to CPU level consuming, to get the benefit from CPU guarantees.
That's a bunch of necessary conditions to possibly get measurably faster code.
(And the latest trend in the C++ community is to give up inventing a proper compiling scheme that's safe in all cases and to come up with a completely different way for the user to instruct the compiler to produce code that "consumes" values, with much more explicit, naively translatable, C++ code.)
One comment said the code can be equal or slower, but the poster
didn't elaborate.
Of course annotations of the kind that you can randomly put on programs simply cannot make code more efficient in general! That would be too easy and also self contradictory.
Either some annotation specifies a constrain on your code, that is a promise to the compiler, and you can't put it anywhere where it doesn't correspond an guarantee in the code (like noexcept in C++, restrict in C), or it would break code in various ways (an exception in a noexcept function stops the program, aliasing of restricted pointers can cause funny miscompilation and bad behavior (formerly the behavior is not defined in that case); then the compiler can use it to optimize the code in specific ways.
Or that annotation doesn't constrain the code in any way, and the compiler can't count on anything and the annotation does not create any more optimization opportunity.
If you get more efficient code in some cases at no cost of breaking program with an annotation then you must potentially get less efficient code in other cases. That's true in general and specifically true with consume semantic, which imposes the previously described constrained on translation of C++ constructs.
I imagine appropriate places to use this is on any function return or
parameter that is a pointer or reference and that will be passed or
returned within the calling thread
No, the one and only case where it might be useful is when the intended calling function will probably use consume memory order.

Do Fortran 95 constructs such as WHERE, FORALL and SPREAD generally result in faster parallel code?

I have read through the Fortran 95 book by Metcalf, Reid and Cohen, and Numerical Recipes in Fortran 90. They recommend using WHERE, FORALL and SPREAD amongst other things to avoid unnecessary serialisation of your program.
However, I stumbled upon this answer which claims that FORALL is good in theory, but pointless in practice - you might as well write loops as they parallelise just as well and you can explicitly parallelise them using OpenMP (or automatic features of some compilers such as Intel).
Can anyone verify from experience whether they have generally found these constructs to offer any advantages over explicit loops and if statements in terms of parallel performance?
And are there any other parallel features of the language which are good in principal but not worth it in practice?
I appreciate that the answers to these questions are somewhat implementation dependant, so I'm most interested in gfortran, Intel CPUs and SMP parallelism.
As I said in my answer to the other question, there is a general belief that FORALL has not been as useful as was hoped when it was introduced to the language. As already explained in other answers, it has restrictive requirements and a limited role, and compilers have become quite good at optimizing regular loops. Compilers keep getting better, and capabilities vary from compiler to compiler. Another clue is that the Fortran 2008 is trying again... besides adding explicit parallelization to the language (co-arrays, already mentioned), there is also "do concurrent", a new loop form that requires restrictions that should better allow the compiler to perform automatic parallization optimizations, yet should be sufficiently general to be useful -- see ftp://ftp.nag.co.uk/sc22wg5/N1701-N1750/N1729.pdf.
In terms of obtaining speed, mostly I select good algorithms and program for readability & maintainability. Only if the program is too slow do I locate the bottle necks and recode or implement multi-threading (OpenMP). It will be a rare case where FORALL or WHERE versus an explicit do loop will have a meaningful speed difference -- I'd look more to how clearly they state the intent of the program.
I've looked shallowly into this and, sad to report, generally find that writing my loops explicitly results in faster programs than the parallel constructs you write about. Even simple whole-array assignments such as A = 0 are generally outperformed by do-loops.
I don't have any data to hand and if I did it would be out of date. I really ought to pull all this into a test suite and try again, compilers do improve (sometimes they get worse too).
I do still use the parallel constructs, especially whole-array operations, when they are the most natural way to express what I'm trying to achieve. I haven't ever tested these constructs inside OpenMP workshare constructs. I really ought to.
FORALL is a generalised masked assignment statement (as is WHERE). It is not a looping construct.
Compilers can parallelise FORALL/WHERE using SIMD instructions (SSE2, SSE3 etc) and is very useful to get a bit of low-level parallelisation. Of course, some poorer compilers don't bother and just serialise the code as a loop.
OpenMP and MPI is more useful at a coarser level of granularity.
In theory, using such assignments lets the compiler know what you want to do and should allow it to optimize it better. In practice, see the answer from Mark... I also think it's useful if the code looks cleaner that way. I have used things such as FORALL myself a couple of times, but didn't notice any performance changes over regular DO loops.
As for optimization, what kind of parallellism do you intent to use? I very much dislike OpenMP, but I guess if you inted to use that, you should test these constructs first.
*This should be a comment, not an answer, but it won't fit into that little box, so I'm putting it here. Don't hold it against me :-) Anyways, to continue somewhat onto #steabert's comment on his answer. OpenMP and MPI are two different things; one rarely gets to choose between the two since it's more dictated by your architecture than personal choice. As far as learning concepts of paralellism go, I would recommend OpenMP any day; it is simpler and one easily continues the transition to MPI later on.
But, that's not what I wanted to say. This is - a few days back from now, Intel has announced that it has started supporting Co-Arrays, a F2008 feature previously only supported by g95. They're not intending to put down g95, but the fact remains that Intel's compiler is more widely used for production code, so this is definitely an interesting line of developemnt. They also changed some things in their Visual Fortran Compiler (the name, for a start :-)
More info after the link: http://software.intel.com/en-us/articles/intel-compilers/

In game programming are global variables bad?

I know my gut reaction to global variables is "badd!" but in the two game development courses I've taken at my college globals were used extensively, and now in the DirectX 9 game programming tutorial I am using (www.directxtutorial.com) I'm being told globals are okay in game programming ...? The site also recommends using only structs if you can when doing game programming to help keep things simple.
I'm really confused on this issue, and all the research I've been trying to do is very confusing. I realize there are issues when using global variables (threading issues, they make code harder to maintain, the state of them is hard to track etc) but also there is a cost associated with not using globals, I'd have to pass a loooot of information around very often which can be confusing and I imagine time-costing, although I guess pointers would speed the process up (this is my first time writing a game in C++.) Anyway, I realize there is probably no "right" or "wrong" answer here since both ways work, but I want my code to be as proper as I can so any input would be good, thank you very much!
The trouble with games and globals is that games (nowadays) are threaded at engine level. Game developers using an engine use the engine's abstractions rather than directly programming concurrency (IIRC). In many of the highlevel languages such as C++, threads sharing state is complex. When many concurrent processes share a common resource they have to make sure they don't tread on eachother's toes.
To get around this, you use concurrency control such as mutex and various locks. This in effect makes asynchronous critical sections of code access shared state in a synchronous manner for writing. The topic of concurrency control is too much to explain fully here.
Suffice to say, if threads run with global variables, it makes debugging very hard, as concurrency bugs are a nightmare (think, "which thread wrote that? Who holds that lock?").
There are exceptions in games programming API such as OpenGL and DX. If your shared data/globals are pointers to DX or OpenGL graphics contexts then typically this maps down to GPU operations which don't suffer so much from the same trouble.
Just be careful. Keeping objects representing 'player' or 'zombie' or whatever, and sharing them between threads can be tricky. Spawn 'player' threads and 'zombie group' threads instead and have a robust concurrency abstraction between them based on message passing rather than accessing those object's state across the thread/critical section boundary.
Saying all that, I do agree with the "Say no to globals" point made below.
For more on the complexities of threads and shared state see:
1 POSIX Threads API - I know it is POSIX, but provides a good idea that translates to other API
2 Wikipedia's excellent range of articles on concurrency control mechanisms
3 The Dining Philosopher's problem (and many others)
4 ThreadMentor tutorials and articles on threading
5 Another Intel article, but more of a marketing thing.
6 An ACM article on building multi-threaded game engines
Have worked on AAA game titles, I can tell you that globals should be eradicated immediately before they spread like a cancer. I've seen them corrupt an I/O subsystem so completely that it had to be wholly thrown out to be rewritten.
Say no to globals. Always.
In this respect, there's no difference between games and other programs. While arguably OK in small examples given in elementary courses, global variables are strongly discouraged in real programs.
So if you want to write correct, readable and maintainable code, stay away from global variables as much as possible.
All answers until now deal with the globals/threads issue, but I will add my 2 cents to the struct/class (understood as all public attributes/private attributes + methods) discussion. Preferring structs over classes on the grounds of that being simpler is in the same line of thought of preferring assembler over C++ on the grounds of that being a simpler language.
The fact that you have to think on how your entities are going to be used and provide methods for it makes the concrete entity a little more complex, but greatly simplifies the rest of the code and maintainability. The whole point of encapsulation is that it simplifies the program by providing clear ways in that your data can be modified while maintaining your objects invariants. You control the entry points and what can happen there. Having all attributes public imply that any part of the code can have a small innocent error (forgot to check condition X) and break your invariants completely (health below 0, but no 'death' processing being triggered)
The other common discussion is performance: If I just need to update a datum, then having to call a method will impact my performance. Not really. If methods are simple and you provide them in the header as inlines (inside the class body or outside with the inline keyword), the compiler will be able to copy those instructions to each use place. You get the guarantee that the compiler will not leave out any check by mistake, and no impact in performance.
Having read a bit more what you posted though:
I'm being told globals are okay in
game programming ...? The site also
recommends using only structs if you
can when doing game programming to
help keep things simple.
Games code is no different from other code really. Gratuitous use of globals is bad regardless. And as for 'only use structs', that is just a load of crap. Approach game development on the same principles as any other software - you may find places where you need to bend this but they should be the exception, typically when dealing with low-level hardware issues.
I would say that the advice about globals and 'keeping things simple' is probably a mixture of being easier to learn and old fashioned thinking about performance. When I was being taught about game programming I remember being told that C++ wasn't advised for games as it would be too slow but I've worked on multiple games using every facet of C++ which proves that isn't true.
I would add to everyone's answers here that globals are to be avoided where possible, I wouldn't be afraid to use whatever you need from C++ to make your code understandable and easy to use. If you come up against a performance problem then profile that specific issue and I'll bet that most of the time you won't need to remove use of a C++ feature but just think about your problem better. There may still be some platforms around that require pure C but I don't really have experience of them, even the Gameboy Advance seemed to deal with C++ quite nicely.
The metaissue here is that of state. How much state does any given function depend on, and how much does it change? Then consider how much state is implicit versus explicit, and cross that with the inspect vs. change.
If you have a function/method/whatever that is called DoStuff(), you have no idea from the outside what it depends on, what it needs, and what's going to happen to the shared state. If this is a class member, you also have no idea how that object's state is going to mutate. This is bad.
Contrast to something like cosf(2), this function is understood not to change any global state, and any state that it requires (lookup tables for example) are hidden from view and have no effect on your program-at-large. This is a function that computes a value based on what you give it and it returns that value. It changes no state.
Class member functions then have the opportunity to step up some problems. There's a huge difference between
myObject.hitpoints -= 4;
myObject.UpdateHealth();
and
myObject.TakeDamage(4);
In the first example, an external operation is changing some state that one of its member functions implicitly depends upon. The fact is that these two lines can be separated by many other lines of code begins to make it non-obvious what's going to happen in the UpdateHealth call, even if outside of the subtraction it is the same as the TakeDamage call. Encapsulating the state changes (in the second example) implies that the specifics of the state changes aren't important to the outside world, and hopefully they're not. In the first example, the state changes are explicitly important to the outside world, and this is really no different than setting some globals and calling a function that uses those globals. E.g. hopefully you'd never see
extern float value_to_sqrt;
value_to_sqrt = 2.0f;
sqrt(); // reads the global value_to_sqrt
extern float sqrt_value; // has the results of the sqrt.
And yet, how many people do exactly this sort of thing in other contexts? (Considering especially that class instance state is "global" in regards to that particular instance.)
So- prefer giving explicit instruction to your function calls, and prefer that they return the results directly rather than having to explicitly set state before calling a function and then checking other state after it returns.
The more state dependencies a bit of code has, the harder it'll be to make it multithread safe, but that has already been covered above. The point I want to make is that the problem isn't so much globals but more the visibility of the collection of state that is required for a bit of code to operate (and subsequently how much other code also depends on that state).
Most games aren't multi-threaded, although newer titles are going down that route, and so they've managed to get away with it so far.
Saying that globals are okay in games is like not bothering to fix the brakes on your car because you only drive at 10mph!
It's bad practice which ever way you look at it.
You only have to look at the number of bugs in games to see examples of this.
If in class A you need to access data D, instead of setting D global, you'd better put into A a reference to D.
Globals are NOT intrinsically bad. In C for instance they are part of the language's normal use... and since C++ builds on C they still have a place.
On an aesthetic level, it's better to avoid them where you can sensibly make them part of a class, but if all you do is wrap a bunch of globals into a singleton, you made things worse because at least with globals it's obvious what the point is.
Be careful, but for some things it makes less sense to force OO concepts on what is actually a global value.
Two specific issues that I've encountered in the past:
First: If you're attempting to separate e.g. render phase (const access to most game state) from logic phase (non-const access to most game state) for whatever reason (decoupling render rate from logic rate, synchronizing game state across a network, recording and playback of gameplay at a fixed point in the frame, etc), globals make it very hard to enforce that.
Problems tend to creep in and become hard to debug and eradicate. This also has implications for threaded renderers separate from game logic, or the like (the other answers cover this topic thoroughly).
Second: The presence of many globals tends to bloat the literal pool, which the compiler typically places after each function.
If you get to your state through either a single "struct GlobalTable" which holds globals or a collection of methods on an object or the like, your literal pool tends to be a lot smaller, decreasing the size of the .text section in your executable.
This is mostly a concern for instruction set architectures that can't embed load targets directly into instructions (see e.g. fixed-width ARM or Thumb version 1 instruction encoding on ARM processors). Even on modern processors I'd wager you'll get slightly smaller codegen.
It also hurts doubly when your instruction and data caches are separate (again, as on some ARM processors); you'll tend to get two cachelines loaded where you might only need one with a unified cache. Since the literal pool may count as data, and won't always start on a cacheline boundary, the same cacheline might have to be loaded both into the i-cache and d-cache.
This may count as a "microoptimization", but it's a transform that's relatively easily applied to an entire codebase (e.g. extern struct GlobalTable { /* ... */ } the; and then replace all g_foo with the.foo)... And we've seen code size decrease between 5 and 10 percent in some cases, with a corresponding boost to performance due to keeping the caches cleaner.
I'm going to take a risk and say it depends to me on the scope/scale of your project. Because if you are trying to program an epic game with a boatload of code, then globals could, and easily in the most painful way in hindsight, cost you way more time than they save.
But if you are trying to code, say, something as simple as Super Mario or Metroid or Contra or even simpler, like Pac-Man, then these games were originally coded in 6502 assembly and used globals for almost everything. Just about all the game data and state was stored in data segments, and that didn't stop the devs from shipping a very competent and popular product in spite of working with absolutely inferior tools and engineering standards which would probably horrify people today.
So if you are just writing this kind of small and simple game which has a very limited scope and isn't designed to grow and expand far beyond its original design, isn't designed to be maintained for years and years, with a few thousand lines of simple C++ code, then I don't see the big deal of using a global here or a singleton there. Someone obsessed with trying to engineer Super Mario with the soundest engineering techniques with SOLID and a DI framework could end up taking far, far longer to ship than even the devs who wrote it in 6502 asm.
And I'm getting old and there's something to it there when I look at these old simple games and how they were coded, and it almost seems like the devs were doing something right in spite of the hard-coded magic numbers and globals all over the place while I spend my career fumbling around and trying to figure out the best way to engineer things. That said this is probably a very unpopular opinion, and not one I would have liked either a decade or two ago, but there's something to it. I don't look at the 6502 asm of Metroid and think, "these devs underengineered their product and their lives would have been so much easier if they did this or that." Seems like they did things just about right.
But again this is for small-scale stuff, maybe in the indie category of games by today's standards, and in the smaller of the indie games among them, and far from doing anything ground-breaking in terms of how much data it can process or using cutting-edge hardware techniques. If in doubt, I'd definitely suggest to err on the side of avoiding globals. It's also a little bit trickier in C++ as opposed to say, C, since you can have objects with constructor and destructors, and initialization and destruction order isn't well-defined and easily predictable for global objects. There I'd say to lean even more on the side of avoiding globals since they can trip you up in whole new ways when you aren't explicitly initializing and destroying them yourself in a predictable order. And naturally if you want to multithread a lot beyond a critical loop here and there, then your ability to reason about thread-safety of any particular code will be severely diminished if you cannot minimize the scope/visibility of your game state to the minimum of places.

Multithreaded paranoia

This is a complex question, please consider carefully before answering.
Consider this situation. Two threads (a reader and a writer) access a single global int. Is this safe? Normally, I would respond without thought, yes!
However, it seems to me that Herb Sutter doesn't think so. In his articles on effective concurrency he discusses a flawed lock-free queue and the corrected version.
In the end of the first article and the beginning of the second he discusses a rarely considered trait of variables, write ordering. Int's are atomic, good, but ints aren't necessarily ordered which could destroy any lock-free algorithm, including my above scenario. I fully agree that the only way to guarantee correct multithreaded behavior on all platforms present and future is to use atomics(AKA memory barriers) or mutexes.
My question; is write re-odering ever a problem on real hardware? Or is the multithreaded paranoia just being pedantic?
What about classic uniprocessor systems?
What about simpler RISC processors like an embedded power-pc?
Clarification: I'm more interested in what Mr. Sutter said about the hardware (processor/cache) reordering variable writes. I can stop the optimizer from breaking code with compiler switches or hand inspection of the assembly post-compilation. However, I'd like to know if the hardware can still mess up the code in practice.
Your idea of inspecting the assembly is not good enough; the reordering can happen at the hardware level.
To answer your question "is this ever a problem on read hardware:" Yes! In fact I've run into that problem myself.
Is it OK to skirt the issue with uniprocessor systems or other special-case situations? I would argue "no" because five years from now you might need to run on multi-core after all, and then finding all these locations will be tricky (impossible?).
One exception: Software designed for embedded hardware applications where indeed you have completely control over the hardware. In fact I have "cheated" like this in those situations on e.g. an ARM processor.
Yup - use memory barriers to prevent instruction reordering where needed. In some C++ compilers, the volatile keyword has been expanded to insert implicit memory barriers for every read and write - but this isn't a portable solution. (Likewise with the Interlocked* win32 APIs). Vista even adds some new finer-grained Interlocked APIs which let you specify read or write semantics.
Unfortunately, C++ has such a loose memory model that any kind of code like this is going to be non-portable to some extent and you'll have to write different versions for different platforms.
Like you said, because of reordering done at cache or processor level, you actually do need some sort of memory barrier to ensure proper synchronisation, especially for multi-processors (and especially on non-x86 platforms). (I am given to believe that single-processor systems don't have these issues, but don't quote me on this---I'm certainly more inclined to play safe and do the synchronised access anyway.)
We have run into the problem, albeit on Itanium processors where the instruction reordering is more aggressive than x86/x64.
The fix was to use an Interlocked instruction since there was (at the time) no way of telling the compiler to simply but a write barrier after the assignment.
We really need language extension to deal with this cleanly. Use of volatile (if supported by the compiler) is too coarse grained for the cases where you are trying to squeeze as much performance out of a piece of code as possible.
is this ever a problem on real hardware?
Absolutely, particularly now with the move to multiple cores for current and future CPUs. If you're dependent on ordered atomicity to implement features in your application and you are unable to guarantee this requirement via your chosen platform or the use of synchronization primitives, under all conditions i.e. customer moves from a single-core CPU to multi-core CPU, then you are just waiting for a problem to occur.
Quoting from the referred to Herb Sutter article (second one)
Ordered atomic variables are spelled in different ways on popular platforms and environments. For example:
volatile in C#/.NET, as in volatile int.
volatile or * Atomic* in Java, as in volatile int, AtomicInteger.
atomic<T> in C++0x, the forthcoming ISO C++ Standard, as in atomic<int>.
I have not seen how C++0x implements ordered atomicity so I'm unable to specify whether the upcoming language feature is a pure library implementation or relies on changes to the language also. You could review the proposal to see if it can be incorporated as a non-standard extension to your current tool chain until the new standard is available, it may even be available already for your situation.
It is a problem on real hardware. A friend of mine works for IBM and makes his living primarily by sussing out this kind of problem in customers' codes.
If you want to see how bad things can get, search for academic papers on the Java Memory Model (and also now the C++ memory model). Given the reordering that real hardware can do, trying to figure out what's safe in a high-level language is a nightmare.
No this isn't safe and there is real hardware avaialble that exhibits this problem, for example the memory model in the powerpc chip on xbox 360 allows writes to be reordered. This is exacerbated by the lack of barriers in the intrinsics, see this article on msdn for more details.
The answer to the question" is it safe" is inherently ambiguous.
It's always safe, even for doubles, in the sense that your computer won't catch fire.
It's safe, in the sense that you always will get a value that the int held at some time in the past,
It's not safe, in the sense that you may get a value which is/will be updated by another thread.
"Atomic" means that you get the second guarantee. Since double usually is not atomic, you could get 32 old and 32 new bits. That's clearly unsafe.
When I asked the question I most interested in uniprocessor powerpc. In one of the comments InSciTek Jeff mentioned the powerpc SYNC and ISYNC instructions. Those where the key to a definitive answer. I found it here on IBM's site.
The article is large and pretty dense, but the take away is No, it is not safe. On older powerpc's the memory optimizers where not sophisticated enough to cause problems on a uniprocessor. However, the newer ones are much more aggressive, and can break even simple access to a global int.