Noise with multi-threaded raytracer

Noise with multi-threaded raytracer - c++

This is my first multi-threaded implementation, so it's probably a beginners mistake. The threads handle the rendering of every second row of pixels (so all rendering is handled within each thread). The problem persists if the threads render the upper and lower parts of the screen respectively.
Both threads read from the same variables, can this cause any problems? From what I've understood only writing can cause concurrency problems...
Can calling the same functions cause any concurrency problems? And again, from what I've understood this shouldn't be a problem...
The only time both threads write to the same variable is when saving the calculated pixel color. This is stored in an array, but they never write to the same indices in that array. Can this cause a problem?
Multi-threaded rendered image
(Spam prevention stops me from posting images directly..)
Ps. I use the exactly same implementation in both cases, the ONLY difference is a single vs. two threads created for the rendering.

Both threads read from the same variables, can this cause any problems? From what I've understood only writing can cause concurrency problems...
This should be ok. Obviously, as long the data is initialized before the two threads start reading and destroyed after both threads have finished.
Can calling the same functions cause any concurrency problems? And again, from what I've understood this shouldn't be a problem...
Yes and no. Too hard to tell without the code. What does the function do? Does it rely on shared state (e.g. static variables, global variables, singletons...)? If yes, then this is definitely a problem. If there is never any shared state, then you're ok.
The only time both threads write to the same variable is when saving the calculated pixel color. This is stored in an array, but they never write to the same indices in that array. Can this cause a problem?
Maybe sometimes. An array of what? It's probably safe if sizeof(element) == sizeof(void*), but the C++ standard is mute on multithreading, so it doesn't force your compiler to force your hardware to make this safe. It's possible that your platform could be biting you here (e.g. 64bit machine and one thread writing 32bits which might overwrite an adjacent 32bit value), but this isn't an uncommon pattern. Usually you're better off using synchronization to be sure.
You can solve this in a couple of ways:
Each thread builds its own data, then it is aggregated when they complete.
You can protect the shared data with a mutex.
The lack of commitment in my answers are what make multi-threaded programming hard :P
For example, from Intel® 64 and IA-32 Architectures Software Developer's Manuals, describes how different platforms gaurantee different levels of atomicity:
7.1.1 Guaranteed Atomic Operations
The Intel486 processor (and newer
processors since) guarantees that the
following basic memory operations will
always be carried out atomically:
Reading or writing a byte
Reading or writing a word aligned on a 16-bit boundary
Reading or writing a doubleword aligned on a 32-bit boundary
The Pentium processor (and newer
processors since) guarantees that the
following additional memory operations
will always be carried out atomically:
Reading or writing a quadword aligned on a 64-bit boundary
16-bit accesses to uncached memory locations that fit within a 32-bit data bus
The P6 family processors (and newer
processors since) guarantee that the
following additional memory operation
will always be carried out atomically:
Unaligned 16-, 32-, and 64-bit accesses to cached memory that fit within a cache line
Accesses to cacheable memory that are
split across bus widths, cache lines,
and page boundaries are not guaranteed
to be atomic by the Intel Core 2 Duo,
Intel Atom, Intel Core Duo, Pentium M,
Pentium 4, Intel Xeon, P6 family,
Pentium, and Intel486 processors. The
Intel Core 2 Duo, Intel Atom, Intel
Core Duo, Pentium M, Pentium 4, Intel
Xeon, and P6 family processors provide
bus control signals that permit
external memory subsystems to make
split accesses atomic; however,
nonaligned data accesses will
seriously impact the performance of
the processor and should be avoided.

I have solved the problem, I did it by building up the data separately for each thread just as Stephen suggested (the elements where not of void* size). Thanks for a very detailed answer!

Related

Can modern x86 hardware not store a single byte to memory?

Speaking of the memory model of C++ for concurrency, Stroustrup's C++ Programming Language, 4th ed., sect. 41.2.1, says:
... (like most modern hardware) the machine could not load or store anything smaller than a word.
However, my x86 processor, a few years old, can and does store objects smaller than a word. For example:
#include <iostream>
int main()
{
char a = 5;
char b = 25;
a = b;
std::cout << int(a) << "\n";
return 0;
}
Without optimization, GCC compiles this as:
[...]
movb $5, -1(%rbp) # a = 5, one byte
movb $25, -2(%rbp) # b = 25, one byte
movzbl -2(%rbp), %eax # load b, one byte, not extending the sign
movb %al, -1(%rbp) # a = b, one byte
[...]
The comments are by me but the assembly is by GCC. It runs fine, of course.
Obviously, I do not understand what Stroustrup is talking about when he explains that hardware can load and store nothing smaller than a word. As far as I can tell, my program does nothing but load and store objects smaller than a word.
The thoroughgoing focus of C++ on zero-cost, hardware-friendly abstractions sets C++ apart from other programming languages that are easier to master. Therefore, if Stroustrup has an interesting mental model of signals on a bus, or has something else of this kind, then I would like to understand Stroustrup's model.
What is Stroustrup talking about, please?
LONGER QUOTE WITH CONTEXT
Here is Stroustrup's quote in fuller context:
Consider what might happen if a linker allocated [variables of char type like] c and b in the same word in memory and (like most modern hardware) the machine could not load or store anything smaller than a word.... Without a well-defined and reasonable memory model, thread 1 might read the word containing b and c, change c, and write the word back into memory. At the same time, thread 2 could do the same with b. Then, whichever thread managed to read the word first and whichever thread managed to write its result back into memory last would determine the result....
ADDITIONAL REMARKS
I do not believe that Stroustrup is talking about cache lines. Even if he were, as far as I know, cache coherency protocols would transparently handle that problem except maybe during hardware I/O.
I have checked my processor's hardware datasheet. Electrically, my processor (an Intel Ivy Bridge) seems to address DDR3L memory by some sort of 16-bit multiplexing scheme, so I don't know what that's about. It is not clear to me that that has much to do with Stroustrup's point, though.
Stroustrup is a smart man and an eminent scientist, so I do not doubt that he is taking about something sensible. I am confused.
See also this question. My question resembles the linked question in several ways, and the answers to the linked question are also helpful here. However, my question goes also to the hardware/bus model that motivates C++ to be the way it is and that causes Stroustrup to write what he writes. I do not seek an answer merely regarding that which the C++ standard formally guarantees, but also wish to understand why the C++ standard would guarantee it. What is the underlying thought? This is part of my question, too.

TL:DR: On every modern ISA that has byte-store instructions (including x86), they're atomic and don't disturb surrounding bytes. (I'm not aware of any older ISAs where byte-store instructions could "invent writes" to neighbouring bytes either.)
The actual implementation mechanism (in non-x86 CPUs) is sometimes an internal RMW cycle to modify a whole word in a cache line, but that's done "invisibly" inside a core while it has exclusive ownership of the cache line so it's only ever a performance problem, not correctness. (And merging in the store buffer can sometimes turn byte-store instructions into an efficient full-word commit to L1d cache.)
About Stroustrup's phrasing
I don't think it's a very accurate, clear or useful statement. It would be more accurate to say that modern CPUs can't load or store anything smaller than a cache line. (Although that's not true for uncacheable memory regions, e.g. for MMIO.)
It probably would have been better just to make a hypothetical example to talk about memory models, rather than implying that real hardware is like this. But if we try, we can maybe find an interpretation that isn't as obviously or totally wrong, which might have been what Stroustrup was thinking when he wrote this to introduce the topic of memory models. (Sorry this answer is so long; I ended up writing a lot while guessing what he might have meant and about related topics...)
Or maybe this is another case of high-level language designers not being hardware experts, or at least occasionally making mis-statements.
I think Stroustrup is talking about how CPUs work internally to implement byte-store instructions. He's suggesting that a CPU without a well-defined and reasonable memory model might implement a byte-store with a non-atomic RMW of the containing word in a cache line, or in memory for a CPU without cache.
Even this weaker claim about internal (not externally visible) behaviour is not true for high-performance x86 CPUs. Modern Intel CPUs have no throughput penalty for byte stores, or even unaligned word or vector stores that don't cross a cache-line boundary. AMD is similar.
If byte or unaligned stores had to do a RMW cycle as the store committed to L1D cache, it would interfere with store and/or load instruction/uop throughput in a way we could measure with performance counters. (In a carefully designed experiment that avoids the possibility of store coalescing in the store buffer before commit to L1d cache hiding the cost, because the store execution unit(s) can only run 1 store per clock on current CPUs.)
However, some high performance designs for non-x86 ISAs do use an atomic RMW cycle to internally commit stores to L1d cache. Are there any modern CPUs where a cached byte store is actually slower than a word store? The cache line stays in MESI Exclusive/Modified state the whole time, so it can't introduce any correctness problems, only a small performance hit. This is very different from doing something that could step on stores from other CPUs. (The arguments below about that not happening still apply, but my update may have missed some stuff that still argues that atomic cache-RMW is unlikely.)
(On many non-x86 ISAs, unaligned stores are not supported at all, or are used more rarely than in x86 software. And weakly-ordered ISAs allow more coalescing in store buffers, so not as many byte store instructions actually result in single-byte commit to L1d. Without these motivations for fancy (power hungry) cache-access hardware, word RMW for scattered byte stores is an acceptable tradeoff in some designs.)
Alpha AXP, a high-performance RISC design from 1992, famously (and uniquely among modern non-DSP ISAs) omitted byte load/store instructions until Alpha 21164A (EV56) in 1996. Apparently they didn't consider word-RMW a viable option for implementing byte stores, because one of the cited advantages for implementing only 32-bit and 64-bit aligned stores was more efficient ECC for the L1D cache. "Traditional SECDED ECC would require 7 extra bits over 32-bit granules (22% overhead) versus 4 extra bits over 8-bit granules (50% overhead)." (#Paul A. Clayton's answer about word vs. byte addressing has some other interesting computer-architecture stuff.) If byte stores were implemented with word-RMW, you could still do error detection/correction with word-granularity.
Current Intel CPUs only use parity (not ECC) in L1D for this reason. (At least some older Xeons could run with L1d in ECC mode at half capacity instead of the normal 32KiB, as discussed on RWT. It's not clear if anything's changed, e.g. in terms of Intel now using ECC for L1d). See also this Q&A about hardware (not) eliminating "silent stores": checking the old contents of cache before the write to avoid marking the line dirty if it matched would require a RMW instead of just a store, and that's a major obstacle.
It turns out some high-perf pipelined designs do use atomic word-RMW to commit to L1d, despite it stalling the memory pipeline, but (as I argue below) it's much less likely that any do an externally-visible RMW to RAM.
Word-RMW isn't a useful option for MMIO byte stores either, so unless you have an architecture that doesn't need sub-word stores for IO, you'd need some kind of special handling for IO (like Alpha's sparse I/O space where word load/stores were mapped to byte load/stores so it could use commodity PCI cards instead of needing special hardware with no byte IO registers).
As #Margaret points out, DDR3 memory controllers can do byte stores by setting control signals that mask out other bytes of a burst. The same mechanisms that get this information to the memory controller (for uncached stores) could also get that information passed along with a load or store to MMIO space. So there are hardware mechanisms for really doing
a byte store even on burst-oriented memory systems, and it's highly likely that modern CPUs will use that instead of implementing an RMW, because it's probably simpler and is much better for MMIO correctness.
How many and what size cycles will be needed to perform longword transferred to the CPU shows how a ColdFire microcontroller signals the transfer size (byte/word/longword/16-byte line) with external signal lines, letting it do byte loads/stores even if 32-bit-wide memory was hooked up to its 32-bit data bus. Something like this is presumably typical for most memory bus setups (but I don't know). The ColdFire example is complicated by also being configurable to use 16 or 8-bit memory, taking extra cycles for wider transfers. But nevermind that, the important point is that it has external signaling for the transfer size, to tell the memory HW which byte it's actually writing.
Stroustrup's next paragraph is
"The C++ memory model guarantees that two threads of execution can update and access separate memory locations without interfering with each other. This is exactly what we would naively expect. It is the compiler’s job to protect us from the sometimes very strange and subtle behaviors of modern hardware. How a compiler and hardware combination achieves that is up to the compiler. ..."
So apparently he thinks that real modern hardware may not provide "safe" byte load/store. The people who design hardware memory models agree with the C/C++ people, and realize that byte store instructions would not be very useful to programmers / compilers if they could step on neighbouring bytes.
All modern (non-DSP) architectures except early Alpha AXP have byte store and load instructions, and AFAIK these are all architecturally defined to not affect neighbouring bytes. However they accomplish that in hardware, software doesn't need to care about correctness. Even the very first version of MIPS (in 1983) had byte and half-word loads/stores, and it's a very word-oriented ISA.
However, he doesn't actually claim that most modern hardware needs any special compiler support to implement this part of the C++ memory model, just that some might. Maybe he really is only talking about word-addressable DSPs in that 2nd paragraph (where C and C++ implementations often use 16 or 32-bit char as exactly the kind of compiler workaround Stroustrup was talking about.)
Most "modern" CPUs (including all x86) have an L1D cache. They will fetch whole cache lines (typically 64 bytes) and track dirty / not-dirty on a per-cache-line basis. So two adjacent bytes are pretty much exactly the same as two adjacent words, if they're both in the same cache line. Writing one byte or word will result in a fetch of the whole line, and eventually a write-back of the whole line. See Ulrich Drepper's What Every Programmer Should Know About Memory. You're correct that MESI (or a derivative like MESIF/MOESI) makes sure this isn't a problem. (But again, this is because hardware implements a sane memory model.)
A store can only commit to L1D cache while the line is in the Modified state (of MESI). So even if the internal hardware implementation is slow for bytes and takes extra time to merge the byte into the containing word in the cache line, it's effectively an atomic read modify write as long as it doesn't allow the line to be invalidated and re-acquired between the read and the write. (While this cache has the line in Modified state, no other cache can have a valid copy). See #old_timer's comment making the same point (but also for RMW in a memory controller).
This is easier than e.g. an atomic xchg or add from a register that also needs an ALU and register access, since all the HW involved is in the same pipeline stage, which can simply stall for an extra cycle or two. That's obviously bad for performance and takes extra hardware to allow that pipeline stage to signal that it's stalling. This doesn't necessarily conflict with Stroustrup's first claim, because he was talking about a hypothetical ISA without a memory model, but it's still a stretch.
On a single-core microcontroller, internal word-RMW for cached byte stores would be more plausible, since there won't be Invalidate requests coming in from other cores that they'd have to delay responding to during an atomic RMW cache-word update. But that doesn't help for I/O to uncacheable regions. I say microcontroller because other single-core CPU designs typically support some kind of multi-socket SMP.
Many RISC ISAs don't support unaligned-word loads/stores with a single instruction, but that's a separate issue (the difficulty is handling the case when a load spans two cache lines or even pages, which can't happen with bytes or aligned half-words). More and more ISAs are adding guaranteed support for unaligned load/store in recent versions, though. (e.g. MIPS32/64 Release 6 in 2014, and I think AArch64 and recent 32-bit ARM).
The 4th edition of Stroustrup's book was published in 2013 when Alpha had been dead for years. The first edition was published in 1985, when RISC was the new big idea (e.g. Stanford MIPS in 1983, according to Wikipedia's timeline of computing HW, but "modern" CPUs at that time were byte-addressable with byte stores. Cyber CDC 6600 was word-addressable and probably still around, but couldn't be called modern.
Even very word-oriented RISC machines like MIPS and SPARC have byte store and byte load (with sign or zero extension) instructions. They don't support unaligned word loads, simplifying the cache (or memory access if there is no cache) and load ports, but you can load any single byte with one instruction, and more importantly store a byte without any architecturally-visible non-atomic rewrite of the surrounding bytes. (Although cached stores can
I suppose C++11 (which introduces a thread-aware memory model to the language) on Alpha would need to use 32-bit char if targeting a version of the Alpha ISA without byte stores. Or it would have to use software atomic-RMW with LL/SC when it couldn't prove that no other threads could have a pointer that would let them write neighbouring bytes.
IDK how slow byte load/store instructions are in any CPUs where they're implemented in hardware but not as cheap as word loads/stores. Byte loads are cheap on x86 as long as you use movzx/movsx to avoid partial-register false dependencies or merging stalls. On AMD pre-Ryzen, movsx/movzx needs an extra ALU uop, but otherwise zero/sign extension is handled right in the load port on Intel and AMD CPUs.) The main x86 downside is that you need a separate load instruction instead of using a memory operand as a source for an ALU instruction (if you're adding a zero-extended byte to a 32-bit integer), saving front-end uop throughput bandwidth and code-size. Or if you're just adding a byte to a byte register, there's basically no downside on x86. RISC load-store ISAs always need separate load and store instructions anyway. x86 byte stores are no more expensive that 32-bit stores.
As a performance issue, a good C++ implementation for hardware with slow byte stores might put each char in its own word and use word loads/stores whenever possible (e.g. for globals outside structs, and for locals on the stack). IDK if any real implementations of MIPS / ARM / whatever have slow byte load/store, but if so maybe gcc has -mtune= options to control it.
That doesn't help for char[], or dereferencing a char * when you don't know where it might be pointing. (This includes volatile char* which you'd use for MMIO.) So having the compiler+linker put char variables in separate words isn't a complete solution, just a performance hack if true byte stores are slow.
PS: More about Alpha:
Alpha is interesting for a lot of reasons: one of the few clean-slate 64-bit ISAs, not an extension to an existing 32-bit ISA. And one of the more recent clean-slate ISAs, Itanium being another from several years later which attempted some neat CPU-architecture ideas.
From the Linux Alpha HOWTO.
When the Alpha architecture was introduced, it was unique amongst RISC architectures for eschewing 8-bit and 16-bit loads and stores. It supported 32-bit and 64-bit loads and stores (longword and quadword, in Digital's nomenclature). The co-architects (Dick Sites, Rich Witek) justified this decision by citing the advantages:
Byte support in the cache and memory sub-system tends to slow down accesses for 32-bit and 64-bit quantities.
Byte support makes it hard to build high-speed error-correction circuitry into the cache/memory sub-system.
Alpha compensates by providing powerful instructions for manipulating bytes and byte groups within 64-bit registers. Standard benchmarks for string operations (e.g., some of the Byte benchmarks) show that Alpha performs very well on byte manipulation.

Not only are x86 CPUs capable of reading and writing a single byte, all modern general purpose CPUs are capable of it. More importantly most modern CPUs (including x86, ARM, MIPS, PowerPC, and SPARC) are capable of atomically reading and writing single bytes.
I'm not sure what Stroustrup was referring to. There used to be a few word addressable machines that weren't capable of 8-bit byte addressing, like the Cray, and as Peter Cordes mentioned early Alpha CPUs didn't support byte loads and stores, but today the only CPUs incapable of byte loads and stores are certain DSPs used in niche applications. Even if we assume he means most modern CPUs don't have atomic byte load and stores this isn't true of most CPUs.
However, simple atomic loads and stores aren't of much use in multithreaded programming. You also typically need ordering guarantees and a way to make read-modify-write operations atomic. Another consideration is that while CPU a may have byte load and store instructions, compiler isn't required to use them. A compiler, for example, could still generate the code Stroustrup describes, loading both b and c using a single word load instruction as an optimization.
So while you do need a well defined memory model, if only so the compiler is forced to generate the code you expect, the problem isn't that modern CPUs aren't capable of loading or storing anything smaller than a word.

The author seems to be concerned about thread 1 and thread 2 getting into a situation where the read-modify-writes (not in software, the software does two separate instructions of a byte size, somewhere down the line logic has to do a read-modify-write) instead of the ideal read modify write read modify write, becomes a read read modify modify write write or some other timing such that both read the pre-modified version and the last one to write wins. read read modify modify write write, or read modify read modify write write or read modify read write modify write.
The concern is to start with 0x1122 and one thread wants to make it 0x33XX the other wants to make it 0xXX44, but with for example a read read modify modify write write you end up with 0x1144 or 0x3322, but not 0x3344
A sane (system/logic) design just doesn't have that problem certainly not for a general purpose processor like this, I have worked on designs with timing issues like this but that is not what we are talking about here, completely different system designs for different purposes. The read-modify-write does not span a long enough distance in a sane design, and x86s are sane designs.
The read-modify-write would happen very near the first SRAM involved (ideally L1 when running an x86 in a typical fashion with an operating system capable of running C++ compiled multi-threaded programs) and happen within a few clock cycles as the ram is at the speed of the bus ideally. And as Peter pointed out this is considered to be the whole cache line that experiences this, within the cache, not a read-modify-write between the processor core and the cache.
The notion of "at the same time" even with multi-core systems isn't necessarily at the same time, eventually you get serialized because performance isn't based on them being parallel from beginning to end, it is based on keeping the busses loaded.
The quote is saying variables allocated to the same word in memory, so that is the same program. Two separate programs are not going to share an address space like that. so
You are welcome to try this, make a multithreaded program that one writes to say address 0xnnn00000 the other writes to address 0xnnnn00001, each does a write, then a read or better several writes of the same value than one read, check the read was the byte they wrote, then repeats with a different value. Let that run for a while, hours/days/weeks/months. See if you trip up the system...use assembly for the actual write instructions to make sure it is doing what you asked (not C++ or any compiler that does or claims it will not put these items in the same word). Can add delays to allow for more cache evictions, but that reduces your odds of "at the same time" collisions.
Your example so long as you insure you are not sitting on two sides of a boundary (cache, or other) like 0xNNNNFFFFF and 0xNNNN00000, isolate the two byte writes to addresses like 0xNNNN00000 and 0xNNNN00001 have the instructions back to back and see if you get a read read modify modify write write. Wrap a test around it, that the two values are different each loop, you read back the word as a whole at whatever delay later as you desire and check the two values. Repeat for days/weeks/months/years to see if it fails. Read up on your processors execution and microcode features to see what it does with this instruction sequence and as needed create a different instruction sequence that tries to get the transactions initiated within a handful or so clock cycles on the far side of the processor core.
EDIT
the problem with the quotes is that this is all about language and the use of. "like most modern hardware" puts the whole of the topic/text in a touchy position, it is too vague, one side can argue all I have to do is find one case that is true to make all the rest true, likewise one side could argue if I find one case the all of the rest is not true. Using the word like kind of messes with that as a possible get out of jail free card.
The reality is that a significant percentage of our data is stored in DRAM in 8 bit wide memories, just that we don't access them as 8 bit wide normally we access 8 of them at a time, 64 bits wide. In some number of weeks/months/years/decades this statement will be incorrect.
The larger quote says "at the same time" and then says read ... first, write ... last, well first and last and at the same time don't make sense together, is it parallel or serial? The context as a whole is concerned about the above read read modify modify write write variations where you have one writing last and depending on when that one read determines if both modifications happened or not. Not about at the same time which "like most modern hardware" doesn't make sense things that start off actually parallel in separate cores/modules eventually get serialized if they are aiming at the same flip-flop/transistor in a memory, one eventually has to wait for the other to go first. Being physics based I don't see this being incorrect in the coming weeks/months/years.

This is correct. An x86_64 CPU, just like an original x86 CPU, is not able to read or write anything smaller than an (in this case 64-bit) word from rsp. to memory. And it will not typically read or write less than a whole cache line, though there are ways to bypass the cache, especially in writing (see below).
In this context, though, Stroustrup refers to potential data races (lack of atomicity on an observable level). This correctness issue is irrelevant on x86_64, because of the cache coherency protocol, which you mentioned. In other words, yes, the CPU is limited to whole word transfers, but this is transparently handled, and you as a programmer generally do not have to worry about it. In fact, the C++ language, starting from C++11, guarantees that concurrent operations on distinct memory locations have well-defined behavior, i.e. the one you'd expect. Even if the hardware did not guarantee this, the implementation would have to find a way by generating possibly more complex code.
That said, it can still be a good idea to keep the fact that whole words or even cache lines are always involved at the machine level in the back of your head, for two reasons.
First, and this is only relevant for people who write device drivers, or design devices, memory-mapped I/O may be sensitive to the way it is accessed. As an example, think of a device that exposes a 64-bit write-only command register in the physical address space. It may then be necessary to:
Disable caching. It is not valid to read a cache line, change a single word, and write back the cache line. Also, even if it were valid, there would still be a great risk that commands might be lost because the CPU cache is not written back soon enough. At the very least, the page needs to be configured as "write-through", which means writes take immediate effect. Therefore, an x86_64 page table entry contains flags that control the CPU's caching behavior for this page.
Ensure that the whole word is always written, on the assembly level. E.g. consider a case where you write the value 1 into the register, followed by a 2. A compiler, especially when optimizing for space, might decide to overwrite only the least significant byte because the others are already supposed to be zero (that is, for ordinary RAM), or it might instead remove the first write because this value appears to be immediately overwritten anyway. However, neither is supposed to happen here. In C/C++, the volatile keyword is vital to prevent such unsuitable optimizations.
Second, and this is relevant for almost any developer writing multi-threaded programs, the cache coherency protocol, while neatly averting disaster, can have a huge performance cost if it is "abused".
Here's a – somewhat contrived – example of a very bad data structure. Assume you have 16 threads parsing some text from a file. Each thread has an id from 0 to 15.
// shared state
char c[16];
FILE *file[16];
void threadFunc(int id)
{
while ((c[id] = getc(file[id])) != EOF)
{
// ...
}
}
This is safe because each thread operates on a different memory location. However, these memory locations would typically reside on the same cache line, or at most are split over two cache lines. The cache coherency protocol is then used to properly synchronize the accesses to c[id]. And herein lies the problem, because this forces every other thread to wait until the cache line becomes exclusively available before doing anything with c[id], unless it is already running on the core that "owns" the cache line. Assuming several, e.g. 16, cores, cache coherency will typically transfer the cache line from one core to another all the time. For obvious reasons, this effect is known as "cache line ping-pong". It creates a horrible performance bottleneck. It is the result of a very bad case of false sharing, i.e. threads sharing a physical cache line without actually accessing the same logical memory locations.
In contrast to this, especially if one took the extra step of ensuring that the file array resides on its own cache line, using it would be completely harmless (on x86_64) from a performance perspective because the pointers are only read from, most the time. In this case, multiple cores can "share" the cache line as read-only. Only when any core tries to write to the cache line, it has to tell the other cores that it is going to "seize" the cache line for exclusive access.
(This is greatly simplified, as there are different levels of CPU caches, and several cores might share the same L2 or L3 cache, but it should give you a basic idea of the problem.)

Not sure what Stroustrup meant by "WORD".
Maybe it is the minimum size of memory storage of the machine?
Anyway not all machines were created with 8bit (BYTE) resolution.
In fact I recommend this awesome article by Eric S. Raymond describing some of the history of computers:
http://www.catb.org/esr/faqs/things-every-hacker-once-knew/
"... It used also to be generally known that 36-bit architectures
explained some unfortunate features of the C language. The original
Unix machine, the PDP-7, featured 18-bit words corresponding to
half-words on larger 36-bit computers. These were more naturally
represented as six octal (3-bit) digits."

Stroustrup is not saying that no machine can perform loads and stores smaller than their native word size, he is saying that a machine couldn't.
While this seems surprising at first, it's nothing esoteric.
For starter, we will ignore the cache hierarchy, we will take that into account later.
Assume there are no caches between the CPU and the memory.
The big problem with memory is density, trying to put more bits possible into the smallest area.
In order to achieve that it is convenient, from an electrical design point of view, to expose a bus as wider as possible (this favours the reuse of some electrical signals, I haven't looked at the specific details though).
So, in architecture where big memories are needed (like the x86) or a simple low-cost design is favourable (for example where RISC machines are involved), the memory bus is larger than the smallest addressable unit (typically the byte).
Depending on the budget and legacy of the project the memory can expose a wider bus alone or along with some sideband signals to select a particular unit into it.
What does this mean practically?
If you take a look at the datasheet of a DDR3 DIMM you'll see that there are 64 DQ0–DQ63 pins to read/write the data.
This is the data bus, 64-bit wide, 8 bytes at a time.
This 8 bytes thing is very well founded in the x86 architecture to the point that Intel refers to it in the WC section of its optimisation manual where it says that data are transferred from the 64 bytes fill buffer (remember: we are ignoring the caches for now, but this is similar to how a cache line gets written back) in bursts of 8 bytes (hopefully, continuously).
Does this mean that the x86 can only write QWORDS (64-bit)?
No, the same datasheet shows that each DIMM has the DM0–DM7 ,DQ0–DQ7 and DQS0–DQS7 signals to mask, direct and strobe each of the 8 bytes in the 64-bit data bus.
So x86 can read and write bytes natively and atomically.
However, now it's easy to see that this could not be the case for every architecture.
For instance, the VGA video memory was DWORD (32-bit) addressable and making it fit in the byte addressable world of the 8086 led to the messy bit-planes.
In general specific purpose architecture, like DSPs, could not have a byte addressable memory at the hardware level.
There is a twist: we have just talked about the memory data bus, this is the lowest layer possible.
Some CPUs can have instructions that build a byte addressable memory on top of a word addressable memory.
What does that mean?
It's easy to load a smaller part of a word: just discard the rest of the bytes!
Unfortunately, I can't recall the name of the architecture (if it even existed at all!) where the processor simulated a load of an unaligned byte by reading the aligned word containing it and rotating the result before saving it in a register.
With stores, the matter is more complex: if we can't simply write the part of the word that we just updated we need to write the unchanged remaining part too.
The CPU, or the programmer, must read the old content, update it and write it back.
This is a Read-Modify-Write operation and it is a core concept when discussing atomicity.
Consider:
/* Assume unsigned char is 1 byte and a word is 4 bytes */
unsigned char foo[4] = {};
/* Thread 0 Thread 1 */
foo[0] = 1; foo[1] = 2;
Is there a data race?
This is safe on x86 because they can write bytes, but what if the architecture cannot?
Both threads would have to read the whole foo array, modify it and write it back.
In pseudo-C this would be
/* Assume unsigned char is 1 byte and a word is 4 bytes */
unsigned char foo[4] = {};
/* Thread 0 Thread 1 */
/* What a CPU would do (IS) What a CPU would do (IS) */
int tmp0 = *((int*)foo) int tmp1 = *((int*)foo)
/* Assume little endian Assume little endian */
tmp0 = (tmp0 & ~0xff) | 1; tmp1 = (tmp1 & ~0xff00) | 0x200;
/* Store it back Store it back */
*((int*)foo) = tmp0; *((int*)foo) = tmp1;
We can now see what Stroustrup was talking about: the two stores *((int*)foo) = tmpX obstruct each other, to see this consider this possible execution sequence:
int tmp0 = *((int*)foo) /* T0 */
tmp0 = (tmp0 & ~0xff) | 1; /* T1 */
int tmp1 = *((int*)foo) /* T1 */
tmp1 = (tmp1 & ~0xff00) | 0x200; /* T1 */
*((int*)foo) = tmp1; /* T0 */
*((int*)foo) = tmp0; /* T0, Whooopsy */
If the C++ didn't have a memory model these kinds of nuisances would have been implementation specific details, leaving the C++ a useless programming language in a multithreading environment.
Considering how common is the situation depicted in the toy example, Stroustrup stressed out the importance of a well-defined memory model.
Formalizing a memory model is hard work, it's an exhausting, error-prone and abstract process so I also see a bit of pride in the words of Stroustrup.
I have not brushed up on the C++ memory model but updating different array elements is fine.
That's a very strong guarantee.
We have left out the caches but that doesn't really change anything, at least for the x86 case.
The x86 writes to memory through the caches, the caches are evicted in lines of 64 bytes.
Internally each core can update a line at any position atomically unless a load/store crosses a line boundary (e.g. by writing near the end of it).
This can be avoided by naturally aligning data (can you prove that?).
In a multi-code/socket environment, the cache coherency protocol ensures that only a CPU at a time is allowed to freely write to a cached line of memory (the CPU that has it in the Exclusive or Modified state).
Basically, the MESI family of protocol use a concept similar to locking found the DBMSs.
This has the effect, for the writing purpose, of "assigning" different memory regions to different CPUs.
So it doesn't really affect the discussion of above.

Why is integer assignment on a naturally aligned variable atomic on x86?

I've been reading this article about atomic operations, and it mentions 32-bit integer assignment being atomic on x86, as long as the variable is naturally aligned.
Why does natural alignment assure atomicity?

"Natural" alignment means aligned to its own type width. Thus, the load/store will never be split across any kind of boundary wider than itself (e.g. page, cache-line, or an even narrower chunk size used for data transfers between different caches).
CPUs often do things like cache-access, or cache-line transfers between cores, in power-of-2 sized chunks, so alignment boundaries smaller than a cache line do matter. (See #BeeOnRope's comments below). See also Atomicity on x86 for more details on how CPUs implement atomic loads or stores internally, and Can num++ be atomic for 'int num'? for more about how atomic RMW operations like atomic<int>::fetch_add() / lock xadd are implemented internally.
First, this assumes that the int is updated with a single store instruction, rather than writing different bytes separately. This is part of what std::atomic guarantees, but that plain C or C++ doesn't. It will normally be the case, though. The x86-64 System V ABI doesn't forbid compilers from making accesses to int variables non-atomic, even though it does require int to be 4B with a default alignment of 4B. For example, x = a<<16 | b could compile to two separate 16-bit stores if the compiler wanted.
Data races are Undefined Behaviour in both C and C++, so compilers can and do assume that memory is not asynchronously modified. For code that is guaranteed not to break, use C11 stdatomic or C++11 std::atomic. Otherwise the compiler will just keep a value in a register instead of reloading every time your read it, like volatile but with actual guarantees and official support from the language standard.
Before C++11, atomic ops were usually done with volatile or other things, and a healthy dose of "works on compilers we care about", so C++11 was a huge step forward. Now you no longer have to care about what a compiler does for plain int; just use atomic<int>. If you find old guides talking about atomicity of int, they probably predate C++11. When to use volatile with multi threading? explains why that works in practice, and that atomic<T> with memory_order_relaxed is the modern way to get the same functionality.
std::atomic<int> shared; // shared variable (compiler ensures alignment)
int x; // local variable (compiler can keep it in a register)
x = shared.load(std::memory_order_relaxed);
shared.store(x, std::memory_order_relaxed);
// shared = x; // don't do that unless you actually need seq_cst, because MFENCE or XCHG is much slower than a simple store
Side-note: for atomic<T> larger than the CPU can do atomically (so .is_lock_free() is false), see Where is the lock for a std::atomic?. int and int64_t / uint64_t are lock-free on all the major x86 compilers, though.
Thus, we just need to talk about the behaviour of an instruction like mov [shared], eax.
TL;DR: The x86 ISA guarantees that naturally-aligned stores and loads are atomic, up to 64bits wide. So compilers can use ordinary stores/loads as long as they ensure that std::atomic<T> has natural alignment.
(But note that i386 gcc -m32 fails to do that for C11 _Atomic 64-bit types inside structs, only aligning them to 4B, so atomic_llong can be non-atomic in some cases. https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65146#c4). g++ -m32 with std::atomic is fine, at least in g++5 because https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65147 was fixed in 2015 by a change to the <atomic> header. That didn't change the C11 behaviour, though.)
IIRC, there were SMP 386 systems, but the current memory semantics weren't established until 486. This is why the manual says "486 and newer".
From the "Intel® 64 and IA-32 Architectures Software Developer Manuals, volume 3", with my notes in italics. (see also the x86 tag wiki for links: current versions of all volumes, or direct link to page 256 of the vol3 pdf from Dec 2015)
In x86 terminology, a "word" is two 8-bit bytes. 32 bits are a double-word, or DWORD.
###Section 8.1.1 Guaranteed Atomic Operations
The Intel486 processor (and newer processors since) guarantees that the following basic memory
operations will always be carried out atomically:
Reading or writing a byte
Reading or writing a word aligned on a 16-bit boundary
Reading or writing a doubleword aligned on a 32-bit boundary (This is another way of saying "natural alignment")
That last point that I bolded is the answer to your question: This behaviour is part of what's required for a processor to be an x86 CPU (i.e. an implementation of the ISA).
The rest of the section provides further guarantees for newer Intel CPUs: Pentium widens this guarantee to 64 bits.
The
Pentium processor (and newer processors since) guarantees that the
following additional memory operations will always be carried out
atomically:
Reading or writing a quadword aligned on a 64-bit boundary
(e.g. x87 load/store of a double, or cmpxchg8b (which was new in Pentium P5))
16-bit accesses to uncached memory locations that fit within a 32-bit data bus.
The section goes on to point out that accesses split across cache lines (and page boundaries) are not guaranteed to be atomic, and:
"An x87 instruction or an SSE instructions that accesses data larger than a quadword may be implemented using
multiple memory accesses."
AMD's manual agrees with Intel's about aligned 64-bit and narrower loads/stores being atomic
So integer, x87, and MMX/SSE loads/stores up to 64b, even in 32-bit or 16-bit mode (e.g. movq, movsd, movhps, pinsrq, extractps, etc.) are atomic if the data is aligned. gcc -m32 uses movq xmm, [mem] to implement atomic 64-bit loads for things like std::atomic<int64_t>. Clang4.0 -m32 unfortunately uses lock cmpxchg8b bug 33109.
On some CPUs with 128b or 256b internal data paths (between execution units and L1, and between different caches), 128b and even 256b vector loads/stores are atomic, but this is not guaranteed by any standard or easily queryable at run-time, unfortunately for compilers implementing std::atomic<__int128> or 16B structs.
(Update: x86 vendors have decided that the AVX feature bit also indicates atomic 128-bit aligned loads/stores. Before that we only had https://rigtorp.se/isatomic/ experimental testing to verify it.)
If you want atomic 128b across all x86 systems, you must use lock cmpxchg16b (available only in 64bit mode). (And it wasn't available in the first-gen x86-64 CPUs. You need to use -mcx16 with GCC/Clang for them to emit it.)
Even CPUs that internally do atomic 128b loads/stores can exhibit non-atomic behaviour in multi-socket systems with a coherency protocol that operates in smaller chunks: e.g. AMD Opteron 2435 (K10) with threads running on separate sockets, connected with HyperTransport.
Intel's and AMD's manuals diverge for unaligned access to cacheable memory. The common subset for all x86 CPUs is the AMD rule. Cacheable means write-back or write-through memory regions, not uncacheable or write-combining, as set with PAT or MTRR regions. They don't mean that the cache-line has to already be hot in L1 cache.
Intel P6 and later guarantee atomicity for cacheable loads/stores up to 64 bits as long as they're within a single cache-line (64B, or 32B on very old CPUs like Pentium III).
AMD guarantees atomicity for cacheable loads/stores that fit within a single 8B-aligned chunk. That makes sense, because we know from the 16B-store test on multi-socket Opteron that HyperTransport only transfers in 8B chunks, and doesn't lock while transferring to prevent tearing. (See above). I guess lock cmpxchg16b must be handled specially.
Possibly related: AMD uses MOESI to share dirty cache-lines directly between caches in different cores, so one core can be reading from its valid copy of a cache line while updates to it are coming in from another cache.
Intel uses MESIF, which requires dirty data to propagate out to the large shared inclusive L3 cache which acts as a backstop for coherency traffic. L3 is tag-inclusive of per-core L2/L1 caches, even for lines that have to be in the Invalid state in L3 because of being M or E in a per-core L1 cache. The data path between L3 and per-core caches is only 32B wide in Haswell/Skylake, so it must buffer or something to avoid a write to L3 from one core happening between reads of two halves of a cache line, which could cause tearing at the 32B boundary.
The relevant sections of the manuals:
The P6 family processors (and newer Intel processors
since) guarantee that the following additional memory operation will
always be carried out atomically:
Unaligned 16-, 32-, and 64-bit accesses to cached memory that fit within a cache line.
AMD64 Manual 7.3.2 Access Atomicity
Cacheable, naturally-aligned single loads or stores of up to a quadword are atomic on any processor
model, as are misaligned loads or stores of less than a quadword that
are contained entirely within a naturally-aligned quadword
Notice that AMD guarantees atomicity for any load smaller than a qword, but Intel only for power-of-2 sizes. 32-bit protected mode and 64-bit long mode can load a 48 bit m16:32 as a memory operand into cs:eip with far-call or far-jmp. (And far-call pushes stuff on the stack.) IDK if this counts as a single 48-bit access or separate 16 and 32-bit.
There have been attempts to formalize the x86 memory model, the latest one being the x86-TSO (extended version) paper from 2009 (link from the memory-ordering section of the x86 tag wiki). It's not usefully skimmable since they define some symbols to express things in their own notation, and I haven't tried to really read it. IDK if it describes the atomicity rules, or if it's only concerned with memory ordering.
Atomic Read-Modify-Write
I mentioned cmpxchg8b, but I was only talking about the load and the store each separately being atomic (i.e. no "tearing" where one half of the load is from one store, the other half of the load is from a different store).
To prevent the contents of that memory location from being modified between the load and the store, you need lock cmpxchg8b, just like you need lock inc [mem] for the entire read-modify-write to be atomic. Also note that even if cmpxchg8b without lock does a single atomic load (and optionally a store), it's not safe in general to use it as a 64b load with expected=desired. If the value in memory happens to match your expected, you'll get a non-atomic read-modify-write of that location.
The lock prefix makes even unaligned accesses that cross cache-line or page boundaries atomic, but you can't use it with mov to make an unaligned store or load atomic. It's only usable with memory-destination read-modify-write instructions like add [mem], eax.
(lock is implicit in xchg reg, [mem], so don't use xchg with mem to save code-size or instruction count unless performance is irrelevant. Only use it when you want the memory barrier and/or the atomic exchange, or when code-size is the only thing that matters, e.g. in a boot sector.)
See also: Can num++ be atomic for 'int num'?
Why lock mov [mem], reg doesn't exist for atomic unaligned stores
From the instruction reference manual (Intel x86 manual vol2), cmpxchg:
This instruction can be used with a LOCK prefix to allow the
instruction to be executed atomically. To simplify the interface to
the processor’s bus, the destination operand receives a write cycle
without regard to the result of the comparison. The destination
operand is written back if the comparison fails; otherwise, the source
operand is written into the destination. (The processor never produces
a locked read without also producing a locked write.)
This design decision reduced chipset complexity before the memory controller was built into the CPU. It may still do so for locked instructions on MMIO regions that hit the PCI-express bus rather than DRAM. It would just be confusing for a lock mov reg, [MMIO_PORT] to produce a write as well as a read to the memory-mapped I/O register.
The other explanation is that it's not very hard to make sure your data has natural alignment, and lock store would perform horribly compared to just making sure your data is aligned. It would be silly to spend transistors on something that would be so slow it wouldn't be worth using. If you really need it (and don't mind reading the memory too), you could use xchg [mem], reg (XCHG has an implicit LOCK prefix), which is even slower than a hypothetical lock mov.
Using a lock prefix is also a full memory barrier, so it imposes a performance overhead beyond just the atomic RMW. i.e. x86 can't do relaxed atomic RMW (without flushing the store buffer). Other ISAs can, so using .fetch_add(1, memory_order_relaxed) can be faster on non-x86.
Fun fact: Before mfence existed, a common idiom was lock add dword [esp], 0, which is a no-op other than clobbering flags and doing a locked operation. [esp] is almost always hot in L1 cache and won't cause contention with any other core. This idiom may still be more efficient than MFENCE as a stand-alone memory barrier, especially on AMD CPUs.
xchg [mem], reg is probably the most efficient way to implement a sequential-consistency store, vs. mov+mfence, on both Intel and AMD. mfence on Skylake at least blocks out-of-order execution of non-memory instructions, but xchg and other locked ops don't. Compilers other than gcc do use xchg for stores, even when they don't care about reading the old value.
Motivation for this design decision:
Without it, software would have to use 1-byte locks (or some kind of available atomic type) to guard accesses to 32bit integers, which is hugely inefficient compared to shared atomic read access for something like a global timestamp variable updated by a timer interrupt. It's probably basically free in silicon to guarantee for aligned accesses of bus-width or smaller.
For locking to be possible at all, some kind of atomic access is required. (Actually, I guess the hardware could provide some kind of totally different hardware-assisted locking mechanism.) For a CPU that does 32bit transfers on its external data bus, it just makes sense to have that be the unit of atomicity.
Since you offered a bounty, I assume you were looking for a long answer that wandered into all interesting side topics. Let me know if there are things I didn't cover that you think would make this Q&A more valuable for future readers.
Since you linked one in the question, I highly recommend reading more of Jeff Preshing's blog posts. They're excellent, and helped me put together the pieces of what I knew into an understanding of memory ordering in C/C++ source vs. asm for different hardware architectures, and how / when to tell the compiler what you want if you aren't writing asm directly.

If a 32-bit or smaller object is naturally-aligned within a "normal" part of memory, it will be possible for any 80386 or compatible processor other than the
80386sx to read or write all 32 bits of the object in a single operation. While the ability of a platform to do something in a quick and useful fashion doesn't necessarily mean the platform won't sometimes do it in some other fashion for some reason, and while I believe it's possible on many if not all x86 processors to have regions of memory which can only be accessed 8 or 16 bits at a time, I don't think Intel has ever defined any conditions where requesting an aligned 32-bit access to a "normal" area of memory would cause the system to read or write part of the value without reading or writing the whole thing, and I don't think Intel has any intention of ever defining any such thing for "normal" areas of memory.

Naturally aligned means that the address of the type is a multiple of the size of the type.
For example, a byte can be at any address, a short (assuming 16 bits) must be on a multiple of 2, an int (assuming 32 bits) must be on a multiple of 4, and a long (assuming 64 bits) must be on a multiple of 8.
In the event that you access a piece of data that is not naturally aligned the CPU will either raise a fault or will read/write the memory, but not as an atomic operation. The action the CPU takes will depend on the architecture.
For example, image we've got the memory layout below:
01234567
...XXXX.
and
int *data = (int*)3;
When we try to read *data the bytes that make up the value are spread across 2 int size blocks, 1 byte is in block 0-3 and 3 bytes are in block 4-7. Now, just because the blocks are logically next to each other it doesn't mean they are physically. For example, block 0-3 could be at the end of a cpu cache line, whilst block 3-7 is sitting in a page file. When the cpu goes to access block 3-7 in order to get the 3 bytes it needs it may see that the block isn't in memory and signals that it needs the memory paged in. This will probably block the calling process whilst the OS pages the memory back in.
After the memory has been paged in, but before your process is woken back up another one may come along and write a Y to address 4. Then your process is rescheduled and the CPU completes the read, but now it has read XYXX, rather than the XXXX you expected.

If you were asking why it's designed so, I would say it's a good side product from the design of CPU architecture.
Back in the 486 time, there is no multi-core CPU or QPI link, so atomicity isn't really a strict requirement at that time (DMA may require it?).
On x86, the data width is 32bits (or 64 bits for x86_64), meaning the CPU can read and write up to data width in one shot. And the memory data bus is typically the same or wider than this number. Combined with the fact that reading/writing on aligned address is done in one shot, naturally there is nothing preventing the read/write to be un-atomic. You gain speed/atomic at the same time.

To answer your first question, a variable is naturally aligned if it exists at a memory address that is a multiple of its size.
If we consider only - as the article you linked does - assignment instructions, then alignment guarantees atomicity because MOV (the assignment instruction) is atomic by design on aligned data.
Other kinds of instructions, INC for example, need to be LOCKed (an x86 prefix which gives exclusive access to the shared memory to the current processor for the duration of the prefixed operation) even if the data are aligned because they actually execute via multiple steps (=instructions, namely load, inc, store).

Can more than one Load/Store instructions can be executed at the same instance of time in Multiprocessor Environment

I believe in single processor systems, more than one Store will happen one after the other,
but what is the case for multi processsor systems?
Adding to the question, also if the machine is 32bit and when we try to write
a long int(64 bit) value to the memory, how will the Load/Store instructions behave?
The reason for the above two questions is, if someone tries to read the same memory (a memory
of size 32bit/64 bit, in 32 bit systems), in another
thread will this be safe, or do i need to consider using locks.?
Added:
I wanted to do with minimum locks possible since ours is time critical execution.
Hence I wanted to understand is there ever a possibility of executing two Store/Load instructions
at the same instant of time to the same memory location, if things gets executed in multi processor
environment.

You are wrong if you only look on load/store cpu instructions.
The compiler and your os and cpu can:
change execution order to optimize the code
can hold values in separate caches
can store data in cpu registers without accessing cache or other memory
can optimize access complete away
... a lot more I believe!
If you want to access the same variable from different threads you must use a synchronization mechanism which is provided from your language or a library which fits to your os. Nothing else will give you a guarantee to work.
The problem is not the real access to any kind of memory. You definitely must ensure that your code contains memory barriers as needed for the underlying libraries and OS support. If there are no barriers between multi thread access you will maybe not see any change from a write in one thread while read it from a second one.
This will also be a problem on a single core cpu because the compiler have no idea that you modify a variable from two threads if you don't use any kind of synchronization.
To your add on:
You simply have no control over any kind of memory access without writing your code in assembler. And if you write it in assembler, you! have to deal with registers L1/L2/Lx Caching, Memory Mapping, Inter-CPU-Communication and so on. Forget all about load/store instruction. This is only 1% of the job!
If you have time critical jobs:
try to fix the core were the thread runs on ( see detailed description for threading libraries like posix pthreads or whatever lib you are running on )
it can be much faster to run a single process with a single thread and program it in a cooperative fashion. No locks, no memory barriers, no ipc. But you have to deal with all the thread a like problems. But it is fast!
Often it is much faster to split you problem in some processes each only with one thread and make the ipc minimal. This needs a deep understanding how you can scale your algorithms.
Often a very! simple 8/16 bit cpu runs much faster in special environments in comparison to a fat 8 core cpu with fat os on it.
But you don't tell us the rest of your environment and requirements so the answer never can give a full answer to your real problem. But keep in mind: load/store was yesterday.

This can not be answered generically. You have to know which model of what design of processor it is. An AMD Opteron will be different from an Intel Pentium which is different from a Intel Core2, and all of those are different from an ARMv7 design. [They are probably fairly similar, but there's details that you may care about if you REALLY want to rely on these operations to be performed in a specific way]. And of course, if you share memory between, say, a GPU (graphics processing unit) and a CPU, you have even more possible scenarios of "different design".
There are single core "superscalar" (more than one execution unit) and "out of order execution" (processors that reorder instructions), so more than one execution unit (including more than one load/store unit), and thus more than one instruction (including load or store) can be performed at the same time.
Obviously, once the processor determines that the memory operation needs to go "outside" (that is, the value is not available in the cache), it has to be serialized, but there is no guarantee that a load or store as sequenced by you or the compiler won't be re-ordered between loads and stores. If the processor has instructions to support "data wider than the bus" (e.g. 32-bit processor loading 64-bit word), these are typically atomic to that processor. If the processor does not in itself support 64-bit words, then the load of a 64-bit value would encompass two 32-bit loads.
[When I write "load", the same applies for "store"]
In case of multiprocessor or multicore architectures, it becomes a system architecture question, which makes it even more complicated than "we can't answer this without understanding the processor design", since there are now more components involved: memory design (one lump of memory shared between processors, several lumps of memory that are not directly shared, etc).
In general, if you have multiple threads, you will need to use atomic operations - most processors have a way to say "I want this to happen without someone else interfering" - in the old days, it would be a "lock" pin on the processor(s) that would be wired to anything else that could access the memory bus, and when that pin was active, all other devices has to wait for it to become inactive before accessing the memory bus. These days, it's a fair bit more sophisticated, since there are caches involved. Most systems use an "exclusive cache content" method: The processor signals all it's peers that "I want this address to be exclusive in my cache", at which point all other processors will "flush and invalidate" that particular address in their caches. Then the atomic operation is performed in the cache, and the results available to be read by other processors only when the atomic operation is completed. This is a pretty simplified view of how it works - modern processors are very complex, and there is a lot of work involved with such seemingly simple things as "make sure this value gets updated in a way that doesn't get interrupted by some other processor writing to the same thing".
If there isn't support in the processor for "atomic" operations, then there has to be proper locks (and any processor designed for use in a multicore/multicpu environment will have operations to support locks in some way), where the lock is taken before updating something, and then released after the update. This is clearly more complex than having builtin atomic operations, but it makes the design of the processor simpler. Also, for more complex updates (where more than one 32- or 64-bit value needs updating) this sort of locking is still required - for example, if we have a "queue" where you have a "where we're writing" and "elements in queue" that both need to be updated on write, you can't do that in a single operation [without being VERY clever about it, at least].
In heterogeneous systems, such as GPU + CPU combinations, you can't do atomics between different devices, because the cache of one device doesn't "understand the language" of the other device. So when the CPU says "I want this as exclusive", the GPU sees "Hurdi gurdi meatballs" and thinks "I have no idea what that is about, I'll just ignore it" [or something like that]. In this case, there has to be some other way to access shared data, but it's typically not atomic ever, you have to send commands (via other means than the interprocessor signalling system) to the GPU to say "flush your cache, and tell me when you're done with that", and when the CPU has written something the GPU needs, the CPU will flush it's cache before telling the GPU that it can use the data. This can get pretty messy, and takes a fair amount of time.

I believe in single processor systems, more than one Store will happen one after the other,
False. Most machines are set up like that, but for performance reasons many CPUs can be configured to have a much more relaxed store ordering. This is almost never a problem for an application on a single CPU (because the CPU will make it look like you expect) but it's really critical to understand when talking to hardware.
Here's a wikipedia article: http://en.wikipedia.org/wiki/Memory_ordering
This gets doubly complicated on CPUs with non-coherent local caches. Because then you can have strong ordering as seen from this CPU while other CPUs will see totally different results depending on the cache flush order.
Adding to the question, also if the machine is 32bit and when we try to write a long int(64 bit) value to the memory, how will the Load/Store instructions behave?
Some 32 bit CPUs have instructions to do atomic 64 bit writes, others don't. Those that don't will do two separate writes where a partial result can be seen by other CPUs or threads (if you get unlucky with context switching) or signal handlers or interrupt handlers.
The reason for the above two questions is, if someone tries to read the same memory (a memory of size 32bit/64 bit, in 32 bit systems), in another thread will this be safe, or do i need to consider using locks.?
Yes, no, maybe. If it's just one value and it doesn't tell the other thread that some other memory might be in a certain state, then yes, it can be safe in certain circumstances. You're not guaranteed that the other thread will see the changed value in memory for a long time, but eventually it should see it.
Generally, you can't reason about the behavior of access to shared memory in a threaded environment without strictly following the documentation of the thread model you're using. And most of those say something like without locks the behavior is undefined, with locks everything that happened before the lock is guaranteed to happen before the lock and everything that happens after the lock is guaranteed to happen after the lock. This is not only because of differences between CPUs, but also because the operating system can do something funny and the locking code needs to be designed to convince the compiler to not do something funny either (which is surprisingly hard with modern compilers).

Which things are necessary when sharing memory between different processors?

So I have an ARM processor and a DSP processor. Some data would be shared between those two processor using a shared memory. The data could consist of data structures such as structs and classes of C++. Which things are important to make the application work with this sort of shared memory. One thing that comes to the mind is endianness, that is both processors should either be little endian or big endian. Anything else, which is also necessary to ensure proper shared memory accesses?

Apart from Endianness, following might be important,
Implementation of standard api's might differ and can have adverse effects on shared data e.g. memcpy() on ARM is implemented in a unique way internally and i at least remember having a hard time with a bug porting an RTOS to ARM. I don't recall the exact details but those surely can be found with a little search.
Strictly use types and structures aligned by the tool chains. Because alignment and padding for each architecture might be totally different. So you would be in for a surprise if you try indexing values inside a struct using pointer / array indexing.

Besides the issues in Fayyazki's answer there are several additional hurdles to the arrangement you describe.
Synchronisation: In all likelihood, you will need a pair of doorbell interrupts so that the CPU and DSP can interrupt each other to notify each other of communication. The alternative is polling - which is unlikely to result in high throughput.
Atomicity: Writing multi-word structures (e.g. C++ classes) to shared memory is problematic as you have none of the synchronisation mechanisms you might use in other circumstances such as bus-locking or disabling interrupts. You could use spin-locks to control read and write access if you insist on writing multiword structures to the shared memory. Controlling the memory packing of your structures may in fact result in more multi-word accesses.
Concurrent memory access - if you are using true multi-port SRAM, concurrent writes on the same address will be non-predictable. You therefore don't do this.
Memory/instruction order barriers: You will need to use memory barriers to ensure that writes are observed on the memory bus in the order you expect (many Cortex A-series ARM CPUs have out-of-order execution and store behaviour). This is in addition to either making the address range of the memory uncachable or flushing the cache.
By far the easiest approach to implementing communications between the two processors is to use a ring buffer in which the writing head and tail pointers are the natural word-size of the memory and are maintained by the writing processor and reading processor respectively. You will need barrier (both memory and instruction order) after writing data to the buffer and updating the head pointer to ensure writes into the buffer are observed by memory before updating the head pointer. Reads work in reverse.

ARM's weakly-ordered memory model means you'll likely have to be very careful with synchronisation and coherency if you're using cacheable memory - if this is the case I'd recommend a thorough read of the brain-melting memory model chapters of the ARM ARM (and possibly the barriers appendix).
Also, if the two processors have different views of the same memory (à la Raspberry Pi), then having pointers in the shared data could be fun*...
* for some given value of "fun"

Thread-safety of sub-word-size flags

Consider the following scenario:
There exists a globally accessible variable F.
Thread A repeatedly assigns a random value to F (without any regard to the previous value of F).
Thread B does the same thing as thread A (independent of A).
Thread C repeatedly reads the value of F (and say, prints it).
This is in C++ (Visual C++) on Windows on x64 architecture, multi-processor. F is of type bool and marked volatile, and none of the accesses are protected by any locks.
Question: Is there anything thread-unsafe about this scenario?
Assuming that the logical behavior of the code is valid, is there anything unsafe about the fact that multiple threads are reading and writing the values to the same location at the same time?
What guarantees can be made (across architectures, OSes, compilers) about the atomicity of reading from and writing to variables that are <= word-size on the platform? (I am assuming word-size is important...)
On a related note, what is an acceptible way of communicating between threads the state of completion of some operation (none of the threads are waiting for the operation to complete, they may just be interested in querying the state from time to time)?

It depends on your thread-safety requirements. You'll always get a consistent value (that is, it is impossible that you get half the value of thread A's write and the other half from thread B), but there is no guarantee that the value you'll read is, in fact, the latest one that was logically written.
The problem here is the CPU cache that may or may not get flushed. When a thread writes to the memory, the value first goes to the cache, and eventually it gets written to memory. In the while, if other cores attempt to read the object from memory, they'll get the old value.

Under x86, any read or write to a type correctly aligned for its size is considered to be atomic (so for a bool it need only be a 1 byte alignment boundery), but its recommended to use explict atomic ops for portability and the memory barriers they provide. An excerpt from the Intel System Programming Guide, Vol 3A, Section 8. May 2011 (there is another one as well, can't find it atm).
The Intel486 processor (and newer processors since) guarantees that the following
basic memory operations will always be carried out atomically:
• Reading or writing a byte
• Reading or writing a word aligned on a 16-bit boundary
• Reading or writing a doubleword aligned on a 32-bit boundary
The Pentium processor (and newer processors since) guarantees that the following
additional memory operations will always be carried out atomically:
• Reading or writing a quadword aligned on a 64-bit boundary
• 16-bit accesses to uncached memory locations that fit within a 32-bit data bus
The P6 family processors (and newer processors since) guarantee that the following
additional memory operation will always be carried out atomically:
• Unaligned 16-, 32-, and 64-bit accesses to cached memory that fit within a cache
line
Microsoft also has a few examples of using volatitle bool's for signaling thread exits, however, if you want to signal threads that are waiting, its best to use kernel constructs, on windows this would be a event (see CreateEventA/W), this will prevent the wait from burning cpu cycles when the variable hasn't been set yet.
Update:
For threads that will have almost zero wait time, its a good idea to implement a user level lock, with optional backoff if its a high contention enviroment, intel has a good article on that here, alternatively you can use WinAPI's CriticalSections (these are semi-kernel level).

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js