Intel TSX hardware transactional memory what do non-transactional threads see? - c++

Suppose you have two threads, one creates a TSX transaction, and modifies some data structure. The other thread does no synchronization of any kind and reads the same data structure. Is the transaction atomic to it? I can't actually imagine that it can be true, since there is no way afaik to block or restart it if it tries reading a cache line modified by the transaction.
If the transaction is not atomic, then are the write ordering rules on x86 still respected? If it sees write #2, then it is guaranteed that it must be able to see the previous write #1. Does this still hold for writes that happen as part of a transaction?
I could not find answers to these questions anywhere, and I kind of doubt anyone on SO would know either, but at least when somebody finds out this is a Google friendly place to put an answer.

(My answer is based on IntelĀ® 64 and IA-32 Architectures Optimization Reference Manual, Chapter 12)
The transaction is atomic to the read, in that the read will cause the transaction to abort, and thus appear that it never took place. In the transactional region, cache lines (tracked in the L1) read are considered the read-set and lines written to from the write-set. If another processor reads from the write-set (which is your example) or writes to either the read- or write-set, then there is a data conflict.
Data conflicts are detected through the cache coherence protocol.
Data conflicts cause transactional aborts. In the initial
implementation, the thread that detects the data conflict will
transactionally abort.
Thus the thread attempting the transaction is tracking the line and will detect the conflict when the other thread makes its read request. It aborts and "the hardware will restart at the instruction address provided by the operation of the XBEGIN instruction". In this chapter, there are no distinctions as to what the second processor is doing. It does not matter whether it is attempting a transaction or performing a simple read.
To summarize, all threads (whether transactional or not) see either the full transaction or nothing. Only the thread in a TSX transaction can see the intermediate state of memory.

Related

Is my understanding of transactional memory as described below correct?

I am trying to understand TM. I have read Ben's answer here and tried to understand some other articles on the Internet. I am still not quite sure if I understood correctly though. In my understanding in transactional memory the threads may execute the transactions in parallel. If two (or more) threads try to access the same transaction variable, all threads except one will abort the transaction and start over (at some point, not necessarily immediately). The one that doesn't abort updates the transaction variable.
So in a nutshell in TM all threads run in parallel and we hope that there won't be any access overlaps to transactional variables and if there are, we just only let one thread continue, while the others roll back and retry. Is this understanding of TM correct?
That is a pretty good synopsis. The details are quite convoluted, and it is possible that some transactions cannot be expressed in a given TM monitor; which means that you may have to include two implementations of your transaction - an optimistic and pessimistic one.
The cache is the underlying implementation; when you make a transactional reference to memory, the cache notes this, and either generates an alarm (restart) when any of those references are modified, or rejects the transaction close if any have been modified.
The number of transactional variables may have to, in general, be lower than your cache's associativity; otherwise they would evict one another from the cache, resulting in a transaction that could never complete.
How interrupts function in the midst of a transaction remains an open problem.
In short, it was a bit of a fascinating idea 20 years ago. As it nears general usability, it seems to have rapidly expanding hardware requirements. It may be more useful for warming cold climes than accelerating computer systems.

How do changes (read/writes) to std::atomic variables propagate across threads

I have asked this question recently do-i-need-to-use-memory-barriers-to-protect-a-shared-resource
To that question I got a very interesting answer that uses this hypothesis:
Changes to std::atomic variables are guaranteed to propagate across threads.
Why is this so? How is it done? How does this behavior fit within the MESI protocol ?
They don't actually have to propagate, the cache coherency model (MESI or something more advanced) provides you a guarantee that the memory behaves coherently, almost as if it's flat and no cached copies exist. Sequential consistency adds to that a guarantee of the same observation order by all agents in the system (notice - most CPUs don't provide sequential consistency through HW alone).
If a thread does a memory write (not even atomic), the core it runs on will fetch the line and obtain ownership over it. Once the write is done, any thread that attempts to observe the line is guaranteed to see the updated value, even if the line still resides in the modifying core - usually this is achieved through snooping the core and getting the line from it as a response. The cache coherency protocols will guarantee that if such a modification exists locally in some core - any other core looking for that line is bound to see it eventually. To do that, the CPU my use snoop filters, directory management (often for cross socket coherency), or other methods.
Now, you're asking why is atomic important? For 2 reasons. First - all the above applies only if the variable resides in memory, and not a register. This is a compiler decision, so the correct type tells it to do so. Other paradigms (like open-MP or POSIX threads) have other ways to tell the compiler that a variable needs to be shared through memory.
Second - modern cores execute operations out-of-order, and we don't want any other operation to pass that write and expose stale data. std::atomic tells the compiler to enforce the strongest memory ordering (through the use of explicit fencing or locking - check out the generated assembly code), which means that all your memory operations from all threads will be have the same global ordering. If you didn't do that, strange things can happen like core A and core B disagreeing on the order of 2 writes to the same location (meaning that they may see different final values in it).
Last, of course, is the actual atomicity - if your data type is not one that has atomicity guaranteed, or it's not properly aligned - this will also solve that problem for you (otherwise the coherency problem intensifies - think of some thread trying to change a value split between 2 cache lines, and different cores seeing partial values)

Can memory store be reordered really, in an OoOE processor?

We know that two instructions can be reordered by an OoOE processor. For example, there are two global variables shared among different threads.
int data;
bool ready;
A writer thread produce data and turn on a flag ready to allow readers to consume that data.
data = 6;
ready = true;
Now, on an OoOE processor, these two instructions can be reordered (instruction fetch, execution). But what about the final commit/write-back of the results? i.e., will the store be in-order?
From what I learned, this totally depends on a processor's memory model. E.g., x86/64 has a strong memory model, and reorder of stores is disallowed. On the contrary, ARM typically has a weak model where store reordering can happen (along with several other reorderings).
Also, the gut feeling tells me that I am right because otherwise we won't need a store barrier between those two instructions as used in typical multi-threaded programs.
But, here is what our wikipedia says:
.. In the outline above, the OoOE processor avoids the stall that
occurs in step (2) of the in-order processor when the instruction is
not completely ready to be processed due to missing data.
OoOE processors fill these "slots" in time with other instructions
that are ready, then re-order the results at the end to make it appear
that the instructions were processed as normal.
I'm confused. Is it saying that the results have to be written back in-order? Really, in an OoOE processor, can store to data and ready be reordered?
The simple answer is YES on some processor types.
Before the CPU, your code faces an earlier problem, compiler reordering.
data = 6;
ready = true;
The compiler is free to rearrange these statements since, as far as it knows, they do not affect each other (it is not thread-aware).
Now down to the processor level:
1) An out-of-order processor can process these instructions in different order, including reversing the order of the stores.
2) Even if the CPU performs them in order, they memory controller may not perform them in order because it may need to flush or bring in new cache lines or do an address translation before it can write them.
3) Even if this doesn't happen, another CPU in the system may not see them in the same order. In order to observe them, it may need to bring in the modified cache lines from the core that wrote them. It may not be able to bring one cache line in earlier than another if it is held be another core or if there is contention for that line by multiple cores, and its own out of order execution will read one before the other.
4) Finally, speculative execution on other cores may read the value of data before ready was set by the writing core, and by the time it gets around to reading ready, it was already set but data was also modified.
These problems are all solved by memory barriers. Platforms with weakly-ordered memory must make use of memory barriers to ensure memory coherence for thread synchronization.
The consistency model (or memory model) for the architecture determines what memory operations can be reordered. The idea is always to achieve the best performance from the code, while preserving the semantics expected by the programmer. That is the point from wikipedia, the memory operations appear in order to the programmer, even though they may have been reordered. Reordering is generally safe when the code is single-threaded, as the processor can easily detect potential violations.
On x86, the common model is that writes are not reordered with other writes. Yet, the processor is using out of order execution (OoOE), so instructions are being reordered constantly. Generally, the processor has several additional hardware structures to support OoOE, like a reorder buffer and load-store queue. The reorder buffer ensures that all instructions appear to execute in order, such that interrupts and exceptions break a specific point in the program. The load-store queue functions similarly, in that it can restore the order of memory operations according to the memory model. The load-store queue also disambiguates addresses, so that the processor can identify when the operations are made to the same or different addresses.
Back to OoOE, the processor is executing 10s to 100s of instructions in every cycle. Loads and stores are computing their addresses, etc. The processor may prefetch the cache lines for the accesses (which may include cache coherence), but it cannot actually access the line either to read or write until it is safe (according to the memory model) to do so.
Inserting store barriers, memory fences, etc tell both the compiler and processor about further restrictions to reordering the memory operations. The compiler is part of implementing the memory model, as some languages like java have specific memory model, while others like C obey the "memory accesses should appear as if they were executed in order".
In conclusion, yes, data and ready can be reordered in an OoOE. But it depends on the memory model as to whether they actually are. So if you need a specific order, provide the appropriate indication using barriers, etc such that the compiler, processor, etc will not choose a different order for higher performance.
On modern processor, the storing action itself is async (think of it like submit a change to the L1 cache and continue execution, the cache system further propagate in async manner). So the changes on two object lies on different cache block may be realised OoO from other CPU's perspective.
Furthermore, even the instruction to store those data, can be executed OoO. For example when two object is stored "at the same time", but the bus line of one object is retained/locked by other CPU or bus mastering, thus other other object may be committed earlier.
Therefore, to properly share data across threads, you need some kind of memory barrier or make use of transactional memory feature found in latest CPU like TSX.
I think you're misinterpreting "appear that the instructions were processed as normal." What that means is that if I have:
add r1 + 7 -> r2
move r3 -> r1
and the order of those is effectively reversed by out-of-order execution, the value that participates in the add operation will still be the value of r1 that was present prior to the move. Etc. The CPU will cache register values and/or delay register stores to assure that the "meaning" of a sequential instruction stream is not changed.
This says nothing about the order of stores as visible from another processor.

Are POSIX' read() and write() system calls atomic?

I am trying to implement a database index based on the data structure (Blink tree) and algorithms suggested by Lehman and Yao in this paper. In page 2, the authors state that:
The disk is partitioned in sections of fixed size (physical pages; in this paper, these correspond to the nodes of the tree). These are the only units that can be read or written by a process. [emphasis mine] (...)
(...) a process is allowed to lock and unlock a disk page. This lock gives that process exclusive modification rights to that page; also, a process must have a page locked in order to modify that page. (...) Locks do not prevent other processes from reading the locked page. [emphasis mine]
I am not completely sure my interpretation is correct (I am not used to reading academic papers), but I think it can be concluded from the emphasized sentences that the authors mean the operations that read and write a page are assumed to be "atomic", in the sense that, if a process A has already begun reading (resp. writing) a page, another process B may not begin writing (resp. reading) that same page until A is done performing its read (resp. write) operation. Multiple processes simultaneously reading the same page is, of course, a legitimate condition, as is having multiple processes simultaneously performing arbitrary operations on exclusively different pages (process A on page P, process B on page Q, process C on page R, etc.).
Is my interpretation correct?
Can I assume POSIX' read() and write() system calls are "atomic" in the sense described above? Can I rely on these system calls having some internal logic to determine whether a specfic read() or write() call should be temporarily blocked based on the position of the file descriptor and the specified size of the chunk to be read or written?
If the answer to the above questions is "No", how should I roll my own locking mechanism?
I don't believe the text you cites implies anything of the sort. It doesn't even mention read() or write() or POSIX. In fact, read() and write() cannot be relied on to be atomic. The only thing POSIX says is that write() must be atomic if the size of the write is less than PIPE_BUF bytes, and even that only applies to pipes.
I didn't read the context around the part of the paper you cited, but it sounds like the passage you cited is stating constraints which must be placed on an implementation in order for the algorithm to work correctly. In other words, it states that an implementation of this algorithm requires locking.
How you do that locking is up to you (the implementor). If we are dealing with a regular file and multiple independent processes, you might try fcntl(F_SETLKW)-style locking. If your data structure is in memory and you are dealing with multiple threads in the same process, something else might be appropriate.
Answers:
Concurrent reads to writes may see torn writes depending on OS, filing system, and what flags you opened the file with. A quick summary by flags, OS and filing system is below.
You can lock byte ranges in a file before accessing them using fcntl() on POSIX or LockFile() on Windows.
No O_DIRECT/FILE_FLAG_NO_BUFFERING:
Microsoft Windows 10 with NTFS: update atomicity = 1 byte
Linux 4.2.6 with ext4: update atomicity = 1 byte
FreeBSD 10.2 with ZFS: update atomicity = at least 1Mb, probably infinite (*)
O_DIRECT/FILE_FLAG_NO_BUFFERING:
Microsoft Windows 10 with NTFS: update atomicity = up to 4096 bytes only if page aligned, otherwise 512 bytes if FILE_FLAG_WRITE_THROUGH off, else 64 bytes. Note that this atomicity is probably a feature of PCIe DMA rather than designed in (*).
Linux 4.2.6 with ext4: update atomicity = at least 1Mb, probably infinite (*). Note that earlier Linuxes with ext4 definitely did not exceed 4096 bytes, XFS certainly used to have custom locking but it looks like recent Linux has finally fixed this.
FreeBSD 10.2 with ZFS: update atomicity = at least 1Mb, probably infinite (*)
You can see the raw empirical test results at https://github.com/BoostGSoC13/boost.afio/blob/master/fs_probe/fs_probe_results.yaml. The results were generated by a program written using asynchronous file i/o through on all platforms. Note we test for torn offsets only on 512 byte multiples, so I cannot say if a partial update of a 512 byte sector would tear during the read-modify-write cycle.

Is memory reordering visible to other threads on a uniprocessor?

It's common that modern CPU architectures employ performance optimizations that can result in out-of-order execution. In single threaded applications memory reordering may also occur, but it's invisible to programmers as if memory was accessed in program order. And for SMP, memory barriers come to the rescue which are used to enforce some sort of memory ordering.
What I'm not sure, is about multi-threading in a uniprocessor. Consider the following example: When thread 1 runs, the store to f could take place before the store to x. Let's say context switch happens after f is written, and right before x is written. Now thread 2 starts to run, and it ends the loop and print 0, which is undesirable of course.
// Both x, f are initialized w/ 0.
// Thread 1
x = 42;
f = 1;
// Thread 2
while (f == 0)
;
print x;
Is the scenario described above possible? Or is there a guarantee that physical memory is committed during thread context switch?
According to this wiki,
When a program runs on a single-CPU machine, the hardware performs
the necessary bookkeeping to ensure that the program execute as if all
memory operations were performed in the order specified by the
programmer (program order), so memory barriers are not necessary.
Although it didn't explicitly mention uniprocessor multi-threaded applications, it includes this case.
I'm not sure it's correct/complete or not. Note that this may highly depend on the hardware(weak/strong memory model). So you may want to include the hardware you know in the answers. Thanks.
PS. device I/O, etc are not my concern here. And it's a single-core uniprocessor.
Edit: Thanks Nitsan for the reminder, we assume no compiler reordering here(just hardware reordering), and loop in thread 2 is not optimized away..Again, devil is in the details.
As a C++ question the answer must be that the program contains a data race, so the behavior is undefined. In reality that means that it could print something other than 42.
That is independent of the underlying hardware. As has been pointed out, the loop can be optimized away and the compiler can reorder the assignments in thread 1, so that result can occur even on uniprocessor machines.
[I'll assume that with "uniprocessor" machine, you mean processors with a single core and hardware thread.]
You now say, that you want to assume compiler reordering or loop elimination does not happen. With this, we have left the realm of C++ and are really asking about corresponding machine instructions. If you want to eliminate compiler reordering, we can probably also rule out any form of SIMD instructions and consider only instructions operating on a single memory location at a time.
So essentially thread1 has two store instructions in the order store-to-x store-to-f, while thread2 has test-f-and-loop-if-not-zero (this may be multiple instructions, but involves a load-from-f) and then a load-from-x.
On any hardware architecture I am aware of or can reasonably imagine, thread 2 will print 42.
One reason is that, if instructions processed by a single processors are not sequentially consistent among themselves, you could hardly assert anything about the effects of a program.
The only event that could interfere here, is an interrupt (as is used to trigger a preemptive context switch). A hypothetical machine that stores the entire state of its current execution pipeline state upon an interrupt and restores it upon return from the interrupt, could produce a different result, but such a machine is impractical and afaik does not exist. These operations would create quite a bit of additional complexity and/or require additional redundant buffers or registers, all for no good reason - except to break your program. Real processors either flush or roll back the current pipeline upon interrupt, which is enough to guarantee sequential consistency for all instructions on a single hardware thread.
And there is no memory model issue to worry about. The weaker memory models originate from the separate buffers and caches that separate the separate hardware processors from the main memory or nth level cache they actually share. A single processor has no similarly partitioned resources and no good reason to have them for multiple (purely software) threads. Again there is no reason to complicate the architecture and waste resources to make the processor and/or memory subsystem aware of something like separate thread contexts, if there aren't separate processing resources (processors/hardware threads) to keep these resources busy.
A strong memory ordering execute memory access instructions with the exact same order as defined in the program, it is often referred as "program ordering".
Weaker memory ordering may be employed to allow the processor reorder memory access for better performance, it is often referred as "processor ordering".
AFAIK, the scenario described above is NOT possible in the Intel ia32 architecture, whose processor ordering outlaws such cases. The relevant rules are (intel ia-32 software development manual Vol3A 8.2 Memory Ordering) :
writes are not reordered with other writes, with the exception of streaming stores, CLFLUSH and string operations.
To illustrate the rule, it gives an example similar to this:
memory location x, y, initialized to 0;
thread 1:
mov [x] 1
mov [y] 1
thread 2:
mov r1 [y]
mov r2 [x]
r1 == 1 and r2 == 0 is not allowed
In your example, thread 1 cannot store f before storing x.
#Eric in respond to your comments.
fast string store instruction "stosd", may store string out of order inside its operation. In a multiprocessor environment, when a processor store a string "str", another processor may observe str[1] being written before str[0], while the logic order presumed to be writing str[0] before str[1];
But these instructions are not reorder with any other stores. and must have precise exception handling. When exception occurs in the middle of stosd, the implementation may choose to delay it so that all out-of-order sub-stores (doesn't necessarily mean the whole stosd instruction) must commit before the context switch.
Edited to address the claims made on as if this is a C++ question:
Even this is considered in the context of C++, As I understand, a standard confirming compiler should NOT reorder the assignment of x and f in thread 1.
$1.9.14
Every value computation and side effect associated with a full-expression is sequenced before every value
computation and side effect associated with the next full-expression to be evaluated.
This isn't really a C or C++ question, since you've explicitly assumed no load/store re-ordering, which compilers for both languages are perfectly allowed to do.
Allowing that assumption for the sake of argument, note that loop may anyway never exit, unless you either:
give the compiler some reason to believe f may change (eg, by passing its address to some non-inlineable function which could modify it)
mark it volatile, or
make it an explicitly atomic type and request acquire semantics
On the hardware side, your worry about physical memory being "committed" during a context switch isn't an issue. Both software threads share the same memory hardware and cache, so there's no risk of inconsistency there whatever consistency/coherence protocol pertains between cores.
Say both stores were issued, and the memory hardware decides to re-order them. What does this really mean? Perhaps f's address is already in cache, so it can be written immediately, but x's store is deferred until that cache line is fetched. Well, a read from x is dependent on the same address, so either:
the load can't happen until the fetch happens, in which case a sane implementation must issue the queued store before the queued load
or the load can peek into the queue and fetch x's value without waiting for the write
Consider anyway that the kernel pre-emption required to switch threads will itself issue whatever load/store barriers are required for consistency of the kernel scheduler state, and it should be obvious that hardware re-ordering can't be a problem in this situation.
The real issue (which you're trying to avoid) is your assumption that there is no compiler re-ordering: this is simply wrong.
You would only need a compiler fence. From the Linux kernel docs on Memory Barriers (link):
SMP memory barriers are reduced to compiler barriers on uniprocessor
compiled systems because it is assumed that a CPU will appear to be
self-consistent, and will order overlapping accesses correctly with
respect to itself.
To expand on that, the reason why synchronization is not required at the hardware level is that:
All threads on a uniprocessor system share the same memory and thus there are no cache-coherency issues (such as propagation latency) that can occur on SMP systems, and
Any out-of-order load/store instructions in the CPU's execution pipeline would either be committed or rolled back in full if the pipeline is flushed due to a preemptive context switch.
This code may well never finish(in thread 2) as the compiler can decide to hoist the whole expression out of the loop(this is similar to using an isRunning flag which is not volatile).
That said you need to worry about 2 types of re-orderings here: compiler and CPU, both are free to move the stores around. See here: http://preshing.com/20120515/memory-reordering-caught-in-the-act for an example. At this point the code you describe above is at the mercy of compiler, compiler flags, and particular architecture. The wiki quoted is misleading as it may suggest internal re-ordering is not at the mercy of the cpu/compiler which is not the case.
As far as the x86 is concerned, the out-of-order-stores are made consistent from the viewpoint of the executing code with regards to program flow. In this case, "program flow" is just the flow of instructions that a processor executes, not something constrained to a "program running in a thread". All the instructions necessary for context switching, etc. are considered part of this flow so the consistency is maintained across threads.
A context switch has to store the complete machine state so that it can be restored before the suspended thread resumes execution. The machine states includes the processor registers but not the processor pipeline.
If you assume no compiler reordering, this means that all hardware instructions that are "on-the-fly" have to be completed before a context switch (i.e. an interrupt) takes place, otherwise they get lost and are not stored by the context switch mechanism. This is independend of hardware reordering.
In your example, even if the processor swaps the two hardware instructions "x=42" and "f=1", the instruction pointer is already after the second one, and therefore both instructions must be completed before the context switch begins. if it were not so, since the content of the pipeline and of the cache are not part of the "context", they would be lost.
In other words, if the interrupt that causes the ctx switch happens when the IP register points at the instruction following "f=1", then all instructions before that point must have completed all their effects.
From my point of view, processor fetch instructions one by one.
In your case, if "f = 1" was speculatively executed before "x = 42", that means both these two instructions are already in the processor's pipeline. The only possible way to schedule current thread out is interrupt. But the processor(at least on X86) will flush the pipeline's instructions before serving the interrupt.
So no need to worry about the reordering in a uniprocessor.