thread local and context switch - c++

I'm got some C++ code making use of thread local storage, each thread has a vector it can push data into.
I use TLS to store an index ID per thread, this can be used to look up which vector to push data into. It then executes a fair amount of code which pushes data into the vector.
What I'm wondering is if it is possible that the OS might reschedule my code to execute on a different thread after it has acquired the pointer to the thread local object. (So far the code executes fine and I have not seen this happen). But if it were possible, this would seem certain to break my program, since it would now be possible for two threads to have the same object.
Assuming this is true, this seems like it would be a problem even for any code that uses TLS of any complexity, is TLS only intended for simple objects where you don't take the address?
Thanks!

Thread Local Storage is just that - storage per thread. Each thread has it's own private data structure. That thread, whichever processor it runs on, is the same thread. The OS doesn't schedule work WITHIN threads, it schedules which of the threads runs.
The thread local storage is acomplished by having some sort of indirection, which is changed along with the thread itself. There are several ways to do this, for example, the OS may have a particular page at a particular offset from the start of virtual memory in the process, and when a thread is scheduled, the page-table is updated to match the thread.
In x86 processors, FS or GS is typically used for "per-thread" data, so the OS will switch the FS register [or the content of the base-address of the register in case of 64-bit processors]. When reading the TLS, the compiler will use the FS or GS segment register to prefix the memory read/write operations, and thus you always get "your private data", not some other threads.
Of course, OS's may have bugs, but this is something quite a few things will rely on, so if it's broken, it would show up pretty soon (unless it's very subtle, and you have to be standing just in the right place, with the moon in the right phase, wearing the right colour clothes, and the wind in the right direction, the date divisibly by both 3 and 7, etc,etc).

TLS means thread local, from your description, each thread access a shared vector of vector through TLS(I'm not sure), you should use some kind of lock. Any sample codes?

Related

Do I need to use a mutex to protect access to an array of mutexes from different threads?

Let's say I have a bunch of files, and an array with a mutex for each file. Now I have different threads reading from random files, but first they need to acquire the lock from the array. Should I have a lock on the entire array that must be acquired before taking the mutex for the particular file?
No, but what you do is to bring the memory in which these mutexes live into every thread since you placed the mutexes close on purpose.
Keep the other threads accesses to memory away from what the other individual threads deal with.
Assosiate each thread with data as tightly packed (but aligned), and as in as few cache lines, as possible. One mutex and one data set - nowhere close to where the other working threads needs access.
You can easily measure the effect by using a homemade std::hardware_constructive_interference_size like ... 64 (popular, non-scientific, but common).
Separate the data in such a way that no other thread needs to touch data within those 64 (or whatever number you come up with) bytes.
It's a "no kidding?" experience.
The number 64 is almost arbitrary. I can compile a program using that constant - but it will not be translated into something meaningful for a different target platform - it'll stay 64. It's a best guess.
Understanding std::hardware_destructive_interference_size and std::hardware_constructive_interference_size
No, accessing different elements of an array in different threads does not cause data races and a mutex can be used by multiple threads unsynchronized, because it must be able to to fulfill its purpose.
You do not need to add a lock for the array itself. The same is true for member functions of standard library containers that only access elements and do not modify the container itself.

memory range sharing in threads : ensure data is not stuck in cache

When sending address of memory location from one thread to another, how to ensure that data is not stuck in CPU cache, and that the second thread actually reads the correct value ? ( I'm using a socketpair() to send the
pointer from one thread to another )
And related question , how does c++ compiler, along with thread primitives, figure out what memory address need to be handled specially for synchronozations.
struct Test { int fld; }
thread_1 ( ) {
Test *ptr1 = new Test;
ptr1->fld = 100;
::write(write_fd, &ptr1, sizeof(ptr1));
}
thread_2 () {
Test *ptr2;
::read(read_fd, &ptr2, sizeof(ptr2));
// WHAT MAGIC IS REQUIRED TO ENSURE THIS ?
assert(ptr2->fld == 100 );
}
If you want to pass the value between threads in the same process, I would ensure that std::atomic<int> as the type of field, and the related setter and getter functions. Obviously, passing a pointer from one process to another doesn't work at all, unless it's from an area of memory that is guaranteed to have the same address in both processes - shared memory for example, but then you shouldn't need sockets...
Compilers do not, in general, know how to deal with caches, except for atomic types (technically, atomics are usually dealt with using separate instructions, rather than cache-flushing and cache-invalidation, and the processor hardware handles the relevant "talk to other processors about the cache content").
The OS (subject to bugs of course) does that sort of thing when passing between processes - or within processes. But for passing pointers, you can't rely on that, the newly received pointer value is correct, but the content the pointer is pointing at isn't cache-managed.
In some processors, you can use a memory barrier to the correct order of memory content between threads. This forces the processor to "perform all memory operations before this point". However, in the case of system calls like read and write, the OS should take care of that for you, and ensure that the memory has been properly written to before the read starts to read the memory it wants to store in the socket buffer, and write will have a memory barrier after it's stored your data (in this case the value of the pointer, but memory barriers affect all reads and/or writes that preceed that point).
If you were to implement your own primitives for passing data, and the processors do not have cache coherency (most of the modern processors do), you will also need to add a cache-flush for the writing side, and a cache invalidate for the reading side. This is architecture dependent, there is no support for this in standard C or C++ (and in some processors, only OS functionality [kernel mode] can flush or invalidate cache content, in other processors it can be done in user-mode code - the granularity of such operations also varies, it may be necessary to flush or invalidate the entire cache-system, or individual lines of 32, 64 or 128 bytes can be flushed at a time)
In C++, you don't need to care about implementation details like caches. The only thing you need to do is to make sure there is a C++ happens-after relation.
As Mats Petersson's answer shows, std::atomic is one way to achieve that. All accesses to an atomic variable are ordered, although the order might not be statically determined (i.e. if you have two threads trying to write to the same atomic variable, you can't predict which write happens last).
Another mechanism to enforce synchronization is std::mutex. Threads can attempt to lock a mutex, but only one thread can have a mutex locked at a time. The other threads will block. The compiler will make certain that when one thread unlocks a mutex and the next thread locks the mutex, writes by the first thread can be read by the second thread. If this requires flushing the cache, the compiler will arrange that.
Yet another mechanism is std::atomic_thread_fence. This is useful if you have multiple objects shared between threads (all in the same direction). Instead of making them all atomic, you can make one of them atomic and "attach" a fence to that atomic variable. You then write to the atomic variable last, and read from it first. Obviously this is best encapsulated in a class.

What is the best architecture to frequently communicate values between multiple threads?

I am writing an application in C++14 that consists of a master thread and multiple slave threads. The master thread coordinates the slave threads which coordinately perform a search, each exploring a part of the search space. A slave thread sometimes encounters a bound on the search. Then it communicates this bound to the master thread which sends the bound to all other slave threads so that they can possibly narrow their searches.
A slave thread must very frequently check whether there is a new bound available, possibly at the entrance of a loop.
What would be the best way to communicate the bound to the slave threads? I can think of using std::atomic<int>, but I am afraid of the performance implications this has whenever the variable is read inside the loop.
The simplest way here is IMO to not overthink this. Just use a std::mutex for each thread, protecting a std::queue that the boundary information is in. Have the main thread wait on a std::condition_variable that each child can lock, write to a "new boundary" queue , then signals te cv, which the main thread then wakes up and copies the value to each child one at at time. As you said in your question, at the top of their loops, the child threads can check their thread-specific queue to see if there's additional bounding conditions.
You actually don't NEED the "main thread" in this. You could have the children write to all other children's queues directly (still mutex-protected), as long as you're careful to avoid deadlock, it would work that way too.
All of these classes can be seen in the thread support library, with decent documentation here.
Yes there's interrupt-based ways of doing things, but in this case polling is relatively cheap because it's not a lot of threads smashing on one mutex, but mostly thread-specific mutexes, and mutexes aren't all that expensive to lock, check, unlock quickly. You're not "holding" on to them for long periods, and thus it's OK. It's a bit of a test really: do you NEED the additional complexity of lock-free? If it's only a dozen (or less) threads, then probably not.
Basically you could make a bet with your architecture that a single write to a primitive datatype is atomic. As you only have one writer, your program would not break if you use the volatile keyword to prevent compiler optimizations that might perform updates to it only in local caches.
However everybody serious about doing things right(tm) will tell you otherwise. Have a look at this article to get a pretty good riskassessment: http://preshing.com/20130618/atomic-vs-non-atomic-operations/
So if you want to be on the safe side, which I recommend, you need to follow the C++ standard. As the C++ standard does not guarantee any atomicity even for the simplest operations, you are stuck with using std::atomic. But honestly, I don't think it is too bad. Sure there is a lock involved, but you can balance out the reading frequency with the benefit of knowing the new boundary early.
To prevent polling the atomic variable, you could use the POSIX signal mechanism to notify slave threads of an update (make sure it works with the platform you are programming for). If that benefits performance or not needs to be seen.
This is actually very simple. You only have to be aware of how things work to be confident the simple solution is not broken. So, what you need is two things:
1. Be sure the variable is written/read to/from memory every time you access it.
2. Be sure you read it in an atomic way, which means you have to read the full value in one go, or if it is not done naturally, have a cheap test to verify it.
To address #1, you have to declare it volatile. Make sure the volatile keyword is applied to the variable itself. Not it's pointer of anything like that.
To address #2, it depends on the type. On x86/64 accesses to integer types is atomic as long as they are aligned to their size. That is, int32_t has to be aligned to 4 bit boundary, and int64_t has to be aligned to 8 byte boundary.
So you may have something like this:
struct Params {
volatile uint64_t bound __attribute__((aligned(8)));
};
If your bounds variable is more complex (a struct) but still fits in 64 bits, you may union it with uint64_t and use the same attribute and volatile as above.
If it's too big for 64 bit, you will need some sort of a lock to ensure you did not read half stale value. The best lock for your circumstances (single writer, multiple readers) is a sequence lock. A sequence lock is simply an volatile int, like above, that serves as the version of the data. Its value starts from 0 and advances 2 on every update. You increment it by 1 before updating the protected value, and again afterwards. The net result is that even numbers are stable states and odd numbers are transient (value updating). In the readers you do this:
1. Read the version. If not changed - return
2. Read till you get an even number
3. Read the protected variable
4. Read the version again. If you get the same number as before - you're good
5. Otherwise - back to step 2
This is actually one of the topics in my next article. I'll implement that in C++ and let you know. Meanwhile, you can look at the seqlock in the linux kernel.
Another word of caution - you need compiler barriers between your memory accesses so that the compiler does not reorder things it should really not. That's how you do it in gcc:
asm volatile ("":::"memory");

What is faster in CUDA: global memory write + __threadfence() or atomicExch() to global memory?

Assuming that we have lots of threads that will access global memory sequentially, which option performs faster in the overall? I'm in doubt because __threadfence() takes into account all shared and global memory writes but the writes are coalesced. In the other hand atomicExch() takes into account just the important memory addresses but I don't know if the writes are coalesced or not.
In code:
array[threadIdx.x] = value;
Or
atomicExch(&array[threadIdx.x] , value);
Thanks.
On Kepler GPUs, I would bet on atomicExch since atomics are very fast on Kepler. On Fermi, it may be a wash, but given that you have no collisions, atomicExch could still perform well.
Please make an experiment and report the results.
Those two do very different things.
atomicExch ensures that no two threads try to modify a given cell at a time. If such conflict would occur, one or more threads may be stalled. If you know beforehand that no two threads access the same cell, there is no point to use any atomic... function.
__threadfence() delays the current thread (and only the current thread!) to ensure that any subsequent writes by given thread do actually happen later.
As such, __threadfence() on its own, without any follow-up code is not very interesting.
For that reason, I don't think there is a point to compare the efficiency of those two. Maybe if you could show a bit more concrete use case I could relate...
Note, that neither of those actually give you any guarantees on the actual order of execution of the threads.

Can shared memory be read and validated without mutexes?

On Linux I'm using shmget and shmat to setup a shared memory segment that one process will write to and one or more processes will read from. The data that is being shared is a few megabytes in size and when updated is completely rewritten; it's never partially updated.
I have my shared memory segment laid out as follows:
-------------------------
| t0 | actual data | t1 |
-------------------------
where t0 and t1 are copies of the time when the writer began its update (with enough precision such that successive updates are guaranteed to have differing times). The writer first writes to t1, then copies in the data, then writes to t0. The reader on the other hand reads t0, then the data, then t1. If the reader gets the same value for t0 and t1 then it considers the data consistent and valid, if not, it tries again.
Does this procedure ensure that if the reader thinks the data is valid then it actually is?
Do I need to worry about out-of-order execution (OOE)? If so, would the reader using memcpy to get the entire shared memory segment overcome the OOE issues on the reader side? (This assumes that memcpy performs it's copy linearly and ascending through the address space. Is that assumption valid?)
Modern hardware is actually anything but sequentially consistent. Thus, this is not guaranteed to work as such if you don't execute memory barriers at the appropriate spots. Barriers are needed because the architecture implements a weaker shared memory consistency model than sequential consistency. This as such has nothing to do with pipelining or OoO, but with allowing multiple processors to efficiently access the memory system in parallel. See e.g. Shared memory consistency models: A tutorial. On a uniprocessor, you don't need barriers, because all the code executes sequentially on that one processor.
Also, there is no need to have two time fields, a sequence count is probably a better choice as there is no need to worry whether two updates are so close that they get the same timestamp, and updating a counter is much faster than getting the current time. Also, there is no chance that the clock moves backwards in time which might happen e.g. when ntpd adjusts for clock drift. Though this last problem can be overcome on Linux by using clock_gettime(CLOCK_MONOTONIC, ...). Another advantage of using sequence counters instead of timestamps is that you need only one sequence counter. The writer increments the counter both before writing the data, and after the write is done. Then the reader reads the sequence number, checks that it's even, and if so, reads the data, and finally then reads the sequence number again and compares to the first sequence number. If the sequence number is odd, it means a write is in progress, and there is no need to read the data.
The Linux kernel uses a locking primitive called seqlocks that do something like the above. If you're not afraid of "GPL contamination", you can google for the implementation; As such it's trivial, but the trick is getting the barriers correct.
Joe Duffy gives the exact same algorithm and calls it: "A scalable reader/writer scheme with optimistic retry".
It works.
You need two sequence number fields.
You need to read and write them in opposite order.
You might need to have memory barriers in place, depending on the memory ordering guarantees of the system.
Specifically, you need read acquire and store release semantics for the readers and writers when they access t0 or t1 for reading and writing respectively.
What instructions are needed to achieve this, depends on the architecture. E.g. on x86/x64, because of the relatively strong guarantees one needs no machine specific barriers at all in this specific case*.
* one still needs to ensure that the compiler/JIT does not mess around with loads and stores , e.g. by using volatile (that has a different meaning in Java and C# than in ISO C/C++. Compilers may differ, however. E.g. using VC++ 2005 or above with volatile it would be safe doing the above. See the "Microsoft Specific" section. It can be done with other compilers as well on x86/x64. The assembly code emitted should be inspected and one must make sure that accesses to t0 and t1 are not eliminated or moved around by the compiler.)
As a side note, if you ever need MFENCE, lock or [TopOfStack],0 might be a better option, depending on your needs.