OpenMP thread affinities with static variables

OpenMP thread affinities with static variables - c++

I have a C++ class C which contains some code, including a static variable which is meant to only be read, and perhaps a constexpr static function. For example:
template<std::size_t T>
class C {
public:
//some functions
void func1();
void func2()
static constexpr std::size_t sfunc1(){ return T; }
private:
std::size_t var1;
std::array<std::size_t,10000> array1;
static int svar1;
}
The idea is to use the thread affinity mechanisms of openMP 4.5 to control the socket (NUMA architecture) where various instances of this class are executed (and therefore also place it in a memory location close to the socket to avoid using the interconnect between the NUMA nodes). It is my understanding that since this code contains a static variable it is effectively shared between all class instances so I won't have control of the memory location where the static variable will be placed, upon thread creation. Is this correct? But I presume the other non-static variables will be located at memory locations close to the socket being used? Thanks

You have to assume that the thread stack, thread-bound malloc, and thread local storage will allocate to the thread's "local" memory - so any auto or new variables should be optimised at least on the thread they were created on, though I don't know which compilers support that kind of allocation model; but as you say, static non-const data can only exist in one location. I guess if the compiler recognises const segments or constructed const segments, then after construction they could be duplicated per zone and then mapped to the same logical address? Again don't know if compilers are doing that automagically.
Non-const statics are going to be troublesome. Presumably these statics are helping to perform some sort of thread synchronisation. If they contain flags that are read often and written rarely then for best performance it may be better for the writer to write to a number of registered copies (one per zone) and each thread uses a thread-local pointer to the appropriate zone copy, than half (or 3/4) the readers are always slow. Of course, that ceases to be a simple atomic write, and a single mutex just puts you back where you started. I suspect this is roll-your-own code land.
The simple case that shouldn't be forgotten: if objects are passed between threads, then potentially a thread could be accessing a non-local object.

Related

is it ok that multi threads write to different variables of a shared object in c++

so this shared object is like this(just for demo, not a working one):
class Shared{
int var1;
int var2;
public:
void setter1(int var){
var1=var;
}
void setter2(int var){
var2=var;
}
}
And can thread1 do shared->setter1(3) while thread2 do shared->setter2(2) at the same time without any race condition or problems?

Yes, you can access individual subobjects of Shared object in two independent threads without synchronization, that will not be a race condition. The laymen explanation is given on the cppreference, here is the partial quote:
Different threads of execution are always allowed to access (read and
modify) different memory locations concurrently, with no interference
and no synchronization requirements.
However, watch out for false sharing! Your code seems to be prone to it.

There is no sharing in your example. When you're talking about what is shared between threads, then the only thing that matters is memory location.
If you have a variable of type Shared s;, then s.var1 and s.var2 are two different memory locations. If s.var1 is only ever accessed by one thread, and s.var2 is only ever accessed by some other thread, then neither of those memory locations is shared.
But do watch out for the false sharing that #SergeyA warned you about. It won't impact the correctness of your program, but it could impact the performance.

C++ threads stack address range

Does the C++ standard provide a guarantee about the non-overlapping nature of thread stacks (as in started by an std::thread)? In particular is there a guarantee that threads will have have their own, exclusive, allocated range in the process's address space for the thread stack? Where is this described in the standard?
For example
std::uintptr_t foo() {
auto integer = int{0};
return std::bit_cast<std::uintptr_t>(&integer);
...
}
void bar(std::uint64_t id, std::atomic<std::uint64_t>& atomic) {
while (atomic.load() != id) {}
cout << foo() << endl;
atomic.fetch_add(1);
}
int main() {
auto atomic = std::atomic<std::uint64_t>{0};
auto one = std::thread{[&]() { bar(0, atomic); }};
auto two = std::thread{[&]() { bar(1, atomic); }};
one.join();
two.join();
}
Can this ever print the same value twice? It feels like the standard should be providing this guarantee somewhere. But not sure..

The C++ standard does not even require that function calls are implemented using a stack (or that threads have stack in this sense).
The current C++ draft says this about overlapping objects:
Two objects with overlapping lifetimes that are not bit-fields may have the same address if one is nested within the other, or if at least one is a subobject of zero size and they are of different types; otherwise, they have distinct addresses and occupy disjoint bytes of storage.
And in the (non-normative) footnote:
Under the “as-if” rule an implementation is allowed to store two objects at the same machine address or not store an object at all if the program cannot observe the difference ([intro.execution]).
In your example, I do not think the threads synchronize properly, as probably intended, so the lifetimes of the integer objects do not necessarily overlap, so both objects can be put at the same address.
If the code were fixed to synchronize properly and foo were manually inlined into bar, in such a way that the integer object still exists when its address is printed, then there would have to be two objects allocated at different addresses because the difference is observable.
However, none of this tells you whether stackful coroutines can be implemented in C++ without compiler help. Real-world compilers make assumptions about the execution environment that are not reflected in the C++ standard and are only implied by the ABI standards. Particularly relevant to stack-switching coroutines is the fact that the address of the thread descriptor and thread-local variables does not change while executing a function (because they can be expensive to compute and the compiler emits code to cache them in registers or on the stack).
This is what can happen:
Coroutine runs on thread A and accesses errno.
Coroutine is suspended from thread A.
Coroutine resumes on thread B.
Coroutine accesses errno again.
At this point, thread B will access the errno value of thread A, which might well be doing something completely different at this point with it.
This problem is avoid if a coroutine is only ever be resumed on the same thread on which it was suspended, which is very restrictive and probably not what most coroutine library authors have in mind. The worst part is that resuming on the wrong thread is likely appear to work, most of the time, because some widely-used thread-local variables (such as errno) which are not quite thread-local do not immediately result in obviously buggy programs.

For all the Standard cares, implementations call new __StackFrameFoo when foo() needs a stack frame. Where those end, who knows.
The chief rule is that different objects have different addresses, and that includes object which "live on the stack". But the rule only applies to two objects which exist at the same time, and then only as far as the comparison is done with proper thread synchronization. And of course, comparing addresses does hinder the optimizer, which might need to assign an address for an object that could otherwise be optimized out.

C++ thready safety of a static member variable

For example, I have this class:
class Example
{
public:
static int m_test;
}
I have threads A and B both using this static member variable. Is this member variable thread safe somewhere under the hub?
I would assume it is not, since it is statically allocated and therefore both threads would be accessing the same memory location, possibly causing collisions. Is that correct or there is some hidden mechanism that makes this static member thread-safe?

No it is not thread safe insofar there is no built-in mechanism to obviate data races.
static std::atomic<int> m_test; would be though.
Note that you also have thread_local as a storage duration too - not of use to you in this instance - but if you had that rather than static then every thread would get their own m_test.

It is safe if both threads just read that variable. If at least one updates it, then it's a data race -> undefined behavior.
Hidden mechanism are atomic operations. E.g., via making this variable of std::atomic<int> type in C++11.

Implementing Thread Local Storage in Software

We are porting an embedded application from Windows CE to a different system. The current processor is an STM32F4. Our current codebase heavily uses TLS. The new prototype is running KEIL CMSIS RTOS which has very reduced functionality.
On http://www.keil.com/support/man/docs/armcc/armcc_chr1359124216560.htm it says that thread local storage is supported since 5.04. Right now we are using 5.04. The problem is that when linking our program with a variable definition of __thread int a; the linker cannot find __aeabi_read_tp which makes sense to me.
My question is: Is it possible to implement __aeabi_read_tp and it will work or is there more to it?
If it simply is not possible for us: Is there a way to implement TLS only in software? Let's not talk about performance there for now.
EDIT
I tried implementing __aeabi_read_tp by looking at old source of freeBSD and other sources. While the function is mostly implemented in assembly I found a version in C which boils down to this:
extern "C"
{
extern osThreadId svcThreadGetId(void);
void *__aeabi_read_tp()
{
return (void*)svcThreadGetId();
}
}
What this basically does is give me the ID (void*) of my currently executing thread. If I understand correctly that is what we want. Can this possibly work?

Not considering the performance and not going into CMIS RTOS specifics (which are unknown to me), you can allocate space needed for your variables - either on heap or as static or global variable - I would suggest to have an array of structures. Then, when you create thread, pass the pointer to the next not used structure to your thread function.
In case of static or global variable, it would be good if you know how many threads are working in parallel for limiting the size of preallocated memory.
EDIT: Added sample of TLS implementation based on pthreads:
#include <pthread.h>
#define MAX_PARALLEL_THREADS 10
static pthread_t threads[MAX_PARALLEL_THREADS];
static struct tls_data tls_data[MAX_PARALLEL_THREADS];
static int tls_data_free_index = 0;
static void *worker_thread(void *arg) {
static struct tls_data *data = (struct tls_data *) arg;
/* Code omitted. */
}
static int spawn_thread() {
if (tls_data_free_index >= MAX_PARALLEL_THREADS) {
// Consider increasing MAX_PARALLEL_THREADS
return -1;
}
/* Prepare thread data - code omitted. */
pthread_create(& threads[tls_data_free_index], NULL, worker_thread, & tls_data[tls_data_free_index]);
}

The not-so-impressive solution is a std::map<threadID, T>. Needs to be wrapped with a mutex to allow new threads.
For something more convoluted, see this idea

I believe this is possible, but probably tricky.
Here's a paper describing how __thread or thread_local behaves in ELF images (though it doesn't talk about ARM architecture for AEABI):
https://www.akkadia.org/drepper/tls.pdf
The executive summary is:
The linker creates .tbss and/or .tdata sections in the resulting executable to provide a prototype image of the thread local data needed for each thread.
At runtime, each thread control block (TCB) has a pointer to a dynamic thread-local vector table (dtv in the paper) that contains the thread-local storage for that thread. It is lazily allocated and initialized the first time a thread attempts to access a thread-local variable. (presumably by __aeabi_read_tp())
Initialization copies the prototype .tdata image and memsets the .tbss image into the allocated storage.
When source code access thread-local variables, the compiler generates code to read the thread pointer from __aeabi_read_tp(), and do all the appropriate indirection to get at the storage for that thread-local variable.
The compiler and linker is doing all the work you'd expect it to, but you need to initialize and return a "thread pointer" that is properly structured and filled out the way the compiler expects it to be, because it's generating instructions directly to follow the hops.
There are a few ways that TLS variables are accessed, as mentioned in this paper, which, again, may or may not totally apply to your compiler and architecture:
http://www.fsfla.org/~lxoliva/writeups/TLS/RFC-TLSDESC-x86.txt
But, the problems are roughly the same. When you have runtime-loaded libraries that may bring their own .tbss and .tdata sections, it gets more complicated. You have to expand the thread-local storage for any thread that suddenly tries to access a variable introduced by a library loaded after the storage for that thread was initialized. The compiler has to generate different access code depending on where the TLS variable is declared. You'd need to handle and test all the cases you would want to support.
It's years later, so you probably already solved or didn't solve your problem. In this case, it is (was) probably easiest to use your OS's TLS API directly.

Is this way of creating static instance thread safe?

I have the following sample C++ code:
class Factory
{
public:
static Factory& createInstance()
{
static Factory fac;
return fac;
}
private:
Factory()
{
//Does something non-trivial
}
};
Let's assume that createInstance is called by two threads at the same time. So will the resulting object be created properly? What happens if the second thread enters the createInstance call when the first thread is in the constructor of Factory?

C++11 and above: local static creation is thread-safe.
The standard guarantees that:
The creation is synchronized.
Should the creation throws an exception, the next time the flow of execution passes the variable definition point, creation will be attempted again.
It is generally implemented with double-checking:
first a thread-local flag is checked, and if set, then the variable is accessed.
if not yet set, then a more expensive synchronized path is taken, and if the variable is created afterward, the thread-local flag is set.
C++03 and C++98: the standard knows no thread.
There are no threads as far as the Standard is concerned, and therefore there is no provision in the Standard regarding synchronization across threads.
However some compilers implement more than the standard mandates, either in the form of extensions or by giving stronger guarantees, so check out for the compilers you're interested in. If they are good quality ones, chances are that they will guarantee it.
Finally, it might not be necessary for it to be thread-safe. If you call this method before creating any thread, then you ensures that it will be correctly initialized before the real multi-threading comes into play, and you'll neatly side-step the issue.

Looking at this page, I'd say that this is not thread-safe, because the constructor could get called multiple times before the variable is finally assigned. An InterlockedCompareExchange() might be needed, where you create a local copy of the variable, then atomically assign the pointer to a static field via the interlocked function, if the static variable is null.

Of course it's thread safe! Unless you are a complete lunatic and spawn threads from constructors of static objects, you won't have any threads until after main() is called, and the createInstance method is just returning a reference to an already constructed object, there's no way this can fail. ISO C++ guarantees that the object will be constructed before the first use after main() is called: there's no assurance that will be before main is called, but is has to be before the first use, and so all systems will perform the initialisation before main() is called. Of course ISO C++ doesn't define behaviour in the presence of threads or dynamic loading, but all compilers for host level machines provide this support and will try to preserve the semantics specified for singly threaded statically linked code where possible.

The instantiation (first call) itself is threadsafe.
However, subsequent access will not be, in general. For instance, suppose after instantiation, one thread calls a mutable Factory method and another calls some accessor method in Factory, then you will be in trouble.
For example, if your factory keeps a count of the number of instances created, you will be in trouble without some kind of mutex around that variable.
However, if Factory is truly a class with no state (no member variables), then you will be okay.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js