Implementing Thread Local Storage in Software

Implementing Thread Local Storage in Software - c++

We are porting an embedded application from Windows CE to a different system. The current processor is an STM32F4. Our current codebase heavily uses TLS. The new prototype is running KEIL CMSIS RTOS which has very reduced functionality.
On http://www.keil.com/support/man/docs/armcc/armcc_chr1359124216560.htm it says that thread local storage is supported since 5.04. Right now we are using 5.04. The problem is that when linking our program with a variable definition of __thread int a; the linker cannot find __aeabi_read_tp which makes sense to me.
My question is: Is it possible to implement __aeabi_read_tp and it will work or is there more to it?
If it simply is not possible for us: Is there a way to implement TLS only in software? Let's not talk about performance there for now.
EDIT
I tried implementing __aeabi_read_tp by looking at old source of freeBSD and other sources. While the function is mostly implemented in assembly I found a version in C which boils down to this:
extern "C"
{
extern osThreadId svcThreadGetId(void);
void *__aeabi_read_tp()
{
return (void*)svcThreadGetId();
}
}
What this basically does is give me the ID (void*) of my currently executing thread. If I understand correctly that is what we want. Can this possibly work?

Not considering the performance and not going into CMIS RTOS specifics (which are unknown to me), you can allocate space needed for your variables - either on heap or as static or global variable - I would suggest to have an array of structures. Then, when you create thread, pass the pointer to the next not used structure to your thread function.
In case of static or global variable, it would be good if you know how many threads are working in parallel for limiting the size of preallocated memory.
EDIT: Added sample of TLS implementation based on pthreads:
#include <pthread.h>
#define MAX_PARALLEL_THREADS 10
static pthread_t threads[MAX_PARALLEL_THREADS];
static struct tls_data tls_data[MAX_PARALLEL_THREADS];
static int tls_data_free_index = 0;
static void *worker_thread(void *arg) {
static struct tls_data *data = (struct tls_data *) arg;
/* Code omitted. */
}
static int spawn_thread() {
if (tls_data_free_index >= MAX_PARALLEL_THREADS) {
// Consider increasing MAX_PARALLEL_THREADS
return -1;
}
/* Prepare thread data - code omitted. */
pthread_create(& threads[tls_data_free_index], NULL, worker_thread, & tls_data[tls_data_free_index]);
}

The not-so-impressive solution is a std::map<threadID, T>. Needs to be wrapped with a mutex to allow new threads.
For something more convoluted, see this idea

I believe this is possible, but probably tricky.
Here's a paper describing how __thread or thread_local behaves in ELF images (though it doesn't talk about ARM architecture for AEABI):
https://www.akkadia.org/drepper/tls.pdf
The executive summary is:
The linker creates .tbss and/or .tdata sections in the resulting executable to provide a prototype image of the thread local data needed for each thread.
At runtime, each thread control block (TCB) has a pointer to a dynamic thread-local vector table (dtv in the paper) that contains the thread-local storage for that thread. It is lazily allocated and initialized the first time a thread attempts to access a thread-local variable. (presumably by __aeabi_read_tp())
Initialization copies the prototype .tdata image and memsets the .tbss image into the allocated storage.
When source code access thread-local variables, the compiler generates code to read the thread pointer from __aeabi_read_tp(), and do all the appropriate indirection to get at the storage for that thread-local variable.
The compiler and linker is doing all the work you'd expect it to, but you need to initialize and return a "thread pointer" that is properly structured and filled out the way the compiler expects it to be, because it's generating instructions directly to follow the hops.
There are a few ways that TLS variables are accessed, as mentioned in this paper, which, again, may or may not totally apply to your compiler and architecture:
http://www.fsfla.org/~lxoliva/writeups/TLS/RFC-TLSDESC-x86.txt
But, the problems are roughly the same. When you have runtime-loaded libraries that may bring their own .tbss and .tdata sections, it gets more complicated. You have to expand the thread-local storage for any thread that suddenly tries to access a variable introduced by a library loaded after the storage for that thread was initialized. The compiler has to generate different access code depending on where the TLS variable is declared. You'd need to handle and test all the cases you would want to support.
It's years later, so you probably already solved or didn't solve your problem. In this case, it is (was) probably easiest to use your OS's TLS API directly.

Related

OpenMP thread affinities with static variables

I have a C++ class C which contains some code, including a static variable which is meant to only be read, and perhaps a constexpr static function. For example:
template<std::size_t T>
class C {
public:
//some functions
void func1();
void func2()
static constexpr std::size_t sfunc1(){ return T; }
private:
std::size_t var1;
std::array<std::size_t,10000> array1;
static int svar1;
}
The idea is to use the thread affinity mechanisms of openMP 4.5 to control the socket (NUMA architecture) where various instances of this class are executed (and therefore also place it in a memory location close to the socket to avoid using the interconnect between the NUMA nodes). It is my understanding that since this code contains a static variable it is effectively shared between all class instances so I won't have control of the memory location where the static variable will be placed, upon thread creation. Is this correct? But I presume the other non-static variables will be located at memory locations close to the socket being used? Thanks

You have to assume that the thread stack, thread-bound malloc, and thread local storage will allocate to the thread's "local" memory - so any auto or new variables should be optimised at least on the thread they were created on, though I don't know which compilers support that kind of allocation model; but as you say, static non-const data can only exist in one location. I guess if the compiler recognises const segments or constructed const segments, then after construction they could be duplicated per zone and then mapped to the same logical address? Again don't know if compilers are doing that automagically.
Non-const statics are going to be troublesome. Presumably these statics are helping to perform some sort of thread synchronisation. If they contain flags that are read often and written rarely then for best performance it may be better for the writer to write to a number of registered copies (one per zone) and each thread uses a thread-local pointer to the appropriate zone copy, than half (or 3/4) the readers are always slow. Of course, that ceases to be a simple atomic write, and a single mutex just puts you back where you started. I suspect this is roll-your-own code land.
The simple case that shouldn't be forgotten: if objects are passed between threads, then potentially a thread could be accessing a non-local object.

How do I safely share a variable with a thread that is in a different compilation unit?

In the structure of my program I've divided "where it gets called from" and "what gets done" into separate source files. As a matter of practicality, this allows me to compile the program as standalone or include it in a DLL. The code below is not the actual code but a simplified example that makes the same point.
There are 3 interacting components here: kernel mode program that loads my DLL, the DLL and its source files and the utility program with it's source, that is maintained separately.
In the DLL form, the program is loaded as a thread. According to the kernel mode application vendor's documentation, I loose the ability to call Win32 API functions after the initialization of the kernel program so I load the thread as an active thread (as opposed to using CREATE_SUSPENDED since I can't wake it).
I have it monitor a flag variable so that it knows when to do something useful through an inelegant but functional:
while ( pauseThreadFlag ) Sleep(1000);
The up to 1 second lag is acceptable (the overall process is lengthy, and infrequently called) and doesn't seem to impact the system.
In the thread source file I declare the variable as
volatile bool pauseThreadFlag = true;
Within the DLL source file I've declared
extern volatile bool pauseThreadFlag;
and when I am ready to have the thread execute, in the DLL I set
pauseThreadFlag = false;
I've had some difficulty in declaring std::string objects as volatile, so instead I have declared my parameters as global variables within the thread's source file and have the DLL call setters which reside in the thread's source. These strings would have been parameters if I could instantiate the thread at will.
(Missing from all of this is locking the variable for thread safety, which is my next "to do")
This strikes me as a bad design ... it's functional but convoluted. Given the constraints that I've mentioned is there a better way to go about this?
I was thinking that a possible revision would be to use the LPVOID lpParams variable given at thread creation to hold pointers to the string objects, even though the strings will be empty when the thread is created, and access them directly from the thread, that way erasing the declarations, setters, etc in the thread program altogether? If this works then the pause flag could also be referenced there, and the extern declarations eliminated (but I think it still needs to be declared volatile to hint the optimizer).
If it makes any difference, the environment is Visual Studio 2010, C++, target platform Win32 (XP).
Thanks!

If all components are running in kernel mode you will want to take a look at KeInitializeEvent, KeSetEvent, KeResetEvent and KeWaitForSingleObject. These all work in a similar fashion to their user mode equivalents.

I ended up removing the struct and replacing it with an object that encapsulates all the data. It's a little hideous, being filled with getters and setters, but in this particular case I'm using the access methods to make sure that locks are properly set/unset.
Using a void cast pointer to this object passed the object correctly and it seems quite stable.

Why do thread creation methods take an argument?

All thread create methods like pthread_create() or CreateThread() in Windows expect the caller to provide a pointer to the arg for the thread. Isn't this inherently unsafe?
This can work 'safely' only if the arg is in the heap, and then again creating a heap variable
adds to the overhead of cleaning the allocated memory up. If a stack variable is provided as the arg then the result is at best unpredictable.
This looks like a half-cooked solution to me, or am I missing some subtle aspect of the APIs?

Context.
Many C APIs provide an extra void * argument so that you can pass context through third party APIs. Typically you might pack some information into a struct and point this variable at the struct, so that when the thread initializes and begins executing it has more information than the particular function that its started with. There's no necessity to keep this information at the location given. For instance you might have several fields that tell the newly created thread what it will be working on, and where it can find the data it will need. Furthermore there's no requirement that the void * actually be used as a pointer, its a typeless argument with the most appropriate width on a given architecture (pointer width), that anything can be made available to the new thread. For instance you might pass an int directly if sizeof(int) <= sizeof(void *): (void *)3.
As a related example of this style: A FUSE filesystem I'm currently working on starts by opening a filesystem instance, say struct MyFS. When running FUSE in multithreaded mode, threads arrive onto a series of FUSE-defined calls for handling open, read, stat, etc. Naturally these can have no advance knowledge of the actual specifics of my filesystem, so this is passed in the fuse_main function void * argument intended for this purpose. struct MyFS *blah = myfs_init(); fuse_main(..., blah);. Now when the threads arrive at the FUSE calls mentioned above, the void * received is converted back into struct MyFS * so that the call can be handled within the context of the intended MyFS instance.

Isn't this inherently unsafe?
No. It is a pointer. Since you (as the developer) have created both the function that will be executed by the thread and the argument that will be passed to the thread you are in full control. Remember this is a C API (not a C++ one) so it is as safe as you can get.
This can work 'safely' only if the arg is in the heap,
No. It is safe as long as its lifespan in the parent thread is as long as the lifetime that it can be used in the child thread. There are many ways to make sure that it lives long enough.
and then again creating a heap variable adds to the overhead of cleaning the allocated memory up.
Seriously. That's an argument? Since this is basically how it is done for all threads unless you are passing something much more simple like an integer (see below).
If a stack variable is provided as the arg then the result is at best unpredictable.
Its as predictable as you (the developer) make it. You created both the thread and the argument. It is your responsibility to make sure that the lifetime of the argument is appropriate. Nobody said it would be easy.
This looks like a half-cooked solution to me, or am i missing some subtle aspects of the APIs?
You are missing that this is the most basic of threading API. It is designed to be as flexible as possible so that safer systems can be developed with as few strings as possible. So we now hove boost::threads which if I guess is build on-top of these basic threading facilities but provide a much safer and easier to use infrastructure (but at some extra cost).
If you want RAW unfettered speed and flexibility use the C API (with some danger).
If you want a slightly safer use a higher level API like boost:thread (but slightly more costly)
Thread specific storage with no dynamic allocation (Example)
#include <pthread.h>
#include <iostream>
struct ThreadData
{
// Stuff for my thread.
};
ThreadData threadData[5];
extern "C" void* threadStart(void* data);
void* threadStart(void* data)
{
intptr_t id = reinterpret_cast<intptr_t>(data);
ThreadData& tData = threadData[id];
// Do Stuff
return NULL;
}
int main()
{
for(intptr_t loop = 0;loop < 5; ++loop)
{
pthread_t threadInfo; // Not good just makes the example quick to write.
pthread_create(&threadInfo, NULL, threadStart, reinterpret_cast<void*>(loop));
}
// You should wait here for threads to finish before exiting.
}

Allocation on the heap does not add a lot of overhead.
Besides the heap and the stack, global variable space is another option. Also, it's possible to use a stack frame that will last as long as the child thread. Consider, for example, local variables of main.
I favor putting the arguments to the thread in the same structure as the pthread_t object itself. So wherever you put the pthread record, put its arguments as well. Problem solved :v) .

This is a common idiom in all C programs that use function pointers, not just for creating threads.
Think about it. Suppose your function void f(void (*fn)()) simply calls into another function. There's very little you can actually do with that. Typically a function pointer has to operate on some data. Passing in that data as a parameter is a clean way to accomplish this, without, say, the use of global variables. Since the function f() doesn't know what the purpose of that data might be, it uses the ever-generic void * parameter, and relies on you the programmer to make sense of it.
If you're more comfortable with thinking in terms of object-oriented programming, you can also think of it like calling a method on a class. In this analogy, the function pointer is the method and the extra void * parameter is the equivalent of what C++ would call the this pointer: it provides you some instance variables to operate on.

The pointer is a pointer to the data that you intend to use in the function. Windows style APIs require that you give them a static or global function.
Often this is a pointer to the class you are intending to use a pointer to this or pThis if you will and the intention is that you will delete the pThis after the ending of the thread.
Its a very procedural approach, however it has a very big advantage which is often overlooked, the CreateThread C style API is binary compatible so that when you wrap this API with a C++ class (or almost any other language) you can do this actually do this. If the parameter was typed, you wouldn't be able to access this from another language as easily.
So yes, this is unsafe but there's a good reason for it.

How to link non thread-safe library so each thread will have its own global variables from it?

I have a program that I link with many libraries. I run my application on profiler and found out that most of the time is spent in "waiting" state after some network requests.
Those requests are effect of my code calling sleeping_function() from external library.
I call this function in a loop which executes many, many times so all waiting times sum up to huge amounts.
As I cannot modify the sleeping_function() I want to start a few threads to run a few iterations of my loop in parallel. The problem is that this function internally uses some global variables.
Is there a way to tell linker on SunOS that I want to link specific libraries in a way that will place all variables from them in Thread Local Storage?

I don’t think you’ll be able to achieve this with just the linker, but you might be able to get something working with some code in C.
The problem is that a call to load a library that is already loaded will return a reference to the already loaded instance instead of loading a new copy. A quick look at the documentation for dlopen and LoadLibrary seems to confirm that there’s no way to load the same library more than once, at least not if you want the image to be prepared for execution. One way to circumvent this would be to prevent the OS from knowing that it is the same library. To do this you could make a copy of the file.
Some pseudo code, just replace calls to sleeping_function with calls to call_sleeping_function_thread_safe:
char *shared_lib_name
void sleeping_function_thread_init(char *lib_name);
void call_sleeping_function_thread_safe()
{
void *lib_handle;
pthread_t pthread;
new_file_name = make_copy_of_file(shared_lib_name);
pthread_create(&pthread, NULL, sleeping_function_thread_init, new_file_name);
}
void sleeping_function_thread_init(char *lib_name)
{
void *lib_handle;
void (*)() sleeping_function;
lib_handle = dlopen(lib_name, RTLD_LOCAL);
sleeping_function = dlsym(lib_handle, "sleeping_function")
while (...)
sleeping_function;
dlclose(lib_handle);
delete_file(lib_name);
}
For windows dlopen becomes LoadLibrary and dlsym becomes GetProcAddress etc... but the basic idea would still work.

In general, this is a bad idea. Global data isn't the only issue that may prevent a non thread-safe library from running in a multithreaded environment.
As one example, what if the library had a global variable that points to a memory-mapped file that it always maps into a single, hardcoded address. In this case, with your technique, you would have one global variable per thread, but they would all point to the same memory location, which would be trashed by multi-threaded access.

C global static - shared among threads?

In C, declaring a variable static in the global scope makes it a global variable. Is this global variable shared among threads or is it allocated per thread?
Update:
If they are shared among threads, what is an easy way to make globals in a preexisting library unique to a thread/non-shared?
Update2:
Basically, I need to use a preexisting C library with globals in a thread-safe manner.

It's visible to the entire process, i.e., all threads. Of course, this is in practice. In theory, you couldn't say because threads have nothing to do with the C standard (at least up to c99, which is the standard that was in force when this question was asked).
But all thread libraries I've ever used would have globals accessible to all threads.
Update 1:
Many thread libraries (pthreads, for one) will allow you to create thread-specific data, a means for functions to create and use data specific to the thread without having it passed down through the function.
So, for example, a function to return pseudo random numbers may want each thread to have an independent seed. So each time it's called it either creates or attaches to a thread-specific block holding that seed (using some sort of key).
This allows the functions to maintain the same signature as the non-threaded ones (important if they're ISO C functions for example) since the other solution involves adding a thread-specific pointer to the function call itself.
Another possibility is to have an array of globals of which each thread gets one, such as:
int fDone[10];
int idx;
: : :
for (i = 0; i < 10; i++) {
idx = i;
startThread (function, i);
while (idx >= 0)
yield();
}
void function () {
int myIdx = idx;
idx = -1;
while (1) {
: : :
}
}
This would allow the thread function to be told which global variable in the array belongs to it.
There are other methods, no doubt, but short of knowing your target environment, there's not much point in discussing them.
Update 2:
The easiest way to use a non-thread-safe library in a threaded environment is to provide wrapper calls with mutex protection.
For example, say your library has a non-thread-safe doThis() function. What you do is provide a wrapper for it:
void myDoThis (a, b) {
static mutex_t serialize;
mutex_claim (&serialize);
doThis (a, b);
mutex_release (&serialize);
}
What will happen there is that only one thread at a time will be able to claim the mutex (and hence call the non-thread-safe function). Others will be blocked until the current one returns.

C/C++ standard doesn't support threads. So all variables shared among threads.
Thread support implemented in C/C++ runtime library which is not part of the standard. Runtime is specific for each implementation of C/C++. If you want to write portable code in C++ you could use boost interprocess library.
To declare thread local variable in Microsoft Visual Studio you could use Microsoft specific keyword __declspec( thread ).

As #Pax mentioned, static variables are visible to all threads. There's no C++ data construct associated with a particular thread.
However, on Windows you can use the TlsAlloc API to allocate index for a thread-specific data and put that index in a static variable. Each thread has its own slot which you can access using this index and the TlsGetValue and TlsSetValue. For more information, read about Using Thread Local Storage on MSDN.
Update: There's no way to make globals in a preexisting library be thread-specific. Any solutions would require you to modify the code as well to be aware that the data has thread affinity.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js