The Cost of thread_local

The Cost of thread_local - c++

Now that C++ is adding thread_local storage as a language feature, I'm wondering a few things:
What is the cost of thead_local likely to be?
In memory?
For read and write operations?
Associated with that: how do Operating Systems usually implement this? It would seem like anything declared thread_local would have to be given thread-specific storage space for each thread created.

Storage space: size of the variable * number of threads, or possibly (sizeof(var) + sizeof(var*)) * number of threads.
There are two basic ways of implementing thread-local storage:
Using some sort of system call that gets information about the current kernel thread. Sloooow.
Using some pointer, probably in a processor register, that is set properly at every thread context switch by the kernel - at the same time as all the other registers. Cheap.
On intel platforms, variant 2 is usually implemented via some segment register (FS or GS, I don't remember). Both GCC and MSVC support this. Access times are therefore about as fast as for global variables.
It is also possible, but I haven't seen it yet in practice, for this to be implemented via existing library functions like pthread_getspecific. Performance would then be like 1. or 2., plus library call overhead. Keep in mind that variant 2. + library call overhead is still a lot faster than a kernel call.

A description for how it works on Linux by Uli Drepper (maintainer of glibc) can be found here: www.akkadia.org/drepper/tls.pdf
The requirement to handle dynamically loaded modules etc. make the entire mechanism a bit convoluted, which perhaps partly explains why the document weights in at 79 pages (!).
Memory-usage-wise, each per-thread variable obviously needs it's own per-thread memory (although in some cases this can be done lazily such that the space is allocated only once the variable is first accessed), and then there's some extra datastructures that are needed for offset tables etc.
Performance-wise, the extra cost to access a TLS variable mostly revolves around retrieving the address of the variable. On x86 Linux the GS register is used as a start to get a thread id, on x86-64 FS. Usually there is a few pointer dereferences, and a function call (__tls_get_addr) for dynamically loaded code. There's also the cost that creating a new thread is slower because the implementation needs to allocate space and possibly initialize all the TLS vars (if not done lazily).
TLS is nice for easily making some old thread-unsafe code patterns thread-safe (think errno), but for new code designed from the start for a multi-threaded world it's very seldom needed.

Related

Optimal Way of Using mlockall() for Real-time Application (nanosecond sensitive)

I am reading mlockall()'s manpage: http://man7.org/linux/man-pages/man2/mlock.2.html
It mentions
Real-time processes that are using mlockall() to prevent delays on page
faults should reserve enough locked stack pages before entering the time-
critical section, so that no page fault can be caused by function calls. This
can be achieved by calling a function that allocates a sufficiently large
automatic variable (an array) and writes to the memory occupied by this array in
order to touch these stack pages. This way, enough pages will be mapped for the
stack and can be locked into RAM. The dummy writes ensure that not even copy-
on-write page faults can occur in the critical section.
I am a bit confused by this statement:
This can be achieved by calling a function that allocates a sufficiently large
automatic variable (an array) and writes to the memory occupied by this array in
order to touch these stack pages.
All the automatic variables (variables on stack) are created "on the fly" on the stack when the function is called. So how can I achieve what the last statement says?
For example, let's say I have this function:
void foo() {
char a;
uint16_t b;
std::deque<int64_t> c;
// do something with those variables
}
Or does it mean before I call any function, I should call a function like this in main():
void reserveStackPages() {
int64_t stackPage[4096/8 * 1024 * 1024];
memset(stackPage, 0, sizeof(stackPage));
}
If yes, does it make a difference if I first allocate the stackPage variable on heap, write and then free? Probably yes, because heap and stack are 2 different region in the RAM?
std::deque exists above is just to bring up another related question -- what if I want to reserve memory for things using both stack pages and heap pages. Will calling "heap" version of reserveStackPages() help?
The goal is to minimize all the jitters in the application (yes, I know there are many other things to look at such as TLB miss, etc; just trying to deal with one kind of jitter at once, and slowly progressing into all).
Thanks in advance.
P.S. This is for a low latency trading application if it matters.

You generally don't need to use mlockall, unless you code (more or less hard) real-time applications (I actually never used it).
If you do need it, you'll better code in C (not in genuine C++) the most real-time parts of your code, because you surely want to understand the details of memory allocation. Notice that unless you dive into std::deque implementation, you don't exactly know where it is sitting (probably most of the data is heap allocated, even if your c is an automatic variable).
You should first understand in details the virtual address space of your process. For that, proc(5) is useful: from inside your process, you'll read /proc/self/maps (see this), from outside (e.g. some terminal) you'll do cat /proc/1234/maps for a process of pid 1234. Or use pmap(1).
because heap and stack are 2 different regions in the RAM?
In fact, your process' address space contains many segments (listed in /proc/1234/maps), much more that two. Typically every dynamically linked shared library (such as libc.so) brings a few segments.
Try cat /proc/self/maps and cat /proc/$$/maps in your terminal to get a better intuition about virtual address spaces. On my machine, the first gives 19 segments of the cat process -each displayed as a line- and the second 97 segments of the zsh (my shell) process.
To ensure that your stack has enough space, you indeed could call a function allocating a large enough automatic variable, like your reserveStackPages. Beware that call stacks are practically of limited size (a few megabytes usually, see also setrlimit(2)).
If you really need mlockall (which is unlikely) you might consider linking statically your program (to have less segments in your virtual address space).
Look also into madvise(2) (and perhaps mincore(2)). It is generally much more useful than mlockall. BTW, in practice, most of your virtual memory is in RAM (unless your system experiments thrashing, and then you'll see it immediately).
Read also Operating Systems: Three Easy Pieces to understand the role of paging.
PS. Nano-second sensitive applications does not make much sense (because of cache misses that the software does not control).

Does the static keyword play a role in C/C++ and the storage level?

This question has been bugging me for a while.
From what I understand that are various levels of storage. They are
CPU Registers
Lower Level Cache
Memory (RAM/ROM)
Hard Disk Space
With "fastest access time / fewest number" of at the top and "slowest access time / most number of" towards the bottom?
In C/C++ how do you control whether variables are put into (and stay in) Lower Level Cache? I'm assuming there is not a way to control which variables say in CPU registers since there are a very limited number.
I want to say that the C/C++ static keyword plays some part in it, but wanted to get clarification on this.
I understand how the static works in theory. Namely that
#include <stdio.h>
void increment(){
static int iSum = 0;
printf(" iSum = %d\n", ++iSum);
return;
}
void main(int argc, char* argv[]){
int iInc = 0;
for(iInc = 0; iInc < 5; iInc++)
increment();
return;
}
Would print
iSum = 1
iSum = 2
iSum = 3
iSum = 4
iSum = 5
But I am not certain how the different levels of storage play a part. Does where a variable lies depend more on the optimziation level such as through invoking the -o2 and -o3 flags on GCC?
Any insight would be greatly appreciated.
Thanks,
Jeff

The static keyword has nothing to do with cache hinting and the compiler is free to allocate registers as it thinks suits better. You might have thought of that because of the storage class specifiers list with a deprecated register specifier.
There's no way to precisely control via C++ (or C) standard-conformant language features how caching and/or register allocation work because you would have to deeply interface with your underlying hardware (and writing your own register allocator or hinting on how to store/spill/cache stuff). Register allocation is usually a compiler's back-end duty while caching stuff is processor's work (along with instruction pipelining, branch prediction and other low-level tasks).
It is true that changing the compiler's optimization level might deeply affect how variables are accessed/loaded into registers. Ideally you would keep everything into registers (they're fast) but since you can't (their size and number is limited) the compiler has to make some predictions and guess what should be spilled (i.e. taken out of a register and reloaded later) and what not (or even optimized-out). Register allocation is a NP-complete problem. In CUDA C you usually can't deal with such issues but you do have a chance of specifying the caching mechanism you intend to use by using different types of memory. However this is not standard C++ as extensions are in place.

Caches are intermediate storage areas between main memory and registers.
They are used because accessing memory today is very expensive, measured in clock ticks, compared to how things used to be (memory access hasn't increased in speed anywhere near what's happened to CPUs).
So they are a way to "simulate" faster memory access while letting you write exactly the same code as without them.
Variables are never "stored" in the cache as such — their values are only held there temporarily in case the CPU needs them. Once modified, they are written out to their proper place in main memory (if they reside there and not in a register).
And statichas nothing to do with any of this.
If a program is small enough, the compiler can decide to use a register for that, too, or inline it to make it disappear completely.

Essentially you need to start looking at writing applications and code that are cache coherent. This is a quick intro to cache coherence:
http://supercomputingblog.com/optimization/taking-advantage-of-cache-coherence-in-your-programs/
Its a long and complicated subject and essentially boils down to actual implementation of algorithms along with the platform that they are targeting. There is a similar discussion in the following thread:
Can I force cache coherency on a multicore x86 CPU?

A function variable declared as static makes it's lifetime that of the duration of the program. That's all C/C++ says about it, nothing about staorage/memory.

To answer this question:
In C/C++ how do you control whether variables are put into (and stay
in) Lower Level Cache?
You can't. You can do some stuff to help the data stay in cache, but you can't pin anything in cache.
It's not what those caches are for, they are mainly fed from the main memory, to speed up access, or allow for some advanced techniques like branch prediction and pipelining.

I think there may be a few things that need clarification. CPU cache (L1, L2, L3, etc...) is a mechanism the CPU uses to avoid having to read and write directly to memory for values that will be accessed more frequently. It isn't distinct from RAM; it could be thought of as a narrow window of it.
Using cache effectively is extremely complex, and it requires nuanced knowledge of code memory access patterns, as well as the underlying architecture. You generally don't have any direct control over the cache mechanism, and an enormous amount of research has gone into compilers and CPUs to use CPU cache effectively. There are storage class specifiers, but these aren't meant to perform cache preload or support streaming.
Maybe it should be noted that simply because something takes fewer cycles to use (register, L1, L2, etc...) doesn't mean using it will necessarily make code faster. For example, if something is only written to memory once, loading it into L1 may cause a cache eviction, which could move data needed for a tight loop into a slower memory. Since the data that's accessed more frequently now takes more cycles to access, the cumulative impact would be lower (not higher) performance.

Performance impact of objects

I am a beginner programmer with some experience at c and c++ programming. I was assigned by the university to make a physics simulator, so as you might imagine there's a big emphasis on performance.
My questions are the following:
How many assembly instructions does an instance data member access
through a pointer translate to (i.e for an example vector->x )?
Is it much more then say another approach where you simply access the
memory through say a char* (at the same memory location of variable
x), or is it the same?
Is there a big impact on performance
compiler-wise if I use an object to access that memory location or
if I just access it?
Another question regarding the subject would be
whether or not accessing heap memory is faster then stack memory
access?

C++ is a compiled language. Accessing a memory location through a pointer is the same regardless of whether that's a pointer to an object or a pointer to a char* - it's one instruction in either case. There are a couple of spots where C++ adds overhead, but it always buys you some flexibility. For example, invoking a virtual function requires an extra level of indirection. However, you would need the same indirection anyway if you were to emulate the virtual function with function pointers, or you would spend a comparable number of CPU cycles if you were to emulate it with a switch or a sequence of ifs.
In general, you should not start optimizing before you know what part of your code to optimize. Usually only a small part of your code is responsible for the bulk of the CPU time used by your program. You do not know what part to optimize until you profile your code. Almost universally it's programmer's code, not the language features of C++, that is responsible for the slowdown. The only way to know for sure is to profile.

On x86, a pointer access is typically one extra instruction, above and beyond what you normally need to perform the operation (e.x. y = object->x; would be one load of the address in object, and one load of the value of x, and one store to y - in x86 assembler both loads and stores are mov instructions with memory target). Sometimes it's "zero" instructions, because the compiler can optimise away the load of the object pointer. In other architectures, it's really down to how the architecture works - some architectures have very limited ways of accessing memory and/or loading addresses to pointers, etc, making it awkward to access pointers.
Exactly the same number of instructions - this applies for all
As #2 - objects in themselves have no impact at all.
Heap memory and stack memory is the same kind. One answer says that "stack memory is always in the caceh", which is true if it's "near the top of the stack", where all the activity goes on, but if you have an object that is being passed around that was created in main, and a pointer to it is used to pass it around for several layers of function calls, and then access through the pointer, there is an obvious chance that this memory hasn't been used for a long while, so there is no real difference there either). The big difference is that "heap memory is plenty of space, stack is limited" along with "running out of heap is possible to do limited recovery, running out of stack is immediate end of execution [without tricks that aren't very portable]"
If you look at class as a synonym for struct in C (which aside from some details, they really are), then you will realize that class and objects are not really adding any extra "effort" to the code generated.
Of course, used correctly, C++ can make it much easier to write code where you deal with things that are "do this in a very similar way, but subtly differently". In C, you often end up with :
void drawStuff(Shape *shapes, int count)
{
for(i = 0; i < count; i++)
{
switch (shapes[i].shapeType)
{
case Circle:
... code to draw a circle ...
break;
case Rectangle:
... code to draw a rectangle ...
break;
case Square:
...
break;
case Triangle:
...
break;
}
}
}
In C++, we can do this at the object creation time, and your "drawStuff" becoems:
void drawStuff(std::vector<Shape*> shapes)
{
for(auto s : shapes)
{
s->Draw();
}
}
"Look Ma, no switch..." ;)
(Of course, you do need a switch or something to do the selection of which object to create, but once choice is made, assuming your objects and the surrounding architecture are well defined, everything should work "magically" like the above example).
Finally, if it's IMPORTANT with performance, then run benchmarks, run profiling and check where the code is spending it's time. Don't optimise too early (but if you have strict performance criteria for something, keep an eye on it, because deciding on the last week of a project that you need to re-organise your data and code dramatically because performance sucks due to some bad decision is also not the best of ideas!). And don't optimise for individual instructions, look at where the time is spent, and come up with better algorithms WHERE you need to. (In the above example, using const std::vector<Shape*>& shapes will effectively pass a pointer to the shapes vector passed in, instead of copying the entire thing - which may make a difference if there are a few thousand elements in shapes).

It depends on your target architecture. An struct in C (and a class in C++) is just a block of memory containing the members in sequence. An access to such a field through a pointer means adding an offset to the pointer and loading from there. Many architectures allow a load to already specify an offset to the target address, meaning that there is no performance penalty there; but even on extreme RISC machines that don't have that, adding the offset should be so cheap that the load completely shadows it.
Stack and heap memory are really the same thing. Just different areas. Their basic access speed is therefore the same. The main difference is that the stack will most likely already be in the cache no matter what, whereas heap memory might not be if it hasn't been accessed lately.

Variable. On most processors instructions are translated to something called microcode, similar to how Java bytecode are translated to processor-specific instructions before you run it. How many actual instructions you get are different between different processor manufacturers and models.
Same as above, it depends on processor internals most of us know little about.
1+2. What you should be asking are how many clock cycles these operations take. On modern platforms the answer are one. It does not matter how many instructions they are, a modern processor have optimizations to make both run on one clock cycle. I will not get into detail here. I other words, when talking about CPU load there are no difference at all.
Here you have the tricky part. While there are no difference in how many clock cycles the instruction itself take, it needs to have data from memory before it can run - this can take a HUGE ammount of clock cycles. Actually someone proved a few years ago that even with a very optimized program a x86 processor spends at least 50% of its time waiting for memory access.
When you use stack memory you are actually doing the same thing as creating an array of structs. For the data, instructions are not duplicated unless you have virtual functions. This makes data aligned and if you are going to do sequential access, you will have optimal cache hits. When you use heap memory you will create an array of pointers, and each object will have its own memory. This memory will NOT be aligned and therefore sequential access will have a lot of cache misses. And cache misses are what really will your application slower and should be avoided at all cost.
I do not know exactly what you are doing but in many cases even using objects are much slower than plain arrays. An array of objects are aligned [object1][object2] etc. If you do something like pseudocode "for each object o {o.setX() = o.getX() + 1}"... this means that you will only access one variable and your sequential access would therefore jump over the other variables in each object and get more cache misses than if your X-variables where aligned in their own array. And if you have code that use all variables in your object, standard arrays will not be slower than object array. It will just load the different arrays into different cache blocks.
While standard arrays are faster in C++ they are MUCH faster in other languages like Java, where you should NEVER store bulk data in objects - as Java objects use more memory and are always stored at the heap. This are the most common mistake that C++ programmers do in Java, and then complain that Java are slow. However if they know how to write optimal C++ programs they store data in arrays which are as fast in Java as in C++.
What I usually do are a class to store the data, that contains arrays. Even if you use the heap, its just one object which becomes as fast as using the stack. Then I have something like "class myitem { private: int pos; mydata data; public getVar1() {return data.getVar1(pos);}}". I do not write out all of the code here, just illustrating how I do this. Then when I iterate trough it the iterator class do not actually return a new myitem instance for each item, it increase the pos value and return the same object. This means you get a nice OO API while you actually only have a few objects and and nicely aligned arrays. This pattern are the fastest pattern in C++ and if you don't use it in Java you will know pain.
The fact that we get multiple function calls do not really matter. Modern processors have something called branch prediction which will remove the cost of the vast majority of those calls. Long before the code actually runs the branch predictor will have figured out what the chains of calls do and replaced them with a single call in the generated microcode.
Also even if all calls would run each would take far less clock cycles the memory access they require, which as I pointed out makes memory alignment the only issue that should bother you.

Is there a way to make sure an array variable (unsigned int*) will be in memory?

I need to set some default value for all entires in a very large array.
It takes me quite long time (110-120 ms) and i suspect it happens because of misses in memory.
I use memset/std:fill to set the default value. Is there a way to make sure that the array will reside in memory before the memset/fill?

Assuming this is a large memory-mapped file, you can use the madvise() libc call with the MADV_WILLNEED argument to hint to the OS that you'll be wanting to access the region mentioned soon.
However YMMV, as the array needs to be large enough that the benefit of the resulting syscall isn't outweighed by the cost of making the call.

You can lock memory at per-page granuality using mlock, though only up to a fixed amount (I'm not sure what the limit is on OS X, but you can check it using getrlimit with RLIMIT_MEMLOCK).

Most likely you have a multiple core processor and functions like memset actually degrade in performance when not used on single core CPUs. It's possible that mutex locking are causing the slowdown. Try allocating memory on the stack instead of dynamic memory. Since it's a very large array then I would experiment making my own memory manager and store segments of it in multiple threads (but that's just an idea I had after reading an article fast). A standard way of doing it would be to use one memory allocator per thread. In any case I would look into something else than memset.
Maybe the following aticle would help

Lock Free Queue -- Single Producer, Multiple Consumers

I am looking for a method to implement lock-free queue data structure that supports single producer, and multiple consumers. I have looked at the classic method by Maged Michael and Michael Scott (1996) but their version uses linked lists. I would like an implementation that makes use of bounded circular buffer. Something that uses atomic variables?
On a side note, I am not sure why these classic methods are designed for linked lists that require a lot of dynamic memory management. In a multi-threaded program, all memory management routines are serialized. Aren't we defeating the benefits of lock-free methods by using them in conjunction with dynamic data structures?
I am trying to code this in C/C++ using pthread library on a Intel 64-bit architecture.
Thank you,
Shirish

The use of a circular buffer makes a lock necessary, since blocking is needed to prevent the head from going past the tail. But otherwise the head and tail pointers can easily be updated atomically. Or in some cases the buffer can be so large that overwriting is not an issue. (in real life you will see this in automated trading systems, with circular buffers sized to hold X minutes of market data. If you are X minutes behind, you have wayyyy worse problems than overwriting your buffer).
When I implemented the MS queue in C++, I built a lock free allocator using a stack, which is very easy to implement. If I have MSQueue then at compile time I know sizeof(MSQueue::node). Then I make a stack of N buffers of the required size. The N can grow, i.e. if pop() returns null, it is easy to go ask the heap for more blocks, and these are pushed onto the stack. Outside of the possibly blocking call for more memory, this is a lock free operation.
Note that the T cannot have a non-trivial dtor. I worked on a version that did allow for non-trivial dtors, that actually worked. But I found that it was easier just to make the T a pointer to the T that I wanted, where the producer released ownership, and the consumer acquired ownership. This of course requires that the T itself is allocated using lockfree methods, but the same allocator I made with the stack works here as well.
In any case the point of lock-free programming is not that the data structures themselves are slower. The points are this:
lock free makes me independent of the scheduler. Lock-based programming depends on the scheduler to make sure that the holders of a lock are running so that they can release the lock. This is what causes "priority inversion" On Linux there are some lock attributes to make sure this happens
If I am independent of the scheduler, the OS has a far easier time managing timeslices, and I get far less context switching
it is easier to write correct multithreaded programs using lockfree methods since I dont have to worry about deadlock , livelock, scheduling, syncronization, etc This is espcially true with shared memory implementations, where a process could die while holding a lock in shared memory, and there is no way to release the lock
lock free methods are far easier to scale. In fact, I have implemented lock free methods using messaging over a network. Distributed locks like this are a nightmare
That said, there are many cases where lock-based methods are preferable and/or required
when updating things that are expensive or impossible to copy. Most lock free methods use some sort of versioning, i.e. make a copy of the object, update it, and check if the shared version is still the same as when you copied it, then make the current version you update version. Els ecopy it again, apply the update, and check again. Keep doing this until it works. This is fine when the objects are small, but it they are large, or contain file handles, etc then not recommended
Most types are impossible to access in a lock free way, e.g. any STL container. These have invariants that require non atomic access , for example assert(vector.size()==vector.end()-vector.begin()). So if you are updating/reading a vector that is shared, you have to lock it.

This is an old question, but no one has provided an accepted solution. So I offer this info for others who may be searching.
This website: http://www.1024cores.net
Provides some really useful lockfree/waitfree data structures with thorough explanations.
What you are seeking is a lock-free solution to the reader/writer problem.
See: http://www.1024cores.net/home/lock-free-algorithms/reader-writer-problem

For a traditional one-block circular buffer I think this simply cannot be done safely with atomic operations. You need to do so much in one read. Suppose you have a structure that has this:
uint8_t* buf;
unsigned int size; // Actual max. buffer size
unsigned int length; // Actual stored data length (suppose in write prohibited from being > size)
unsigned int offset; // Start of current stored data
On a read you need to do the following (this is how I implemented it anyway, you can swap some steps like I'll discuss afterwards):
Check if the read length does not surpass stored length
Check if the offset+read length do not surpass buffer boundaries
Read data out
Increase offset, decrease length
What should you certainly do synchronised (so atomic) to make this work? Actually combine steps 1 and 4 in one atomic step, or to clarify: do this synchronised:
check read_length, this can be sth like read_length=min(read_length,length);
decrease length with read_length: length-=read_length
get a local copy from offset unsigned int local_offset = offset
increase offset with read_length: offset+=read_length
Afterwards you can just do a memcpy (or whatever) starting from your local_offset, check if your read goes over circular buffer size (split in 2 memcpy's), ... . This is 'quite' threadsafe, your write method could still write over the memory you're reading, so make sure your buffer is really large enough to minimize that possibility.
Now, while I can imagine you can combine 3 and 4 (I guess that's what they do in the linked-list case) or even 1 and 2 in atomic operations, I cannot see you do this whole deal in one atomic operation :).
You can however try to drop 'length' checking if your consumers are very smart and will always know what to read. You'd also need a new woffset variable then, because the old method of (offset+length)%size to determine write offset wouldn't work anymore. Note this is close to the case of a linked list, where you actually always read one element (= fixed, known size) from the list. Also here, if you make it a circular linked list, you can read to much or write to a position you're reading at that moment!
Finally: my advise, just go with locks, I use a CircularBuffer class, completely safe for reading & writing) for a realtime 720p60 video streamer and I have got no speed issues at all from locking.

This is an old question but no one has provided an answer that precisely answers it. Given that still comes up high in search results for (nearly) the same question, there should be an answer, given that one exists.
There may be more than one solution, but here is one that has an implementation:
https://github.com/tudinfse/FFQ
The conference paper referenced in the readme details the algorithm.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js