Why struct with padding fields works faster

Why struct with padding fields works faster - concurrency

I just found this library, that provides lock-free ring, that works way faster then channels: https://github.com/textnode/gringo (and it works really faster especially with GOMAXPROCS > 1 )
But interesting part is struct for managing queue state:
type Gringo struct {
padding1 [8]uint64
lastCommittedIndex uint64
padding2 [8]uint64
nextFreeIndex uint64
padding3 [8]uint64
readerIndex uint64
padding4 [8]uint64
contents [queueSize]Payload
padding5 [8]uint64
}
If i remove "paddingX [8]uint64" fields it works about 20% slower. How it can be?
Also appreciate if someone explained why this lock-free algorithm much faster then channels, even buffered?

Padding eliminates false sharing by putting each structure on its own cache line. If two variables share a cache line, a read of an unmodified variable will be as expensive as a read of a modified variable if there's an intervening write to the other variable.
When a variable is read on multiple cores and not modified, the cache line is shared by the cores. This makes the reads very cheap. Before any core can write to any part of that cache line, it must invalidate the cache line on other cores. If any core later reads from that cache line, it will find the cache line invalidated and have to go back to sharing it. This makes painful extra cache coherency traffic when one variable is frequently modified and the other is frequently read.

It works faster because it does not require locks. This is an implementation in Java (called Disruptor) which works really well, and seems to be the inspiration for gringo. They explain the cost of locks and how you can increase throughput here.
As for the padding, the paper also hints at some of the reasons. Basically: processor caches. This paper explains it well. You can gain tremendous performance gain by making the processor access its Level 1 cache instead of going through memory or its outer caches as often as possible. But this requires to take extra precautions as the processor will fully load its cache, and reload it (from memory or level 2-3 caches) every time it is required.
In the case of concurrent data structure, as #David Schwartz said, the false sharing will force the processor to reload its cache much more often, as some data might be loaded in the rest of the memory line, be modified, and force the whole cache to be loaded again.

Related

Does Multiple reader single writer implementation in g++-4.4(Not in C++11/14) via boost::shared_mutex impact performance?

Usage: In our production we have around 100 thread which can access the cache we are trying to implement. If cache is missed then information will be fetched from the database and cache will be updated via writer thread.
To achieve this we are planning to implement multiple read and single writer We cannot update the g++ version since we are using g++-4.4
Update: Each worker thread can work for both read and write. If cache is missed then information is cached from the DB.
Problem Statement:
We need to implement the cache to enhance the performance.
For this, cache read are more frequent and write operations to the cache is very much less.
I think we can use boost::shared_mutex boost::shared_lock, boost::upgrade_lock, boost::upgrade_to_unique_lock implementation
But we learnt that boost::shared_mutex has performance issues:
Performance comparison on reader writer locks
Lib boost devel
Questions
Does boost::shared_mutex impact the performance in case read are much frequent?
What are other constructs and design approaches we can take while considering compiler version g++4.4?
Is there a way-around on how to design it, such that reads are lock free?
Also, we are intended to use Map to keep the information for cache.

If writes were non-existent, one possibility can be 2-level cache where you first have a thread-local cache, and then the normal cache with mutex or reader/writer lock.
If writes are extremely rare, you can do the same. But have some lock-free way of invalidating the thread-local cache, e.g. an atomic int updated with every write, and in those cases clear the thread-local cache.

You need to profile it.
In case you're stuck because you don't have a "similar enough" environment where you can actually test things, you can probably write a simple wrapper using pthreads: pthread_rwlock_t
pthread_rwlock_rdlock
pthread_rwlock_wrlock
pthread_rwlock_unlock
Of course you can design things to be lock free. Most obvious solution would be to not share state. (If you do share state, you'll have to check that your target platform supports atomic instructions). However, without any knowledge of your application domain, I feel very safe suggesting you do not want lock-free. See e.g. Do lock-free algorithms really perform better than their lock-full counterparts?

It all depends on the frequency of the updates, the size of the cache and how much is changed in the update.
Let's assume you have a rather big cache with a lot of changes on each update. Then I would use a read-copy-update pattern, which is lock-free.
If your cached data is pretty small and one time read (e.g. a single integer) RCU is also a good choice.
A big cache, with small updates or a big cache with updates which are to frequent for RCU a Read-Write Lock is a good choice.

Alongside other answers suggesting you profile it, a large benefit can be had if you can somehow structure or predict the type, order and size of the requests.
If particular types of data are requested in a typical cycle, it would be better to split up the cache per data type. You will increase cache-hit/miss ratios and the size of each cache can be adapted to the type. You will also reduce possible contention.
Likewise, the size of the requests is important when choosing your update approach. Smaller data fragments may be stored longer or even pooled together, while larger chunks may be requested less frequently.
Even with a basic prediction scheme in place that covers only the most frequent fetch patterns, you may already improve performance quite a bit. It's definitely worth it to try and train e.g. a NN (Neural Network) to guess the next request in advance.

does false sharing occur when data is read in openmp?

If I have a C++ program with OpenMP parallelization, where different threads constantly use some small shared array only for reading data from it, does false sharing occur in this case? in other words, is false sharing related only to memory write operations, or it can also happen with memory read operations.

Typically used cache coherence protocols, such as MESI (modified, exclusive, shared, invalid), have a specific state for cache lines called "shared". Cache lines are in this state if they are read by multiple processors. Each processor then has a copy of the cache line and can safely read from it without false-sharing. On a write, all processors are informed to invalidate the cache line which is the main cause for false-sharing

False sharing is a performance issue because it causes additional movement of a cache line which takes time. When two variables which are not really shared reside in the same line and separate threads update each of them, the line has to bounce around the machine which increases the latency of each access. In this case if the variables were in separate lines each thread would keep a locally modified copy of "its" line and no data movement would be required.
However, if you are not updating a line, then no data movement is necessary and there is no performance impact from the sharing beyond the fact that you might have been able to have data each thread does need in there, rather than data it doesn't. That is a small, second order, effect. though. So unless you know you are cache capacity limited ignore it!

Scenarios when software prefetching manual instructions are reasonable

I have read about that on x86 and x86-64 Intel gcc provides special prefetching instructions:
#include <xmmintrin.h>
enum _mm_hint
{
_MM_HINT_T0 = 3,
_MM_HINT_T1 = 2,
_MM_HINT_T2 = 1,
_MM_HINT_NTA = 0
};
void _mm_prefetch(void *p, enum _mm_hint h);
Programs can use the _mm_prefetch intrinsic on any
pointer in the program. And The different hints to be used with the _mm_prefetch
intrinsic are implementation defined. Generally said is that each of the hints have its own meaning.
_MM_HINT_T0
fetches data to all levels of the cache for inclusive caches
and to the lowest level cache for exclusive caches
_MM_HINT_T1 hint pulls the data into L2 and
not into L1d. If there is an L3 cache the _MM_HINT_T2
hints can do something similar for it
_MM_HINT_NTA, allows telling the processor to treat the prefetched cache line specially
So can someone describe examples when this instruction used?
And how to properly choose the hint?

The idea of prefetching is based upon these facts:
Accessing memory is very expensive the first time.
The first time a memory address1 is accessed is must be fetched from memory, it is then stored in the cache hierarchy2.
Accessing memory is inherently asynchronous.
The CPU doesn't need any resource from the core to perform the lengthiest part of a load/store3 and thus it can be easily done in parallel with other tasks4.
Thanks to the above it makes sense to try a load before it is actually needed so that when the code will actually need the data, it won't have to wait.
It is very worth nothing that the CPU can go pretty far ahead when looking for something to do, but not arbitrarily deep; so sometimes it needs the help of the programmer to perform optimally.
The cache hierarchy is, by its very nature, an aspect of the micro-architecture not the architecture (read ISA). Intel or AMD cannot give strong guarantees on what these instructions do.
Furthermore using them correctly is not easy as the programmer must have clear in mind how many cycles each instruction can take.
Finally, the latest CPU are getting more and more good at hiding memory latency and lowering it.
So in general prefetching is a job for the skilled assembly programmer.
That said the only possible scenario is where the timing of a piece of code must be consistent at every invocation.
For example, if you know that an interrupt handler always update a state and it must perform as fast as possible, it is worth, when setting the hardware that uses such interrupt, to prefetch the state variable.
Regarding the different level of prefetching, my understanding is that different levels (L1 - L4) correspond to different amounts of sharing and polluting.
For example prefetch0 is good if the thread/core that executes the instruction is the same that will read the variable.
However, this will take a line in all the caches, eventually evicting other, possibly useful, lines.
You can use this for example when you know that you'll need the data surely in short.
prefetch1 is good to make the data quickly available for all core or core group (depending on how L2 is shared) without polluting L1.
You can use this if you know that you may need the data or that you'll need it after having done with another task (that takes priority in using the cache).
This is not as fast as having the data in L1 but much better than having it in memory.
prefetch2 can be used to take out most of the memory access latency since it moves the data in the L3 cache.
It doesn't pollute L1 or L2 and it is shared among cores, so it's good for data used by rare (but possible) code paths or for preparing data for other cores.
prefetchnta is the easiest to understand, it is a non-temporal move. It avoids creating an entry in every cache line for a data that is accessed only once.
prefetchw/prefetchwnt1 are like the others but makes the line Exclusive and invalidates other cores lines that alias this one.
Basically, it makes writing faster as it is in the optimal state of the MESI protocol (for cache coherence).
Finally, a prefetch can be done incrementally, first by moving into L3 and then by moving into L1 (just for the threads that need it).
In short, each instruction let you decide the compromise between pollution, sharing and speed of access.
Since these all require to keep track of the use of the cache very carefully (you need to know that it's not worth creating and entry in the L1 but it is in the L2) the use is limited to very specific environments.
In a modern OS, it's not possible to keep track of the cache, you can do a prefetch just to find your quantum expired and your program replaced by another one that evicts the just loaded line.
As for a concrete example I'm a bit out of ideas.
In the past, I had to measure the timing of some external event as consistently as possible.
I used and interrupt to periodically monitor the event, in such case I prefetched the variables needed by the interrupt handler, thereby eliminating the latency of the first access.
Another, unorthodox, use of the prefetching is to move the data into the cache.
This is useful if you want to test the cache system or unmap a device from memory relying on the cache to keep the data a bit longer.
In this case moving to L3 is enough, not all CPU has an L3, so we may need to move to L2 instead.
I understand these examples are not very good, though.
1 Actually the granularity is "cache lines" not "addresses".
2 Which I assume you are familiar with. Shortly put: It, as present, goes from L1 to L3/L4. L3/L4 is shared among cores. L1 is always private per core and shared by the core's threads, L2 usually is like L1 but some model may have L2 shared across pairs of cores.
3 The lengthiest part is the data transfer from the RAM. Computing the address and initializing the transaction takes up resources (store buffer slots and TLB entries for example).
4 However any resource used to access the memory can become a critical issue as pointed out by #Leeor and proved by the Linux kernel developer.

Concurrent writes to different locations in the same cache line

Suppose I have a C++11 application where two threads write to different but nearby memory locations, using simple pointers to primitive types. Can I be sure that both these writes will end up in memory eventually (probably after both have reached a boost::barrier), or is there a risk that both CPU cores hold their own cache line containing that data, and the second core flushing its modification to RAM will overwrite and undo the modification done by the first write?
I hope that cache coherence will take care of this for me in all situations and on all setups compliant with the C++11 memory model, but I'd like to be sure.

Yes the cache coherency mechanisms will take care of this. This is called False sharing and should be avoided by better separating the data to increase performance.

How can i know my array is in cache?

Lets say my array is 32KB, L1 is 64 KB. Does Windows use some of it while program is running? Maybe I am not able to use L1 because windows is making other programs work? Should I set priority of my program to use all cache?
for(int i=0;i<8192;i++)
{
array_3[i]+=clock()*(rand()%256);//clock() and rand in cache too?
//how many times do I need to use a variable to make it stay in cache?
//or cache is only for reading? look below plz
temp_a+=array_x[i]*my_function();
}
The program is in C/C++.
Same thing for L2 too please.
Also are functions kept in cache? Cache is read only? (If I change my array then it loses the cache bond?)
Does the compiler create the asm codes to use cache more yield?
Thanks

How can i know my array is in cache?
In general, you can't. Generally speaking, the cache is managed directly by hardware, not by Windows. You also can't control whether data resides in the cache (although it is possible to specify that an area of memory shouldn't be cached).
Does windows use some of it while program is running? Maybe i am not able to use L1 because windows is making other programs work? Should i set priority of my program to use all cache?
The L1 and L2 caches are shared by all processes running on a given core. When your process is running, it will use all of cache (if it needs it). When there's a context switch, some or all of the cache will be evicted, depending on what the second process needs. So next time there's a context switch back to your process, the cache may have to be refilled all over again.
But again, this is all done automatically by the hardware.
also functions are kept in cache?
On most modern processors, there is a separate cache for instructions. See e.g. this diagram which shows the arrangement for the Intel Nehalem architecture; note the shared L2 and L3 caches, but the separate L1 caches for instructions and data.
cache is read only?(if i change my array then it loses the cache bond?)
No. Caches can handle modified data, although this is considerably more complex (because of the problem of synchronising multiple caches in a multi-core system.)
does the compiler create the asm codes to use cache more yield?
As cache activity is generally all handled automatically by the hardware, no special instructions are needed.

Cache is not directly controlled by the operating system, it is done
in hardware
In case of a context switch, another application may modify the
cache, but you should not care about this. It is more important to
handle cases when your program behaves cache unfriendly.
Functions are kept in cache (I-Cahce , instruction cache)
Cache is not read only, when you write something it goes to [memory
and] the cache.

The cache is primarily controlled by the hardware. However, I know that Windows scheduler tends to schedule execution of a thread to the same core as before specifically because of the caches. It understands that it will be necessary to reload them on another core. Windows is using this behavior at least since Windows 2000.

As others have stated, you generally cannot control what is in cache. If you are writing code for high-performance and need to rely on cache for performance, then it is not uncommon to write your code so that you are using about half the space of L1 cache. Methods for doing so involve a great deal of discussion beyond the scope of StackOverflow questions. Essentially, you would want to do as much work as possible on some data before moving on to other data.
As a matter of what works practically, using about half of cache leaves enough space for other things to occur that most of your data will remain in cache. You cannot rely on this without cooperation from the operating system and other aspects of the computing platform, so it may be a useful technique for speeding up research calculations but it cannot be used where real-time performance must be guaranteed, as in operating dangerous machinery.
There are additional caveats besides how much data you use. Using data that maps to the same cache lines can evict data from cache even though there is plenty of cache unused. Matrix transposes are notorious for this, because a matrix whose row length is a multiple of a moderate power of two will have columns in which elements map to a small set of cache lines. So learning to use cache efficiently is a significant job.

As far as I know, you can't control what will be in the cache. You can declare a variable as register var_type a and then access to it will be in a single cycle(or a small number of cycles). Moreover, the amount of cycles it will take you to access a chunk of memory also depends on virtual memory translation and TLB.
It should be noted that the register keyword is merely a suggestion and the compiler is perfectly free to ignore it, as was suggested by the comment.

Even though you may not know which data is in cache and which not, you still may get an idea how much of the cache you are utilizing. Modern processor have quite many performance counters and some of them related to cache. Intel's processors may tell you how many L1 and L2 misses there were. Check this for more details of how to do it: How to read performance counters on i5, i7 CPUs

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js