How do affinity scheduling reduce the number of TLB misses and page faults - scheduling

I am trying to understand how affinity scheduling reduce TLB misses and page faults? Can someone please give me a explanation how this process works? I understand the "affinity scheduling", but can not understand how that can reduce TLB misses and page faults?

Affinity scheduling has to do with putting the right thread on the right CPU. Doing so might well reduce TLB misses since these are kept inside each CPU. On the other hand, it has no effect on page faults, since if a page is in memory for one CPU, it is in memory for all CPUs.

Related

Micro benchmarking C++ Linux

I am benchmarking a function of my c++ program using inline rdtsc(1st and last instruction in the function) My setup has isolated cores and hyper threading is off and the frequency is 3.5Mhz.
I cannot afford more than 1000 cpu cycles so i count the percentage of calls taking more than 1000 cpu cycles and that is approximately 2-3%. The structure being accessed in the code is huge and can certainly result in cache miss. But a cache miss is 300-400 cpu cycles.
Is there a problem with rdtsc benchmarking? If not, what else can cause a 2-3% of my cases going through the same set of instructions abruptly high number of cycles
I want help to understand what i should look for to understand this 2-3% of my WC(worst cases)
Often rare "performance noise" like you describe is caused by context switches in the timed region, where your process happened to exceed its scheduler quanta during your interval and some other process was scheduled to run on the core. Another possibility is a core migration by the kernel.
When you say "My setup has isolated cores", are you actually pinning your process to specific cores using a tool (e.g. hwloc)? This can greatly help to get reproducible results. Have you checked for other daemon processes that might also be eligible to run on your core?
Have you tried measuring your code using a sampling profiler like gprof or HPCToolkit? These tools provide alot more context and behavioral information that can be difficult to discover from manual instrumentation.

Multithreaded Cache Miss Exploiting

When I eg. iterate over a linked list and become really unlucky, I will have ~ 0% cache-hitrate (let's assume this anyways). Let's also assume I have a CPU that can only run one Instruction at a time (no multicore / hyperthreads) for simplicity. Cool. Now with my 0% hitrate the CPU / program is spending 99% of the time waiting for data.
Question: If a thread is waiting for data from the RAM / disk is that core blocked? Or can I exploit the low cache-hitrate by running other threads (or another way that is not todo with increasing the hitrate) to not have the CPU exclusively wait for data and do other work instead?
If you run SMT, then the other thread can grap all the core resources and hence cover over the cache miss (at least partially).
I know of no processor that makes task switch on cache miss, but I know several architectures that use SMT-2/4/8 (yes some Power CPU's have SMT-8) to cover over such cases.

How to optimize code for Simultaneous Multithreading?

Currently, I am learning parallel processing using CPU, which is a well-covered topic with plenty of tutorials and books.
However, I could not find a single tutorial or resource that talks about programming techniques for hyper threaded CPU. Not a single code sample.
I know that to utilize hyper threading, the code must be implemented such that different parts of the CPU can be used at the same time (simplest example is calculating integer and float at the same time), so it's not plug-and-play.
Which book or resource should I look at if I want to learn more about this topic? Thank you.
EDIT: when I said hyper threading, I meant Simultaneous Multithreading in general, not Intel's hyper threading specifically.
Edit 2: for example, if I have an i7 8-core CPU, I can make a sorting algorithms that runs 8 times faster when it uses all 8-core instead of 1. But it will run the same on a 4-core CPU and a 4c-8t CPU, so in my case SMT does nothing.
Meanwhile, Cinebench will run much better on a 4c-8t CPU than on a 4c-4t CPU.
SMT is generally most effective, when one thread is loading something from memory. Depending on the memory (L1, L2, L3 cache, RAM), read/write latency can span a lot of CPU cycles that would have to be wasted doing nothing, if only one thread would be executed per core.
So, if you want to maximize the impact of SMT, try to interleave memory access of two threads so that one of them can execute instructions, while the other reads data. Theoretically, you can also use a thread just for cache warming, i.e. loading data from RAM or main storage into cache for subsequent use by other threads.
The way of successfully applying this can vary from one system to another because the access latency of cache, RAM and main storage as well as their size may differ by a lot.

More TLB misses when process memory size larger?

I have my program which I have written in C++. On linux the process is allocated a certain amount of memory. Part is the Stack, part the Heap, part Text and part BSS.
Is the following true:
The larger the amount of memory allocated to the Heap component of my process- the chance of Translation Lookaside Buffer misses increases?
And generally speaking- the more memory my application process consumes, the greater the chance of TLB misses?
I think there is no direct relationship between the amount of memory allocated and the miss rate of TLB. As far as I know, as long as your program has good locality, the TLB misses will remain low.
There is several reasons that would lead to high TLB miss:
1.Not enough memory and to many running process;
2.Low locality of your program.
3.the inefficient way you visit array elements in cycles in your codes.
Programs are usually divided into phases that exhibit completely different memory and execution characteristics - your code may allocate a huge chunk of memory at some point, then be off doing some other unrelated computations. In that case, your TLBs (that are basically just caches for address translation) would age away the unused pages and eventually drop them. While you're not using these pages, you shouldn't care about that.
The real question is - when you get to some performance critical phase, are you going to work with more pages than your TLBs can sustain simultaneously? On one hand modern CPUs have large TLB, often with 2 levels of caching - the L2 TLB of a modern intel CPU should have (IIRC) 512 entries - that's 2M worth of data if you're using 4k pages (with large pages that would have been more, but TLBs usually don't like to work with them due to potential conflicts with smaller pages..).
It's quite possible for an application to work with more than 2M of data, but you should avoid doing this at the same time if possible - either by doing cache tiling or changing the algorithms. That's not always possible (for e.g. when streaming from memory or from IO), but then the TLB misses are probably not your main bottlenecks. When working with the same set of data and accessing the same elements multiple times - you should always attempt to keep them cached as close as possible.
It's also possible to use software prefetches to make the CPU perform the TLB misses (and following page walks) earlier in time, preventing them from blocking your progress. On some CPUs hardware prefetches are already doing this for you.

time of an assembler operation

Why can the same assembler operation (mul for example) in different parts of a program consume different amount of time?
P.S. I'm using C++ and disassembler.
This question is very vague, but generally on a modern CPU you cannot expect operations to have a constant execution time, because a lot of factors can influence this, including but not limited to:
Branch prediction failures
Cache misses
Pipelining
...
There are all kinds reasons why the same kind of operation can have massively varying performance on modern processors.
Data Cache Misses:
If your operation accesses memory it might go to the cache at one location and generate a cache miss elsewhere. Cache misses can be in the order of a hundret cycles, while easy operations often execute in a few cycles, so this will make it much slower.
Pipeline Stalls:
Modern CPUs are typically pipelined, so an instruction (or more then one) can be scheduled each cycle, but they typically need more than one cycle till the result is available. Your operation might depend on the result of another operation, which isn't ready when the operation is scheduled, so the CPU has to wait till the operation generating the result has finished.
Instruction Cache Misses:
The instruction stream is also cached, so you might find a situation where for one location the cpu generates a cache miss each time it encounteres that location (unlikely for anything which will take a measurable amount of the runtime though, instruction caches aren't that small).
Branch Misprediction:
Another kind of pipeline stall. The CPU will try to predict which way a conditional jump will go and speculatively execute the code in that execution path. If it is wrong it has to discard the results from this speculative execution and start on the other path. This might show up on the first line of the other path in a profiler.
Resource Contention: The operation might not depend a not avalible result, but the execution unit needed might still be occupied by another instruction (some instructions are not fully pipelined on all processors, or it might be because of some kind of Hyperthreading or Bulldozers shared FPU). Again the CPU might have to stall until the unit is free.
Page Faults: Should be pretty obvious. Basically a Cache Miss on steroids. If the accessed memory has to be reloaded from disk it will cost hundreds of thousands of cycles
...: The list goes on, however the mentioned points are the ones most likely to make an impact in my opionon.
I assume you're asking about exactly the same instruction applied to the same operands.
One possible cause that could have huge performance implications is whether the operands are readily available in the CPU cache or whether they have to be fetched from the main RAM.
This is just one example; there are many other potential causes. With modern CPUs it's generally very hard to figure out how many cycles a given instruction will require just by looking at the code.
In profiler I see "mulps %xmm11, %xmm5", for example. I guess it is data in registers
xmmXX are SSE instructions. mulps is precision single, it depends whether or not you are comparing a SSE multiply against a normal scalar multiply. in which case it is understandable.
We really need more information for a better answer a chunk of asm and your profilers figures.
If it just this instruction that is slow? or a block of instructions, maybe it loading from unaligned memory, or you getting cache misses, pipeline hazards and a significant number of other possiblities.