Dual socket vs single socket memory model? - c++

I am a bit confused about what memory looks like in a dual CPU machine from the perspective of a C/C++ program running on Linux.
Case 1 (understood)
With one quad-core HT CPU, 32GB RAM, I can, in theory, write a single process application, using up to 8 threads and up to 32GB RAM without going into swap or overloading the threading facilities - I am ignore the OS and other processes here for simplicity.
Case 2 (confusion)
What happens with a dual quad-core HT CPU with 64GB RAM set up?
Development-wise, do you need to write an application to run as two processes (8 threads, 32GB each) that communicate or can you write it as one process (16 threads, 64GB full memory)?
If the answer is the former, what are some efficient modern strategies to utilize the entire hardware? shm? IPC? Also, how do you direct Linux to use a different CPU for each process?

From the application's viewpoint, the number of physical CPUs (dies) doesn't matter. Only the number of virtual processors. These include all cores on all processors, and double, if hyperthreading is enabled on a core. Threads are scheduled on them in the same way. It doesn't matter if the cores are all on one die or spread across multiple dies.
In general, the best way to handle these things is to not. Don't worry about what's running on which core. Just spawn an appropriate number of threads for your application, (up to a theoretical maximum equal to the total number of cores in the system), and let the OS deal with the scheduling.
The memory is shared amongst all cores in the system, of course. But again, it's up the OS to handle allocation of physical memory. Very few applications really need to worry about how much memory they use, and divvying up that memory between threads. Let the OS handle that.

The memory model has ** nothing ** to do with number of cores per se, rather it has to do with the architecture employed on multi core computers. Most mainstream computers use symmetric multi processing model, wherein a single OS is controlling all the CPUs, and programs running on those CPUs have access to all the available memory. Each CPU does has private memory (cache), but the RAM is all shared. So if you have 64 bit machine it makes zilch difference whether you write 1 process, or two processes AS FAR AS memory usage implications are concerned. Programming wise you would be better to use a single process.
As other pointed out, you do need to worry about thread affinities and such, but that has more to do with efficient use of CPU resources, and little to do with RAM usage. There would be some implications of cache usage though.
Contrast with other memory model computers, like NUMA (Non-Uniform Memory Access), where each CPU has its own block of memory, and communicating across CPUs then requires some arbiter in between. On these computers you WOULD NEED to worry about where to place your threads, memory wise.

Related

Multi-threaded C++ code slower with more physical cores? (threaded C++ mex function)

I am running multi-threaded C++ code on different machines right now. I am using it within a Matlab mex function, so the overall program is run from MatLab. I used the code in this link here, only changed what is done in "main_loop" to fit to my task. The code is running perfectly fine on two of my computers and it is many times faster than running the same C++ code as single thread. So I think that the program itself is fine.
However, when I run the same things on a third machine, it is suddenly extremely slow. The single threaded version is fine, but the multi-threaded one takes 10-15 times longer. Now, since everything seems fine on the other computers, my guess is that it has something to do with the specs of the third machine (details see below). My guess: The third computer has two physical processors. I guess this requires to copy everything physically to both processors? (The original code is intentionally written such that no hard-copy of any involved variable is required) If so, is there a way to control on which processor the threads are opened? (It would already help if I can just limit myself to one CPU and avoid copying everything) I already tried to set the number of threads down to 2, what did not help.
Specs of 2-CPU computer:
Intel Xeon Silver 4210R, 2.40Ghz (2 times), 128 GB Ram 64bit, Windows
10 Pro
Specs of other computers:
Intel Core i7-8700, 3.2Ghz, 64 GB Ram 64bit, Windows 10 Pro
Intel Core i7-10750H, 2.6Ghz, 16 GB Ram 64bit, Windows 10 Pro, Laptop
TL;DR: NUMA effects combined with false-sharing are very likely to produce the observed effect only on the 2-socket system. Low-level profiling information to confirm/disprove the hypothesis.
Multi-processors systems are subject to NUMA effect. Non-uniform memory access platforms are composed of NUMA nodes which have their own local memory. Accessing to the memory of another node is more expensive (higher latency or/and smaller throughput). Multiples threads/processes located on multiple NUMA nodes accessing to the same NUMA node memory can saturate it.
Allocated memory is split in pages that are are mapped to NUMA nodes. The exact mapping policy is very dependent of the operating system (OS), its configuration and the one of the target processes. The first touch policy is quite usual. The idea is to allocate the page on the NUMA node performing the first access on the target page. Regarding the target chosen policy, OS can migrate pages from one NUMA node to another regarding the amount of remote NUMA node access. Controlling the policy is critical on NUMA platforms, especially if the application is not NUMA-aware.
The memory of multiple NUMA nodes is kept coherent thanks to a cache coherence protocol and an high-performance inter-processor communication network (Ultra Path Interconnect in your case). Cache coherence also applies between cores of the same processor. The thing is moving a cache line from (the L2 cache of) one core to another (L2 cache) is much faster than moving it from (the L3 cache of) one processor to another (L3 cache). Here is an analogy for human communication: neurons of different cortical area communicate faster than two humans together.
If your application operate in parallel on the same cache line, the false-sharing can cause a cache-line bouncing effect which is much more visible between threads spread on different processors.
This is a very complex topic. That being said, you can analyse these effects using low-level profilers like VTune (or perf on Linux). The idea is to analyse low-level performance hardware counters like L2/L3 cache misses/hit, RFOs, remote NUMA accesses, etc. This can be complex and tedious to use for someone not familiar with how processors and OS works but VTune help a bit. Note that there are some more specific tools of Intel to analyse (more easily) such specific effects that usually happens on parallel applications. AFAIK, they are part of the Intel XE set of applications (which is not free). The best to do is to avoid false-sharing using padding, design your application so each thread should operate on its own memory location as much a possible (ie. good locality), to control the NUMA allocation policy and finally to bind threads/processes to core (to avoid unexpected migrations).
Experimental benchmarks can also be used to quickly check if NUMA effect and false sharing occurs. For example, you can bind all the threads/processes on the same NUMA node and tell the OS to allocate pages on this NUMA node. This enable you to find issues related to NUMA effects. Another example is to bind two threads/processes on two different logical cores (ie. hardware thread) of the same physical cores, and then on different physical cores so to see if performance is impacted. This one help you to locate false sharing issues. That being said, such experiments can be impacted by many other effects adding noise and making the analysis pretty complex in practice for large applications. Thus, a low-level analysis based on hardware performance counters is better.
Note that some processors like AMD Zen ones are composed of multiple sub-parts (called CCD/CCX) that can be seen has multiple NUMA nodes even though there is only one processor and one socket. Such architectures will certainly become more widespread in the future. In fact, Intel also started to go in this direction with Sub-NUMA Clustering.

std::vector reserve & resize NUMA locality

I'm currently looking into optimizing NUMA locality of my application.
So far I think I understand that memory will be resident to that NUMA node that first touches it after allocation.
My questions in regards to std::vector (using the default allocator) are:
std::vector::reserve allocates new memory - but does it also touch it? If not, how can I force to touch it after a call to reserve?
does std::vector::resize touch the memory?
what about the constructor that takes size_t?
And about NUMA in general:
If memory that already has been touched is paged out to disk and then is accessed again and generates a hard fault, does that count as a new first touch or is the page loaded into the memory resident to the numa node that touched it first originally?
I'm using c++11 threads. So long as I'm inside a thread and allocating/touching new memory, can I be sure that all this memory will be resident to the same numa node, or is it possible that the OS switches the executing CPU underneath my thread while it executes and then some of my allocations will be in one NUMA domain and others in another?
Assuming we're talking about Intel CPUs: on their Nahlem vintage CPUs, if you had two such CPUs, there was a power-on option for telling them how to divide up physical memory between them. The physical architecture is two CPUs connected by QPI, with each CPU controlling its own set of memory SIMMs. The options are,
first half of the physical address space on one CPU, second half on the next, or
alternating of memory pages between CPUs
For the first option, if you allocated a piece of memory it'd be down to the OS where it would take that from in the physical address space, and then I suppose a good scheduler would endeavour to run threads accessing that memory on the CPU that's controlling it. For the second option, if you allocated several pages of memory then that'd be split between the two physical CPUs, and then it wouldn't really matter what the scheduler did with threads accessing it. I actually played around with this briefly, and couldn't really spot the difference; Intel had done a good job on the QPI. I'm less familiar with newer Intel architectures, but I'm assuming that it's more of the same.
The other question really is what do you mean by a NUMA node? If we are referring to modern Intel and AMD CPUs, these present a synthesized SMP environment to software, using things like QPI / Hypertransport (and now their modern equivalents) to do so on top of a NUMA hardware architecture. So when talking NUMA locality, it's really a case of whether or not the OS scheduler will run the thread on a core on a CPU that is controlling the RAM that the thread is accessing (SMP meaning that it can be run on any core and still access, though perhaps with slight latency differences, the memory no matter where in physical memory it was allocated). I don't know the answer, but I think that some will do that. Certainly endeavours I've made to use core affinity for threads and memory has yielded only a tiny improvement over just letting the OS (Linux 2.6) just do it's thing. And the cache systems on modern CPUs and their interaction with inter-CPU interconnects like QPI are very clever.
Older OSes dating back to when SMP really was pure hardware SMP wouldn't know to do that.
Small rabbit hole - if we are referring to a pure NUMA system (Transputers, the Cell processor out of the PS3 and its SPEs) then a thread would be running on a specific core and would be able to access only that core's memory; to access data allocated (by another thread) in another core's memory, the software has to sort that out itself by sending data across some interconnect. This is much harder to code for until learned, but the results can be impressively fast. It took Intel about 10 years to match the Cell processor for raw maths speed.

openMP: Running with all threads in parallel leads to out-of-memory-exceptions

I want to shorten the runtime of an lengthy image processing algorithm, which is applied to multiple images by using parallel processing with openMP.
The algorithm works fine with single or limited number (=2) of threads.
But: The parallel processing with openMP requires lots of memory, leading to out-of-memory-exceptions, when running with the maximum number of possible threads.
To resolve the issue, I replaced the "throwing of exceptions" with a "waiting for free memory" in case of low memory, leading to many (<= all) threads just waiting for free memory...
Is there any solution/tool/approach to dynamically maintain the memory or start threads depending on available memory?
Try compiling your program 64-bit. 32-bit programs can only have up to 2^32 = about 4GB of memory. 64-bit programs can use significantly more (2^64 which is 18 exabytes). It's very easy to hit 4GB of memory these days.
Note that if you are using more RAM than you have available, your OS will have to page some memory to disk. This can hurt performance a lot. If you get to this point (where you are using a significant portion of RAM) and still have extra cores, you would have to go deeper into the algorithm to find a more granular section to parallelize.
If you for some reason can't switch to 64-bit, you can do multiprocessing (running multiple instances of a program) so each process will have up to 4GB. You will need to launch and coordinate the processes somehow. Depending on your needs, this could mean using simple command-line arguments or complicated inter-process communication (IPC). OpenMP doesn't do IPC, but Open MPI does. Open MPI is generally used for communication between many nodes on a network, but it can be set up to run concurrent instances on one machine.

What does taskset in linux exactly do?

I have a program running on a 32 core system using Intel TBB.
The problem I have is when I set the program to use 32 threads, the performance doesn't gain enough compared to 16 threads (only 50% boost). However, when I use:
taskset 0xFFFFFFFF ./foo
which would lock the process to 32 cores, the performance is much better.
I have the two following questions:
Why? By Default, the OS would use all 32 cores for the 32 thread program anyway.
I'm assuming that even with taskset, the OS is allowed (would) to exchange the virtual threads and the physical threads, i.e. threads are not pinned. am I right?
Thanks.
The operating system may choose to use less cores for cache purposes. Imagine if the application uses the same set of memory then each write causes a cache invalidate. Forcing the lock is essentially you telling the OS the cache overhead for concurrency is not worth it, go ahead and use all the cores.
You must also remember there are other processes to run (like kthreads from the kernel, and background processes.) and migrating threads between cores is costly and may cause imbalances if your threads are not doing an even amount of work.
Also remember that the OS tries to evenly distribute work on the cores across ALL processes not just yours. This means that the load balancer may choose to not place your process on all 32 cores as there are other processes currently running and migration costs could be high or spreading your process evenly could cause load imbalance among the cpu cores. The OS strives for best system performance not necessarily best per application performance.

Multiple instances of program on multi-core machine

I am assuming a dual-core (2 cores per processors) machine with 2 processors for the questions that follow; so a total of 4 "cores". So some natural questions arose:
Suppose I wrote a simple serial program and built it in, say, Visual Studio.. and ran the same program twice, say, with distinct input data in each run. Would they be running on the same processor? Or distinct processors? How much RAM memory would be assigned to each? Would it be the RAM memory on 1 processor (2 cores) or the total RAM? I believe the two programs would run on distinct processors and should each have RAM memory of 1 processor (2 cores); but I am not 100% certain. Would the behavior be any different on Linux?
Now suppose my program was written using a distributed memory parallel interface such as MPI and that I ran it once with 2 processors in the np argument (say). Would the program use both processors (and in effect all 4 cores)? Is this the optimal value for the argument -np? In other words, if I did the same with -np 3 or -np 4; is it correct to assume there would be no added advantage? Again, I think so, but I am not 100% certain. I assume also that I could go higher than 4 (-np 5, -np 6, etc). In such cases, how do the processes compete for memory at values of np > 4? Would the performance get worse for np > 4. I think yes, and perhaps this partly depends on problem size, but again not 100% sure.
Next, suppose I ran two instances of my MPI-built parallel program, both with -np 2, each with, say, different input data. First off, is this possible? I assume it is and that they each run on both processors? How are the two programs synchronized and how do they individually compete for memory sequentially? This should atleast in part, be based on the order of launching the programs, presumably?
Lastly, suppose my program was written using a shared memory parallel interface such as OpenMP and that I ran it once. How many "threads" can I run it on to make full use of shared memory parallelism - is it 2 or 4? (since I have 2 processors with 2 cores each). My guess is it is 4; since all 4 cores are part of the a single shared memory unit? Is that correct? If the answer is 4; does it make sense to run on greater than 4 threads? I am not sure this even works (unlike MPI, where I believe we can do -np 5, -np 6 and so on).
Finally, suppose I run 2 instances of the shared memory parallel program, each with, say, different input data. I assume this is possible and that the individual processes would somehow compete for memory, presumably in the order the programs were launched?
Which processor they run on is entirely up to the OS and depends on many factors, including whatever else is happening on the same machine. The common case, though, is that they will tend to sit on one core each, occasionally swapping to different cores ("occasionally" may mean several times a second or even more frequently).
Çores don't have their own RAM on normal PC hardware, and the processes will be given however much RAM they ask for.
For MPI processes, yes, your parallelism should match the core count (assuming a CPU-heavy workload). If two MPI processes run with -np 2, they will simply consume all four cores. Increase anything and they'll start to contend. As explained above, RAM has nothing to do with any of this, though cache will suffer in the presence of contention.
This "question" is way too long, so I'm going to stop now.
#Marcelo is absolutely right and I'd like to just expand on his answer a little bit.
The OS will determine where and when the threads the comprise the application execution depending on what else is going on in the system and the available resources. Each application will run in it's own process and that process can have hundereds or thousands of threads. The OS (Windows, Linux, Mac whatever) will switch the execution context of the processing cores to ensure that all applications and services get a slice of the pie.
As for I/O access to such things as RAM that is physically controlled by the NorthBridge Controller that sits on your motherboard. Each process (not processor!) will have an allocated amount of RAM that it can deal with that can expand or contract over the lifetime of the application... this of course is limited to the amount of resources available on the system, and also worth noting the OS will take care of swapping RAM requests beyond it's physically availability to disk (i.e. Virtual RAM).
On the other hand though you will need to coordinate access to memory within your application through the use of critical sections and other thread synchronising mechanisms.
OpenMP is a library that helps you write multithreaded parellel applications and makes the syntax of keeping threads in sync easier.... I would comment more, but it's been quite a while since I've used it and I'm sure someone could give a better explaination.
I see you are using windows, so I will summarize by saying that you can set process affinities (which core or cores a process can run on) in the task manager. There's also a winapi call but the name escapes me
a) for a single threaded program, they will not launch on the same cpu (assuming its cpu bound). You can guarantee it by changing the affinity. in linux there's a call sched_setaffinity and a userspace program taskset
b) depends on the MPI library; the machinery is library-specific.
c) it depends on the specific application and data pattern. For small data accesses but lots of messaging passing, you may actually find limiting to 1 CPU to be the most efficient pattern.