std::vector reserve & resize NUMA locality - c++

I'm currently looking into optimizing NUMA locality of my application.
So far I think I understand that memory will be resident to that NUMA node that first touches it after allocation.
My questions in regards to std::vector (using the default allocator) are:
std::vector::reserve allocates new memory - but does it also touch it? If not, how can I force to touch it after a call to reserve?
does std::vector::resize touch the memory?
what about the constructor that takes size_t?
And about NUMA in general:
If memory that already has been touched is paged out to disk and then is accessed again and generates a hard fault, does that count as a new first touch or is the page loaded into the memory resident to the numa node that touched it first originally?
I'm using c++11 threads. So long as I'm inside a thread and allocating/touching new memory, can I be sure that all this memory will be resident to the same numa node, or is it possible that the OS switches the executing CPU underneath my thread while it executes and then some of my allocations will be in one NUMA domain and others in another?

Assuming we're talking about Intel CPUs: on their Nahlem vintage CPUs, if you had two such CPUs, there was a power-on option for telling them how to divide up physical memory between them. The physical architecture is two CPUs connected by QPI, with each CPU controlling its own set of memory SIMMs. The options are,
first half of the physical address space on one CPU, second half on the next, or
alternating of memory pages between CPUs
For the first option, if you allocated a piece of memory it'd be down to the OS where it would take that from in the physical address space, and then I suppose a good scheduler would endeavour to run threads accessing that memory on the CPU that's controlling it. For the second option, if you allocated several pages of memory then that'd be split between the two physical CPUs, and then it wouldn't really matter what the scheduler did with threads accessing it. I actually played around with this briefly, and couldn't really spot the difference; Intel had done a good job on the QPI. I'm less familiar with newer Intel architectures, but I'm assuming that it's more of the same.
The other question really is what do you mean by a NUMA node? If we are referring to modern Intel and AMD CPUs, these present a synthesized SMP environment to software, using things like QPI / Hypertransport (and now their modern equivalents) to do so on top of a NUMA hardware architecture. So when talking NUMA locality, it's really a case of whether or not the OS scheduler will run the thread on a core on a CPU that is controlling the RAM that the thread is accessing (SMP meaning that it can be run on any core and still access, though perhaps with slight latency differences, the memory no matter where in physical memory it was allocated). I don't know the answer, but I think that some will do that. Certainly endeavours I've made to use core affinity for threads and memory has yielded only a tiny improvement over just letting the OS (Linux 2.6) just do it's thing. And the cache systems on modern CPUs and their interaction with inter-CPU interconnects like QPI are very clever.
Older OSes dating back to when SMP really was pure hardware SMP wouldn't know to do that.
Small rabbit hole - if we are referring to a pure NUMA system (Transputers, the Cell processor out of the PS3 and its SPEs) then a thread would be running on a specific core and would be able to access only that core's memory; to access data allocated (by another thread) in another core's memory, the software has to sort that out itself by sending data across some interconnect. This is much harder to code for until learned, but the results can be impressively fast. It took Intel about 10 years to match the Cell processor for raw maths speed.

Related

Multi-threaded C++ code slower with more physical cores? (threaded C++ mex function)

I am running multi-threaded C++ code on different machines right now. I am using it within a Matlab mex function, so the overall program is run from MatLab. I used the code in this link here, only changed what is done in "main_loop" to fit to my task. The code is running perfectly fine on two of my computers and it is many times faster than running the same C++ code as single thread. So I think that the program itself is fine.
However, when I run the same things on a third machine, it is suddenly extremely slow. The single threaded version is fine, but the multi-threaded one takes 10-15 times longer. Now, since everything seems fine on the other computers, my guess is that it has something to do with the specs of the third machine (details see below). My guess: The third computer has two physical processors. I guess this requires to copy everything physically to both processors? (The original code is intentionally written such that no hard-copy of any involved variable is required) If so, is there a way to control on which processor the threads are opened? (It would already help if I can just limit myself to one CPU and avoid copying everything) I already tried to set the number of threads down to 2, what did not help.
Specs of 2-CPU computer:
Intel Xeon Silver 4210R, 2.40Ghz (2 times), 128 GB Ram 64bit, Windows
10 Pro
Specs of other computers:
Intel Core i7-8700, 3.2Ghz, 64 GB Ram 64bit, Windows 10 Pro
Intel Core i7-10750H, 2.6Ghz, 16 GB Ram 64bit, Windows 10 Pro, Laptop
TL;DR: NUMA effects combined with false-sharing are very likely to produce the observed effect only on the 2-socket system. Low-level profiling information to confirm/disprove the hypothesis.
Multi-processors systems are subject to NUMA effect. Non-uniform memory access platforms are composed of NUMA nodes which have their own local memory. Accessing to the memory of another node is more expensive (higher latency or/and smaller throughput). Multiples threads/processes located on multiple NUMA nodes accessing to the same NUMA node memory can saturate it.
Allocated memory is split in pages that are are mapped to NUMA nodes. The exact mapping policy is very dependent of the operating system (OS), its configuration and the one of the target processes. The first touch policy is quite usual. The idea is to allocate the page on the NUMA node performing the first access on the target page. Regarding the target chosen policy, OS can migrate pages from one NUMA node to another regarding the amount of remote NUMA node access. Controlling the policy is critical on NUMA platforms, especially if the application is not NUMA-aware.
The memory of multiple NUMA nodes is kept coherent thanks to a cache coherence protocol and an high-performance inter-processor communication network (Ultra Path Interconnect in your case). Cache coherence also applies between cores of the same processor. The thing is moving a cache line from (the L2 cache of) one core to another (L2 cache) is much faster than moving it from (the L3 cache of) one processor to another (L3 cache). Here is an analogy for human communication: neurons of different cortical area communicate faster than two humans together.
If your application operate in parallel on the same cache line, the false-sharing can cause a cache-line bouncing effect which is much more visible between threads spread on different processors.
This is a very complex topic. That being said, you can analyse these effects using low-level profilers like VTune (or perf on Linux). The idea is to analyse low-level performance hardware counters like L2/L3 cache misses/hit, RFOs, remote NUMA accesses, etc. This can be complex and tedious to use for someone not familiar with how processors and OS works but VTune help a bit. Note that there are some more specific tools of Intel to analyse (more easily) such specific effects that usually happens on parallel applications. AFAIK, they are part of the Intel XE set of applications (which is not free). The best to do is to avoid false-sharing using padding, design your application so each thread should operate on its own memory location as much a possible (ie. good locality), to control the NUMA allocation policy and finally to bind threads/processes to core (to avoid unexpected migrations).
Experimental benchmarks can also be used to quickly check if NUMA effect and false sharing occurs. For example, you can bind all the threads/processes on the same NUMA node and tell the OS to allocate pages on this NUMA node. This enable you to find issues related to NUMA effects. Another example is to bind two threads/processes on two different logical cores (ie. hardware thread) of the same physical cores, and then on different physical cores so to see if performance is impacted. This one help you to locate false sharing issues. That being said, such experiments can be impacted by many other effects adding noise and making the analysis pretty complex in practice for large applications. Thus, a low-level analysis based on hardware performance counters is better.
Note that some processors like AMD Zen ones are composed of multiple sub-parts (called CCD/CCX) that can be seen has multiple NUMA nodes even though there is only one processor and one socket. Such architectures will certainly become more widespread in the future. In fact, Intel also started to go in this direction with Sub-NUMA Clustering.

When should I prefer write-combined CUDA-allocated mapped host memory?

the cudaHostAlloc() API call has, among others, the flags:
cudaHostAllocMapped: Maps the allocation into the CUDA address space. The device pointer to the memory may be obtained by calling cudaHostGetDevicePointer().
cudaHostAllocWriteCombined: Allocates the memory as write-combined (WC). WC memory can be transferred across the PCI Express bus more quickly on some system configurations, but cannot be read efficiently by most CPUs. WC memory is a good option for buffers that will be written by the CPU and read by the device via mapped pinned memory or host->device transfers.
I could quite understand when exactly I would prefer the "write-combined" option. I mean, it didn't say the transfer may be faster just in one direction, so why do they only recommend it for one direction? Also, which kind of systems benefit from this "write-combining"?
I read this white paper, Section 4.7, and still could not get it. Ok, so reading by the CPU is inefficient; but what if other benefits offset this inefficiency? Or - if they cannot, why can't they?
An elucidation would be appreciated.
Write-combined memory allows the CPU to combine multiple narrow writes into fewer wider writes, thus increasing the efficiency of memory writes. If memory serves, WC memory was first introduced with the Intel PentiumPro around 1995 to speed up CPU writes into the frame buffer of video cards. I am not up to speed on which modern system platforms use or support this.
The efficiency of reads performed by the CPU is going to be the same for both cudaHostAllocMapped and cudaHostAllocWriteCombined. But because the latter allows more efficient writes by the CPU, it is recommended for "buffers that will be written by the CPU and read by the device", as stated by quoted documentation.

Where are mapped device memory to, in virtual addressing, when using Intel I/OAT?

When I use Intel I/OAT for DMA zero-copy/zero-cycles(without CPU) transfer through async_memcpy, then where are mapped device memory to, in virtual addressing: to the kernel-buffer(kernel space) or to the user-buffer(user space)?
And does it make any sense to use I/OAT in modern x86_64 CPUs (when CPU-core can fast access to the RAM without north-bridge of chipset)?
http://www.intel.com/content/www/us/en/wireless-network/accel-technology.html
Given that the memory is physical memory, it can be any memory that the kernel can address, including both kernel buffers and user-space buffers. It does however have to be "pinned" or "locked", so that the memory doesn't get taken away (e.g. someone doing free on the memory should not release the memory back to the OS for reassignment to another process, because you could get very interesting effects if that is the case). This is of course the same rules that apply to various other DMA accesses.
I doubt very much this helps in copying data structures for your average user-mode application. On the other hand, I don't believe Intel would put these sort of features into the processor unless they thought it was beneficial in some way. The way I understand it is that it's helpful for copying the network receive buffer into user-mode application that is receiving the data, with less CPU involvement. It doesn't necessarily speed up the actual memory transfer much (if at all), but it offloads the CPU from the to do other things.
I'm pretty sure I saw something not so long ago about this technology [or something very similar] also going into the latest models of processors, so I expect there is some advantage to it.

Dual socket vs single socket memory model?

I am a bit confused about what memory looks like in a dual CPU machine from the perspective of a C/C++ program running on Linux.
Case 1 (understood)
With one quad-core HT CPU, 32GB RAM, I can, in theory, write a single process application, using up to 8 threads and up to 32GB RAM without going into swap or overloading the threading facilities - I am ignore the OS and other processes here for simplicity.
Case 2 (confusion)
What happens with a dual quad-core HT CPU with 64GB RAM set up?
Development-wise, do you need to write an application to run as two processes (8 threads, 32GB each) that communicate or can you write it as one process (16 threads, 64GB full memory)?
If the answer is the former, what are some efficient modern strategies to utilize the entire hardware? shm? IPC? Also, how do you direct Linux to use a different CPU for each process?
From the application's viewpoint, the number of physical CPUs (dies) doesn't matter. Only the number of virtual processors. These include all cores on all processors, and double, if hyperthreading is enabled on a core. Threads are scheduled on them in the same way. It doesn't matter if the cores are all on one die or spread across multiple dies.
In general, the best way to handle these things is to not. Don't worry about what's running on which core. Just spawn an appropriate number of threads for your application, (up to a theoretical maximum equal to the total number of cores in the system), and let the OS deal with the scheduling.
The memory is shared amongst all cores in the system, of course. But again, it's up the OS to handle allocation of physical memory. Very few applications really need to worry about how much memory they use, and divvying up that memory between threads. Let the OS handle that.
The memory model has ** nothing ** to do with number of cores per se, rather it has to do with the architecture employed on multi core computers. Most mainstream computers use symmetric multi processing model, wherein a single OS is controlling all the CPUs, and programs running on those CPUs have access to all the available memory. Each CPU does has private memory (cache), but the RAM is all shared. So if you have 64 bit machine it makes zilch difference whether you write 1 process, or two processes AS FAR AS memory usage implications are concerned. Programming wise you would be better to use a single process.
As other pointed out, you do need to worry about thread affinities and such, but that has more to do with efficient use of CPU resources, and little to do with RAM usage. There would be some implications of cache usage though.
Contrast with other memory model computers, like NUMA (Non-Uniform Memory Access), where each CPU has its own block of memory, and communicating across CPUs then requires some arbiter in between. On these computers you WOULD NEED to worry about where to place your threads, memory wise.

memory access vs. memory copy

I am writing an application in C++ that needs to read-only from the same memory many times from many threads.
My question is from a performance point of view will it be better to copy the memory for each thread or give all threads the same pointer and have all of them access the same memory.
Thanks
There is no definitive answer from the little information you have given about your target system and so on, but on a normal PC, most likely the fastest will be to not copy.
One reason copying could be slow, is that it might result in cache misses if the data area is large. A normal PC would cache read-only access to the same data area very efficiently between threads, even if those threads happen to run on different cores.
One of the benefits explicitly listed by Intel for their approach to caching is "Allows more data-sharing opportunities for threads running on separate cores that are sharing cache". I.e. they encourage a practice where you don't have to program the threads to explicitly cache data, the CPU will do it for you.
Since you specifically mention many threads, I assume you have at least a multi-socket system. Typically, memory banks are associated to processor sockets. That is, one processor is "nearest" to its own memory banks and needs to communicate with the other processors memopry controllers to access data on other banks. (Processor here means the physical thing in the socket)
When you allocate data, typically a first-write policy is used to determine on which memory banks your data will be allocated, which means it can access it faster than the other processors.
So, at least for multiple processors (not just multiple cores) there should be a performance improvement from allocating a copy at least for every processor. Be sure, to allocate/copy the data with every processor/thread and not from a master thread (to exploit the first-write policy). Also you need to make sure, that threads will not migrate between processors, because then you are likely to lose the close connection to your memory.
I am not sure, how copying data for every thread on a single processor would affect performance, but I guess not copying could improve the ability to share the contents of the higher level caches, that are shared between cores.
In any case, benchmark and decide based on actual measurements.