When programming againt virual machine, some system remind hardware virtualization extension is needed. What does it mean?
Hardware virtualization extension allows your computer to have second state which represents virtual machine state (for example vmware). When VM code is scheduled to run, processor switches to its "virtual" context and then works in this "sandbox". When hypervisor executes guest code it needs to emulate many hardware aspects - perform software virtualiztion of them. Hardware extensions allows to do the emulation in hardware. It significantly reduces virtualization overhead.
When a CPU is emulated (vCPU) by the hypervisor, the hypervisor has to translate the instructions meant for the vCPU to the physical CPU. As you can imagine this has a massive performance impact. To overcome this, modern processors support virtualization extensions, such as Intel VT-x and AMD-V. These technologies provide the ability for a slice of the physical CPU to be directly mapped to the vCPU. Therefore the instructions meant for the vCPU can be directly executed on the physical CPU slic
Related
I am running multi-threaded C++ code on different machines right now. I am using it within a Matlab mex function, so the overall program is run from MatLab. I used the code in this link here, only changed what is done in "main_loop" to fit to my task. The code is running perfectly fine on two of my computers and it is many times faster than running the same C++ code as single thread. So I think that the program itself is fine.
However, when I run the same things on a third machine, it is suddenly extremely slow. The single threaded version is fine, but the multi-threaded one takes 10-15 times longer. Now, since everything seems fine on the other computers, my guess is that it has something to do with the specs of the third machine (details see below). My guess: The third computer has two physical processors. I guess this requires to copy everything physically to both processors? (The original code is intentionally written such that no hard-copy of any involved variable is required) If so, is there a way to control on which processor the threads are opened? (It would already help if I can just limit myself to one CPU and avoid copying everything) I already tried to set the number of threads down to 2, what did not help.
Specs of 2-CPU computer:
Intel Xeon Silver 4210R, 2.40Ghz (2 times), 128 GB Ram 64bit, Windows
10 Pro
Specs of other computers:
Intel Core i7-8700, 3.2Ghz, 64 GB Ram 64bit, Windows 10 Pro
Intel Core i7-10750H, 2.6Ghz, 16 GB Ram 64bit, Windows 10 Pro, Laptop
TL;DR: NUMA effects combined with false-sharing are very likely to produce the observed effect only on the 2-socket system. Low-level profiling information to confirm/disprove the hypothesis.
Multi-processors systems are subject to NUMA effect. Non-uniform memory access platforms are composed of NUMA nodes which have their own local memory. Accessing to the memory of another node is more expensive (higher latency or/and smaller throughput). Multiples threads/processes located on multiple NUMA nodes accessing to the same NUMA node memory can saturate it.
Allocated memory is split in pages that are are mapped to NUMA nodes. The exact mapping policy is very dependent of the operating system (OS), its configuration and the one of the target processes. The first touch policy is quite usual. The idea is to allocate the page on the NUMA node performing the first access on the target page. Regarding the target chosen policy, OS can migrate pages from one NUMA node to another regarding the amount of remote NUMA node access. Controlling the policy is critical on NUMA platforms, especially if the application is not NUMA-aware.
The memory of multiple NUMA nodes is kept coherent thanks to a cache coherence protocol and an high-performance inter-processor communication network (Ultra Path Interconnect in your case). Cache coherence also applies between cores of the same processor. The thing is moving a cache line from (the L2 cache of) one core to another (L2 cache) is much faster than moving it from (the L3 cache of) one processor to another (L3 cache). Here is an analogy for human communication: neurons of different cortical area communicate faster than two humans together.
If your application operate in parallel on the same cache line, the false-sharing can cause a cache-line bouncing effect which is much more visible between threads spread on different processors.
This is a very complex topic. That being said, you can analyse these effects using low-level profilers like VTune (or perf on Linux). The idea is to analyse low-level performance hardware counters like L2/L3 cache misses/hit, RFOs, remote NUMA accesses, etc. This can be complex and tedious to use for someone not familiar with how processors and OS works but VTune help a bit. Note that there are some more specific tools of Intel to analyse (more easily) such specific effects that usually happens on parallel applications. AFAIK, they are part of the Intel XE set of applications (which is not free). The best to do is to avoid false-sharing using padding, design your application so each thread should operate on its own memory location as much a possible (ie. good locality), to control the NUMA allocation policy and finally to bind threads/processes to core (to avoid unexpected migrations).
Experimental benchmarks can also be used to quickly check if NUMA effect and false sharing occurs. For example, you can bind all the threads/processes on the same NUMA node and tell the OS to allocate pages on this NUMA node. This enable you to find issues related to NUMA effects. Another example is to bind two threads/processes on two different logical cores (ie. hardware thread) of the same physical cores, and then on different physical cores so to see if performance is impacted. This one help you to locate false sharing issues. That being said, such experiments can be impacted by many other effects adding noise and making the analysis pretty complex in practice for large applications. Thus, a low-level analysis based on hardware performance counters is better.
Note that some processors like AMD Zen ones are composed of multiple sub-parts (called CCD/CCX) that can be seen has multiple NUMA nodes even though there is only one processor and one socket. Such architectures will certainly become more widespread in the future. In fact, Intel also started to go in this direction with Sub-NUMA Clustering.
I use a mechanical simulation software that take 5-10 hours to resolve one simulation. My licence software is limited to 4-core.
Spec of machine that actualy run the software:
Windows 7 Pro
1x Xeon E5-2650 v2 2.60GHz (8-core)
32GB Ram
SSD
I'm trying to find a way to reduce as much as possible the time of simulation.
In virtualization, is it possible to take for exemple, 2x physical 8-core 2.50GHz CPU and make a vCPU of 4-core at 10GHz per core? Each virtual core will take 4 physicals core in this exemple. Is this possible to make it?
Any suggestion?
It is possible to use AWS EC2 Amazon server to make it?
Thank you!
Amazon EC2 is available in a variety of Instance Types. Each type has a set number of virtual CPUs, RAM, etc. Each vCPU is a hyperthread of an Intel Xeon core, so you might want to check your licensing to confirm how it defines a 'core'.
By choosing Instance Type, you can control how many CPUs you get and you will know the type of processor being used. However, you cannot combine CPUs to make a faster CPU.
I am considering vectorizing some floor() calls using sse2 intrinsics, then measuring the performance gain. But ultimately the binary is going to be run on a virtual machine which I have no access to.
I don't really know how a VM works. Is a binary entirely executed on a software-emulated virtual cpu ?
If not, supposing the VM is run on a cpu with SSE2, could the VM use his cpu SSE2 instruction when executing a SSE2 instruction from my binary ?
Could my vectorization be beneficial on the VM ?
I don't really know how a VM works. Is a binary entirely executed on a software-emulated virtual cpu?
For serious purposes, no, because it's too slow. (But e.g. Bochs does; it can be useful for kernel debugging among other things)
The binary is executed "normally" as much as possible. This generally means any code that doesn't try to interact with the OS will be executed directly. For example, system calls are likely to require the involvement of the VM implementation.
If not, supposing the VM is run on a cpu with SSE2, could the VM use his cpu SSE2 instruction when executing a SSE2 instruction from my binary?
Yes.
Could my vectorization be beneficial on the VM?
Yes.
Depends on VM technology and CPU capabilities. First x86 VMs (like VMWare on 32-bit machines) used recompilation. They looked into binary code of VMs to seek for harmful instructions (like accessing raw memory or special registers) to replace them with hyper-calls.
Since SSE2 instructions are not harmful, they would just left as is, and no performance penalty added in VM. Moreover, modern x86 CPUs use "hardware virtualization" which allows to avoid recompilation. Harmful instructions are caught by CPU and generate an interrupt, but again SSE2 instrs shouldn't trigger it.
There are of course full processor emulators like QEMU (not QEMU-KVM) or Bochs, but it's a different story. Bochs-emulated CPU, for example, is about 1000 times slower than host CPU.
If a program is compiled on a Xeon-Phi coprocessor, and contains instructions from IMCI instruction set extension, is it possible to run it on a user machine with no Xeon-Phi coprocessor ?
If it is possible, will the performance be improved on the user machine, compared to same application with no IMCI instructions compiled for instance on a i7 Core processor ?
In other words, to benefit from increased performance when using Intel instruction set extension, is it necessary that the user machine has a processor which is supporting this extension ?
If a program is compiled on a Xeon-Phi coprocessor, and contains instructions from IMCI instruction set extension, is it possible to run it on a user machine with no Xeon-Phi coprocessor ?
If your program use IMCI you need a processor (or coprocessor, this is relative) that support that instructions.
This is true for every instruction you use.
Actually I'm aware of only Intel Xeon Phi coprocessors that support IMCI, so the answer is No.
If it is possible, will the performance be improved on the user machine, compared to same application with no IMCI instructions compiled for instance on a i7 Core processor ?
In other words, to benefit from increased performance when using Intel instruction set extension, is it necessary that the user machine has a processor which is supporting this extension ?
I'm not sure what you are asking here, you can't use an instruction set extension not supported by the target processor, this is obvious as it is that you cannot speak russian with someone who can't understand russian.
If you try using unsupported instructions the processor will raise a #UD exception signaling a not recognized instruction, the program state could not advance as you cannot skip instructions in the program flow and the application will be forced to stop.
The KNL microarch of the Xeon Phi will support AVX512 which is also supported by "mainstream" CPU.
This question may be useful: Are there SIMD(SSE / AVX) instructions in the x86-compatible accelerators Intel Xeon Phi?
Also note that you should see Xeon Phi (as it is now) as a coprocessor compatible with the IA32e architecture rather than as member of IA32e family.
If I'm using Oracle's virtualbox, and I assign more than one virtual core to the virtual machine, how are the actual cores assigned? Does it use both real cores in the virtual machine, or does it use something that emulates cores?
Your question is almost like asking: How does an operating system determine which core to run a given process/thread on? Your computer is making that type of decision all the time - it has far more processes/threads running than you have cores available. This specific answer is similar in nature but also depends on how the guest machine is configured and what support your hardware has available to accelerate the virtualization process - so this answer is certainly not definitive and I won't really touch on how the host schedules code to be executed, but lets examine two relatively simple cases:
The first would be a fully virtualized machine - this would be a machine with no or minimal acceleration enabled. The hardware presented to the guest is fully virtualized even though many CPU instructions are simply passed through and executed directly on the CPU. In cases like this, your guest VM more-or-less behaves like any process running on the host: The CPU resources are scheduled by the operating system (to be clear, the host in this case) and the processes/threads can be run on whatever cores they are allowed to. The default is typically any core that is available, though some optimizations may be present to try and keep a process on the same core to allow the L1/L2 caches to be more effective and minimize context switches. Typically you would only have a single CPU allocated to the guest operating system in these cases, and that would roughly translate to a single process running on the host.
In a slightly more complex scenario, a virtual machine is configured with all available CPU virtualization acceleration options. In Intel speak these are referred to as VT-x for AMD it is AMD-V. These primarily support privileged instructions that would normally require some binary translation / trapping to keep the host and guest protected. As such, the host operating system loses a little bit of visibility. Include in that hardware accelerated MMU support (such that memory page tables can be accessed directly without being shadowed by the virtualization software) - and the visibility drops a little more. Ultimately though it still largely behaves as the first example: It is a process running on the host and is scheduled accordingly - only that you can think of a thread being allocated to run the instructions (or pass them through) for each virtual CPU.
It is worth noting that while you can (with the right hardware support) allocate more virtual cores to the guest than you have available, it isn't a good idea. Typically this will result in decreased performance as the guest potentially thrashes the CPU and can't properly schedule the resources that are being requested - even if the the CPU is not fully taxed. I bring this up as a scenario that shares certain similarities with a multi-threaded program that spawns far more threads (that are actually busy) than there are idle CPU cores available to run them. Your performance will typically be worse than if you had used fewer threads to get the work done.
In the extreme case, VirtualBox even supports hot-plugging CPU resources - though only a few operating systems properly support it: Windows 2008 Data Center edition and certain Linux kernels. The same rules generally apply where one guest CPU core is treated as a process/thread on a logical core for the host, however it is really up to the host and hardware itself to decide which logical core will be used for the virtual core.
With all that being said - your question of how VirtualBox actually assigns those resources... well, I haven't dug through the code so I certainly can't answer definitively but it has been my experience that it generally behaves as described. If you are really curious you could experiment with tagging the VirtualBox VBoxSvc.exe and associated processes in Task Manager and choosing the "Set Affinity" option and limiting their execution to a single CPU and see if those settings are honored. It probably depends on what level of HW assist you have available if those settings are honored by the host as the guest probably isn't really running as part of those.