in virtualbox, what happens when you allocate more than one virtual core? - virtualbox

If I'm using Oracle's virtualbox, and I assign more than one virtual core to the virtual machine, how are the actual cores assigned? Does it use both real cores in the virtual machine, or does it use something that emulates cores?

Your question is almost like asking: How does an operating system determine which core to run a given process/thread on? Your computer is making that type of decision all the time - it has far more processes/threads running than you have cores available. This specific answer is similar in nature but also depends on how the guest machine is configured and what support your hardware has available to accelerate the virtualization process - so this answer is certainly not definitive and I won't really touch on how the host schedules code to be executed, but lets examine two relatively simple cases:
The first would be a fully virtualized machine - this would be a machine with no or minimal acceleration enabled. The hardware presented to the guest is fully virtualized even though many CPU instructions are simply passed through and executed directly on the CPU. In cases like this, your guest VM more-or-less behaves like any process running on the host: The CPU resources are scheduled by the operating system (to be clear, the host in this case) and the processes/threads can be run on whatever cores they are allowed to. The default is typically any core that is available, though some optimizations may be present to try and keep a process on the same core to allow the L1/L2 caches to be more effective and minimize context switches. Typically you would only have a single CPU allocated to the guest operating system in these cases, and that would roughly translate to a single process running on the host.
In a slightly more complex scenario, a virtual machine is configured with all available CPU virtualization acceleration options. In Intel speak these are referred to as VT-x for AMD it is AMD-V. These primarily support privileged instructions that would normally require some binary translation / trapping to keep the host and guest protected. As such, the host operating system loses a little bit of visibility. Include in that hardware accelerated MMU support (such that memory page tables can be accessed directly without being shadowed by the virtualization software) - and the visibility drops a little more. Ultimately though it still largely behaves as the first example: It is a process running on the host and is scheduled accordingly - only that you can think of a thread being allocated to run the instructions (or pass them through) for each virtual CPU.
It is worth noting that while you can (with the right hardware support) allocate more virtual cores to the guest than you have available, it isn't a good idea. Typically this will result in decreased performance as the guest potentially thrashes the CPU and can't properly schedule the resources that are being requested - even if the the CPU is not fully taxed. I bring this up as a scenario that shares certain similarities with a multi-threaded program that spawns far more threads (that are actually busy) than there are idle CPU cores available to run them. Your performance will typically be worse than if you had used fewer threads to get the work done.
In the extreme case, VirtualBox even supports hot-plugging CPU resources - though only a few operating systems properly support it: Windows 2008 Data Center edition and certain Linux kernels. The same rules generally apply where one guest CPU core is treated as a process/thread on a logical core for the host, however it is really up to the host and hardware itself to decide which logical core will be used for the virtual core.
With all that being said - your question of how VirtualBox actually assigns those resources... well, I haven't dug through the code so I certainly can't answer definitively but it has been my experience that it generally behaves as described. If you are really curious you could experiment with tagging the VirtualBox VBoxSvc.exe and associated processes in Task Manager and choosing the "Set Affinity" option and limiting their execution to a single CPU and see if those settings are honored. It probably depends on what level of HW assist you have available if those settings are honored by the host as the guest probably isn't really running as part of those.

Related

Multi-threaded C++ code slower with more physical cores? (threaded C++ mex function)

I am running multi-threaded C++ code on different machines right now. I am using it within a Matlab mex function, so the overall program is run from MatLab. I used the code in this link here, only changed what is done in "main_loop" to fit to my task. The code is running perfectly fine on two of my computers and it is many times faster than running the same C++ code as single thread. So I think that the program itself is fine.
However, when I run the same things on a third machine, it is suddenly extremely slow. The single threaded version is fine, but the multi-threaded one takes 10-15 times longer. Now, since everything seems fine on the other computers, my guess is that it has something to do with the specs of the third machine (details see below). My guess: The third computer has two physical processors. I guess this requires to copy everything physically to both processors? (The original code is intentionally written such that no hard-copy of any involved variable is required) If so, is there a way to control on which processor the threads are opened? (It would already help if I can just limit myself to one CPU and avoid copying everything) I already tried to set the number of threads down to 2, what did not help.
Specs of 2-CPU computer:
Intel Xeon Silver 4210R, 2.40Ghz (2 times), 128 GB Ram 64bit, Windows
10 Pro
Specs of other computers:
Intel Core i7-8700, 3.2Ghz, 64 GB Ram 64bit, Windows 10 Pro
Intel Core i7-10750H, 2.6Ghz, 16 GB Ram 64bit, Windows 10 Pro, Laptop
TL;DR: NUMA effects combined with false-sharing are very likely to produce the observed effect only on the 2-socket system. Low-level profiling information to confirm/disprove the hypothesis.
Multi-processors systems are subject to NUMA effect. Non-uniform memory access platforms are composed of NUMA nodes which have their own local memory. Accessing to the memory of another node is more expensive (higher latency or/and smaller throughput). Multiples threads/processes located on multiple NUMA nodes accessing to the same NUMA node memory can saturate it.
Allocated memory is split in pages that are are mapped to NUMA nodes. The exact mapping policy is very dependent of the operating system (OS), its configuration and the one of the target processes. The first touch policy is quite usual. The idea is to allocate the page on the NUMA node performing the first access on the target page. Regarding the target chosen policy, OS can migrate pages from one NUMA node to another regarding the amount of remote NUMA node access. Controlling the policy is critical on NUMA platforms, especially if the application is not NUMA-aware.
The memory of multiple NUMA nodes is kept coherent thanks to a cache coherence protocol and an high-performance inter-processor communication network (Ultra Path Interconnect in your case). Cache coherence also applies between cores of the same processor. The thing is moving a cache line from (the L2 cache of) one core to another (L2 cache) is much faster than moving it from (the L3 cache of) one processor to another (L3 cache). Here is an analogy for human communication: neurons of different cortical area communicate faster than two humans together.
If your application operate in parallel on the same cache line, the false-sharing can cause a cache-line bouncing effect which is much more visible between threads spread on different processors.
This is a very complex topic. That being said, you can analyse these effects using low-level profilers like VTune (or perf on Linux). The idea is to analyse low-level performance hardware counters like L2/L3 cache misses/hit, RFOs, remote NUMA accesses, etc. This can be complex and tedious to use for someone not familiar with how processors and OS works but VTune help a bit. Note that there are some more specific tools of Intel to analyse (more easily) such specific effects that usually happens on parallel applications. AFAIK, they are part of the Intel XE set of applications (which is not free). The best to do is to avoid false-sharing using padding, design your application so each thread should operate on its own memory location as much a possible (ie. good locality), to control the NUMA allocation policy and finally to bind threads/processes to core (to avoid unexpected migrations).
Experimental benchmarks can also be used to quickly check if NUMA effect and false sharing occurs. For example, you can bind all the threads/processes on the same NUMA node and tell the OS to allocate pages on this NUMA node. This enable you to find issues related to NUMA effects. Another example is to bind two threads/processes on two different logical cores (ie. hardware thread) of the same physical cores, and then on different physical cores so to see if performance is impacted. This one help you to locate false sharing issues. That being said, such experiments can be impacted by many other effects adding noise and making the analysis pretty complex in practice for large applications. Thus, a low-level analysis based on hardware performance counters is better.
Note that some processors like AMD Zen ones are composed of multiple sub-parts (called CCD/CCX) that can be seen has multiple NUMA nodes even though there is only one processor and one socket. Such architectures will certainly become more widespread in the future. In fact, Intel also started to go in this direction with Sub-NUMA Clustering.

Artificially reducing CPU speed or number of cores

I have a pretty beefy development machine that I am using to develop and test my project. I set up a situation that should load the CPU pretty heavily and checked performance monitor to see that I was using 130% CPU. I want to know how my program would behave on a machine with fewer cores. Is there a way to throttle the number of CPUs that the OS will allocate to my program? Or can I reduce the effective clock speed somehow?
you can run your program inside virtual box. some of them let you change how much resources your os inside virtual box can access like number of cores and threads, amount of memory ...
As far as I know with c++ you can only change form one thread to multi threads and vice versa. So you need to access your OS specific guidelines!
Another way is to run multiple instance of your software all together!

Can I somehow protect the whole Windows machine from stalling in case my 64-bit program consumes all memory? [duplicate]

Is there any way to set a system wide memory limit a process can use in Windows XP? I have a couple of unstable apps which do work ok for most of the time but can hit a bug which results in eating whole memory in a matter of seconds (or at least I suppose that's it). This results in a hard reset as Windows becomes totally unresponsive and I lose my work.
I would like to be able to do something like the /etc/limits on Linux - setting M90, for instance (to set 90% max memory for a single user to allocate). So the system gets the remaining 10% no matter what.
Use Windows Job Objects. Jobs are like process groups and can limit memory usage and process priority.
Use the Application Verifier (AppVerifier) tool from Microsoft.
In my case I need to simulate memory no longer being available so I did the following in the tool:
Added my application
Unchecked Basic
Checked Low Resource Simulation
Changed TimeOut to 120000 - my application will run normally for 2 minutes before anything goes into effect.
Changed HeapAlloc to 100 - 100% chance of heap allocation error
Set Stacks to true - the stack will not be able to grow any larger
Save
Start my application
After 2 minutes my program could no longer allocate new memory and I was able to see how everything was handled.
Depending on your applications, it might be easier to limit the memory the language interpreter uses. For example with Java you can set the amount of RAM the JVM will be allocated.
Otherwise it is possible to set it once for each process with the windows API
SetProcessWorkingSetSize Function
No way to do this that I know of, although I'm very curious to read if anyone has a good answer. I have been thinking about adding something like this to one of the apps my company builds, but have found no good way to do it.
The one thing I can think of (although not directly on point) is that I believe you can limit the total memory usage for a COM+ application in Windows. It would require the app to be written to run in COM+, of course, but it's the closest way I know of.
The working set stuff is good (Job Objects also control working sets), but that's not total memory usage, only real memory usage (paged in) at any one time. It may work for what you want, but afaik it doesn't limit total allocated memory.
Per process limits
From an end-user perspective, there are some helpful answers (and comments) at the superuser question “Is it possible to limit the memory usage of a particular process on Windows”, including discussions of how to set recursive quota limits on any or all of:
CPU assignment (quantity, affinity, NUMA groups),
CPU usage,
RAM usage (both ‘committed’ and ‘working set’), and
network usage,
… mostly via the built-in Windows ‘Job Objects’ system (as mentioned in #Adam Mitz’s answer and #Stephen Martin’s comment above), using:
the registry (for persistence, when desired) or
free tools, such as the open-source Process Governor.
(Note: nested Job Objects ~may~ not have been available under all earlier versions of Windows, but the un-nested version appears to date back to Windows XP)
Per-user limits
As far as overall per-user quotas:
??
It is possible that each user session is automatically assigned to a job group itself; if true, per-user limits should be able to be applied to that job group. Update: nope; Job Objects can only be nested at the time they are created or associated with a specific process, and in some cases a child Job Object is allowed to ‘break free’ from its parent and become independent, so they can’t facilitate ‘per-user’ resource limits.
(NTFS does support per-user file system ~storage~ quotas, though)
Per-system limits
Besides simple BIOS or ‘energy profile’ restrictions:
VM hypervisor or Kubernetes-style container resource limit controls may be the most straightforward (in terms of end-user understandability, at least) option.
Footnotes, regarding per-process and other resource quotas / QoS for non-Windows systems:
‘Classic’ Mac OS (including ‘classic’ applications running on 2000s-era versions of Mac OS X): per-application memory limits can be easily set within the ‘Memory’ section of the Finder ‘Get Info’ window for the target program; as a system using a cooperative multitasking concurrency model, per-process CPU limits were impossible.
BSD: ? (probably has some overlap with linux and non-proprietary macOS methods?)
macOS (aka ‘Mac OS X’): no user-facing interface; system support includes, depending on version, the ‘Multiprocessing Services API’, Grand Central Dispatch, POSIX threads / pthread, ‘operation objects’, and possibly others.
Linux: ‘Resource Manager’/limits.conf, control groups/‘cgroups’, process priority/‘niceness’/renice, others?
IBM z/OS and other mainframe-style systems: resource controls / allocation was built-in from nearly the beginning

Locking a process to Cuda core

I'm just getting into GPU processing.
I was wondering if it's possible to lock a new process, or 'launch' a process that is locked to a CUDA core?
For example you may have a small C program that performs an image filter on an index of images. Can you have that program running on each CUDA core that essentially runs forever - reading/writing from it's own memory to system memory and disk?
If this is possible, what are the implications for CPU performance - can we totally offset CPU usage or does the CPU still need to have some input/output?
My semantics here are probably way off. I apologize if what i've said requries some interpretation. I'm not that used to GPU stuff yet.
Thanks.
All of my comments here should be prefaced with "at the moment". Technology is constantly evolving.
was wondering if it's possible to lock a new process, or 'launch' a process that is locked to a CUDA core?
process is mostly a (host) operating system term. CUDA doesn't define a process separately from the host operating system definition of it, AFAIK. CUDA threadblocks, once launched on a Streaming Multiprocessor (or SM, a hardware execution resource component inside a GPU), in many cases will stay on that SM for their "lifetime", and the SM includes an array of "CUDA cores" (a bit of a loose or conceptual term). However, there is at least one documented exception today to this in the case of CUDA Dynamic Parallelism, so in the most general sense, it is not possible to "lock" a CUDA thread of execution to a CUDA core (using core here to refer to that thread of execution forever remaining on a given warp lane within a SM).
Can you have that program running on each CUDA core that essentially runs forever
You can have a CUDA program that runs essentially forever. It is a recognized programming technique sometimes referred to as persistent threads. Such a program will naturally occupy/require one or more CUDA cores (again, using the term loosely). As already stated, that may or may not imply that the program permanently occupies a particular set of physical execution resources.
reading/writing from it's own memory to system memory
Yes, that's possible, extending the train of thought. Writing to it's own memory is obviously possible, by definition, and writing to system memory is possible via the zero-copy mechanism (slides 21/22), given a reasonable assumption of appropriate setup activity for this mechanism.
and disk?
No, that's not directly possible today, without host system interaction, and/or without a significant assumption of atypical external resources such as a disk controller of some sort connected via a GPUDirect interface (with a lot of additional assumptions and unspecified framework). The GPUDirect exception requires so much additional framework, that I would say, for typical usage, the answer is "no", not without host system activity/intervention. The host system (normally) owns the disk drive, not the GPU.
If this is possible, what are the implications for CPU performance - can we totally offset CPU usage or does the CPU still need to have some input/output?
In my opinion, the CPU must still be considered. One consideration is if you need to write to disk. Even if you don't, most programs derive I/O from somewhere (e.g. MPI) and so the implication of a larger framework of some sort is there. Secondly, and relatedly, the persistent threads programming model usually implies a producer/consumer relationship, and a work queue. The GPU is on the processing side (consumer side) of the work queue, but something else (usually) is on the producer side, typically the host system CPU. Again, it could be another GPU, either locally or via MPI, that is on the producer side of the work queue, but that still usually implies an ultimate producer somewhere else (i.e. the need for system I/O).
Additionally:
Can CUDA threads send packets over a network?
This is like the disk question. These questions could be viewed in a general way, in which case the answer might be "yes". But restricting ourselves to formal definitions of what a CUDA thread can do, I believe the answer is more reasonably "no". CUDA provides no direct definitions for I/O interfaces to disk or network (or many other things, such as a display!). It's reasonable to conjecture or presume the existence of a lightweight host process that simply copies packets between a CUDA GPU and a network interface. With this presumption, the answer might be "yes" (and similarly for disk I/O). But without this presumption (and/or a related, perhaps more involved presumption of a GPUDirect framework), I think the most reasonable answer is "no". According to the CUDA programming model, there is no definition of how to access a disk or network resource directly.

What does hardware virtualization extension mean?

When programming againt virual machine, some system remind hardware virtualization extension is needed. What does it mean?
Hardware virtualization extension allows your computer to have second state which represents virtual machine state (for example vmware). When VM code is scheduled to run, processor switches to its "virtual" context and then works in this "sandbox". When hypervisor executes guest code it needs to emulate many hardware aspects - perform software virtualiztion of them. Hardware extensions allows to do the emulation in hardware. It significantly reduces virtualization overhead.
When a CPU is emulated (vCPU) by the hypervisor, the hypervisor has to translate the instructions meant for the vCPU to the physical CPU. As you can imagine this has a massive performance impact. To overcome this, modern processors support virtualization extensions, such as Intel VT-x and AMD-V. These technologies provide the ability for a slice of the physical CPU to be directly mapped to the vCPU. Therefore the instructions meant for the vCPU can be directly executed on the physical CPU slic