sse2 vectorization and virtual machines - c++

I am considering vectorizing some floor() calls using sse2 intrinsics, then measuring the performance gain. But ultimately the binary is going to be run on a virtual machine which I have no access to.
I don't really know how a VM works. Is a binary entirely executed on a software-emulated virtual cpu ?
If not, supposing the VM is run on a cpu with SSE2, could the VM use his cpu SSE2 instruction when executing a SSE2 instruction from my binary ?
Could my vectorization be beneficial on the VM ?

I don't really know how a VM works. Is a binary entirely executed on a software-emulated virtual cpu?
For serious purposes, no, because it's too slow. (But e.g. Bochs does; it can be useful for kernel debugging among other things)
The binary is executed "normally" as much as possible. This generally means any code that doesn't try to interact with the OS will be executed directly. For example, system calls are likely to require the involvement of the VM implementation.
If not, supposing the VM is run on a cpu with SSE2, could the VM use his cpu SSE2 instruction when executing a SSE2 instruction from my binary?
Yes.
Could my vectorization be beneficial on the VM?
Yes.

Depends on VM technology and CPU capabilities. First x86 VMs (like VMWare on 32-bit machines) used recompilation. They looked into binary code of VMs to seek for harmful instructions (like accessing raw memory or special registers) to replace them with hyper-calls.
Since SSE2 instructions are not harmful, they would just left as is, and no performance penalty added in VM. Moreover, modern x86 CPUs use "hardware virtualization" which allows to avoid recompilation. Harmful instructions are caught by CPU and generate an interrupt, but again SSE2 instrs shouldn't trigger it.
There are of course full processor emulators like QEMU (not QEMU-KVM) or Bochs, but it's a different story. Bochs-emulated CPU, for example, is about 1000 times slower than host CPU.

Related

How is if statement executed in NVIDIA GPUs?

As much as know GPU cores are very simple and can only execute basic mathematic instructions.
If I have a kernel with an if statement, then what does execute that if statement? Fp32, Fp64 and Int32 can only execute operations with floats, doubles and integers, not a COMPARE instruction, am I wrong. What happens if I have printf function in kernel? Who executes that.
Compare instructions are arithmetic instructions, you can implement a comparison with subtraction and a flag register, and GPGPUs have them.
But they are often not advertised as much as the number-crunching capability of the whole GPU.
NVIDIA doesn't publish the machine code documentation for their GPUs nor the ISA of the respective assembly (called SASS).
Instead, NVIDIA maintains the PTX language which is designed to be more portable across different generations while still being very close to the actual machine code.
PTX is a predicated architecture. The setp instruction (which again, is just a subtraction with a few caveats) sets the value of the defined predicate registers and these are used to conditionally execute other instructions. Including the bra instruction which is a branch, making it possible to execute conditional branches.
One could argue that PTX is not SASS but it seems the predicate architecture is what NVIDIA GPUs, at least, used to do.
AMD GPUs seem to use the traditional approach to branching: there are comparison instructions (e.g. S_CMP_EQ_U64) and conditional branches (e.g. S_CBRANCH_SCCZ).
Intel GPUs also rely on predication but have different instructions for divergent vs non-divergent branches.
So GPGPUs do have branch instructions, in fact, their SIMT model has to deal with the branch divergence problem.
Before c. 2006 GPUs were not fully programmable and programmers had to rely on other tricks (like data masking or branchless code) to implement their kernel.
Keep in mind that at the time it was not widely accepted that one could execute arbitrary programs or make arbitrary shading effects with GPUs. GPUs relaxed their programming constraints with time.
Putting a printf in a CUDA kernel won't probably work because there is no C runtime on the GPU (remember the GPU is an entirely different executor from the CPU) and the linking would fail I guess.
You can theoretically force a GPU implementation of the CRT and design a mechanism to call syscalls from the GPU code but that would be unimaginably slow since GPUs are not designed for this kind of work.
EDIT: Apparently NVIDIA actually did implement a printf on the GPU that prints to a buffer shared with host.
The problem here is not the presence of branches but the very nature of printf.

What happens when I compile on machine that supports avx2 and run the binary on another machine that only supports avx?

I compiled my c++ program on a machine that supports avx2 (Intel E5-2643 V3). It compiles and runs just fine. I confirm the avx2 instruction is used since after I dissemble the binary, I saw avx2 instructions such as vpbroadcastd.
Then I run this binary on another machine that only has avx instruction set (Intel E5-2643 V2). It runs also fine. Does the binary runs on a backward compatible avx instruction instead? What is this instruction? Do you see any potential issue?
There are multiple compilers and multiple settings you can use but the general principle is that usually a compiler is not targeting a particular processor, it's targeting an architecture, and by default it will usually have a fairly inclusive approach meaning the generated code will be compatible with as many processors as reasonable. You would normally expect an x86_64 compiler to generate code that runs without AVX2, indeed, that it should run on some of the earliest CPUs supporting the x86_64 instruction set.
If you have code that benefits greatly from extensions to the instruction set that aren't universally supported like AVX2, your aim when producing software is generally to degrade gracefully. For instance you could use runtime feature detection to see if the current processor supports AVX2 and run a separate code path. Some compilers may support automated ways of doing this or helpers to assist you in achieving this yourself.
It's not rare to have AVX2 instructions in a binary that uses CPU detection to make sure it only runs them on CPUs that support them. (e.g. via cpuid and setting function pointers).
If the AVX2 instruction actually executed on a CPU without AVX2 support, it raises #UD, so the OS delivers SIGILL (illegal instruction) to your process, or the Windows equivalent.
There are a few cases where an instruction like lzcnt decodes as rep bsr, which runs as bsr on CPUs without BMI1. (Giving a different answer). But VEX-coded AVX2 instructions just fault on older CPUs.

Is there a way to flush the entire CPU cache related to a program?

On x86-64 platforms, the CLFLUSH assembly instruction allows to flush the cache line corresponding to a given address. Instead of flushing the cache related to a specific address, would there be a way to flush the entire cache (either the cache related to the program being executed, or the entire cache), for example by making it full of dummy contents (or any other approach I would not be aware of):
using only standard C++17?
using standard C++17 and compiler intrinsics if necessary?
What would be the contents of the following function: (the function should work regardless of compiler optimizations)?
void flush_cache()
{
// Contents
}
For links to related questions about clearing caches (especially on x86), see the first answer on WBINVD instruction usage.
No, you cannot do this reliably or efficiently with pure ISO C++. It doesn't know or care about CPU caches. The best you could do is touch a lot of memory so everything else ends up getting evicted1, but this is not what you're really asking for. (Of course, flushing all cache is by definition inefficient...)
See Flushing the cache to prevent benchmarking fluctiations for some tips about implementation details if you go that route.
CPU cache management functions / intrinsics / asm instructions are implementation-specific extensions to the C++ language. But other than inline asm, no C or C++ implementations that I'm aware of provide a way to flush all cache, rather than a range of addresses. That's because it's not a normal thing to do.
On x86, for example, the asm instruction you're looking for is wbinvd. It writes-back any dirty lines before evicting, unlike invd (which drops cache without write-back, useful when leaving cache-as-RAM mode). So in theory wbinvd has no architectural effect, only microarchitectural, but it's so slow that's it's a privileged instruction. As Intel's insn ref manual entry for wbinvd points out, it will increase interrupt latency, because it is not itself interruptible and may have to wait for 8 MiB or more of dirty L3 cache to be flushed. i.e. delaying interrupts for that long can be considered an architectural effect, unlike most timing effects. It's also complicated on a multi-core system because it has to flush caches for all cores.
I don't think there's any way to use it in user-space (ring 3) on x86. Unlike cli / sti and in/out, it's not enabled by the IO-privilege level (which you can set on Linux with an iopl() system call). So wbinvd only works when actually running in ring 0 (i.e. in kernel code). See Privileged Instructions and CPU Ring Levels.
But if you're writing a kernel (or freestanding program that runs in ring0) in GNU C or C++, you could use asm("wbinvd" ::: "memory");. On a computer running actual DOS, normal programs run in real mode (which doesn't have any lower-privilege levels; everything is effectively kernel). That would be another way to run a microbenchmark that needs to run privileged instructions to avoid kernel<->userspace transition overhead for wbinvd, and also has the convenience of running under an OS so you can use a filesystem. Putting your microbenchmark into a Linux kernel module might be easier than booting FreeDOS from a USB stick or something, though. Especially if you want control of turbo frequency stuff.
The only reason I can think of that you might want this is for some kind of experiment to figure out how the internals of a specific CPU are designed. So the details of exactly how it's done are critical. It doesn't make sense to me to even want a portable / generic way to do this.
Or maybe in a kernel before reconfiguring physical memory layout, e.g. so there's now an MMIO region for an ethernet card where there used to be normal DRAM. But in that case your code is already totally arch-specific.
Normally when you want / need to flush caches for correctness reasons, you know which address range needs flushing. e.g. when writing drivers on architectures with DMA that isn't cache coherent, so write-back happens before a DMA read, and doesn't step on a DMA write. (And the eviction part is important for DMA reads, too: you don't want the old cached value). But x86 has cache-coherent DMA these days, because modern designs build the memory controller into the CPU die so system traffic can snoop L3 on the way from PCIe to memory.
The major case outside of drivers where you need to worry about caches is with JIT code-generation on non-x86 architectures with non-coherent instruction caches. If you (or a JIT library) write some machine code into a char[] buffer and cast it to a function pointer, architectures like ARM don't guarantee that code-fetch will "see" that newly-written data.
This is why gcc provides __builtin__clear_cache. It doesn't necessarily flush anything, only makes sure it's safe to execute that memory as code. x86 has instruction caches that are coherent with data caches and supports self-modifying code without any special syncing instructions. See godbolt for x86 and AArch64, and note that __builtin__clear_cache compiles to zero instructions for x86, but has an effect on surrounding code: without it, gcc can optimize away stores to a buffer before casting to a function pointer and calling. (It doesn't realize that data is being used as code, so it thinks they're dead stores and eliminates them.)
Despite the name, __builtin__clear_cache is totally unrelated to wbinvd. It needs an address-range as args so it's not going to flush and invalidate the entire cache. It also doesn't use use clflush, clflushopt, or clwb to actually write-back (and optionally evict) data from cache.
When you need to flush some cache for correctness, you only want to flush a range of addresses, not slow the system down by flushing all the caches.
It rarely if ever makes sense to intentionally flush caches for performance reasons, at least on x86. Sometimes you can use pollution-minimizing prefetch to read data without as much cache pollution, or use NT stores to write around cache. But doing "normal" stuff and then clflushopt after touching some memory for the last time is generally not worth it in normal cases. Like a store, it has to go all the way through the memory hierarchy to make sure it finds and flushes any copy of that line anywhere.
There isn't a light-weight instruction designed as a performance hint, like the opposite of _mm_prefetch.
The only cache-flushing you can do in user-space on x86 is with clflush / clflushopt. (Or with NT stores, which also evict the cache line if it was hot before hand). Or of course creating conflict evictions for known L1d size and associativity, like writing to multiple lines at multiples of 4kiB which all map to the same set in a 32k / 8-way L1d.
There's an Intel intrinsic _mm_clflush(void const *p) wrapper for clflush (and another for clflushopt), but these can only flush cache lines by (virtual) address. You could loop over all the cache lines in all the pages your process has mapped... (But that can only flush your own memory, not cache lines that are caching kernel data, like the kernel stack for your process or its task_struct, so the first system-call will still be faster than if you had flushed everything).
There's a Linux system call wrapper to portably evict a range of addresses: cacheflush(char *addr, int nbytes, int flags). Presumably the implementation on x86 uses clflush or clflushopt in a loop, if it's supported on x86 at all. The man page says it first appeared in MIPS Linux "but
nowadays, Linux provides a cacheflush() system call on some other
architectures, but with different arguments."
I don't think there's a Linux system call that exposes wbinvd, but you could write a kernel module that adds one.
Recent x86 extensions introduced more cache-control instructions, but still only by address to control specific cache lines. The use-case is for non-volatile memory attached directly to the CPU, such as Intel Optane DC Persistent Memory. If you want to commit to persistent storage without making the next read slow, you can use clwb. But note that clwb is not guaranteed to avoid eviction, it's merely allowed to. It might run the same as clflushopt, like may be the case on SKX.
See https://danluu.com/clwb-pcommit/, but note that pcommit isn't required: Intel decided to simplify the ISA before releasing any chips that need it, so clwb or clflushopt + sfence are sufficient. See https://software.intel.com/en-us/blogs/2016/09/12/deprecate-pcommit-instruction.
Anyway, this is the kind of cache-control that's relevant for modern CPUs. Whatever experiment you're doing requires ring0 and assembly on x86.
Footnote 1: Touching a lot of memory: pure ISO C++17
You could maybe allocate a very large buffer and then memset it (so those writes will pollute all the (data) caches with that data), then unmap it. If delete or free actually returns the memory to the OS right away, then it will no longer be part of your process's address space, so only a few cache lines of other data will still be hot: probably a line or two of stack (assuming you're on a C++ implementation that uses a stack, as well as running programs under an OS...). And of course this only pollutes data caches, not instruction caches, and as Basile points out, some levels of cache are private per-core, and OSes can migrate processes between CPUs.
Also, beware that using an actual memset or std::fill function call, or a loop that optimizes to that, could be optimized to use cache-bypassing or pollution-reducing stores. And I also implicitly assumed that your code is running on a CPU with write-allocate caches, instead of write-through on store misses (because all modern CPUs are designed this way). x86 supports WT memory regions on a per-page basis, but mainstream OSes use WB pages for all "normal" memory.
Doing something that can't optimize away and touches a lot of memory (e.g. a prime sieve with a long array instead of a bitmap) would be more reliable, but of course still dependent on cache pollution to evict other data. Just reading large amounts of data isn't reliable either; some CPUs implement adaptive replacement policies that reduce pollution from sequential accesses, so looping over a big array hopefully doesn't evict lots of useful data. E.g. the L3 cache in Intel IvyBridge and later does this.
The answer is no, there is no standard C++ way to do this (even with some compiler intrinsics). GCC has __builtin__clear_cache and __builtin_prefetch and Clang probably has them also.
As Johan commented, x86-64 has a privileged instruction for doing what you want, but __builtin__clear_cache doesn't use it (and is a no-op on x86-64, because instruction caches are coherent with data caches on that architecture so hardware takes care of syncing recently-stored data before executing it as code).
On Linux, you might (perhaps) use the cacheflush(2) Linux specific system call. I never used it, and I don't know if it is implemented on x86-64.
BTW, you should not reason on programs, but on processes. Each has its own virtual address space.
Your question lacks some motivation. If you care about micro-benchmarking, be aware that the kernel scheduler is allowed to reschedule and move your thread or process to some other core at arbitrary machine code instruction (be however aware of processor affinity).
(the function should work regardless of compiler optimizations)?
No, optimizing compilers are reordering and rescheduling machine code instructions and often mix several computations related to different C++ statements. They are allowed to do some computations at compile-time. Read more about the as-if rule. See CppCon 2017 talk: Matt Godbolt “What Has My Compiler Done for Me Lately? Unbolting the Compiler's Lid”.

What does hardware virtualization extension mean?

When programming againt virual machine, some system remind hardware virtualization extension is needed. What does it mean?
Hardware virtualization extension allows your computer to have second state which represents virtual machine state (for example vmware). When VM code is scheduled to run, processor switches to its "virtual" context and then works in this "sandbox". When hypervisor executes guest code it needs to emulate many hardware aspects - perform software virtualiztion of them. Hardware extensions allows to do the emulation in hardware. It significantly reduces virtualization overhead.
When a CPU is emulated (vCPU) by the hypervisor, the hypervisor has to translate the instructions meant for the vCPU to the physical CPU. As you can imagine this has a massive performance impact. To overcome this, modern processors support virtualization extensions, such as Intel VT-x and AMD-V. These technologies provide the ability for a slice of the physical CPU to be directly mapped to the vCPU. Therefore the instructions meant for the vCPU can be directly executed on the physical CPU slic

in virtualbox, what happens when you allocate more than one virtual core?

If I'm using Oracle's virtualbox, and I assign more than one virtual core to the virtual machine, how are the actual cores assigned? Does it use both real cores in the virtual machine, or does it use something that emulates cores?
Your question is almost like asking: How does an operating system determine which core to run a given process/thread on? Your computer is making that type of decision all the time - it has far more processes/threads running than you have cores available. This specific answer is similar in nature but also depends on how the guest machine is configured and what support your hardware has available to accelerate the virtualization process - so this answer is certainly not definitive and I won't really touch on how the host schedules code to be executed, but lets examine two relatively simple cases:
The first would be a fully virtualized machine - this would be a machine with no or minimal acceleration enabled. The hardware presented to the guest is fully virtualized even though many CPU instructions are simply passed through and executed directly on the CPU. In cases like this, your guest VM more-or-less behaves like any process running on the host: The CPU resources are scheduled by the operating system (to be clear, the host in this case) and the processes/threads can be run on whatever cores they are allowed to. The default is typically any core that is available, though some optimizations may be present to try and keep a process on the same core to allow the L1/L2 caches to be more effective and minimize context switches. Typically you would only have a single CPU allocated to the guest operating system in these cases, and that would roughly translate to a single process running on the host.
In a slightly more complex scenario, a virtual machine is configured with all available CPU virtualization acceleration options. In Intel speak these are referred to as VT-x for AMD it is AMD-V. These primarily support privileged instructions that would normally require some binary translation / trapping to keep the host and guest protected. As such, the host operating system loses a little bit of visibility. Include in that hardware accelerated MMU support (such that memory page tables can be accessed directly without being shadowed by the virtualization software) - and the visibility drops a little more. Ultimately though it still largely behaves as the first example: It is a process running on the host and is scheduled accordingly - only that you can think of a thread being allocated to run the instructions (or pass them through) for each virtual CPU.
It is worth noting that while you can (with the right hardware support) allocate more virtual cores to the guest than you have available, it isn't a good idea. Typically this will result in decreased performance as the guest potentially thrashes the CPU and can't properly schedule the resources that are being requested - even if the the CPU is not fully taxed. I bring this up as a scenario that shares certain similarities with a multi-threaded program that spawns far more threads (that are actually busy) than there are idle CPU cores available to run them. Your performance will typically be worse than if you had used fewer threads to get the work done.
In the extreme case, VirtualBox even supports hot-plugging CPU resources - though only a few operating systems properly support it: Windows 2008 Data Center edition and certain Linux kernels. The same rules generally apply where one guest CPU core is treated as a process/thread on a logical core for the host, however it is really up to the host and hardware itself to decide which logical core will be used for the virtual core.
With all that being said - your question of how VirtualBox actually assigns those resources... well, I haven't dug through the code so I certainly can't answer definitively but it has been my experience that it generally behaves as described. If you are really curious you could experiment with tagging the VirtualBox VBoxSvc.exe and associated processes in Task Manager and choosing the "Set Affinity" option and limiting their execution to a single CPU and see if those settings are honored. It probably depends on what level of HW assist you have available if those settings are honored by the host as the guest probably isn't really running as part of those.