Intel instruction set extension and user machine (AVX, IMCI...) - c++

If a program is compiled on a Xeon-Phi coprocessor, and contains instructions from IMCI instruction set extension, is it possible to run it on a user machine with no Xeon-Phi coprocessor ?
If it is possible, will the performance be improved on the user machine, compared to same application with no IMCI instructions compiled for instance on a i7 Core processor ?
In other words, to benefit from increased performance when using Intel instruction set extension, is it necessary that the user machine has a processor which is supporting this extension ?

If a program is compiled on a Xeon-Phi coprocessor, and contains instructions from IMCI instruction set extension, is it possible to run it on a user machine with no Xeon-Phi coprocessor ?
If your program use IMCI you need a processor (or coprocessor, this is relative) that support that instructions.
This is true for every instruction you use.
Actually I'm aware of only Intel Xeon Phi coprocessors that support IMCI, so the answer is No.
If it is possible, will the performance be improved on the user machine, compared to same application with no IMCI instructions compiled for instance on a i7 Core processor ?
In other words, to benefit from increased performance when using Intel instruction set extension, is it necessary that the user machine has a processor which is supporting this extension ?
I'm not sure what you are asking here, you can't use an instruction set extension not supported by the target processor, this is obvious as it is that you cannot speak russian with someone who can't understand russian.
If you try using unsupported instructions the processor will raise a #UD exception signaling a not recognized instruction, the program state could not advance as you cannot skip instructions in the program flow and the application will be forced to stop.
The KNL microarch of the Xeon Phi will support AVX512 which is also supported by "mainstream" CPU.
This question may be useful: Are there SIMD(SSE / AVX) instructions in the x86-compatible accelerators Intel Xeon Phi?
Also note that you should see Xeon Phi (as it is now) as a coprocessor compatible with the IA32e architecture rather than as member of IA32e family.

Related

What happens when I compile on machine that supports avx2 and run the binary on another machine that only supports avx?

I compiled my c++ program on a machine that supports avx2 (Intel E5-2643 V3). It compiles and runs just fine. I confirm the avx2 instruction is used since after I dissemble the binary, I saw avx2 instructions such as vpbroadcastd.
Then I run this binary on another machine that only has avx instruction set (Intel E5-2643 V2). It runs also fine. Does the binary runs on a backward compatible avx instruction instead? What is this instruction? Do you see any potential issue?
There are multiple compilers and multiple settings you can use but the general principle is that usually a compiler is not targeting a particular processor, it's targeting an architecture, and by default it will usually have a fairly inclusive approach meaning the generated code will be compatible with as many processors as reasonable. You would normally expect an x86_64 compiler to generate code that runs without AVX2, indeed, that it should run on some of the earliest CPUs supporting the x86_64 instruction set.
If you have code that benefits greatly from extensions to the instruction set that aren't universally supported like AVX2, your aim when producing software is generally to degrade gracefully. For instance you could use runtime feature detection to see if the current processor supports AVX2 and run a separate code path. Some compilers may support automated ways of doing this or helpers to assist you in achieving this yourself.
It's not rare to have AVX2 instructions in a binary that uses CPU detection to make sure it only runs them on CPUs that support them. (e.g. via cpuid and setting function pointers).
If the AVX2 instruction actually executed on a CPU without AVX2 support, it raises #UD, so the OS delivers SIGILL (illegal instruction) to your process, or the Windows equivalent.
There are a few cases where an instruction like lzcnt decodes as rep bsr, which runs as bsr on CPUs without BMI1. (Giving a different answer). But VEX-coded AVX2 instructions just fault on older CPUs.

sse2 vectorization and virtual machines

I am considering vectorizing some floor() calls using sse2 intrinsics, then measuring the performance gain. But ultimately the binary is going to be run on a virtual machine which I have no access to.
I don't really know how a VM works. Is a binary entirely executed on a software-emulated virtual cpu ?
If not, supposing the VM is run on a cpu with SSE2, could the VM use his cpu SSE2 instruction when executing a SSE2 instruction from my binary ?
Could my vectorization be beneficial on the VM ?
I don't really know how a VM works. Is a binary entirely executed on a software-emulated virtual cpu?
For serious purposes, no, because it's too slow. (But e.g. Bochs does; it can be useful for kernel debugging among other things)
The binary is executed "normally" as much as possible. This generally means any code that doesn't try to interact with the OS will be executed directly. For example, system calls are likely to require the involvement of the VM implementation.
If not, supposing the VM is run on a cpu with SSE2, could the VM use his cpu SSE2 instruction when executing a SSE2 instruction from my binary?
Yes.
Could my vectorization be beneficial on the VM?
Yes.
Depends on VM technology and CPU capabilities. First x86 VMs (like VMWare on 32-bit machines) used recompilation. They looked into binary code of VMs to seek for harmful instructions (like accessing raw memory or special registers) to replace them with hyper-calls.
Since SSE2 instructions are not harmful, they would just left as is, and no performance penalty added in VM. Moreover, modern x86 CPUs use "hardware virtualization" which allows to avoid recompilation. Harmful instructions are caught by CPU and generate an interrupt, but again SSE2 instrs shouldn't trigger it.
There are of course full processor emulators like QEMU (not QEMU-KVM) or Bochs, but it's a different story. Bochs-emulated CPU, for example, is about 1000 times slower than host CPU.

Run OpenCL without compatible hardware?

I have two PCs:
a new high-end desktop PC, OpenCL compatible CPU and GPU, 32GB RAM
a very old laptop, Intel Celeron CPU, 512MB RAM, Ati M200 GPU
I am writing an OpenCL/C++ sw on my desktop PC. But when I travel somewhere, I continue the work on my oldschool laptop. Programming C++ on this laptop is good, but I can't try the OpenCL parts of my code. So this time I am writing OpenCL code, but I don't know it is good or not.
Is there a way, to virtualize an OpenCL compatible CPU/GPU? I don't want to get high performance, I just want to try my code, doesn't matter if it is very slow (slower than if I run it 1-thread on my Celeron CPU).
I guess, the answer is no.
(BTW, my plan is, there will be an option in my program, and you can run it with or without OpenCL. This is also needed to measure performance, and compare OpenCL CPU/GPU, and CPU in 1-thread mode without OpenCL.)
almost an answer, but not completely what I am looking for: http://www.acooke.org/cute/Developing0.html
For all existing OpenCL implementations, you need some form of SSE.
A website gathering all this info is here.
The lowest requirements are provided by the AMD OpenCL drivers, which require SSE3. As the list shows, that goes all the way back to late Pentium 4's.
In order to be sure about your CPU's capabilities, you'll need to use something like CPU-Z which can show the capabilities of your processor.
All that aside, I searched for laptops with your GPU, and ended up with processors like the Intel Celeron M 420, which according to Intel doesn't even have 64-bit support (which would imply SSE2).
I currently know of no other OpenCL implementations that are worth anything, so the answer would be no.
On the other hand, some websites claim that processor has SSE3 support, so that would mean AMD's OpenCL SDK is your option of choice.

OpenCL Verification of Parallel Execution

What methods exist to verify that work is indeed being parallelized by OpenCL? (How can I verify that work is being distributed to all the processing elements for execution?) Or at least a method to monitor which cores/processors of the GPU or CPU are being used?
I would simply like a way to verify that OpenCL is actually doing what its specification claims it is supposedly doing. To do this, I need to collect hard evidence that OpenCL / the OS / the drivers are indeed scheduling kernels and work items to be executed in parallel (as opposed to serially).
I have written an OpenCL program conforming to the OpenCL API 1.2 specification along with a simple OpenCL C kernel which simply squares in the input integer.
In my program, work_group_size = MAX_WORK_GROUP_SIZE (so that they will fit on the compute units and so that OpenCL won't throw a fit).
The total amount_of_work is a scalar multiple of (MAX_COMPUTE_UNITS * MAX_WORK_GROUP_SIZE). Since amount_of_work > MAX_COMPUTE_UNITS * MAX_WORK_GROUP_SIZE, hopefully OpenCL
Hopefully this would be enough to force the schedulers to execute the maximum number of kernels + work items efficiently as possible, making use of the available cores / processors.
For a CPU, you can check cpuid, or sched_getcpu, or GetProcessorNumber in order to check which core / processor the current thread is currently executing on.
Is there a method on the OpenCL API which provides this information? (I have yet to find any.)
Is there an OpenCL C language built in function... or perhaps do the vendor's compilers understand some form of assembly language which I could use to obtain this information?
Is there an equivalent to cpuid, sched_getcpu, or GetProcessorNumber for GPUs for core usage monitoring, etc? Perhaps something vender architecture specific?
Is there an external program which I could use as a monitor for this information? I have tried Process Monitor and AMD's CodeXL, both of which are not useful for what I'm looking for. Intel has VTune, but I doubt that works on an AMD GPU.
Perhaps I could take a look at the compiled kernel code as generated from the AMD and Intel Compilers for some hints?
Hardware Details:
GPU: AMD FirePro, using AMD Capeverde architecture, 7700M Series chipset. I don't know which one exactly of in the series it is. If there is an AMD instruction set manual for this architecture (i.e. there are manuals for x86), that would possibly be a start.
CPU: Intel(R) Core(TM) i7-3630QM CPU # 2.40GHz
Development Environment Details:
OS: Win 7 64-bit, will also eventually need to run on Linux, but that's besides the point.
Compiling with MinGW GNU GCC 4.8.1 -std=c++11
Intel OpenCL SDK (OpenCL header, libraries, and runtime)
According to Process Manager, Intel's OpenCL compiler is a clang variant.
AMD APP OpenCL SDK (OpenCL header, libraries, and runtime)
OpenCL 1.2
I am trying to keep the source code as portable as possible.
Instead of relying on speculations, you can comment-out a program's buffer copies and visualisations, leave only kernel-executions intact. Then put it in a tight loop and watch for heat rising. If it heats like furmark, then it is using cores. If it is not heating, you can disable serial operations in kernels too(gid==0), then try again. For example, a simple nbody simulator pushes a well cooled HD7000 series gpu to over 70°C in minutes and 90°C for poor coolers. Compare it to a known benchmark's temperature limits.
Similar thing for CPU exists. Using float4 heats more than simple floats which shows even instruction type is important to use all ALUs (let alone threads)
If GPU has a really good cooler, you can watch its Vdroop. More load means more voltage drop. More cores more drop, more load per core also more drop.
Whatever you do, its up to compiler and hardware's abilities and you don't have explicit control over ALUs. Because opencl hides hardware complexity from developer.
Usin msi-after burner or similar software is not useful because they show %100 usage even when you use %1 of cards true potential.
Simply look at temperature difference of computer case at equilibrium state from starting state. If delta-T is like 50 with opencl and 5 without opencl, opencl is parallelising stuff you can't know how much.

AMD multi-core programming

I want to start to write applications(C++) that will utilize the additional cores to execute portions of the code that have a need to perform lots of calculations and whose computations are independent of each other. I have the following processor : x64 Family 15 Model 104 Stepping 2 Authentic AMD ~1900 Mhz running on Windows Vista Home premium 32 bit and Opensuse 11.0 64 bit.
On the Intel platforms , I've used the following APIs Intel TBB, OpenMP. Do they work on AMD and does AMD have similar APIs.what has been your experience?
OpenMP and TBB are both available also for AMD - it is also a compiler question.
E.g. see linux TBB on AMD.
I think the latest development on this end is to use the graphic card via CUDA or similar APIs- but this depends on the nature of your calculations. If it fits, it is faster than the CPU anyway.