I'm implementing an eigen-decomposition method by making use of math kernel library in fortran.
Inside the method, I first try to apply zgehrd to convert the input matrix into a upper Hessenberg form.
During the debug, however, I found given the same input matrix, the method zgehrd produces different results on different computers. Some computers are in windows 10 while some are still in windows 7.
To further test whether this problem is system dependent, I installed a windows 10 (Pro 64 bit) VM machine on a windows 10 (Home 64 bit) computer. It turns out the results are still different slightly in this case.
Since the eigen-decomposition method will be recursively called by an optimizer, the slight differences will accumulate. I've tried the solution by enforcing the Conditional Numerical Reproducibility in fortran, yet it does not help. Any help would be appreciated
if you want to see bit-to-bit output results with Intel MKL on different machines, please call MKL_VERBOSE mode ( set/export environment variables MKL_VERBOSE=1 ) first and check the lowest reported code branch. Example:
running the MKL code on AVX and AVX-512 based systems we will see the following messages:
MKL_VERBOSE Intel(R) MKL 2019.0 Update 4 Product build 20190411 for Intel(R) 64 architecture Intel(R) Advanced Vector Extensions (**Intel(R) AVX**) enabled processors, Lnx 2.80GHz intel_thread
and on SkyLake system:
MKL_VERBOSE Intel(R) MKL 2019.0 Update 4 Product build 20190411 for Intel(R) 64 architecture Intel(R) Advanced Vector Extensions 512 (**Intel(R) AVX-512**) enabled processors, Lnx 2.20GHz intel_thread
The next steps:
Calling the MKL’s bitwise reproducibility features by setting the environment variables: set/export MKL_CBWR=AVX
Then MKL guarantees that you will see the same outputs on AVX and AVX-512 based systems in the case of the same #of threads and the same OS.
Related
I compiled my c++ program on a machine that supports avx2 (Intel E5-2643 V3). It compiles and runs just fine. I confirm the avx2 instruction is used since after I dissemble the binary, I saw avx2 instructions such as vpbroadcastd.
Then I run this binary on another machine that only has avx instruction set (Intel E5-2643 V2). It runs also fine. Does the binary runs on a backward compatible avx instruction instead? What is this instruction? Do you see any potential issue?
There are multiple compilers and multiple settings you can use but the general principle is that usually a compiler is not targeting a particular processor, it's targeting an architecture, and by default it will usually have a fairly inclusive approach meaning the generated code will be compatible with as many processors as reasonable. You would normally expect an x86_64 compiler to generate code that runs without AVX2, indeed, that it should run on some of the earliest CPUs supporting the x86_64 instruction set.
If you have code that benefits greatly from extensions to the instruction set that aren't universally supported like AVX2, your aim when producing software is generally to degrade gracefully. For instance you could use runtime feature detection to see if the current processor supports AVX2 and run a separate code path. Some compilers may support automated ways of doing this or helpers to assist you in achieving this yourself.
It's not rare to have AVX2 instructions in a binary that uses CPU detection to make sure it only runs them on CPUs that support them. (e.g. via cpuid and setting function pointers).
If the AVX2 instruction actually executed on a CPU without AVX2 support, it raises #UD, so the OS delivers SIGILL (illegal instruction) to your process, or the Windows equivalent.
There are a few cases where an instruction like lzcnt decodes as rep bsr, which runs as bsr on CPUs without BMI1. (Giving a different answer). But VEX-coded AVX2 instructions just fault on older CPUs.
If a program is compiled on a Xeon-Phi coprocessor, and contains instructions from IMCI instruction set extension, is it possible to run it on a user machine with no Xeon-Phi coprocessor ?
If it is possible, will the performance be improved on the user machine, compared to same application with no IMCI instructions compiled for instance on a i7 Core processor ?
In other words, to benefit from increased performance when using Intel instruction set extension, is it necessary that the user machine has a processor which is supporting this extension ?
If a program is compiled on a Xeon-Phi coprocessor, and contains instructions from IMCI instruction set extension, is it possible to run it on a user machine with no Xeon-Phi coprocessor ?
If your program use IMCI you need a processor (or coprocessor, this is relative) that support that instructions.
This is true for every instruction you use.
Actually I'm aware of only Intel Xeon Phi coprocessors that support IMCI, so the answer is No.
If it is possible, will the performance be improved on the user machine, compared to same application with no IMCI instructions compiled for instance on a i7 Core processor ?
In other words, to benefit from increased performance when using Intel instruction set extension, is it necessary that the user machine has a processor which is supporting this extension ?
I'm not sure what you are asking here, you can't use an instruction set extension not supported by the target processor, this is obvious as it is that you cannot speak russian with someone who can't understand russian.
If you try using unsupported instructions the processor will raise a #UD exception signaling a not recognized instruction, the program state could not advance as you cannot skip instructions in the program flow and the application will be forced to stop.
The KNL microarch of the Xeon Phi will support AVX512 which is also supported by "mainstream" CPU.
This question may be useful: Are there SIMD(SSE / AVX) instructions in the x86-compatible accelerators Intel Xeon Phi?
Also note that you should see Xeon Phi (as it is now) as a coprocessor compatible with the IA32e architecture rather than as member of IA32e family.
I am going to offload some computation to Xeon Phi but would like to test different APIs and different apporached to the parallel programming first.
Is there a simulator / emulator for Xeon Phi (either Windows or Linux) ?
In the event that future internet users see this question and wonder about Knights Landing simulation, the Intel SDE (https://software.intel.com/en-us/articles/intel-software-development-emulator) emulates AVX-512.
For the uninitiated, Knights Landing is the official code name for the next-generation of Intel Xeon Phi processor. It is incorrect to assume that Xeon Phi means Knights Corner, any more than it is incorrect to assume that Xeon means Haswell. It's just that there has only been one iteration of Xeon Phi to date.
Suitability feature in Intel(R) Advisor XE 2015 Beta (could be "enrolled" for free here) could be used to address your requests. Suitability Beta is specifically capable to:
evaluate if Intel® Xeon Phi™ (native or limited support for offload) performance levels can exceed CPU performance peaks for given workload
evaluate imbalance, run-time overhead and other performance losses impact depending on parallel APIs, number of threads and loop iteration number/granularity being used
All kinds of given "evaluations" could be done on arbitrary x86 machine (Windows or Linux OS supported). So it's really sort of "emulation". However it's a software-based modeling tool (not traditional hardware simulator or emulator).
Note: given Xeon Phi-specific stuff is only available as "experimental" feature now. Which means that at the moment (as of April 2014) it's still Beta quality and it's still unavailable by default. You will have to set-up experimental variable ADVIXE_EXPERIMENTAL=suitability_xeon_phi_modeling to make it enabled. Usually Advisor Beta experimental features tend to become better quality and more mature later in the year (either in Beta Update or later releases).
Given note is not applicable to all other parts of Suitability feature which are not Xeon Phi-specific.
Here is a screen-shot for given Beta Experimental feature GUI look&feel (bold red is mine add-on):
What methods exist to verify that work is indeed being parallelized by OpenCL? (How can I verify that work is being distributed to all the processing elements for execution?) Or at least a method to monitor which cores/processors of the GPU or CPU are being used?
I would simply like a way to verify that OpenCL is actually doing what its specification claims it is supposedly doing. To do this, I need to collect hard evidence that OpenCL / the OS / the drivers are indeed scheduling kernels and work items to be executed in parallel (as opposed to serially).
I have written an OpenCL program conforming to the OpenCL API 1.2 specification along with a simple OpenCL C kernel which simply squares in the input integer.
In my program, work_group_size = MAX_WORK_GROUP_SIZE (so that they will fit on the compute units and so that OpenCL won't throw a fit).
The total amount_of_work is a scalar multiple of (MAX_COMPUTE_UNITS * MAX_WORK_GROUP_SIZE). Since amount_of_work > MAX_COMPUTE_UNITS * MAX_WORK_GROUP_SIZE, hopefully OpenCL
Hopefully this would be enough to force the schedulers to execute the maximum number of kernels + work items efficiently as possible, making use of the available cores / processors.
For a CPU, you can check cpuid, or sched_getcpu, or GetProcessorNumber in order to check which core / processor the current thread is currently executing on.
Is there a method on the OpenCL API which provides this information? (I have yet to find any.)
Is there an OpenCL C language built in function... or perhaps do the vendor's compilers understand some form of assembly language which I could use to obtain this information?
Is there an equivalent to cpuid, sched_getcpu, or GetProcessorNumber for GPUs for core usage monitoring, etc? Perhaps something vender architecture specific?
Is there an external program which I could use as a monitor for this information? I have tried Process Monitor and AMD's CodeXL, both of which are not useful for what I'm looking for. Intel has VTune, but I doubt that works on an AMD GPU.
Perhaps I could take a look at the compiled kernel code as generated from the AMD and Intel Compilers for some hints?
Hardware Details:
GPU: AMD FirePro, using AMD Capeverde architecture, 7700M Series chipset. I don't know which one exactly of in the series it is. If there is an AMD instruction set manual for this architecture (i.e. there are manuals for x86), that would possibly be a start.
CPU: Intel(R) Core(TM) i7-3630QM CPU # 2.40GHz
Development Environment Details:
OS: Win 7 64-bit, will also eventually need to run on Linux, but that's besides the point.
Compiling with MinGW GNU GCC 4.8.1 -std=c++11
Intel OpenCL SDK (OpenCL header, libraries, and runtime)
According to Process Manager, Intel's OpenCL compiler is a clang variant.
AMD APP OpenCL SDK (OpenCL header, libraries, and runtime)
OpenCL 1.2
I am trying to keep the source code as portable as possible.
Instead of relying on speculations, you can comment-out a program's buffer copies and visualisations, leave only kernel-executions intact. Then put it in a tight loop and watch for heat rising. If it heats like furmark, then it is using cores. If it is not heating, you can disable serial operations in kernels too(gid==0), then try again. For example, a simple nbody simulator pushes a well cooled HD7000 series gpu to over 70°C in minutes and 90°C for poor coolers. Compare it to a known benchmark's temperature limits.
Similar thing for CPU exists. Using float4 heats more than simple floats which shows even instruction type is important to use all ALUs (let alone threads)
If GPU has a really good cooler, you can watch its Vdroop. More load means more voltage drop. More cores more drop, more load per core also more drop.
Whatever you do, its up to compiler and hardware's abilities and you don't have explicit control over ALUs. Because opencl hides hardware complexity from developer.
Usin msi-after burner or similar software is not useful because they show %100 usage even when you use %1 of cards true potential.
Simply look at temperature difference of computer case at equilibrium state from starting state. If delta-T is like 50 with opencl and 5 without opencl, opencl is parallelising stuff you can't know how much.
I want to start to write applications(C++) that will utilize the additional cores to execute portions of the code that have a need to perform lots of calculations and whose computations are independent of each other. I have the following processor : x64 Family 15 Model 104 Stepping 2 Authentic AMD ~1900 Mhz running on Windows Vista Home premium 32 bit and Opensuse 11.0 64 bit.
On the Intel platforms , I've used the following APIs Intel TBB, OpenMP. Do they work on AMD and does AMD have similar APIs.what has been your experience?
OpenMP and TBB are both available also for AMD - it is also a compiler question.
E.g. see linux TBB on AMD.
I think the latest development on this end is to use the graphic card via CUDA or similar APIs- but this depends on the nature of your calculations. If it fits, it is faster than the CPU anyway.