AMD multi-core programming - c++

I want to start to write applications(C++) that will utilize the additional cores to execute portions of the code that have a need to perform lots of calculations and whose computations are independent of each other. I have the following processor : x64 Family 15 Model 104 Stepping 2 Authentic AMD ~1900 Mhz running on Windows Vista Home premium 32 bit and Opensuse 11.0 64 bit.
On the Intel platforms , I've used the following APIs Intel TBB, OpenMP. Do they work on AMD and does AMD have similar APIs.what has been your experience?

OpenMP and TBB are both available also for AMD - it is also a compiler question.
E.g. see linux TBB on AMD.
I think the latest development on this end is to use the graphic card via CUDA or similar APIs- but this depends on the nature of your calculations. If it fits, it is faster than the CPU anyway.

Related

math kernel library function produce different result on different machine

I'm implementing an eigen-decomposition method by making use of math kernel library in fortran.
Inside the method, I first try to apply zgehrd to convert the input matrix into a upper Hessenberg form.
During the debug, however, I found given the same input matrix, the method zgehrd produces different results on different computers. Some computers are in windows 10 while some are still in windows 7.
To further test whether this problem is system dependent, I installed a windows 10 (Pro 64 bit) VM machine on a windows 10 (Home 64 bit) computer. It turns out the results are still different slightly in this case.
Since the eigen-decomposition method will be recursively called by an optimizer, the slight differences will accumulate. I've tried the solution by enforcing the Conditional Numerical Reproducibility in fortran, yet it does not help. Any help would be appreciated
if you want to see bit-to-bit output results with Intel MKL on different machines, please call MKL_VERBOSE mode ( set/export environment variables MKL_VERBOSE=1 ) first and check the lowest reported code branch. Example:
running the MKL code on AVX and AVX-512 based systems we will see the following messages:
MKL_VERBOSE Intel(R) MKL 2019.0 Update 4 Product build 20190411 for Intel(R) 64 architecture Intel(R) Advanced Vector Extensions (**Intel(R) AVX**) enabled processors, Lnx 2.80GHz intel_thread
and on SkyLake system:
MKL_VERBOSE Intel(R) MKL 2019.0 Update 4 Product build 20190411 for Intel(R) 64 architecture Intel(R) Advanced Vector Extensions 512 (**Intel(R) AVX-512**) enabled processors, Lnx 2.20GHz intel_thread
The next steps:
Calling the MKL’s bitwise reproducibility features by setting the environment variables: set/export MKL_CBWR=AVX
Then MKL guarantees that you will see the same outputs on AVX and AVX-512 based systems in the case of the same #of threads and the same OS.

C++: Low CPU usage on Ubuntu multi core server

I' m having a problem to run a c++ code on a powerful multi core server that uses Ubuntu. The problem is that my app is using less than 10% of one cpu. But same app uses around 100% of one cpu in my i3 notebook that uses a different version of Ubuntu.
My OS:
Linux version 3.11.0-23-generic (buildd#batsu) (gcc version 4.6.3 (Ubuntu/Linaro 4.6.3-1ubuntu5) ) #40~precise1-Ubuntu SMP Wed Jun 4 22:06:36 UTC 2014
The server's OS:
Linux version 3.11.0-12-generic (buildd#allspice) (gcc version 4.8.1 (Ubuntu/Linaro 4.8.1-10ubuntu7) ) #19-Ubuntu SMP Wed Oct 9 16:20:46 UTC 2013
At least for now, I do not need to parallelize the code, nor to make my code more efficient. I just want to know how I can achieve 100% use of a core this server.
Could anyone help me?
It may not be your OS but instead the compiler. Compilers are moving targets, year by year they improve (hopefully) their automatic optimizations. Your code may still be vectorizing and you don't know it. Yes, I realize that you are using a newer compiler on your laptop.
See if you still have the performance delta when you disable all optimizations (-O0 or some such). If you are trying to maximize CPU cycles, you may be using numerical calculations that are easily vectorized. The same goes for parallelization. You can also get general optimization reports as well as a specific vectorization report from gcc. I don't recall the parameter, but you can find it easily on-line.
Also, there is a world of difference between the # of cores on a server (probably a multi-core Xeon) and your i3. Your i3 has 2 cores, each capable of running two hardware threads, meaning you have in effect 4 CPUs. Depending upon your server configuration, you can have up to 18 cores with two hardware threads each in a processor. That translates to 36 effective CPUs. Also, you can have multiple processors per motherboard. You can do the math.
Both the compiler and OS can impact an application's processor use. If you are forking off multiple threads to try and consume processing, the OS can farm those out to different processors, reducing your system wide CPU usage. Even if you are running pure serial code, a smart compiler can break up the code into multiple threads that your threading library may distribute over those 36 effective CPUs.
You OS can also meter how much processing you can hog. If this server is not under your control, the administrator may have established a policy that limits the percent of processing that any one application can consume.
So in conclusion:
(1) Disable all optimization
(2) Check on individual core CPU usage to see what the load is on all your effective CPUs
(3) Restructure your code to farm out tasks that will be distributed across your effective CPUs, each consuming as much processing as possible.
(4) Make sure your admin isn't limiting the amount of processing individual applications can consume.

Run OpenCL without compatible hardware?

I have two PCs:
a new high-end desktop PC, OpenCL compatible CPU and GPU, 32GB RAM
a very old laptop, Intel Celeron CPU, 512MB RAM, Ati M200 GPU
I am writing an OpenCL/C++ sw on my desktop PC. But when I travel somewhere, I continue the work on my oldschool laptop. Programming C++ on this laptop is good, but I can't try the OpenCL parts of my code. So this time I am writing OpenCL code, but I don't know it is good or not.
Is there a way, to virtualize an OpenCL compatible CPU/GPU? I don't want to get high performance, I just want to try my code, doesn't matter if it is very slow (slower than if I run it 1-thread on my Celeron CPU).
I guess, the answer is no.
(BTW, my plan is, there will be an option in my program, and you can run it with or without OpenCL. This is also needed to measure performance, and compare OpenCL CPU/GPU, and CPU in 1-thread mode without OpenCL.)
almost an answer, but not completely what I am looking for: http://www.acooke.org/cute/Developing0.html
For all existing OpenCL implementations, you need some form of SSE.
A website gathering all this info is here.
The lowest requirements are provided by the AMD OpenCL drivers, which require SSE3. As the list shows, that goes all the way back to late Pentium 4's.
In order to be sure about your CPU's capabilities, you'll need to use something like CPU-Z which can show the capabilities of your processor.
All that aside, I searched for laptops with your GPU, and ended up with processors like the Intel Celeron M 420, which according to Intel doesn't even have 64-bit support (which would imply SSE2).
I currently know of no other OpenCL implementations that are worth anything, so the answer would be no.
On the other hand, some websites claim that processor has SSE3 support, so that would mean AMD's OpenCL SDK is your option of choice.

Is there a simulator/emulator of Xeon Phi?

I am going to offload some computation to Xeon Phi but would like to test different APIs and different apporached to the parallel programming first.
Is there a simulator / emulator for Xeon Phi (either Windows or Linux) ?
In the event that future internet users see this question and wonder about Knights Landing simulation, the Intel SDE (https://software.intel.com/en-us/articles/intel-software-development-emulator) emulates AVX-512.
For the uninitiated, Knights Landing is the official code name for the next-generation of Intel Xeon Phi processor. It is incorrect to assume that Xeon Phi means Knights Corner, any more than it is incorrect to assume that Xeon means Haswell. It's just that there has only been one iteration of Xeon Phi to date.
Suitability feature in Intel(R) Advisor XE 2015 Beta (could be "enrolled" for free here) could be used to address your requests. Suitability Beta is specifically capable to:
evaluate if Intel® Xeon Phi™ (native or limited support for offload) performance levels can exceed CPU performance peaks for given workload
evaluate imbalance, run-time overhead and other performance losses impact depending on parallel APIs, number of threads and loop iteration number/granularity being used
All kinds of given "evaluations" could be done on arbitrary x86 machine (Windows or Linux OS supported). So it's really sort of "emulation". However it's a software-based modeling tool (not traditional hardware simulator or emulator).
Note: given Xeon Phi-specific stuff is only available as "experimental" feature now. Which means that at the moment (as of April 2014) it's still Beta quality and it's still unavailable by default. You will have to set-up experimental variable ADVIXE_EXPERIMENTAL=suitability_xeon_phi_modeling to make it enabled. Usually Advisor Beta experimental features tend to become better quality and more mature later in the year (either in Beta Update or later releases).
Given note is not applicable to all other parts of Suitability feature which are not Xeon Phi-specific.
Here is a screen-shot for given Beta Experimental feature GUI look&feel (bold red is mine add-on):

OpenCL Verification of Parallel Execution

What methods exist to verify that work is indeed being parallelized by OpenCL? (How can I verify that work is being distributed to all the processing elements for execution?) Or at least a method to monitor which cores/processors of the GPU or CPU are being used?
I would simply like a way to verify that OpenCL is actually doing what its specification claims it is supposedly doing. To do this, I need to collect hard evidence that OpenCL / the OS / the drivers are indeed scheduling kernels and work items to be executed in parallel (as opposed to serially).
I have written an OpenCL program conforming to the OpenCL API 1.2 specification along with a simple OpenCL C kernel which simply squares in the input integer.
In my program, work_group_size = MAX_WORK_GROUP_SIZE (so that they will fit on the compute units and so that OpenCL won't throw a fit).
The total amount_of_work is a scalar multiple of (MAX_COMPUTE_UNITS * MAX_WORK_GROUP_SIZE). Since amount_of_work > MAX_COMPUTE_UNITS * MAX_WORK_GROUP_SIZE, hopefully OpenCL
Hopefully this would be enough to force the schedulers to execute the maximum number of kernels + work items efficiently as possible, making use of the available cores / processors.
For a CPU, you can check cpuid, or sched_getcpu, or GetProcessorNumber in order to check which core / processor the current thread is currently executing on.
Is there a method on the OpenCL API which provides this information? (I have yet to find any.)
Is there an OpenCL C language built in function... or perhaps do the vendor's compilers understand some form of assembly language which I could use to obtain this information?
Is there an equivalent to cpuid, sched_getcpu, or GetProcessorNumber for GPUs for core usage monitoring, etc? Perhaps something vender architecture specific?
Is there an external program which I could use as a monitor for this information? I have tried Process Monitor and AMD's CodeXL, both of which are not useful for what I'm looking for. Intel has VTune, but I doubt that works on an AMD GPU.
Perhaps I could take a look at the compiled kernel code as generated from the AMD and Intel Compilers for some hints?
Hardware Details:
GPU: AMD FirePro, using AMD Capeverde architecture, 7700M Series chipset. I don't know which one exactly of in the series it is. If there is an AMD instruction set manual for this architecture (i.e. there are manuals for x86), that would possibly be a start.
CPU: Intel(R) Core(TM) i7-3630QM CPU # 2.40GHz
Development Environment Details:
OS: Win 7 64-bit, will also eventually need to run on Linux, but that's besides the point.
Compiling with MinGW GNU GCC 4.8.1 -std=c++11
Intel OpenCL SDK (OpenCL header, libraries, and runtime)
According to Process Manager, Intel's OpenCL compiler is a clang variant.
AMD APP OpenCL SDK (OpenCL header, libraries, and runtime)
OpenCL 1.2
I am trying to keep the source code as portable as possible.
Instead of relying on speculations, you can comment-out a program's buffer copies and visualisations, leave only kernel-executions intact. Then put it in a tight loop and watch for heat rising. If it heats like furmark, then it is using cores. If it is not heating, you can disable serial operations in kernels too(gid==0), then try again. For example, a simple nbody simulator pushes a well cooled HD7000 series gpu to over 70°C in minutes and 90°C for poor coolers. Compare it to a known benchmark's temperature limits.
Similar thing for CPU exists. Using float4 heats more than simple floats which shows even instruction type is important to use all ALUs (let alone threads)
If GPU has a really good cooler, you can watch its Vdroop. More load means more voltage drop. More cores more drop, more load per core also more drop.
Whatever you do, its up to compiler and hardware's abilities and you don't have explicit control over ALUs. Because opencl hides hardware complexity from developer.
Usin msi-after burner or similar software is not useful because they show %100 usage even when you use %1 of cards true potential.
Simply look at temperature difference of computer case at equilibrium state from starting state. If delta-T is like 50 with opencl and 5 without opencl, opencl is parallelising stuff you can't know how much.