Is there a simulator/emulator of Xeon Phi? - c++

I am going to offload some computation to Xeon Phi but would like to test different APIs and different apporached to the parallel programming first.
Is there a simulator / emulator for Xeon Phi (either Windows or Linux) ?

In the event that future internet users see this question and wonder about Knights Landing simulation, the Intel SDE (https://software.intel.com/en-us/articles/intel-software-development-emulator) emulates AVX-512.
For the uninitiated, Knights Landing is the official code name for the next-generation of Intel Xeon Phi processor. It is incorrect to assume that Xeon Phi means Knights Corner, any more than it is incorrect to assume that Xeon means Haswell. It's just that there has only been one iteration of Xeon Phi to date.

Suitability feature in Intel(R) Advisor XE 2015 Beta (could be "enrolled" for free here) could be used to address your requests. Suitability Beta is specifically capable to:
evaluate if Intel® Xeon Phi™ (native or limited support for offload) performance levels can exceed CPU performance peaks for given workload
evaluate imbalance, run-time overhead and other performance losses impact depending on parallel APIs, number of threads and loop iteration number/granularity being used
All kinds of given "evaluations" could be done on arbitrary x86 machine (Windows or Linux OS supported). So it's really sort of "emulation". However it's a software-based modeling tool (not traditional hardware simulator or emulator).
Note: given Xeon Phi-specific stuff is only available as "experimental" feature now. Which means that at the moment (as of April 2014) it's still Beta quality and it's still unavailable by default. You will have to set-up experimental variable ADVIXE_EXPERIMENTAL=suitability_xeon_phi_modeling to make it enabled. Usually Advisor Beta experimental features tend to become better quality and more mature later in the year (either in Beta Update or later releases).
Given note is not applicable to all other parts of Suitability feature which are not Xeon Phi-specific.
Here is a screen-shot for given Beta Experimental feature GUI look&feel (bold red is mine add-on):

Related

Does Intel Xeon Phi co-processor support graphic processing on hardware level?

I am going to do some rendering experiments on a large scale computer system with massive number of processors. This system uses some Intel Xeon E5 processors and Intel Xeon Phi co-processors. I've read documents and developer guide of Xeon Phi co-processor but none of them mention details about OpenGL or DirectX.
I'm not familiar with Xeon Phi co-processor and I want to know if it supports OpenGL or DirectX for graphic processing on hardware level.
Technically OpenGL depends on nothing. Pure software implementations of OpenGL are perfectly valid and do exist. For example the Mesa softpipe implementation; you could try to optimize it for the Xeon Phi, though I doubt you'll beat even the most humble low cost entry level GPUs with it.
Of course most of the time you want OpenGL to be accelerated by a dedicated GPU. But a Xeon Phi optimized OpenGL implementation certainly is feasible (though doesn't exist to my knowledge). When Intel was pushing their Larrabee architecture it was meant as a new approach on realtime graphics; a OpenGL implementation for Larrabee would have been part of it. But Larrabee never saw the light of the world, it remained a Intel research project.

Run OpenCL without compatible hardware?

I have two PCs:
a new high-end desktop PC, OpenCL compatible CPU and GPU, 32GB RAM
a very old laptop, Intel Celeron CPU, 512MB RAM, Ati M200 GPU
I am writing an OpenCL/C++ sw on my desktop PC. But when I travel somewhere, I continue the work on my oldschool laptop. Programming C++ on this laptop is good, but I can't try the OpenCL parts of my code. So this time I am writing OpenCL code, but I don't know it is good or not.
Is there a way, to virtualize an OpenCL compatible CPU/GPU? I don't want to get high performance, I just want to try my code, doesn't matter if it is very slow (slower than if I run it 1-thread on my Celeron CPU).
I guess, the answer is no.
(BTW, my plan is, there will be an option in my program, and you can run it with or without OpenCL. This is also needed to measure performance, and compare OpenCL CPU/GPU, and CPU in 1-thread mode without OpenCL.)
almost an answer, but not completely what I am looking for: http://www.acooke.org/cute/Developing0.html
For all existing OpenCL implementations, you need some form of SSE.
A website gathering all this info is here.
The lowest requirements are provided by the AMD OpenCL drivers, which require SSE3. As the list shows, that goes all the way back to late Pentium 4's.
In order to be sure about your CPU's capabilities, you'll need to use something like CPU-Z which can show the capabilities of your processor.
All that aside, I searched for laptops with your GPU, and ended up with processors like the Intel Celeron M 420, which according to Intel doesn't even have 64-bit support (which would imply SSE2).
I currently know of no other OpenCL implementations that are worth anything, so the answer would be no.
On the other hand, some websites claim that processor has SSE3 support, so that would mean AMD's OpenCL SDK is your option of choice.

How to offload particular thread of a single app to particular Xeon Phi cores?

Suppose I have a single c/c++ app running on the host. there are few threads running on the host CPU and 50 threads running on the Xeon Phi cores.
How can I make sure that each of these 50 runs on its own Xeon Phi core and is never purged off the core cache (given the code is small enough).
Could someone please to outline a very general idea how to do this and which tool/API would be more suitable (for C/C++ code) ?
What is the fastest way to exchange data between the host thread-aggregator and the 50 Phi threads?
Given that the actual parallelism will be very limited - this application is going to be more like 51 thread plane application with some basic multithreading data sync.
Can I use conventional C/C++ compiler to create the app like this?
You have raised several questions:
Yes, you can use conventional C program and compile it using regular Intel C/C++/Fortran compilers (known as Intel Composer XE) in order to generate binary being able to run on Intel Xeon Phi co-processor in either "native"/"symmetric" or "offload" modes. In simplest case - you just recompile your C/C++ program with -mmic and run it "natively" on Phi just "as is".
Which API to use? Use OpenMP4.0 standard or Intel Cilk Plus programming models (actually set of pragmas or keywords applicable to C/C++). OpenCL, Intel TBB and likely OpenACC are also possible, but OpenMP and Cilk Plus have capability to express threading, vectorization and offload (i.e. 3 things essential for Xeon Phi programming) without re-factoring or rewriting "conventional C/C++/Fortran" program .
Threads pinning: could be achieved via OpenMP affinity (see more details on MIC_KMP_AFFINITY below) or Intel TBB affinity stuff.
The fastest way to exchange the data between the host and target Phi - is.. avoid any exchange -using MPI symmetric approach for example. However you seem to ask about "offload" programming model specifically, so using asynchronous offload you can achieve the best performance. At the same time synchronous offload is theoretically simpler in terms of programming, but worse in terms of achievable performance.
Overall, you tend to ask several general questions, so I would recommend to start from the very beginning - i.e. looking at following ~10-pages Dr. Dobbs manual or given Intel' intro document.
Threads pinning is more advanced topic and at the same time seems to be "most interesting" for you, so I will explicitly explain more:
If your code is parallelized using OpenMP4.0 standard, then you can achieve desirable behavior using MIC_KMP_AFFINITY / MIC_KMP_PLACE_THREADS for Xeon Phi and KMP_AFFINITY / KMP_PLACE_THREADS for Host CPU.
Quite likely you're looking for this specific setting: MIC_KMP_PLACE_THREADS=50c,1t
I've seen that people mention PHI_KMP_AFFINITY instead of MIC_KMP_AFFINITY. I believe they are aliased, but didn't try myself.
Using 50 threads on Xeon Phi is usually not the best idea. It's better to try around 120 threads or so
More details about affinity on Xeon Phi are explained in these 3 articles:
http://www.prace-project.eu/Best-Practice-Guide-Intel-Xeon-Phi-HTML#id-1.6.2.3
and
https://software.intel.com/en-us/articles/best-known-methods-for-using-openmp-on-intel-many-integrated-core-intel-mic-architecture
and
https://software.intel.com/en-us/articles/openmp-thread-affinity-control

OpenCL Verification of Parallel Execution

What methods exist to verify that work is indeed being parallelized by OpenCL? (How can I verify that work is being distributed to all the processing elements for execution?) Or at least a method to monitor which cores/processors of the GPU or CPU are being used?
I would simply like a way to verify that OpenCL is actually doing what its specification claims it is supposedly doing. To do this, I need to collect hard evidence that OpenCL / the OS / the drivers are indeed scheduling kernels and work items to be executed in parallel (as opposed to serially).
I have written an OpenCL program conforming to the OpenCL API 1.2 specification along with a simple OpenCL C kernel which simply squares in the input integer.
In my program, work_group_size = MAX_WORK_GROUP_SIZE (so that they will fit on the compute units and so that OpenCL won't throw a fit).
The total amount_of_work is a scalar multiple of (MAX_COMPUTE_UNITS * MAX_WORK_GROUP_SIZE). Since amount_of_work > MAX_COMPUTE_UNITS * MAX_WORK_GROUP_SIZE, hopefully OpenCL
Hopefully this would be enough to force the schedulers to execute the maximum number of kernels + work items efficiently as possible, making use of the available cores / processors.
For a CPU, you can check cpuid, or sched_getcpu, or GetProcessorNumber in order to check which core / processor the current thread is currently executing on.
Is there a method on the OpenCL API which provides this information? (I have yet to find any.)
Is there an OpenCL C language built in function... or perhaps do the vendor's compilers understand some form of assembly language which I could use to obtain this information?
Is there an equivalent to cpuid, sched_getcpu, or GetProcessorNumber for GPUs for core usage monitoring, etc? Perhaps something vender architecture specific?
Is there an external program which I could use as a monitor for this information? I have tried Process Monitor and AMD's CodeXL, both of which are not useful for what I'm looking for. Intel has VTune, but I doubt that works on an AMD GPU.
Perhaps I could take a look at the compiled kernel code as generated from the AMD and Intel Compilers for some hints?
Hardware Details:
GPU: AMD FirePro, using AMD Capeverde architecture, 7700M Series chipset. I don't know which one exactly of in the series it is. If there is an AMD instruction set manual for this architecture (i.e. there are manuals for x86), that would possibly be a start.
CPU: Intel(R) Core(TM) i7-3630QM CPU # 2.40GHz
Development Environment Details:
OS: Win 7 64-bit, will also eventually need to run on Linux, but that's besides the point.
Compiling with MinGW GNU GCC 4.8.1 -std=c++11
Intel OpenCL SDK (OpenCL header, libraries, and runtime)
According to Process Manager, Intel's OpenCL compiler is a clang variant.
AMD APP OpenCL SDK (OpenCL header, libraries, and runtime)
OpenCL 1.2
I am trying to keep the source code as portable as possible.
Instead of relying on speculations, you can comment-out a program's buffer copies and visualisations, leave only kernel-executions intact. Then put it in a tight loop and watch for heat rising. If it heats like furmark, then it is using cores. If it is not heating, you can disable serial operations in kernels too(gid==0), then try again. For example, a simple nbody simulator pushes a well cooled HD7000 series gpu to over 70°C in minutes and 90°C for poor coolers. Compare it to a known benchmark's temperature limits.
Similar thing for CPU exists. Using float4 heats more than simple floats which shows even instruction type is important to use all ALUs (let alone threads)
If GPU has a really good cooler, you can watch its Vdroop. More load means more voltage drop. More cores more drop, more load per core also more drop.
Whatever you do, its up to compiler and hardware's abilities and you don't have explicit control over ALUs. Because opencl hides hardware complexity from developer.
Usin msi-after burner or similar software is not useful because they show %100 usage even when you use %1 of cards true potential.
Simply look at temperature difference of computer case at equilibrium state from starting state. If delta-T is like 50 with opencl and 5 without opencl, opencl is parallelising stuff you can't know how much.

AMD multi-core programming

I want to start to write applications(C++) that will utilize the additional cores to execute portions of the code that have a need to perform lots of calculations and whose computations are independent of each other. I have the following processor : x64 Family 15 Model 104 Stepping 2 Authentic AMD ~1900 Mhz running on Windows Vista Home premium 32 bit and Opensuse 11.0 64 bit.
On the Intel platforms , I've used the following APIs Intel TBB, OpenMP. Do they work on AMD and does AMD have similar APIs.what has been your experience?
OpenMP and TBB are both available also for AMD - it is also a compiler question.
E.g. see linux TBB on AMD.
I think the latest development on this end is to use the graphic card via CUDA or similar APIs- but this depends on the nature of your calculations. If it fits, it is faster than the CPU anyway.