How to bind my job to intel xeon phi coprocessor? - xeon-phi

I have a server with four mic cards (mic0-mic3), and it works well. how to bind a parallel job(mic_app) to mic0, other parallel job can not run in mic0. how to detect the mic0 has been running for a parallel job(mic_app).

Restricts the offload process to use only the coprocessors specified as the value of the variable.
Set this variable to a comma separated list of target device numbers in the range 0 to (number_of_devices_in_the_system -1), where 0 is the first coprocessor in the system, and (number_of_devices_in_the_system -1) is the last coprocessor in the system.
Coprocessors available for offloading are numbered logically. The function _Offload_number_of_devices() returns the number of available coprocessors. Coprocessor indices that you use in the target specifier of the offload pragmas are in the range 0 to number_of_devices_in_the_system-1.
Default: The offload process uses all devices.
Example: OFFLOAD_DEVICES = 1,2
On a system with more than two coprocessors installed, this setting enables the application to use only coprocessors 1 and 2. Offloads to coprocessors 0 or 1 are performed on the second and third physical coprocessors. Offloads to target numbers higher than 1 wrap-around so that all offloads remain within coprocessors 0 and 1. The function _Offload_number_of_devices() executed on a coprocessor return 0 or 1, when the offload is running on the first or second coprocessor.
Supported Environment Variables

Related

How to write a value to a port in DIO module AUTOSAR?

I am working in an AUTOSAR project on a STM32 NUCLEO-F767ZI board and I have to write the value for a port in the DIO module. I know that there is a function called HAL_GPIO_WritePin(), but how can I make to write the value for an entire port?
You can do that by writing the value for each channel in that port.
The ports usually have 16 channels so the value you want to write is a 16 bit number containing 0 and 1 (LOW and HIGH). So for each bit in that number you call the function HAL_GPIO_WritePin() and use the parameter RESET for 0 and SET for 1 to write the value to the corresponding channel.

Strange behaviour of Parallel Boost Graph Library example code

I have set up simple tests with Parallel Boost Graph Library (PBGL), which I have never used before, and observed entirely unexpected behaviour I would like to explain.
My steps were as follows:
Dump test data in METIS format (a kind of social graph with 50 mln vertices and 100 mln edges);
Build modified PBGL example from graph_parallel\example\dijkstra_shortest_paths.cpp
Example was slightly extended to proceed with Eager, Crauser and delta-stepping algorithms.
Note: building of the example required some obscure workaround about the MUTABLE_QUEUE define in crauser_et_al_shortest_paths.hpp (example code is in fact incompatible with the new mutable_queue)
int lookahead = 1;
delta_stepping_shortest_paths(g, start, dummy_property_map(), get(vertex_distance, g), get(edge_weight, g), lookahead);
dijkstra_shortest_paths(g, start, distance_map(get(vertex_distance, g)).lookahead(lookahead));
dijkstra_shortest_paths(g, start, distance_map(get(vertex_distance, g)));
Run
mpiexec -n 1 mytest.exe mydata.me
mpiexec -n 2 mytest.exe mydata.me
mpiexec -n 4 mytest.exe mydata.me
mpiexec -n 8 mytest.exe mydata.me
The observed behaviour:
-n 1:
mem usage: 35 GB in 1 running process, which utilizes exactly 1 device thread (processor load 12.5%)
delta stepping time: about 1 min 20 s
eager time: about 2 min
crauser time: about 3 min 20 s.
-n 2:
crash in the stage of data load.
-n 4:
mem usage: 40+ Gb in roughly equal parts in 4 running processes, each of which utilizes exactly 1 device thread
calculation times are unchanged in the margins of observation error.
-n 8:
mem usage: 44+ Gb in roughly equal parts in 8 running processes, each of which utilizes exactly 1 device thread
calculation times are unchanged in the margins of observation error.
So, except the unapropriate memory usage and very low total performance the only changes I observe when more MPI processes are running are slightly increased total memory consumption and linear rise of processor load.
The fact that initial graph is somehow partitioned between processes (probably by the vertices number ranges) is nevertheless evident.
What is wrong with this test (and, probably, my idea of MPI usage in whole)?
My enviromnent:
- one Win 10 PC with 64 Gb and 8 kernels;
- MS MPI 10.0.12498.5;
- MSVC 2017, toolset 141;
- boost 1.71
N.B. See original example code here.

gnu parallel and resource management

I would like to use the gnu parallel command line to basically act as a simple scheduling mechanism.
in my case, i have N number of GPU's on a system and i would like to effectively queue a list of jobs onto those GPU's.
basically, i have a list of inputs and i would naively run
parallel --jobs=4 ./my_script.sh ::: cat list_of_things.txt ::: 0 1 2 3
where ./my_script.sh accepts two args the thing i want to process, and the GPU i want to process it on.
what i want is for each thing in the list, to just run on one of the gpus (0 thru 3).
however, this ends up just running each thing 4 times.
Try this:
parallel --jobs=4 ./my_script.sh {%} {} :::: list_of_things.txt

changing thread number doesn't affect code

I am trying to learn xeon-phi , and while studying the Intel Xeon-Phi Coprocessor HPC book , I tried to run the code here. (from book)
The code uses openmp and 2 threads.
But the results I am taking are the same as running with 1 thread.
( no use of openmp at all )
I even used in mic different combinations but still the same:
export OMP_NUM_THREADS=2
export MIC_OMP_NUM_THREADS=124
export MIC_ENV_PREFIX=MIC
It seems that somehow openmp is not enabled?Am I missing something here?
The code using only 1 thread is here
I compiled using:
icc -mmic -openmp -qopt-report -O3 hello.c
Thanks!
I am not sure exactly which book you are talking about, but perhaps this will help.
The code you show does not use the offload programming style and must be run natively on the the coprocessor, meaning you copy the executable to the coprocessor and run it there or you use the micnativeloadex utility to run the code from the host processor. You show that you know the code must be run natively because you compiled it with the -mmic option.
If you use micnativeloadex, then the number of omp threads on the coprocessor is set by executing "export MIC_OMP_NUM_THREADS=124" on the host. If you copy the executable to the coprocessor and then log in to run it there, the number of omp threads on the coprocessor is set by executing "export OMP_NUM_THREADS=124" on the coprocessor. If you use "export OMP_NUM_THREADS=2" on the coprocessor, you get only two threads; the MIC_OMP_NUM_THREADS environment variable is not used if you set it directly on the coprocessor.
I don't see any place in the code where it prints out the number of threads, so I don't know for sure how you determined the number of threads actually being used. I suspect you were using a tool like micsmc. However micsmc tells you how may cores are in use, not how many threads are in use.
By default, the omp threads are laid out in order, so that the first core would run threads 0,1,2,3, the second core would run threads 4,5,6,7 and so on. If you are using only two threads, both threads would run on the first core.
So, is that what you are seeing - not that you are using only one thread but instead that you are using only one core?
I was looking at the serial version of the code you are using. For the following lines:
for(j=0; j<MAXFLOPS_ITERS; j++)
{
//
// scale 1st array and add in the 2nd array
// example usage - y = mx + b;
//
for(k=0; k<LOOP_COUNT; k++)
{
fa[k] = a * fa[k] + fb[k];
}
}
I see that here you do not scan the complete array. Instead you keep on updating the first 128 (LOOP_COUNT) elements of the array Fa. If you wish to compare this serial version to the parallel code you are referring to, then you will have to ensure that the program does same amount of work in both versions.
Thanks
I noticed three things in your first program omp:
the total floating point operations should reflect the number of threads doing the work. Therefore,
gflops = (double)( 1.0e-9*LOOP_COUNTMAXFLOPS_ITERSFLOPSPERCALC*numthreads);
You harded code the number of thread = 2. If you want to use the OMP env variable, you should comment out the API "omp_set_num_threads(2);"
After transferring the binary to the coprocessor, to set the OMP env variable in the coprocessor please use OMP_NUM_THREADS, and not MIC_OMP_NUM_THREADS. For example, if you want 64 threads to run your program in the coprocessor:
% ssh mic0
% export OMP_NUM_THREADS=64

Is a process can send to himself data? Using MPICH2

I have an upper triangular matrix and the result vector b.
My program need to solve the linear system:
Ax = b
using the pipeline method.
And one of the constraints is that the number of process is smaller than the number of
the equations (let's say it can be from 2 to numberOfEquations-1).
I don't have the code right now, I'm thinking about the pseudo code..
My Idea was that one of the processes will create the random upper triangular matrix (A)
the vector b.
lets say this is the random matrix:
1 2 3 4 5 6
0 1 7 8 9 10
0 0 1 12 13 14
0 0 0 1 16 17
0 0 0 0 1 18
0 0 0 0 0 1
and the vector b is [10 5 8 9 10 5]
and I have a smaller amount of processes than the number of equations (lets say 2 processes)
so what I thought is that some process will send to each process line from the matrix and the relevant number from vector b.
so the last line of the matrix and the last number in vector b will be send to
process[numProcs-1] (here i mean to the last process (process 1) )
than he compute the X and sends the result to process 0.
Now process 0 need to compute the 5 line of the matrix and here i'm stuck..
I have the X that was computed by process 1, but how can the process can send to himself
the next line of the matrix and the relevant number from vector b that need to be computed?
Is it possible? I don't think it's right to send to "myself"
Yes, MPI allows a process to send data to itself but one has to be extra careful about possible deadlocks when blocking operations are used. In that case one usually pairs a non-blocking send with blocking receive or vice versa, or one uses calls like MPI_Sendrecv. Sending a message to self usually ends up with the message simply being memory-copied from the source buffer to the destination one with no networking or other heavy machinery involved.
And no, communicating with self is not necessary a bad thing. The most obvious benefit is that it makes the code more symmetric as it removes/reduces the special logic needed to handle self-interaction. Sending to/receiving from self also happens in most collective communication calls. For example, MPI_Scatter also sends part of the data to the root process. To prevent some send-to-self cases that unnecessarily replicate data and decrease performance, MPI allows in-place mode (MPI_IN_PLACE) for most communication-related collectives.
Is it possible? I don't think it's right to send to "myself"
Sure, it is possible to communicate with oneself. There is even a communicator for it: MPI_COMM_SELF. Talking to yourself is not too uncommon.
Your setup sounds like you would rather use MPI collectives. Have a look at MPI_Scatter and MPI_Gather and see if they don't provide you with the functionality, you are looking for.