CPU usage drops on Linux (Ubuntu) - amazon-web-services

I was using AWS c5.4xlarge instance, which has 16 vCPU, and running a 10 processes python programme. However, the CPU usage of each process gradually dropped down to 10% as showed in the picture in just 10 seconds. The total CPU usage of the 16 vCPU instances was just about 6%.
I reduced the number of the processes, but the CPU usage of each process was still quite low. Everything is ok on my own macOS.
What is wrong with this?

Ok I find the answer. This is about processor affinity. For Linux beginner:
https://en.wikipedia.org/wiki/Processor_affinity
In Linux you can allocate a cpu to a specific process in the Linux ternimal:
$ taskset -cp CPU_ID PID
For example:
$ taskset -cp 0-1 1000
will allocate CPU 0 and 1 to process with ID 1000.
You can find the PID by using
$ top
in your terminal.

Related

How to use mpirun to use different CPU cores for different programs?

I have a virtual machine with 32 cores.
I am running some simulations for which I need to utilize 16 cores at one time.
I use the below command to run a job on 16 cores :
mpirun -n 16 program_name args > log.out 2>&1
This program runs on 16 cores.
Now if I want to run the same programs on the rest of the cores, with different arguments, I use the same command like
mpirun -n 8 program_name diff_args > log_1.out 2>&1
The second process utilizes the same 16 cores that were utilized earlier.
How can use mpirun to run this process on 8 different cores, not the previous 16 that first job was using.
I am using headless Ubuntu 16.04.
Open MPI's launcher supports restricting the CPU set via the --cpu-set option. It accepts a set of logical CPUs expressed as a list of the form s0,s1,s2,..., where each list entry is either a single logical CPU number of a range of CPUs n-m.
Provided that the logical CPUs in your VM are numbered consecutively, what you have to do is:
mpirun --cpu-set 0-15 --bind-to core -n 16 program_name args > log.out 2>&1
mpirun --cpu-set 16-23 --bind-to core -n 8 program_name diff_args > log_1.out 2>&1
--bind-to core tells Open MPI to bind the processes to separate cores each while respecting the CPU set provided in the --cpu-set argument.
It might be helpful to use a tool such as lstopo (part of the hwloc library of Open MPI) to obtain the topology of the system, which helps in choosing the right CPU numbers and, e.g., prevents binding to hyperthreads, although this is less meaningful in a virtualised environment.
(Note that lstopo uses a confusing naming convention and calls the OS logical CPUs physical, so look for the numbers in the (P#n) entries. lstopo -p hides the hwloc logical numbers and prevents confusion.)

SCHED_FIFO thread freezes terminal

I have a centos minimal hexacore 3.5ghz machine and I do not undestand why a SCHED_FIFO realtime thread pinned to 1 core only, freezes the terminal? How to avoid this while keeping the realtime behaviour of the thread without using sleep in the loop or blocking it? To simplify my problem, this thread tries to dequeuue items from a non-blocking,lockfree,concurrent queue in an infinite loop.
The kernel runs on core 0, all the other cores are free. All other threads and my process too, are SCHED_OTHER same priority, 20. This is the only thread where i need ultra low latency for some high frequency calculations. After starting the application it seems everything works ok but my terminal freezes (i connect remotely trough ssh). I am able to see the threads created and force close my app from htop. The RT thread seems to run 100% burnout the core assigned as expected. When i kill the app, the terminal frozen is released and i can use again.
It looks like that thread has higher priorty than everything else across all cores, but i want this on the core i pinned it only.
Thank you
Hi victor you need to isolate the core from the linux scheduler so that it does not try to assign lower priority tasks such as running your terminal to a core that is running SCHED_* jobs with higher priority. You can achieve isolating core 1 in your case by adding the kernel option isolcpus=1 to your grub.cfg (or whatever boot loader config you are using).
After rebooting you can confirm that you have successfully isolated core 1 by running dmesg | grep isol
and see that your kernel was booted with the option.
Here is some more info on isolcpus:
https://www.linuxtopia.org/online_books/linux_kernel/kernel_configuration/re46.html

CPU usage of a process tree using perf?

I executed the following command for getting the cpu usage
perf record -F 99 -p PID sleep 1
But I think this command is giving me the cpu usage of this process only.
If it fork any new process then the cpu usage of that process will not be included in the perf report.
Can someone suggest how can we get the cpu usage of a given PID with usage of all its successors combined ?
Also i am interested in getting the peak cpu usage at any interval.Is this also possible using perf?

Speed variation between vCPUs on the same Amazon EC2 instance

I'm exploring the feasibility of running numerical computations on Amazon EC2. I currently have one c4.8xlarge instance running. It has 36 vCPUs, each of which is a hyperthread of a Haswell Xeon chip. The instance runs Ubuntu in HVM mode.
I have a GCC-optimised binary of a completely sequential (i.e. single-threaded) program. I launched 30 instances with CPU-pinning thus:
for i in `seq 0 29`; do
nohup taskset -c $i $BINARY_PATH &> $i.out &
done
The 30 processes run almost identical calculations. There's very little disk activity (a few megabytes every 5 minutes), and there's no network activity or interprocess communication whatsoever. htop reports that all processes run constantly at 100%.
The whole thing has been running for about 4 hours at this point. Six processes (12-17) have already done their task, while processes 0, 20, 24 and 29 look as if they will require another 4 hours to complete. Other processes fall somewhere in between.
My questions are:
Other than resource contention with other users, is there anything else that may be causing the significant variation in performance between the vCPUs within the same instance? As it stands, the instance would be rather unsuitable for any OpenMP or MPI jobs that synchronise between threads/ranks.
Is there anything I can do to achieve a more uniform (hopefully higher) performance across the cores? I have basically excluded hyperthreading as a culprit here since the six "fast" processes are hyperthreads on the same physical cores. Perhaps there's some NUMA-related issue?
My experience is on c3 instances. It's likely similar with c4.
For example, take a c3.2xlarge instance with 8 vCPUs (most of the explaination below is derived from direct discussion with AWS support).
In fact only the first 4 vCPUs are usable for heavy scientic calculations. The last 4 vCPUs are hyperthreads. For scientific applications it’s often not useful to use hyperthreading, it causes context swapping or reduces the available cache (and associated bandwidth) per thread.
To find out the exact mapping between the vCPUs and the physical cores, look into /proc/cpuinfo
"physical id" : shows the physical processor id (only one processor in c3.2xlarge)
"processor" : gives the number of vCPUs
"core id" : tells you which vCPUs map back to each Core ID.
If you put this in a table, you have:
physical_id processor core_id
0 0 0
0 1 1
0 2 2
0 3 3
0 4 0
0 5 1
0 6 2
0 7 3
You can also get this from the "thread_siblings_list". Internal kernel map of cpuX's hardware threads within the same core as cpuX (https://www.kernel.org/doc/Documentation/cputopology.txt) :
cat /sys/devices/system/cpu/cpuX/topology/thread_siblings_list
When Hyper-threading is enabled each vCPU (Processor) is a "Logical Core".
There will be 2 "Logical Cores" that are associated with a "Physical Core"
So, in your case, one solution is to disable hyperthreading with :
echo 0 > /sys/devices/system/cpu/cpuX/online
Where X for a c3.2xlarge would be 4...7
EDIT : you can observe this behaviour only in HVM instances. In PV instances, this topology is hidden by the hypervisor : all core ids & processor ids in /proc/cpuinfo are '0'.

why does taskset have no effect on fedora?

I am trying to run a highly multi-threaded application and want to measure its performance with different cores ( 0,1,2,3,4,5,6 ... 12). I saw taskset when googled,
taskset 0x00000003 ./my_app
but when I see system monitor of the fedora, It only shows one core doing 100% and others only 12%, 0%,...etc.
Is there any way to tell the process to run on certain core. I also heard of an option like -t #no of cores . like
./my_app -t2
for 0 and 1 core .. but this also have no effect
what am I doing wrong can any one please lead me to right direction.
taskset 0x00000003 ./my_app sets the affinity of the my_app process to cores 1 and 2. If your application is multithreaded, the threads inherit the affinity, but their distribution between core 1 and 2 is not set.
To set the affinity of each thread within your process, you can either use taskset after the process is running (i.e. run myapp, examine the thread ids and call taskset -pc <core> <tid> for each) or set the affinity at thread creation with sched_setaffinity, pthread_setaffinity_np if you are using pthreads etc).
Whatever ./myapp -t2 does is specific to you application.