I'm exploring the feasibility of running numerical computations on Amazon EC2. I currently have one c4.8xlarge instance running. It has 36 vCPUs, each of which is a hyperthread of a Haswell Xeon chip. The instance runs Ubuntu in HVM mode.
I have a GCC-optimised binary of a completely sequential (i.e. single-threaded) program. I launched 30 instances with CPU-pinning thus:
for i in `seq 0 29`; do
nohup taskset -c $i $BINARY_PATH &> $i.out &
done
The 30 processes run almost identical calculations. There's very little disk activity (a few megabytes every 5 minutes), and there's no network activity or interprocess communication whatsoever. htop reports that all processes run constantly at 100%.
The whole thing has been running for about 4 hours at this point. Six processes (12-17) have already done their task, while processes 0, 20, 24 and 29 look as if they will require another 4 hours to complete. Other processes fall somewhere in between.
My questions are:
Other than resource contention with other users, is there anything else that may be causing the significant variation in performance between the vCPUs within the same instance? As it stands, the instance would be rather unsuitable for any OpenMP or MPI jobs that synchronise between threads/ranks.
Is there anything I can do to achieve a more uniform (hopefully higher) performance across the cores? I have basically excluded hyperthreading as a culprit here since the six "fast" processes are hyperthreads on the same physical cores. Perhaps there's some NUMA-related issue?
My experience is on c3 instances. It's likely similar with c4.
For example, take a c3.2xlarge instance with 8 vCPUs (most of the explaination below is derived from direct discussion with AWS support).
In fact only the first 4 vCPUs are usable for heavy scientic calculations. The last 4 vCPUs are hyperthreads. For scientific applications it’s often not useful to use hyperthreading, it causes context swapping or reduces the available cache (and associated bandwidth) per thread.
To find out the exact mapping between the vCPUs and the physical cores, look into /proc/cpuinfo
"physical id" : shows the physical processor id (only one processor in c3.2xlarge)
"processor" : gives the number of vCPUs
"core id" : tells you which vCPUs map back to each Core ID.
If you put this in a table, you have:
physical_id processor core_id
0 0 0
0 1 1
0 2 2
0 3 3
0 4 0
0 5 1
0 6 2
0 7 3
You can also get this from the "thread_siblings_list". Internal kernel map of cpuX's hardware threads within the same core as cpuX (https://www.kernel.org/doc/Documentation/cputopology.txt) :
cat /sys/devices/system/cpu/cpuX/topology/thread_siblings_list
When Hyper-threading is enabled each vCPU (Processor) is a "Logical Core".
There will be 2 "Logical Cores" that are associated with a "Physical Core"
So, in your case, one solution is to disable hyperthreading with :
echo 0 > /sys/devices/system/cpu/cpuX/online
Where X for a c3.2xlarge would be 4...7
EDIT : you can observe this behaviour only in HVM instances. In PV instances, this topology is hidden by the hypervisor : all core ids & processor ids in /proc/cpuinfo are '0'.
Related
I am using shared memory for transferring data between two process, using boost::interprocess::managed_shared_memory to allocate a vector as buffer and atomic variables for enforcing memory synchronization (similar to boost::lockfree::spsc_queue).
I was measuring the end-to-end latency for the setup with 2 processes -
sender process - writes to the buffer in shared memory, and sleeps. So the rate at which it pushes data is in interval of around 55 microseconds.
receiver process - runs a busy loop to see if something can be consumed from the buffer.
I am using a RingBuffer of size 4K (high for safety), although ideally a maximun of 1 element will be present in the buffer as per the current setup. Also, I am pushing data around 3 million times to get a good estimate for the end to end latency.
To measure the latency - I get the current time in nanoseconds and store it in a vector (resized to size 3 million at the beginning). I have a 6 core setup, with isolated cpus, and I do taskset to different cores for both sender and receiver process. I also make sure no other program is running from my end on the machine when doing this testing. Output of /proc/cmdline
initrd=\initramfs-linux-lts.img root=PARTUUID=cc2a533b-d26d-4995-9166-814d7f59444d rw isolcpus=0-4 intel_idle.max_cstate=0 idle=poll
I have already done the verification that all data transfer is accurate and nothing is lost. So simple row-wise subtraction of the timestamp is sufficient to get the latency.
I am getting latency of around a 300-400 nanosecods as mean and median of the distribution, but the standard deviation was too high (few thousands of nanos). On looking at the numbers, I found out that there are 2-3 instances where the latency shoots upto 600000 nanos, and then gradually comes down (in steps of around 56000 nanos - probably queueing is happening and consecutive pops from the buffer are successful). Attaching a sample "jitter" here -
568086 511243 454416 397646 340799 284018 227270 170599 113725 57022 396
If I filter out these jittery datapoints, the std_dev becomes very less. So I went digging into what can be the reason for this. Initially I was looking if there was some pattern, or if it is occuring periodically, but it doesnot seem so in my opinion.
I ran the receiver process with perf stat -d, it clearly shows the number of context switches to be 0.
Interestingly, when looking the receiver process's /proc/${pid}/status, I monitor
voluntary_ctxt_switches, nonvoluntary_ctxt_switches and see that the nonvoluntary_ctxt_switches increase at a rate of around 1 per second, and voluntary_ctxt_switches is constant once the data sharing starts. But the problem is that for around the 200 seconds of my setup runtime, the number of latency spikes is around 2 or 3 and does not match the frequency of this context_switch numbers. (what is this count then?)
I also followed a thread which feels relevant, but cant get anything.
For the core running the receiver process, the trace on core 1 with context switch is (But the number of spikes this time was 5)-
$ grep " 1)" trace | grep "=>"
1) jemallo-22010 => <idle>-0
2) <idle>-0 => kworker-138
3) kworker-138 => <idle>-0
I also checked the difference between /proc/interrupts before and after the run of the setup.
The differences are
name
receiver_core
sender_core
enp1s0f0np1-0
2
0
eno1
0
3280
Non-maskable interrupts
25
25
Local timer interrupts
2K
~3M
Performance monitoring interrupts
25
25
Rescheduling interrupts
9
12
Function call interrupts
120
110
machine-check polls
1
1
I am not exactly sure of what most of these numbers represent. But I am curious as why there are rescheduling interrupts, and what is enp1s0f0np1-0.
It might be the case that the spike is not coming due to context switches at the first place, but a number of the range 600 mics does hunch towards that. Leads towards any other direction would be very helpful. I have also tried restarting the server.
Turns out the problem was indeed not related to context switch.
I was also dumping the received data in a file. Stopping that recording removed the spikes. So, the high latency was due to some kind of write flush happening.
This is what I get when creating a e2-small machine
This is what I get when checking the machine after creation:
This is what is shown on the page https://cloud.google.com/compute/vm-instance-pricing:
*This is what I get when using cat /proc/cpuinfo:
In your first screenshot, it shows **e2-small (2 vCPU, 2 GB memory).
In your second screenshot, it shows 1 shared core.
One CPU core is 2 vCPUs. Therefore, your first and second screenshots are showing the same thing.
A vCPU is a hyper-thread. Each CPU core consists of two hyper-threads.
How does Hyper-Threading work? When Intel® Hyper-Threading Technology
is active, the CPU exposes two execution contexts per physical core.
This means that one physical core now works like two “logical cores”
that can handle different software threads.
What Is Hyper-Threading?
I am running a cluster of EMR Spark with this setup:
Master: 1 of m5.xlarge
Core: 4 of m5.xlarge
spark.executor.instances 4
spark.executor.cores 4
spark.driver.memory 11171M
spark.executor.memory 10356M
spark.emr.maximizeResourceAllocation true
spark.emr.default.executor.memory 10356M
spark.emr.default.executor.cores 4
spark.emr.default.executor.instances 4
where xlarge is an instance type which has 4 vCPU cores 16 GB memory.
Since I am using Spark to migrate database, the workload is very I/O intensive but not so much CPU intensive. I notice each executor node only spawns 4 threads (seems like 1 thread per vCPU core), while the CPU still has plenty of headroom.
Is there a way which allows me to force a higher thread allocation per executor node so that I can fully utilize my resources? Thanks.
One vCPU can hold only one thread.
In case you have assign 4 vCPU to your executor it will never spawns more than 4 threads.
For more detail
Calculation of vCPU & Cores
First, we need to select a virtual server and CPU. For this example, we’ll select Intel Xeon E-2288G as the underlying CPU. Key stats for the Intel Xeon E-2288G include 8 cores / 16 threads with a 3.7GHz base clock and 5.0GHz turbo boost. There is 16MB of onboard cache.
(16 Threads x 8 Cores) x 1 CPU = 128 vCPU
reference
I was using AWS c5.4xlarge instance, which has 16 vCPU, and running a 10 processes python programme. However, the CPU usage of each process gradually dropped down to 10% as showed in the picture in just 10 seconds. The total CPU usage of the 16 vCPU instances was just about 6%.
I reduced the number of the processes, but the CPU usage of each process was still quite low. Everything is ok on my own macOS.
What is wrong with this?
Ok I find the answer. This is about processor affinity. For Linux beginner:
https://en.wikipedia.org/wiki/Processor_affinity
In Linux you can allocate a cpu to a specific process in the Linux ternimal:
$ taskset -cp CPU_ID PID
For example:
$ taskset -cp 0-1 1000
will allocate CPU 0 and 1 to process with ID 1000.
You can find the PID by using
$ top
in your terminal.
I have recently set up an instance (m4.4xlarge).
when I execute 'lscpu' command, the output looks something like the following:
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 16
On-line CPU(s) list: 0-15
Thread(s) per core: 2
Core(s) per socket: 8
CPU socket(s): 1
.
.
.
Does this mean that only 8 cores can be utilized?
If so, what are the rest of CPUs for?
m4.4xlarge instances have 16 logical CPUs, so it looks like your EC2 instance is reporting it as having a single socket, with 1 physical CPU that has 8 cores. Each core can execute two threads simultaneously (Intel Hyperthreading technology) so each core is presented as 2 logical CPUs.
CPU(s): 16 <- logical CPUs (Threads per core * Cores per socket)
On-line CPU(s) list: 0-15
Thread(s) per core: 2 <- Each core has hyperthreading and presents
itself as two logical CPUs
Core(s) per socket: 8 <- Instance sees it has 8-core physical CPU per socket
CPU socket(s): 1 <- Instance sees it has 1 physical CPU