Nested cgroups: How a parent distributes resources when it has processes in it AND child cgroups with processes in them? - cgroups

I've been studying containerization in-depth in the context of docker and kubernetes, and one thing didn't quite click in my mind just yet, and I couldn't find an answer to it, which is exactly what the title says.
Suppose:
I have a parent cgroup "A" with access to 50% CPU, and 2 process running in it (PID 1 and 2), both using non-stop 25% absolute CPU each.
Then I add a child cgroup "A1" to it, asking everything A can give (shares=1024 & quota/period ratio = 1.0), with 2 other process attached to it (PIDs 3 and 4).
How would CPU be distributed between these 4 process in different hierarchy levels?
The way I think of it, is PID 1, 2 and cgroup A1 would share equal 1/3s of A's 50%, so:
PID 1: 16.66666%
PID 2: 16.66666%
cgroup A1: 16.66666% (Which means, PID 3 and 4 would get 8.3333% CPU each).
Is it right? Doest it makes sense, or am I missing on something?
If it is not right, how should I educate myself to "visualize" this multi-hierarchy level distribution better and think of it in a more intuitive way?

Related

Akka Dispatcher Thread creation

I have been working on Akka Actor model. I have an usecase where more than 1000 actors will be in active and I have to process those actors. I thought of controlling the thread count through configuration defined in the application.conf.
But no. of dispatcher thread created in my application makes me helpless in tuning the dispatcher configuration. Each time when I restart my application, I see different number of dispatcher threads created (I have checked this via Thread dump each time after starting the application).
Even thread count is not equal to the one which I defined in parallelism-min. Due to this low thread count, my application is processing very slowly.
On checking the no. of core in my machine through the below code:
Runtime.getRuntime().availableProcessors();
It displays 40. But the no. of dispatcher thread count created is less than 300 even I configured parallelism as 500.
Following is my application.conf file:
consumer-dispatcher {
type = "Dispatcher"
executor = "fork-join-executor"
fork-join-executor {
parallelism-min = 500
parallelism-factor = 20.0
parallelism-max = 1000
}
shutdown-timeout = 1s
throughput = 1
}
May I know on what basis akka will be creating dispatcher threads internally and how I can increase my dispatcher thread count to increase parallel processing of actors?
X-Post from discuss.lightbend.com
First let me answer the question directly.
A fork-join-executor will be backed by a java.util.concurrent.forkJoinPool pool with its parallelism set to the implied parallelism from the dispatcher config. (parallelism-factor * processors, but no larger than max or less than min). So, in your case, 800.
And while I’m no expert on the implementation of the ForkJoinPool the source for the Java implementation of ForkJoinPool says “All worker thread creation is on-demand, triggered by task submissions, replacement of terminated workers, and/or compensation for blocked workers.” and it has methods like getActiveThreads(), so it’s clear that ForkJoinPooldoesn’t just naively create a giant pool of workers.
In other words, what you are seeing is expected: it’s only going to create threads as they are needed. If you really must have a gigantic pool of worker threads you could create a thread-pool-executor with a fixed-pool-size of 800. This would give you the implementation you are looking for.
But, before you do so, I think you are entirely missing the point of actors and Akka. One of the reasons that people like actors is that they are much more lightweight than threads and can give you a lot more concurrency than a thread. (Also note that concurrency != parallelism as noted in the documentation on concepts.) So trying to create a pool of 800 threads to back 1000 actors is very wasteful. In the akka docs introduction it highlights "Millions of actors can be efficiently scheduled on a dozen of threads".
I can’t tell you exactly how many threads you need without knowing your application (for example if you have blocking behavior) but the defaults (which would give you a parallelism factor of 20) is probably just fine. Benchmark to be certain, but I really don’t think you have a problem with too few threads. (The ForkJoinPool behavior you are observing seems to confirm this.)

Speed variation between vCPUs on the same Amazon EC2 instance

I'm exploring the feasibility of running numerical computations on Amazon EC2. I currently have one c4.8xlarge instance running. It has 36 vCPUs, each of which is a hyperthread of a Haswell Xeon chip. The instance runs Ubuntu in HVM mode.
I have a GCC-optimised binary of a completely sequential (i.e. single-threaded) program. I launched 30 instances with CPU-pinning thus:
for i in `seq 0 29`; do
nohup taskset -c $i $BINARY_PATH &> $i.out &
done
The 30 processes run almost identical calculations. There's very little disk activity (a few megabytes every 5 minutes), and there's no network activity or interprocess communication whatsoever. htop reports that all processes run constantly at 100%.
The whole thing has been running for about 4 hours at this point. Six processes (12-17) have already done their task, while processes 0, 20, 24 and 29 look as if they will require another 4 hours to complete. Other processes fall somewhere in between.
My questions are:
Other than resource contention with other users, is there anything else that may be causing the significant variation in performance between the vCPUs within the same instance? As it stands, the instance would be rather unsuitable for any OpenMP or MPI jobs that synchronise between threads/ranks.
Is there anything I can do to achieve a more uniform (hopefully higher) performance across the cores? I have basically excluded hyperthreading as a culprit here since the six "fast" processes are hyperthreads on the same physical cores. Perhaps there's some NUMA-related issue?
My experience is on c3 instances. It's likely similar with c4.
For example, take a c3.2xlarge instance with 8 vCPUs (most of the explaination below is derived from direct discussion with AWS support).
In fact only the first 4 vCPUs are usable for heavy scientic calculations. The last 4 vCPUs are hyperthreads. For scientific applications it’s often not useful to use hyperthreading, it causes context swapping or reduces the available cache (and associated bandwidth) per thread.
To find out the exact mapping between the vCPUs and the physical cores, look into /proc/cpuinfo
"physical id" : shows the physical processor id (only one processor in c3.2xlarge)
"processor" : gives the number of vCPUs
"core id" : tells you which vCPUs map back to each Core ID.
If you put this in a table, you have:
physical_id processor core_id
0 0 0
0 1 1
0 2 2
0 3 3
0 4 0
0 5 1
0 6 2
0 7 3
You can also get this from the "thread_siblings_list". Internal kernel map of cpuX's hardware threads within the same core as cpuX (https://www.kernel.org/doc/Documentation/cputopology.txt) :
cat /sys/devices/system/cpu/cpuX/topology/thread_siblings_list
When Hyper-threading is enabled each vCPU (Processor) is a "Logical Core".
There will be 2 "Logical Cores" that are associated with a "Physical Core"
So, in your case, one solution is to disable hyperthreading with :
echo 0 > /sys/devices/system/cpu/cpuX/online
Where X for a c3.2xlarge would be 4...7
EDIT : you can observe this behaviour only in HVM instances. In PV instances, this topology is hidden by the hypervisor : all core ids & processor ids in /proc/cpuinfo are '0'.

MPI fortran code on top of another one

I wrote a MPI fortran program that I need to run multiple times (for consistency let call this program P1). The minimum number of core that I can use to run a program is 512. The problem is that P1 has the best scalability with 128 cores.
What I want to do is to create another program (P2) on top of P1, that call P1 4 times simultaneously, each of the call would be on 128 cores..
Basically I need to run 4 instances of a call simultaneously with a number of process equal to the total processors divided by 4.
Do you think it is possible? My problem is I don't know where to search to do this.
I am currently looking at MPI groups and communicators, am I following the good path to reach my goal?
EDIT :
The system scheduler is Loadleveler. When I submit a job I need to specify how many nodes I need. There is 16 cores by node et the minimum nodes I can use is 32. In the batch, we specify also -np NBCORES, but if we do so i.e. -np 128, the time consumed will be as if we were using 512 cores (32 nodes) even if the job ran on 128 cores..
I were able to do it thanks to your answers.
As I mentioned later (sry for that), the scheduler is Loadlever,
if you have access to the subblock module follow this : http://www.hpc.cineca.it/content/batch-scheduler-loadleveler-0#sub-block as Hristo Iliev mentioned.
if you don't, you can do a multistep job with no dependency between the steps, so they will be executed simultaneously. It is a classic multistep job, you just have to remove any #dependency flags (in the case of Loadleveler).

How can I get a process to core mapping in C?

What library function can I call to get mapping of processes to cores or given a process id tell me what core it's running on, it ran last time, or scheduled to run. So something like this:
core 1: 14232,42323
core 2: 42213,63434,434
core 3: 34232,34314
core 4: 42325,6353,1434,4342
core 5: 43432,64535,14345,34233
core 6: 23242,53422,4231,34242
core 7: 78789
core 8: 23423,23124,5663
I sched_getcpu returns the core number of calling process. If there was a function that given a process id, would return the core number that would be good too but I have not found one. sched_getaffinity is not useful either; It just tells you given a process what cores it can run on which is not what I'm interested in.
I don't know that you can get information about what CPU any particular process is running on, but if you look in /proc, you'll find one entry for each running process. Under that, in /proc/<pid>/cpuset you'll find information about the set of CPUs that can be used to run that process.
Your question does not have any precise answer. The scheduler can migrate a process from one processor core to another at any time (and it is actually doing that). So by the time you got the answer it may be already wrong. And a process is usually not tied to any particular core (unless its CPU affinity has been set e.g. with sched_setaffinity(2), which is unusual; see also cpuset(7) for more).
Why are you asking? Why does that matter?
You probably want to dig inside /proc, see proc(5) man page.
In other words, if the kernel does give that information, it is thru /proc/ but I guess that information is not available because it does not make any sense.
NB. The kernel will schedule processes on the various processor cores much better than you can do, so even with a warehouse, you should not care about the core running some pid.
Yes, the virtual file /proc/[pid]/stat seems to have this info: man 5 proc:
/proc/[pid]/stat
Status information about the process. This is used by ps(1). It is
defined in /usr/src/linux/fs/proc/array.c.
(...fields description...)
processor %d (since Linux 2.2.8)
CPU number last executed on.
on my dual core:
cat /proc/*/stat | awk '{printf "%-32s %d\n", $2 ":", $(NF-5)}'
(su): 0
(bash): 0
(tail): 1
(hd-audio0): 1
(chromium-browse): 0
(bash): 1
(upstart-socket-): 1
(rpcbind): 1
..though I can't say if it's pertinent and/or accurate..

Open MPI: how to run exactly 1 process per host

Actually I have 3 questions. Any input is appreciated. Thank you!
1) How to run exactly 1 process on each host? My application uses TBB for multi-threading. Does it mean that I should run exactly 1 process on each host for best performance?
2) My cluster has heterogeneous hosts. Some hosts have better CPUs and more memory than the others. How to map process ranks to real hosts for work distribution purposes? I am thinking to use hostname.Is there a better to do it?
3) How process ranks are assigned? What process gets 0?
1) TBB splits loops into several threads of a thread pool to utilize all processors of one machine. So you should only run one process per machine. More processes would fight with each other for processor time. The number of processes per machine is given by options in your hostfile:
# my_hostfile
192.168.0.208 slots=1 max_slots=1
...
2) To give each machine an appropriate amount of work according to its performance is not trivial.
The easiest approach is to split the workload into small pieces of work, send them to the slaves, collect their answers, and give them new pieces of work, until you are done. There is an example on my website (in German). You can also find some references to manuals and tutorials there.
3) Each process gets a number (processID) in your program by
MPI_Comm_rank(MPI_COMM_WORLD, &processID);
The master has processID == 0. Maybe the other are given the slots in the order of your hostfile. Another possibility is they are assigned in the order the connections to slaves are established. I don't know that.