I have read a lot of examples for creating a dockerfile for your specific setup of PHP including installing extensions. The docker-php-ext-install command sometimes includes -j$(nproc) as an option. What exactly is happening there? I suspect the nproc has something to do with the number of processes?
Here's an example from https://github.com/docker-library/docs/tree/master/php#php-core-extensions
FROM php:7.0-fpm
RUN apt-get update && apt-get install -y \
libfreetype6-dev \
libjpeg62-turbo-dev \
libmcrypt-dev \
libpng-dev \
&& docker-php-ext-install -j$(nproc) iconv mcrypt \
&& docker-php-ext-configure gd --with-freetype-dir=/usr/include/ --with-jpeg-dir=/usr/include/ \
&& docker-php-ext-install -j$(nproc) gd
I'm new to docker and want to understand exactly what's happening in each step rather than blindly copying and pasting things from various examples and tutorials.
It is the number of Jobs for the make calls contained inside the script docker-php-ext-install (line 53 stores the option in the variable $j and 105-106 call make -j$j).
The command nproc is giving directly to the script the number of physical thread available for your system. For example, on my system it will be reduced to:
make -j$(nproc) -> make -j8
thus it runs make with 8 parallel recipes.
From make manual:
-j [jobs], --jobs[=jobs]:
Specifies the number of jobs (commands) to run simultaneously. If there is more than one -j option, the last one is effective. If the -j option is given without an argument, make will not limit the number of jobs that can run simultaneously.
with more information in the GNU make documentation about parallel jobs:
GNU make knows how to execute several recipes at once. Normally, make will execute only one recipe at a time, waiting for it to finish before executing the next. However, the -j or --jobs option tells make to execute many recipes simultaneously. [...] On MS-DOS, the -j option has no effect, since that system doesn’t support multi-processing.
If the -j option is followed by an integer, this is the number of recipes to execute at once; this is called the number of job slots. If there is nothing looking like an integer after the -j option, there is no limit on the number of job slots. The default number of job slots is one, which means serial execution (one thing at a time).
Ideally, if that number is equal to the number of the physical threads available (roughly the number of processors, or as in this case the number returned by nproc), you should get the fastest compilation possible.
nproc is a linux command - see man nproc too. It means "number of processing units available" - your CPU Cores. You can imagine it as "Number of processors" [source: #tron5 comment]
You must consider the memory available though. For example, if you allocate 8 slots with only 1GB of RAM and the compilation of 3 simultaneous jobs fill the RAM, then when the fourth will start it will exit with an error due to insufficient memory, arresting the whole compilation process.
Related
I'm using perf for profiling on Ubuntu 20.04 (though I can use any other free tool). It allows to pass a delay in CLI, so that event collection starts after a certain time since program launch. However, this time varies a lot (by 20 seconds out of 1000) and there are tail computations which I am not interested in either.
So it would be great to call some API from my program to start perf event collection for the fragment of code I'm interested in, and then stop collection after the code finishes.
It's not really an option to run the code in a loop because there is a ~30 seconds initialization phase and 10 seconds measurement phase and I'm only interested in the latter.
There is an inter-process communication mechanism to achieve this between the program being profiled (or a controlling process) and the perf process: Use the --control option in the format --control=fifo:ctl-fifo[,ack-fifo] or --control=fd:ctl-fd[,ack-fd] as discussed in the perf-stat(1) manpage. This option specifies either a pair of pathnames of FIFO files (named pipes) or a pair of file descriptors. The first file is used for issuing commands to enable or disable all events in any perf process that is listening to the same file. The second file, which is optional, is used to check with perf when it has actually executed the command.
There is an example in the manpage that shows how to use this option to control a perf process from a bash script, which you can easily translate to C/C++:
ctl_dir=/tmp/
ctl_fifo=${ctl_dir}perf_ctl.fifo
test -p ${ctl_fifo} && unlink ${ctl_fifo}
mkfifo ${ctl_fifo}
exec ${ctl_fd}<>${ctl_fifo} # open for read+write as specified FD
This first checks the file /tmp/perf_ctl.fifo, if exists, is a named pipe and only then it deletes it. It's not a problem if the file doesn't exist, but if it exists and it's not a named pipe, the file should not be deleted and mkfifo should fail instead. The mkfifo creates a named pipe with the pathname /tmp/perf_ctl.fifo. The next command then opens the file with read/write permissions and assigns the file descriptor to ctl_fd. The equivalent syscalls are fstat, unlink, mkfifo, and open. Note that the named pipe will be written to by the shell script (controlling process) or the process being profiled and will be read from the perf process. The same commands are repeated for the second named pipe, ctl_fd_ack, which will be used to receive acknowledgements from perf.
perf stat -D -1 -e cpu-cycles -a -I 1000 \
--control fd:${ctl_fd},${ctl_fd_ack} \
-- sleep 30 &
perf_pid=$!
This forks the current process and runs the perf stat program in the child process, which inherits the same file descriptors. The -D -1 option tells perf to start with all events disabled. You probably need to change the perf options as follows:
perf stat -D -1 -e <your event list> --control fd:${ctl_fd},${ctl_fd_ack} -p pid
In this case, the program to be profiled is the the same as the controlling process, so tell perf to profile your already running program using -p. The equivalent syscalls are fork followed by execv in the child process.
sleep 5 && echo 'enable' >&${ctl_fd} && read -u ${ctl_fd_ack} e1 && echo "enabled(${e1})"
sleep 10 && echo 'disable' >&${ctl_fd} && read -u ${ctl_fd_ack} d1 && echo "disabled(${d1})"
The example script sleeps for about 5 seconds, writes 'enable' to the ctl_fd pipe, and then checks the response from perf to ensure that the events have been enabled before proceeding to disable the events after about 10 seconds. The equivalent syscalls are write and read.
The rest of the script deletes the file descriptors and the pipe files.
Putting it all together now, your program should look like this:
/* PART 1
Initialization code.
*/
/* PART 2
Create named pipes and fds.
Fork perf with disabled events.
perf is running now but nothing is being measured.
You can redirect perf output to a file if you wish.
*/
/* PART 3
Enable events.
*/
/* PART 4
The code you want to profile goes here.
*/
/* PART 5
Disable events.
perf is still running but nothing is being measured.
*/
/* PART 6
Cleanup.
Let this process terminate, which would cause the perf process to terminate as well.
Alternatively, use `kill(pid, SIGINT)` to gracefully kill perf.
perf stat outputs the results when it terminates.
*/
Is there a way to limit the number of threads a gsutil -m command spawns? Can I say something like gsutil -m --threads=4 to spawn exactly four threads?
You should set the parallel_thread_count and parallel_process_count values in the boto configuration file to 4
Gsutil Top-Level Command-Line Options
-m flag
Causes supported operations (acl ch, acl set, cp, mv, rm, rsync, and
setmeta) to run in parallel. This can significantly improve
performance if you are performing operations on a large number of
files over a reasonably fast network connection.
gsutil performs the specified operation using a combination of
multi-threading and multi-processing, using a number of threads and
processors determined by the parallel_thread_count and
parallel_process_count values set in the boto configuration file. You
might want to experiment with these values, as the best values can
vary based on a number of factors, including network speed, number of
CPUs, and available memory.
Using the -m option may make your performance worse if you are using a
slower network, such as the typical network speeds offered by
non-business home network plans. It can also make your performance
worse for cases that perform all operations locally (e.g., gsutil
rsync, where both source and destination URLs are on the local disk),
because it can "thrash" your local disk.
If a download or upload operation using parallel transfer fails before
the entire transfer is complete (e.g. failing after 300 of 1000 files
have been transferred), you will need to restart the entire transfer.
Also, although most commands will normally fail upon encountering an
error when the -m flag is disabled, all commands will continue to try
all operations when -m is enabled with multiple threads or processes,
and the number of failed operations (if any) will be reported as an
exception at the end of the command's execution.
I have a virtual machine with 32 cores.
I am running some simulations for which I need to utilize 16 cores at one time.
I use the below command to run a job on 16 cores :
mpirun -n 16 program_name args > log.out 2>&1
This program runs on 16 cores.
Now if I want to run the same programs on the rest of the cores, with different arguments, I use the same command like
mpirun -n 8 program_name diff_args > log_1.out 2>&1
The second process utilizes the same 16 cores that were utilized earlier.
How can use mpirun to run this process on 8 different cores, not the previous 16 that first job was using.
I am using headless Ubuntu 16.04.
Open MPI's launcher supports restricting the CPU set via the --cpu-set option. It accepts a set of logical CPUs expressed as a list of the form s0,s1,s2,..., where each list entry is either a single logical CPU number of a range of CPUs n-m.
Provided that the logical CPUs in your VM are numbered consecutively, what you have to do is:
mpirun --cpu-set 0-15 --bind-to core -n 16 program_name args > log.out 2>&1
mpirun --cpu-set 16-23 --bind-to core -n 8 program_name diff_args > log_1.out 2>&1
--bind-to core tells Open MPI to bind the processes to separate cores each while respecting the CPU set provided in the --cpu-set argument.
It might be helpful to use a tool such as lstopo (part of the hwloc library of Open MPI) to obtain the topology of the system, which helps in choosing the right CPU numbers and, e.g., prevents binding to hyperthreads, although this is less meaningful in a virtualised environment.
(Note that lstopo uses a confusing naming convention and calls the OS logical CPUs physical, so look for the numbers in the (P#n) entries. lstopo -p hides the hwloc logical numbers and prevents confusion.)
I am using Linux Ubuntu, and programming in C++. I have been able to access the performance counters (instruction counts, cache misses etc) using perf_event (actually using programs from this link: https://github.com/castl/easyperf).
However, now I am running a multi-threaded application using pthreads, and need the instruction counts and cycles to completion of each thread separately. Any ideas on how to go about this?
Thanks!
perf is a system profiling tool you can use. it's not like https://github.com/castl/easyperf), which is a library and you use it in your code. Following the steps and use it to profile your program:
Install perf on Ubuntu. The installation could be quite different in different Linux distribution. You can find out the installation tutorial line.
Simply run your program and get all thread id of your program:
ps -eLf | grep [application name]
open separate terminal and run perf as perf stat -t [threadid] according to man page:
usage: perf stat [<options>] [<command>]
-e, --event <event> event selector. use 'perf list' to list available events
-i, --no-inherit child tasks do not inherit counters
-p, --pid <n> stat events on existing process id
-t, --tid <n> stat events on existing thread id
-a, --all-cpus system-wide collection from all CPUs
-c, --scale scale/normalize counters
-v, --verbose be more verbose (show counter open errors, etc)
-r, --repeat <n> repeat command and print average + stddev (max: 100)
-n, --null null run - dont start any counters
-B, --big-num print large numbers with thousands' separators
there is an analysis article about perf, you can get a feeling about it.
You can use standard tool to access perf_event - the perf (from linux-tools). It can work with all threads of your program and report summary profile and per-thread (per-pid/per-tid) profile.
This profile is not exact hardware counters, but rather result of sampling every N events, with N tuned to be reached around 99 Hz (times per second). You can also try -c 2000000 option to get sample every 2 millions of hardware event. For example, cycles event (full list - perf list or try some listed in perf stat ./program)
perf record -e cycles -F 99 ./program
perf record -e cycles -c 2000000 ./program
Summary on all threads. -n will show you total number of samples
perf report -n
Per pid (actually tids are used here, so it will allow you to select any thread).
Text variant will list all threads recorded with summary sample count (with -c 2000000 you can multiply sample count with 2 million to estimate hw event count for the thread)
perf report -n -s pid | cat
Or ncurses-like interactive variant where you can select any thread and see its own profile:
perf report -n -s pid
Please take a look at the perf tool documentation here, it supports some of the events (eg: both instructions and cache-misses) that you're looking to profile. Extract from the wiki page linked above:
The perf tool can be used to count events on a per-thread, per-process, per-cpu or system-wide basis. In per-thread mode, the counter only monitors the execution of a designated thread. When the thread is scheduled out, monitoring stops. When a thread migrated from one processor to another, counters are saved on the current processor and are restored on the new one.
We have a Linux server with multiple users logged in. If someone runs make -jN it hogs the whole server CPU usage and responsiveness to other users decreases drastically.
Is there any way to decrease the priority of make process run by anyone in Linux?
Make has a '-l' (--load-average) option.
If you specify 'make -l 3', make will not launch additional jobs if there are already jobs running and the load is over 3.
From the manpage:
-l [load], --load-average[=load]
Specifies that no new jobs (commands) should be started if there
are others jobs running and the load average is at least load (a
floating-point number). With no argument, removes a previous load
limit.
It doesn't really decrease the priority of make, but it can avoid causing too much load.
replace make with your own script and add a "nice -n <>" command, so that higher the -jN, more the niceness.
start a super-user process that does ps -u "user name" | grep make, and count the number of processes. use renice on the process ids make them in line, or any other algorithm you want