I am using Linux Ubuntu, and programming in C++. I have been able to access the performance counters (instruction counts, cache misses etc) using perf_event (actually using programs from this link: https://github.com/castl/easyperf).
However, now I am running a multi-threaded application using pthreads, and need the instruction counts and cycles to completion of each thread separately. Any ideas on how to go about this?
Thanks!
perf is a system profiling tool you can use. it's not like https://github.com/castl/easyperf), which is a library and you use it in your code. Following the steps and use it to profile your program:
Install perf on Ubuntu. The installation could be quite different in different Linux distribution. You can find out the installation tutorial line.
Simply run your program and get all thread id of your program:
ps -eLf | grep [application name]
open separate terminal and run perf as perf stat -t [threadid] according to man page:
usage: perf stat [<options>] [<command>]
-e, --event <event> event selector. use 'perf list' to list available events
-i, --no-inherit child tasks do not inherit counters
-p, --pid <n> stat events on existing process id
-t, --tid <n> stat events on existing thread id
-a, --all-cpus system-wide collection from all CPUs
-c, --scale scale/normalize counters
-v, --verbose be more verbose (show counter open errors, etc)
-r, --repeat <n> repeat command and print average + stddev (max: 100)
-n, --null null run - dont start any counters
-B, --big-num print large numbers with thousands' separators
there is an analysis article about perf, you can get a feeling about it.
You can use standard tool to access perf_event - the perf (from linux-tools). It can work with all threads of your program and report summary profile and per-thread (per-pid/per-tid) profile.
This profile is not exact hardware counters, but rather result of sampling every N events, with N tuned to be reached around 99 Hz (times per second). You can also try -c 2000000 option to get sample every 2 millions of hardware event. For example, cycles event (full list - perf list or try some listed in perf stat ./program)
perf record -e cycles -F 99 ./program
perf record -e cycles -c 2000000 ./program
Summary on all threads. -n will show you total number of samples
perf report -n
Per pid (actually tids are used here, so it will allow you to select any thread).
Text variant will list all threads recorded with summary sample count (with -c 2000000 you can multiply sample count with 2 million to estimate hw event count for the thread)
perf report -n -s pid | cat
Or ncurses-like interactive variant where you can select any thread and see its own profile:
perf report -n -s pid
Please take a look at the perf tool documentation here, it supports some of the events (eg: both instructions and cache-misses) that you're looking to profile. Extract from the wiki page linked above:
The perf tool can be used to count events on a per-thread, per-process, per-cpu or system-wide basis. In per-thread mode, the counter only monitors the execution of a designated thread. When the thread is scheduled out, monitoring stops. When a thread migrated from one processor to another, counters are saved on the current processor and are restored on the new one.
Related
I have a source code that is using some threads.
I want to see how many percent cpu and percent memory are used each threads.
So I used "htop" command. (I am using ubuntu.)
There are "PID" columns which is Process/thread ID.
I googled how to get thread id.
https://en.cppreference.com/w/cpp/thread/thread/id/hash
How to convert std::thread::id to string in c++?
But, somehow the thread ids I got from source code are not matched with PID values, one of htop command's output.
Any better idea or help would be great, thank you.
PID is Process ID. Run top -H
man 1 top
-H :Threads-mode operation
Instructs top to display individual threads. Without this
command-line option a summation of all threads in each
process is shown. Later this can be changed with the `H'
interactive command
I'm using perf for profiling on Ubuntu 20.04 (though I can use any other free tool). It allows to pass a delay in CLI, so that event collection starts after a certain time since program launch. However, this time varies a lot (by 20 seconds out of 1000) and there are tail computations which I am not interested in either.
So it would be great to call some API from my program to start perf event collection for the fragment of code I'm interested in, and then stop collection after the code finishes.
It's not really an option to run the code in a loop because there is a ~30 seconds initialization phase and 10 seconds measurement phase and I'm only interested in the latter.
There is an inter-process communication mechanism to achieve this between the program being profiled (or a controlling process) and the perf process: Use the --control option in the format --control=fifo:ctl-fifo[,ack-fifo] or --control=fd:ctl-fd[,ack-fd] as discussed in the perf-stat(1) manpage. This option specifies either a pair of pathnames of FIFO files (named pipes) or a pair of file descriptors. The first file is used for issuing commands to enable or disable all events in any perf process that is listening to the same file. The second file, which is optional, is used to check with perf when it has actually executed the command.
There is an example in the manpage that shows how to use this option to control a perf process from a bash script, which you can easily translate to C/C++:
ctl_dir=/tmp/
ctl_fifo=${ctl_dir}perf_ctl.fifo
test -p ${ctl_fifo} && unlink ${ctl_fifo}
mkfifo ${ctl_fifo}
exec ${ctl_fd}<>${ctl_fifo} # open for read+write as specified FD
This first checks the file /tmp/perf_ctl.fifo, if exists, is a named pipe and only then it deletes it. It's not a problem if the file doesn't exist, but if it exists and it's not a named pipe, the file should not be deleted and mkfifo should fail instead. The mkfifo creates a named pipe with the pathname /tmp/perf_ctl.fifo. The next command then opens the file with read/write permissions and assigns the file descriptor to ctl_fd. The equivalent syscalls are fstat, unlink, mkfifo, and open. Note that the named pipe will be written to by the shell script (controlling process) or the process being profiled and will be read from the perf process. The same commands are repeated for the second named pipe, ctl_fd_ack, which will be used to receive acknowledgements from perf.
perf stat -D -1 -e cpu-cycles -a -I 1000 \
--control fd:${ctl_fd},${ctl_fd_ack} \
-- sleep 30 &
perf_pid=$!
This forks the current process and runs the perf stat program in the child process, which inherits the same file descriptors. The -D -1 option tells perf to start with all events disabled. You probably need to change the perf options as follows:
perf stat -D -1 -e <your event list> --control fd:${ctl_fd},${ctl_fd_ack} -p pid
In this case, the program to be profiled is the the same as the controlling process, so tell perf to profile your already running program using -p. The equivalent syscalls are fork followed by execv in the child process.
sleep 5 && echo 'enable' >&${ctl_fd} && read -u ${ctl_fd_ack} e1 && echo "enabled(${e1})"
sleep 10 && echo 'disable' >&${ctl_fd} && read -u ${ctl_fd_ack} d1 && echo "disabled(${d1})"
The example script sleeps for about 5 seconds, writes 'enable' to the ctl_fd pipe, and then checks the response from perf to ensure that the events have been enabled before proceeding to disable the events after about 10 seconds. The equivalent syscalls are write and read.
The rest of the script deletes the file descriptors and the pipe files.
Putting it all together now, your program should look like this:
/* PART 1
Initialization code.
*/
/* PART 2
Create named pipes and fds.
Fork perf with disabled events.
perf is running now but nothing is being measured.
You can redirect perf output to a file if you wish.
*/
/* PART 3
Enable events.
*/
/* PART 4
The code you want to profile goes here.
*/
/* PART 5
Disable events.
perf is still running but nothing is being measured.
*/
/* PART 6
Cleanup.
Let this process terminate, which would cause the perf process to terminate as well.
Alternatively, use `kill(pid, SIGINT)` to gracefully kill perf.
perf stat outputs the results when it terminates.
*/
I have a complex application that executes in a number of phases. I would like to profile only one of the phases.
The C++ application runs on Linux, x86-64.
This program takes several minutes to run. If I use perf to profile the whole thing, the resulting data set is too large for perf report to process. However, at this point I am interested only in profiling the execution of one phase of the program that takes maybe 1/3 of the total time. Perhaps this data set will be easier for perf to report on.
Ideally, I'd like something along the lines of "send yourself SIGUSR1 to start profiling, and SIGUSR2 to stop it". At that point I can easily delineate the execution phase that I want profile information for.
I can always write my own (albeit basic) profiler using SIGPROF, but is there a way I can do this with existing tools such as perf?
A possible way to do this is to attach perf to an existing process.
So, start your program, check out its pid. Then start profiling with the -p <pid> option, when it's appropriate. And use CTRL-C or SIGINT to stop the profiling. But this trick works only if you don't need to start/stop profiling a lot of times, as the data append functionality has been removed from perf long time ago.
Or maybe you can just decrease sampling frequency with -F, so the resulting data becomes more tractable.
I executed the following command for getting the cpu usage
perf record -F 99 -p PID sleep 1
But I think this command is giving me the cpu usage of this process only.
If it fork any new process then the cpu usage of that process will not be included in the perf report.
Can someone suggest how can we get the cpu usage of a given PID with usage of all its successors combined ?
Also i am interested in getting the peak cpu usage at any interval.Is this also possible using perf?
We have a Linux server with multiple users logged in. If someone runs make -jN it hogs the whole server CPU usage and responsiveness to other users decreases drastically.
Is there any way to decrease the priority of make process run by anyone in Linux?
Make has a '-l' (--load-average) option.
If you specify 'make -l 3', make will not launch additional jobs if there are already jobs running and the load is over 3.
From the manpage:
-l [load], --load-average[=load]
Specifies that no new jobs (commands) should be started if there
are others jobs running and the load average is at least load (a
floating-point number). With no argument, removes a previous load
limit.
It doesn't really decrease the priority of make, but it can avoid causing too much load.
replace make with your own script and add a "nice -n <>" command, so that higher the -jN, more the niceness.
start a super-user process that does ps -u "user name" | grep make, and count the number of processes. use renice on the process ids make them in line, or any other algorithm you want