How to run record instruction-history and function-call-history in GDB? - gdb

(EDIT: per the first answer below the current "trick" seems to be using an Atom processor. But I hope some gdb guru can answer if this is a fundamental limitation, or whether there adding support for other processors is on the roadmap?)
Reverse execution seems to be working in my environment: I can reverse-continue, see a plausible record log, and move around within it:
(gdb) start
...Temporary breakpoint 5 at 0x8048460: file bang.cpp, line 13.
Starting program: /home/thomasg/temp/./bang
Temporary breakpoint 5, main () at bang.cpp:13
13 f(1000);
(gdb) record
(gdb) continue
Continuing.
Breakpoint 3, f (d=900) at bang.cpp:5
5 if(d) {
(gdb) info record
Active record target: record-full
Record mode:
Lowest recorded instruction number is 1.
Highest recorded instruction number is 1005.
Log contains 1005 instructions.
Max logged instructions is 200000.
(gdb) reverse-continue
Continuing.
Breakpoint 3, f (d=901) at bang.cpp:5
5 if(d) {
(gdb) record goto end
Go forward to insn number 1005
#0 f (d=900) at bang.cpp:5
5 if(d) {
However the instruction and function histories aren't available:
(gdb) record instruction-history
You can't do that when your target is `record-full'
(gdb) record function-call-history
You can't do that when your target is `record-full'
And the only target type available is full, the other documented type "btrace" fails with "Target does not support branch tracing."
So quite possibly it just isn't supported for this target, but as it's a mainstream modern one (gdb 7.6.1-ubuntu, on amd64 Linux Mint "Petra" running an "Intel(R) Core(TM) i5-3570") I'm hoping that I've overlooked a crucial step or config?

It seems that there is no other solution except a CPU that supports it.
More precisely, your kernel has to support Intel Processor Tracing (Intel PT). This can be checked in Linux with:
grep intel_pt /proc/cpuinfo
See also: https://unix.stackexchange.com/questions/43539/what-do-the-flags-in-proc-cpuinfo-mean
The commands only works in record btrace mode.
In the GDB source commit beab5d9, it is nat/linux-btrace.c:kernel_supports_pt that checks if we can enter btrace. The following checks are carried out:
check if /sys/bus/event_source/devices/intel_pt/type exists and read the type
do a syscall (SYS_perf_event_open, &attr, child, -1, -1, 0); with the read type, and see if it returns >=0. TODO: why not use the C wrapper?
The first check fails for me: the file does not exist.
Kernel side
cd into the kernel 4.1 source and:
git grep '"intel_pt"'
we find arch/x86/kernel/cpu/perf_event_intel_pt.c which sets up that file. In particular, it does:
if (!test_cpu_cap(&boot_cpu_data, X86_FEATURE_INTEL_PT))
goto fail;
so intel_pt is a pre-requisite.
How I've found kernel_supports_pt
First grep for:
git grep 'Target does not support branch tracing.'
which leads us to btrace.c:btrace_enable. After a quick debug with:
gdb -q -ex start -ex 'b btrace_enable' -ex c --args /home/ciro/git/binutils-gdb/install/bin/gdb --batch -ex start -ex 'record btrace' ./hello_world.out
Virtual box does not support it either: Extract execution log from gdb record in a VirtualBox VM
Intel SDE
Intel SDE 7.21 already has this CPU feature, checked with:
./sde64 -- cpuid | grep 'Intel processor trace'
But I'm not sure if the Linux kernel can be run on it: https://superuser.com/questions/950992/how-to-run-the-linux-kernel-on-intel-software-development-emulator-sde
Other GDB methods
More generic questions, with less efficient software solutions:
call graph: List of all function calls made in an application
instruction trace: Displaying each assembly instruction executed in gdb

At least a partial answer (for the "am I doing it wrong" aspect) - from gdb-7.6.50.20140108/gdb/NEWS
* A new record target "record-btrace" has been added. The new target
uses hardware support to record the control-flow of a process. It
does not support replaying the execution, but it implements the
below new commands for investigating the recorded execution log.
This new recording method can be enabled using:
record btrace
The "record-btrace" target is only available on Intel Atom processors
and requires a Linux kernel 2.6.32 or later.
* Two new commands have been added for record/replay to give information
about the recorded execution without having to replay the execution.
The commands are only supported by "record btrace".
record instruction-history prints the execution history at
instruction granularity
record function-call-history prints the execution history at
function granularity
It's not often that I envy the owner of an Atom processor ;-)
I'll edit the question to refocus upon the question of workarounds or plans for future support.

Related

Memchk (valgrind) reporting inconsistent results in different docker hosts

I have a fairly robust CI test for a C++ library, these tests (around 50) run over the same docker image but in different machines.
In one machine ("A") all the memcheck (valgrind) tests pass (i.e. no memory leaks).
In the other ("B"), all tests produce the same valgrind error below.
51/56 MemCheck #51: combinations.cpp.x ....................***Exception: SegFault 0.14 sec
valgrind: m_libcfile.c:66 (vgPlain_safe_fd): Assertion 'newfd >= VG_(fd_hard_limit)' failed.
Cannot find memory tester output file: /builds/user/boost-multi/build/Testing/Temporary/MemoryChecker.51.log
The machines are very similar, both are intel i7.
The only difference I can think of is that one is:
A. Ubuntu 22.10, Linux 5.19.0-29, docker 20.10.16
and the other:
B. Fedora 37, Linux 6.1.7-200.fc37.x86_64, docker 20.10.23
and perhaps some configuration of docker I don't know about.
Is there some configuration of docker that might generate the difference? or of the kernel? or some option in valgrind to workaround this problem?
I know for a fact that in real machines (not docker) valgrind doesn't produce any memory error.
The options I use for valgrind are always -leak-check=yes --num-callers=51 --trace-children=yes --leak-check=full --track-origins=yes --gen-suppressions=all.
Valgrind version in the image is 3.19.0-1 from the debian:testing image.
Note that this isn't an error reported by valgrind, it is an error within valgrind.
Perhaps after all, the only difference is that Ubuntu version of valgrind is compiled in release mode and the error is just ignored. (<-- this doesn't make sense, valgrind is the same in both cases because the docker image is the same).
I tried removing --num-callers=51 or setting it at 12 (default value), to no avail.
I found a difference between the images and the real machine and a workaround.
It has to do with the number of file descriptors.
(This was pointed out briefly in one of the threads on valgind bug issues on Mac OS https://bugs.kde.org/show_bug.cgi?id=381815#c0)
Inside the docker image running in Ubuntu 22.10:
ulimit -n
1048576
Inside the docker image running in Fedora 37:
ulimit -n
1073741816
(which looks like a ridiculous number or an overflow)
In the Fedora 37 and the Ubuntu 22.10 real machines:
ulimit -n
1024
So, doing this in the CI recipe, "solved" the problem:
- ulimit -n # reports current value
- ulimit -n 1024 # workaround neededed by valgrind in docker running in Fedora 37
- ctest ... (with memcheck)
I have no idea why this workaround works.
For reference:
$ ulimit --help
...
-n the maximum number of open file descriptors
First off, "you are doing it wrong" with your Valgrind arguments. For CI I recommend a two stage approach. Use as many default arguments as possible for the CI run (--trace-children=yes may well be necessary but not the others). If your codebase is leaky then you may need to check for leaks, but if you can maintain a zero leak policy (or only suppressed leaks) then you can tell if there are new leaks from the summary. After your CI detects an issue you can run again with the kitchen sink options to get full information. Your runs will be significantly faster without all those options.
Back to the question.
Valgrind is trying to dup() some file (the guest exe, a tempfile or something like that). The fd that it fets is higher than what it thinks the nofile rlimit is, so it is asserting.
A billion files is ridiculous.
Valgrind will try to call prlimit RLIMIT_NOFILE, with a fallback call to rlimit, and a second fallback to setting the limit to 1024.
To realy see what is going on you need to modify the Valgrind source (m_main.c, setup_file_descriptors, set local show to True). With this change I see
fd limits: host, before: cur 65535 max 65535
fd limits: host, after: cur 65535 max 65535
fd limits: guest : cur 65523 max 65523
Otherwise with strace I see
2049 prlimit64(0, RLIMIT_NOFILE, NULL, {rlim_cur=65535, rlim_max=65535}) = 0
2049 prlimit64(0, RLIMIT_NOFILE, {rlim_cur=65535, rlim_max=65535}, NULL) = 0
(all the above on RHEL 7.6 amd64)
EDIT: Note that the above show Valgrind querying and setting the resource limit. If you use ulimit to lower the limit before running Valgrind, then Valgrind will try to honour that limit. Also note that Valgrind reserves a small number (8) of files for its own use.

CUDA GPU __global__ function does not complete

__global__ void functionA()
{
printf("functionA");
}
int main()
{
printf("main1");
functionA<<<1,1>>>();
printf("main2");
}
I'm trying to run a simple test with the above. But the program only outputs "main1". The program should output "functionA" and "main2" too.
This seems to have two reasons:
First of all you need to add
cudaDeviceSynchronize();
after the CUDA routine in order to block the main until the device has completed all tasks.
Furthermore this might happen if you set the wrong GPU architecture/compute capability XX when compiling the code
$ nvcc -gencode=arch=compute_XX,code=sm_XX -o my_app my_app.cu
In this case only the host code is run while the parts on the accelerator will be omitted it seems. You can find an overview of the corresponding number XX for the different hardware generations over here. The K20m you are running is 35. So it should be
$ nvcc -gencode=arch=compute_35,code=sm_35 -o my_app my_app.cu
in your case.
This might also occur if you have multiple graphic accelerators in your system and the code is executed on the wrong one. Each graphics card/accelerator is assigned a particular device id. The device with number 0 should be assigned automatically to the most powerful device and will be used by default. Therefore the first time I compiled the code on my system containing a powerful Tesla K80 (architecture 37) and a low power Quadro P620 (architecture 60) I selected 37 and had the same error as you have while when selecting 60 the code would run. I then used then the Querying Device Properties example to give me a list of the CUDA-capable devices and their corresponding device id, just to find out that on my system the Tesla K80 is set as 1 and 2 while the simple Quadro P620 graphics card is set as 0. I assume this is the case as the K80 is deprecated in CUDA 11!
You can select the device inside your code with cudaSetDevice or change it when launching the program with
$ CUDA_VISIBLE_DEVICES="1" ./my_app
where 1 has to be replaced by the device id you wish to use. Doing so should make your code run without any problems.
You can also test if this really is the issue this by cloning the Github repository of "Learn CUDA Programming", then browsing Chapter01/01_cuda_introduction/01_hello_world/, compile the make file with $ make and finally run it with $ ./hello_world. It automatically compiles for multiple architectures/compute capabilities and should therefore run without any issue!

MPI_Comm_spawn fails with "All nodes which are allocated for this job are already filled"

I'm trying to use Torque's (5.1.1) qsub command to launch multiple OpenMPI
processes, one process per node, and having each process launch a single
process on its own local node using MPI_Comm_spawn(). MPI_Comm_spawn() is reporting:
All nodes which are allocated for this job are already filled.
My OpenMPI version is 4.0.1.
I am following the instructions here to control the mapping of nodes.
Controlling node mapping of MPI_COMM_SPAWN
using the --map-by ppr:1:node option to mpiexec, and a hostfile (programatically derived
from the ${PBS_NODEFILE} file that Torque produces). My derived file MyHostFile looks
like this:
n001.cluster.com slots=2 max_slots=2
n002.cluster.com slots=2 max_slots=2
while the original ${PBS_NODEFILE} only has the node names, and no slot specifications.
My qsub command is
qsub -V -j oe -e ./tempdir -o ./tempdir -N MyJob MyJob.bash
The mpiexec command from MyJob.bash is
mpiexec --display-map --np 2 --hostfile MyNodefile --map-by ppr:1:node <executable>.
MPI_Comm_spawn() causes this error to be printed:
Data for JOB [22220,1] offset 0 Total slots allocated 1 <=====
======================== JOB MAP ========================
Data for node: n001 Num slots: 1 Max slots: 0 Num procs: 1
Process OMPI jobid: [22220,1] App: 0 Process rank: 0 Bound: socket 0[core 0[hwt 0]]:[B/././././././././.][./././././././././.]
=============================================================
All nodes which are allocated for this job are already filled.
There are two things that occur to me:
(1) "Total slots allocated" is 1 above, but I need at least two slots available.
(2) It may not be right to try to specify a hostfile to mpiexec when
using Torque (though it is derived from the Torque hostfile ${PBS_NODEFILE}). Maybe my derived hostfile is being ignored.
Is there a way to make this work? I've tried recompiling OpenMPI
without Torque support, hopefully preventing OpenMPI from interacting
with it, but it didn't change the error message.
Answering my own question: adding the argument -l nodes=1:ppn=2 to the qsub command reserves 2 processors on the node, even though mpiexec is launching only one process. MPI_Comm_spawn() can then spawn the new process on the second reserved slot.
I also had to compile OpenMPI without Torque support, since including it causes my hostfile argument to be ignored and the Torque-generated hostfile to be used.

mysterious rtm abort using haswell tsx

I'm experimenting with the tsx extensions in haswell, by adapting an existing medium-sized (1000's of lines) codebase to using GCC transactional memory extensions (which indirectly are using haswell tsx in this machine) instead of coarse grained locks. I am using GCC's transactional_memory extensions, not writing my own _xbegin / _xend directly. I am using the ITM_DEFAULT_METHOD=htm
I'm having issues getting it to work fast enough because I get high rates of hardware transaction abort for mysterious reasons. As shown below, these aborts are not due to conflicts nor due to capacity limitations.
Here is the perf command I used to quantify the failure rate and underlying causes:
perf stat \
-e cpu/event=0x54,umask=0x2,name=tx_mem_abort_capacity_write/ \
-e cpu/event=0x54,umask=0x1,name=tx_mem_abort_conflict/ \
-e cpu/event=0x5d,umask=0x1,name=tx_exec_misc1/ \
-e cpu/event=0x5d,umask=0x2,name=tx_exec_misc2/ \
-e cpu/event=0x5d,umask=0x4,name=tx_exec_misc3/ \
-e cpu/event=0x5d,umask=0x8,name=tx_exec_misc4/ \
-e cpu/event=0x5d,umask=0x10,name=tx_exec_misc5/ \
-e cpu/event=0xc9,umask=0x1,name=rtm_retired_start/ \
-e cpu/event=0xc9,umask=0x2,name=rtm_retired_commit/ \
-e cpu/event=0xc9,umask=0x4,name=rtm_retired_aborted/pp \
-e cpu/event=0xc9,umask=0x8,name=rtm_retired_aborted_misc1/ \
-e cpu/event=0xc9,umask=0x10,name=rtm_retired_aborted_misc2/ \
-e cpu/event=0xc9,umask=0x20,name=rtm_retired_aborted_misc3/ \
-e cpu/event=0xc9,umask=0x40,name=rtm_retired_aborted_misc4/ \
-e cpu/event=0xc9,umask=0x80,name=rtm_retired_aborted_misc5/ \
./myprogram -th 1 -reps 3000000
So, the program runs some code with transactions in it 30 million times. Each request involves one transaction gcc __transaction_atomic block. There is only one thread in this run.
This particular perf command captures most of the relevant tsx performance events described in the Intel software developers manual vol 3.
The output from perf stat is the following:
0 tx_mem_abort_capacity_write [26.66%]
0 tx_mem_abort_conflict [26.65%]
29,937,894 tx_exec_misc1 [26.71%]
0 tx_exec_misc2 [26.74%]
0 tx_exec_misc3 [26.80%]
0 tx_exec_misc4 [26.92%]
0 tx_exec_misc5 [26.83%]
29,906,632 rtm_retired_start [26.79%]
0 rtm_retired_commit [26.70%]
29,985,423 rtm_retired_aborted [26.66%]
0 rtm_retired_aborted_misc1 [26.75%]
0 rtm_retired_aborted_misc2 [26.73%]
29,927,923 rtm_retired_aborted_misc3 [26.71%]
0 rtm_retired_aborted_misc4 [26.69%]
176 rtm_retired_aborted_misc5 [26.67%]
10.583607595 seconds time elapsed
As you can see from the output:
The rtm_retired_start count is 30 million (matches input to program)
The rtm_retired_abort count is about the same (no commits at all)
The abort_conflict and abort_capacity counts are 0, so these are not the reasons. Also, recall it is only one thread running, conflicts should be rare.
The only actual leads here are the high values of tx_exec_misc1 and rtm_retired_aborted_misc3, which are somewhat similar in description.
The Intel manual (vol 3) defines rtm_retired_aborted_misc3 counters:
code: C9H 20H
mnemonic: RTM_RETIRED.ABORTED_MISC3
description: Number of times an RTM execution aborted due to HLE unfriendly instructions.
The definition for tx_exec_misc1 has some similar words:
code: 5DH 01H
mnemonic: TX_EXEC.MISC1
description: Counts the number of times a class of instructions that may cause a transactional abort was executed. Since this is the count of execution, it may not always cause a transactional abort.
I checked the assembly location for the aborts using perf record/ perf report using high precision (PEBS) support for rtm_retired_aborted. The location has a mov instruction from register to register. No weird instruction names seen nearby.
Update:
Here are two things I've tried since then:
1) the tx_exec_misc1 and rtm_retired_aborted_misc3 signature we we see here can be obtained, for example, by a dummy block of the form
for (int i = 0; i < 10000000; i++){
__transaction_atomic{
_xabort(1);
}
}
or one of the form
for (int i = 0; i < 10000000; i++){
__transaction_atomic{
printf("hello");
fflush(stdout);
}
}
In both cases, the perf counters look similar to what I see. However, in both cases the perf report for -e cpu/tx-abort/ points to the intuitively correct assembly lines: an xabort instruction for the first example and a syscall one for the second one. In the real codebase, the perf report points to a stack push right at the start of a function:
: 00000000004167e0 <myns::myfun()>:
100.00 : 4167e0: push %rbp
0.00 : 4167e1: mov %rsp,%rbp
0.00 : 4167e4: push %r15
I have also run the same command under the intel software development emulator. It turns out that the problem goes away in that case: I get no aborts as far as the application is concerned.
Though it's been the case for a while, I found this unanswered question while searching, so here's the answer: This is a hardware bug in Haswell and early Broadwell chips.
The particular hardware erratum assigned by Intel is HSW136, and is not fixable using microcode updates. Indeed, I think it was in stepping 4 that the feature was no longer reported as available by the cpuid instruction, even when there was (faulty) silicon on the chip to implement it.

Efficient variable watching in C/C++

I'm currently writing a multi-threaded, high efficient and scalable algorithm. Because I have to guess a parameter for the code and I'm not sure how the calculation performs on a specific data set, I would like to watch a variable. The test only works with a real world, huge data set. It is possible to analyze the collected data after profiling. Imagine the following, simple code example (real code can contain multiple watch points:
// function get's called by loops of multiple threads
void payload(data_t* data, double threshold) {
double value = calc(data);
// here I want to watch the value
if (value < threshold) {
doSomething(data);
} else {
doSomethingElse(data);
}
}
I thought about the following approaches:
Using cout or other system outputs
Use a binary output (file, network)
Set a breakpoint via gdb/lldb
Use variable watching + logging via gdb/lldb
I'm not happy with the results because: To use 1. and 2. I have to change the code, but this is a debugging/evaluating task. Furthermore 1. requires locking and 1.+2. requires I/O operations, which heavily slows down the entire code and makes testing with real data nearly impossible. 3. is also too slow. To use 4., I have to know the variable address because it's not a global variable, but because threads get created by a dynamic scheduler, this would require breaking + stepping for each thread.
So my conclusion is, that I need a profiler/debugger that works at machine code level and dumps/logs/watches the variable without double->string conversion and is highly efficient, or to sum up with other words: I would like to profile the internal state of my algorithm without heavy slow-down and without doing deep modification. Does anybody know a tool that is able to this?
OK, this took some time but now I'm able to present a solution for my problem. It's called tracepoints. Instead of breaking the program every time, it's more lightweight and (ideally) doesn't change performance/timing too much. It does not require code changes. Here is an explanation how to use them using gdb:
Make sure you compiled your program with debugging symbols (using the -g flag). Now, start the gdb server and provide a network port (e.g. 10000) and the program arguments:
gdbserver :10000 ./program --parameters you --want --to use
Now, switch to a second console and start gdb (program parameters are not required here):
gdb ./program
All following commands are entered in the gdb command line interface. So let's connect to the server:
target remote :10000
After you got the connection confirmation, use trace or ftrace to set a tracepoint to a specific source location (try ftrace first, it should be faster but doesn't work on all platforms):
trace source.c:127
This should create tracepoint #1. Now you can setup an action for this tracepoint. Here I want to collect the data from myVariable
action 1
collect myVariable
end
If expect much data or want to use the data later (after restart), you can set a binary trace file:
tsave trace.bin
Now, start tracing and run the program:
tstart
continue
You can wait for program exit or interrupt your program using CTRL-C (still on gdb console, not on server side). Continue by telling gdb that you want to stop tracing:
tstop
Now we come the tricky part and I'm not really happy with the following code because it's really slow:
set pagination off
set logging file trace.txt
tfind start
while ($trace_frame != -1)
set logging on
printf "%f\n", myVariable
set logging off
tfind
end
This dumps all variable data to a text file. You can add some filter or preparation here. Now you're done and you can exit gdb. This will also shutdown the server:
quit
For detailed documentation especially for explanation of filtering and more advanced tracepoint positions, you can visit the following document: http://sourceware.org/gdb/onlinedocs/gdb/Tracepoints.html
To isolate trace file writing from your program execution, you can use cgroups or another network connected computer. When using another computer, you have to add the host to the port information (e.g. 192.168.1.37:10000). To load a binary trace file later, just start gdb as shown above (forget the server) and change the target:
gdb ./program
target tfile trace.bin
you can set hardware watchpoint using gdb debugger, for example if you have
bool b;
variable and you want to be notified every time the value of it has chenged (by any thread)
you would declare a watchpoint like this:
(gdb) watch *(bool*)0x7fffffffe344
example:
root#comp:~# gdb prog
GNU gdb (GDB) 7.5-ubuntu
Copyright ...
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>...
Reading symbols from /dist/Debug/GNU-Linux-x86/cppapp_socket5_ipaddresses...done.
(gdb) watch *(bool*)0x7fffffffe344
Hardware watchpoint 1: *(bool*)0x7fffffffe344
(gdb) start
Temporary breakpoint 2 at 0x40079f: file main.cpp, line 26.
Starting program: /dist/Debug/GNU-Linux-x86/cppapp_socket5_ipaddresses
Hardware watchpoint 1: *(bool*)0x7fffffffe344
Old value = true
New value = false
main () at main.cpp:50
50 if (strcmp(mask, "255.0.0.0") != 0) {
(gdb) c
Continuing.
Hardware watchpoint 1: *(bool*)0x7fffffffe344
Old value = false
New value = true
main () at main.cpp:41
41 if (ifa ->ifa_addr->sa_family == AF_INET) { // check it is IP4
(gdb) c
Continuing.
mask:255.255.255.0
eth0 IP Address 192.168.1.5
[Inferior 1 (process 18146) exited normally]
(gdb) q