Debugging and killing apps on Mac OS X? - c++

Hey all, I'm in the process of debugging a C++ app on mac os 10.5. Occasionally, I'll do something bad and cause a segfault or an otherwise illegal operation. This results in the app hanging for a while, and eventually a system dialog notifying me of the crash. The wait time between the "hang" and the dialog is significant; a few minutes. If I try to force quit the application or kill -9 it from the command line nothing happens. If I start the app from the debugger (gdb), upon a crash I get back to gdb prompt and can exit the process cleanly. That's not ideal though as gdb is slow to load.
Anyway, can you guys recommend something? Is there a way to disable the crash reporting mechanism in OS X?
Thanks.
Update 1:
Here're the zombies that are left over from an XCode execution. Apparently xcode can't stop 'em properly either.
1 eightieight#eightieights-MacBook-Pro:~$ ps auxw|grep -i Reader
2 eightieight 28639 0.0 0.0 599828 504 s004 R+ 2:54pm 0:00.00 grep -i reader
3 eightieight 28288 0.0 1.1 1049324 45032 ?? UEs 2:46pm 0:00.89 /Users/eightieight/workspace/spark/spark/reader/browser/build/Debug/Reader.app/Contents/MacOS/Reader
4 eightieight 28271 0.0 1.1 1049324 45036 ?? UEs 2:45pm 0:00.89 /Users/eightieight/workspace/spark/spark/reader/browser/build/Debug/Reader.app/Contents/MacOS/Reader
5 eightieight 28146 0.0 1.1 1049324 44996 ?? UEs 2:39pm 0:00.90 /Users/eightieight/workspace/spark/spark/reader/browser/build/Debug/Reader.app/Contents/MacOS/Reader
6 eightieight 27421 0.0 1.1 1049328 45024 ?? UEs 2:29pm 0:00.88 /Users/eightieight/workspace/spark/spark/reader/browser/build/Debug/Reader.app/Contents/MacOS/Reader
7 eightieight 27398 0.0 1.1 1049324 45044 ?? UEs 2:28pm 0:00.90 /Users/eightieight/workspace/spark/spark/reader/browser/build/Debug/Reader.app/Contents/MacOS/Reader

There's the CrashReporterPrefs app that comes with XCode (search for it with Spotlight; should be in /Developer/Applications/Utilities). That can be to set to Server Mode to disable the application 'Unexpectedly Quit' dialog too.
Here's another suggestion:
sudo chmod 000 /System/Library/CoreServices/Problem\ Reporter.app
To re-enable, do the following:
sudo chmod 755 /System/Library/CoreServices/Problem\ Reporter.app
It might be that the application is dumping a large core file - you'd probably notice the effect on available disk space though. You can switch off core dumping using
sudo sysctl -w kern.coredump=0
Reactivate by setting =1.

This article from osxdaily.com says you just need to type:
defaults write com.apple.CrashReporter DialogType none
in the terminal. Don't know if that will fix the delay though.

I finally figured it out.
in /System/Library/CoreServices:
---------- 1 root wheel 56752 11 Aug 2009 ReportPanic
That must've been from my earlier attempts to disable the annoying report dialog. Live and learn. :]

Related

Memchk (valgrind) reporting inconsistent results in different docker hosts

I have a fairly robust CI test for a C++ library, these tests (around 50) run over the same docker image but in different machines.
In one machine ("A") all the memcheck (valgrind) tests pass (i.e. no memory leaks).
In the other ("B"), all tests produce the same valgrind error below.
51/56 MemCheck #51: combinations.cpp.x ....................***Exception: SegFault 0.14 sec
valgrind: m_libcfile.c:66 (vgPlain_safe_fd): Assertion 'newfd >= VG_(fd_hard_limit)' failed.
Cannot find memory tester output file: /builds/user/boost-multi/build/Testing/Temporary/MemoryChecker.51.log
The machines are very similar, both are intel i7.
The only difference I can think of is that one is:
A. Ubuntu 22.10, Linux 5.19.0-29, docker 20.10.16
and the other:
B. Fedora 37, Linux 6.1.7-200.fc37.x86_64, docker 20.10.23
and perhaps some configuration of docker I don't know about.
Is there some configuration of docker that might generate the difference? or of the kernel? or some option in valgrind to workaround this problem?
I know for a fact that in real machines (not docker) valgrind doesn't produce any memory error.
The options I use for valgrind are always -leak-check=yes --num-callers=51 --trace-children=yes --leak-check=full --track-origins=yes --gen-suppressions=all.
Valgrind version in the image is 3.19.0-1 from the debian:testing image.
Note that this isn't an error reported by valgrind, it is an error within valgrind.
Perhaps after all, the only difference is that Ubuntu version of valgrind is compiled in release mode and the error is just ignored. (<-- this doesn't make sense, valgrind is the same in both cases because the docker image is the same).
I tried removing --num-callers=51 or setting it at 12 (default value), to no avail.
I found a difference between the images and the real machine and a workaround.
It has to do with the number of file descriptors.
(This was pointed out briefly in one of the threads on valgind bug issues on Mac OS https://bugs.kde.org/show_bug.cgi?id=381815#c0)
Inside the docker image running in Ubuntu 22.10:
ulimit -n
1048576
Inside the docker image running in Fedora 37:
ulimit -n
1073741816
(which looks like a ridiculous number or an overflow)
In the Fedora 37 and the Ubuntu 22.10 real machines:
ulimit -n
1024
So, doing this in the CI recipe, "solved" the problem:
- ulimit -n # reports current value
- ulimit -n 1024 # workaround neededed by valgrind in docker running in Fedora 37
- ctest ... (with memcheck)
I have no idea why this workaround works.
For reference:
$ ulimit --help
...
-n the maximum number of open file descriptors
First off, "you are doing it wrong" with your Valgrind arguments. For CI I recommend a two stage approach. Use as many default arguments as possible for the CI run (--trace-children=yes may well be necessary but not the others). If your codebase is leaky then you may need to check for leaks, but if you can maintain a zero leak policy (or only suppressed leaks) then you can tell if there are new leaks from the summary. After your CI detects an issue you can run again with the kitchen sink options to get full information. Your runs will be significantly faster without all those options.
Back to the question.
Valgrind is trying to dup() some file (the guest exe, a tempfile or something like that). The fd that it fets is higher than what it thinks the nofile rlimit is, so it is asserting.
A billion files is ridiculous.
Valgrind will try to call prlimit RLIMIT_NOFILE, with a fallback call to rlimit, and a second fallback to setting the limit to 1024.
To realy see what is going on you need to modify the Valgrind source (m_main.c, setup_file_descriptors, set local show to True). With this change I see
fd limits: host, before: cur 65535 max 65535
fd limits: host, after: cur 65535 max 65535
fd limits: guest : cur 65523 max 65523
Otherwise with strace I see
2049 prlimit64(0, RLIMIT_NOFILE, NULL, {rlim_cur=65535, rlim_max=65535}) = 0
2049 prlimit64(0, RLIMIT_NOFILE, {rlim_cur=65535, rlim_max=65535}, NULL) = 0
(all the above on RHEL 7.6 amd64)
EDIT: Note that the above show Valgrind querying and setting the resource limit. If you use ulimit to lower the limit before running Valgrind, then Valgrind will try to honour that limit. Also note that Valgrind reserves a small number (8) of files for its own use.

HTCondor - Partitionable slot not working

I am following the tutorial on
Center for High Throughput Computing and Introduction to Configuration in the HTCondor website to set up a Partitionable slot. Before any configuration I run
condor_status
and get the following output.
I update the file 00-minicondor in /etc/condor/config.d by adding the following lines at the end of the file.
NUM_SLOTS = 1
NUM_SLOTS_TYPE_1 = 1
SLOT_TYPE_1 = cpus=4
SLOT_TYPE_1_PARTITIONABLE = TRUE
and reconfigure
sudo condor_reconfig
Now with
condor_status
I get this output as expected. Now, I run the following command to check everything is fine
condor_status -af Name Slotype Cpus
and find slot1#ip-172-31-54-214.ec2.internal undefined 1 instead of slot1#ip-172-31-54-214.ec2.internal Partitionable 4 61295 that is what I would expect. Moreover, when I try to summit a job that asks for more than 1 cpu it does not allocate space for it (It stays waiting forever) as it should.
I don't know if I made some mistake during the installation process or what could be happening. I would really appreciate any help!
EXTRA INFO: If it can be of any help have have installed HTCondor with the command
curl -fsSL https://get.htcondor.org | sudo /bin/bash -s – –no-dry-run
on Ubuntu 18.04 running on an old p2.xlarge instance (it has 4 cores).
UPDATE: After rebooting the whole thing it seems to be working. I can now send jobs with different CPUs requests and it will start them properly.
The only issue I would say persists is that Memory allocation is not showing properly, for example:
But in reality it is allocating enough memory for the job (in this case around 12 GB).
If I run again
condor_status -af Name Slotype Cpus
I still get something I am not supposed to
But at least it is showing the correct number of CPUs (even if it just says undefined).
What is the output of condor_q -better when the job is idle?

Unexpected crash while clustering with RStudio on ec2 (AWS)

I am experiencing crashes with RStudio on the ec2 while clustering with currently 32 cores using the package doSNOW. The problem keeps happening and the logs in RStudio and the awslogs show following problems:
The previous R session was abnormally terminated due to an unexpected crash. You may have lost workspace data as a result of this crash
I have tried a workaround found on the RStudio community page like this:
rm -rf ~/.rstudio
I restarted it, terminated the RStudio many times, but it didn't help. I change to a bigger instance:
r4.8xlarge
but the calculation couldn't be made either.
Apr 30 14:14:23 ip-172-31-46-102 rsession-rstudio[12984]: ERROR session hadabend; LOGGED FROM: rstudio::core::Error {anonymous}::rInit(const rstudio::r::session::RInitInfo&) /home/ubuntu/rstudio/src/cpp/session/SessionMain.cpp:563
This is the following code when the RStudio crashes:
# Clustering using gower distance and hclust()
d <- sapply(1:nrow(data), function(i) gower_dist(data[i,], data))
d <- as.dist(d)
h <- hclust(d) # this causes error
The problem is solved - the hclust is not really suitable for big data. Replacing that by flashClust does not lead to a crash of RStudio anymore and the calculation was successful.

Ember CLI build killed

I build my Ember CLI app inside a docker container on startup. The build fails without an error message, it just says killed:
root#fstaging:/frontend/source# node_modules/ember-cli/bin/ember build -prod
version: 1.13.15
Could not find watchman, falling back to NodeWatcher for file system events.
Visit http://www.ember-cli.com/user-guide/#watchman for more info.
Buildingember-auto-register-helpers is not required for Ember 2.0.0 and later please remove from your `package.json`.
Building.DEPRECATION: The `bind-attr` helper ('app/templates/components/file-selector.hbs' # L1:C7) is deprecated in favor of HTMLBars-style bound attributes.
at isBindAttrModifier (/app/source/bower_components/ember/ember-template-compiler.js:11751:34)
Killed
The same docker image successfully starts up in another environment, but without hardware constraints. Does Ember CLI have hard-coded hardware constraints for the build process? The RAM is limited to 128m and swap to 2g.
That is likely not enough memory for Ember CLI to do what it needs. You are correct in that, the process is being killed because of an OOM situation. If you log in to the host and take a look at the dmesg output you will probably see something like:
V8 WorkerThread invoked oom-killer: gfp_mask=0xd0, order=0, oom_score_adj=0
V8 WorkerThread cpuset=867781e35d8a0a231ef60a272ae5d418796c45e92b5aa0233df317ce659b0032 mems_allowed=0
CPU: 0 PID: 2027 Comm: V8 WorkerThread Tainted: G O 4.1.13-boot2docker #1
Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006
0000000000000000 00000000000000d0 ffffffff8154e053 ffff880039381000
ffffffff8154d3f7 ffff8800395db528 ffff8800392b4528 ffff88003e214580
ffff8800392b4000 ffff88003e217080 ffffffff81087faf ffff88003e217080
Call Trace:
[<ffffffff8154e053>] ? dump_stack+0x40/0x50
[<ffffffff8154d3f7>] ? dump_header.isra.10+0x8c/0x1f4
[<ffffffff81087faf>] ? finish_task_switch+0x4c/0xda
[<ffffffff810f46b1>] ? oom_kill_process+0x99/0x31c
[<ffffffff811340e6>] ? task_in_mem_cgroup+0x5d/0x6a
[<ffffffff81132ac5>] ? mem_cgroup_iter+0x1c/0x1b2
[<ffffffff81134984>] ? mem_cgroup_oom_synchronize+0x441/0x45a
[<ffffffff8113402f>] ? mem_cgroup_is_descendant+0x1d/0x1d
[<ffffffff810f4d77>] ? pagefault_out_of_memory+0x17/0x91
[<ffffffff815565d8>] ? page_fault+0x28/0x30
Task in /docker/867781e35d8a0a231ef60a272ae5d418796c45e92b5aa0233df317ce659b0032 killed as a result of limit of /docker/867781e35d8a0a231ef60a272ae5d418796c45e92b5aa0233df317ce659b0032
memory: usage 131072kB, limit 131072kB, failcnt 2284203
memory+swap: usage 262032kB, limit 262144kB, failcnt 970540
kmem: usage 0kB, limit 9007199254740988kB, failcnt 0
Memory cgroup stats for /docker/867781e35d8a0a231ef60a272ae5d418796c45e92b5aa0233df317ce659b0032: cache:340KB rss:130732KB rss_huge:10240KB mapped_file:8KB writeback:0KB swap:130960KB inactive_anon:72912KB active_anon:57880KB inactive_file:112KB active_file:40KB unevictable:0KB
[ pid ] uid tgid total_vm rss nr_ptes nr_pmds swapents oom_score_adj name
[ 1993] 0 1993 380 1 6 3 17 0 sh
[ 2025] 0 2025 203490 32546 221 140 32713 0 npm
Memory cgroup out of memory: Kill process 2025 (npm) score 1001 or sacrifice child
Killed process 2025 (npm) total-vm:813960kB, anon-rss:130184kB, file-rss:0kB
It might be worthwhile to profile the container using something like https://github.com/google/cadvisor to find out what kind of memory maximums it may need.

How to run record instruction-history and function-call-history in GDB?

(EDIT: per the first answer below the current "trick" seems to be using an Atom processor. But I hope some gdb guru can answer if this is a fundamental limitation, or whether there adding support for other processors is on the roadmap?)
Reverse execution seems to be working in my environment: I can reverse-continue, see a plausible record log, and move around within it:
(gdb) start
...Temporary breakpoint 5 at 0x8048460: file bang.cpp, line 13.
Starting program: /home/thomasg/temp/./bang
Temporary breakpoint 5, main () at bang.cpp:13
13 f(1000);
(gdb) record
(gdb) continue
Continuing.
Breakpoint 3, f (d=900) at bang.cpp:5
5 if(d) {
(gdb) info record
Active record target: record-full
Record mode:
Lowest recorded instruction number is 1.
Highest recorded instruction number is 1005.
Log contains 1005 instructions.
Max logged instructions is 200000.
(gdb) reverse-continue
Continuing.
Breakpoint 3, f (d=901) at bang.cpp:5
5 if(d) {
(gdb) record goto end
Go forward to insn number 1005
#0 f (d=900) at bang.cpp:5
5 if(d) {
However the instruction and function histories aren't available:
(gdb) record instruction-history
You can't do that when your target is `record-full'
(gdb) record function-call-history
You can't do that when your target is `record-full'
And the only target type available is full, the other documented type "btrace" fails with "Target does not support branch tracing."
So quite possibly it just isn't supported for this target, but as it's a mainstream modern one (gdb 7.6.1-ubuntu, on amd64 Linux Mint "Petra" running an "Intel(R) Core(TM) i5-3570") I'm hoping that I've overlooked a crucial step or config?
It seems that there is no other solution except a CPU that supports it.
More precisely, your kernel has to support Intel Processor Tracing (Intel PT). This can be checked in Linux with:
grep intel_pt /proc/cpuinfo
See also: https://unix.stackexchange.com/questions/43539/what-do-the-flags-in-proc-cpuinfo-mean
The commands only works in record btrace mode.
In the GDB source commit beab5d9, it is nat/linux-btrace.c:kernel_supports_pt that checks if we can enter btrace. The following checks are carried out:
check if /sys/bus/event_source/devices/intel_pt/type exists and read the type
do a syscall (SYS_perf_event_open, &attr, child, -1, -1, 0); with the read type, and see if it returns >=0. TODO: why not use the C wrapper?
The first check fails for me: the file does not exist.
Kernel side
cd into the kernel 4.1 source and:
git grep '"intel_pt"'
we find arch/x86/kernel/cpu/perf_event_intel_pt.c which sets up that file. In particular, it does:
if (!test_cpu_cap(&boot_cpu_data, X86_FEATURE_INTEL_PT))
goto fail;
so intel_pt is a pre-requisite.
How I've found kernel_supports_pt
First grep for:
git grep 'Target does not support branch tracing.'
which leads us to btrace.c:btrace_enable. After a quick debug with:
gdb -q -ex start -ex 'b btrace_enable' -ex c --args /home/ciro/git/binutils-gdb/install/bin/gdb --batch -ex start -ex 'record btrace' ./hello_world.out
Virtual box does not support it either: Extract execution log from gdb record in a VirtualBox VM
Intel SDE
Intel SDE 7.21 already has this CPU feature, checked with:
./sde64 -- cpuid | grep 'Intel processor trace'
But I'm not sure if the Linux kernel can be run on it: https://superuser.com/questions/950992/how-to-run-the-linux-kernel-on-intel-software-development-emulator-sde
Other GDB methods
More generic questions, with less efficient software solutions:
call graph: List of all function calls made in an application
instruction trace: Displaying each assembly instruction executed in gdb
At least a partial answer (for the "am I doing it wrong" aspect) - from gdb-7.6.50.20140108/gdb/NEWS
* A new record target "record-btrace" has been added. The new target
uses hardware support to record the control-flow of a process. It
does not support replaying the execution, but it implements the
below new commands for investigating the recorded execution log.
This new recording method can be enabled using:
record btrace
The "record-btrace" target is only available on Intel Atom processors
and requires a Linux kernel 2.6.32 or later.
* Two new commands have been added for record/replay to give information
about the recorded execution without having to replay the execution.
The commands are only supported by "record btrace".
record instruction-history prints the execution history at
instruction granularity
record function-call-history prints the execution history at
function granularity
It's not often that I envy the owner of an Atom processor ;-)
I'll edit the question to refocus upon the question of workarounds or plans for future support.