Memchk (valgrind) reporting inconsistent results in different docker hosts - c++

I have a fairly robust CI test for a C++ library, these tests (around 50) run over the same docker image but in different machines.
In one machine ("A") all the memcheck (valgrind) tests pass (i.e. no memory leaks).
In the other ("B"), all tests produce the same valgrind error below.
51/56 MemCheck #51: combinations.cpp.x ....................***Exception: SegFault 0.14 sec
valgrind: m_libcfile.c:66 (vgPlain_safe_fd): Assertion 'newfd >= VG_(fd_hard_limit)' failed.
Cannot find memory tester output file: /builds/user/boost-multi/build/Testing/Temporary/MemoryChecker.51.log
The machines are very similar, both are intel i7.
The only difference I can think of is that one is:
A. Ubuntu 22.10, Linux 5.19.0-29, docker 20.10.16
and the other:
B. Fedora 37, Linux 6.1.7-200.fc37.x86_64, docker 20.10.23
and perhaps some configuration of docker I don't know about.
Is there some configuration of docker that might generate the difference? or of the kernel? or some option in valgrind to workaround this problem?
I know for a fact that in real machines (not docker) valgrind doesn't produce any memory error.
The options I use for valgrind are always -leak-check=yes --num-callers=51 --trace-children=yes --leak-check=full --track-origins=yes --gen-suppressions=all.
Valgrind version in the image is 3.19.0-1 from the debian:testing image.
Note that this isn't an error reported by valgrind, it is an error within valgrind.
Perhaps after all, the only difference is that Ubuntu version of valgrind is compiled in release mode and the error is just ignored. (<-- this doesn't make sense, valgrind is the same in both cases because the docker image is the same).
I tried removing --num-callers=51 or setting it at 12 (default value), to no avail.

I found a difference between the images and the real machine and a workaround.
It has to do with the number of file descriptors.
(This was pointed out briefly in one of the threads on valgind bug issues on Mac OS https://bugs.kde.org/show_bug.cgi?id=381815#c0)
Inside the docker image running in Ubuntu 22.10:
ulimit -n
1048576
Inside the docker image running in Fedora 37:
ulimit -n
1073741816
(which looks like a ridiculous number or an overflow)
In the Fedora 37 and the Ubuntu 22.10 real machines:
ulimit -n
1024
So, doing this in the CI recipe, "solved" the problem:
- ulimit -n # reports current value
- ulimit -n 1024 # workaround neededed by valgrind in docker running in Fedora 37
- ctest ... (with memcheck)
I have no idea why this workaround works.
For reference:
$ ulimit --help
...
-n the maximum number of open file descriptors

First off, "you are doing it wrong" with your Valgrind arguments. For CI I recommend a two stage approach. Use as many default arguments as possible for the CI run (--trace-children=yes may well be necessary but not the others). If your codebase is leaky then you may need to check for leaks, but if you can maintain a zero leak policy (or only suppressed leaks) then you can tell if there are new leaks from the summary. After your CI detects an issue you can run again with the kitchen sink options to get full information. Your runs will be significantly faster without all those options.
Back to the question.
Valgrind is trying to dup() some file (the guest exe, a tempfile or something like that). The fd that it fets is higher than what it thinks the nofile rlimit is, so it is asserting.
A billion files is ridiculous.
Valgrind will try to call prlimit RLIMIT_NOFILE, with a fallback call to rlimit, and a second fallback to setting the limit to 1024.
To realy see what is going on you need to modify the Valgrind source (m_main.c, setup_file_descriptors, set local show to True). With this change I see
fd limits: host, before: cur 65535 max 65535
fd limits: host, after: cur 65535 max 65535
fd limits: guest : cur 65523 max 65523
Otherwise with strace I see
2049 prlimit64(0, RLIMIT_NOFILE, NULL, {rlim_cur=65535, rlim_max=65535}) = 0
2049 prlimit64(0, RLIMIT_NOFILE, {rlim_cur=65535, rlim_max=65535}, NULL) = 0
(all the above on RHEL 7.6 amd64)
EDIT: Note that the above show Valgrind querying and setting the resource limit. If you use ulimit to lower the limit before running Valgrind, then Valgrind will try to honour that limit. Also note that Valgrind reserves a small number (8) of files for its own use.

Related

HTCondor - Partitionable slot not working

I am following the tutorial on
Center for High Throughput Computing and Introduction to Configuration in the HTCondor website to set up a Partitionable slot. Before any configuration I run
condor_status
and get the following output.
I update the file 00-minicondor in /etc/condor/config.d by adding the following lines at the end of the file.
NUM_SLOTS = 1
NUM_SLOTS_TYPE_1 = 1
SLOT_TYPE_1 = cpus=4
SLOT_TYPE_1_PARTITIONABLE = TRUE
and reconfigure
sudo condor_reconfig
Now with
condor_status
I get this output as expected. Now, I run the following command to check everything is fine
condor_status -af Name Slotype Cpus
and find slot1#ip-172-31-54-214.ec2.internal undefined 1 instead of slot1#ip-172-31-54-214.ec2.internal Partitionable 4 61295 that is what I would expect. Moreover, when I try to summit a job that asks for more than 1 cpu it does not allocate space for it (It stays waiting forever) as it should.
I don't know if I made some mistake during the installation process or what could be happening. I would really appreciate any help!
EXTRA INFO: If it can be of any help have have installed HTCondor with the command
curl -fsSL https://get.htcondor.org | sudo /bin/bash -s – –no-dry-run
on Ubuntu 18.04 running on an old p2.xlarge instance (it has 4 cores).
UPDATE: After rebooting the whole thing it seems to be working. I can now send jobs with different CPUs requests and it will start them properly.
The only issue I would say persists is that Memory allocation is not showing properly, for example:
But in reality it is allocating enough memory for the job (in this case around 12 GB).
If I run again
condor_status -af Name Slotype Cpus
I still get something I am not supposed to
But at least it is showing the correct number of CPUs (even if it just says undefined).
What is the output of condor_q -better when the job is idle?

CUDA GPU __global__ function does not complete

__global__ void functionA()
{
printf("functionA");
}
int main()
{
printf("main1");
functionA<<<1,1>>>();
printf("main2");
}
I'm trying to run a simple test with the above. But the program only outputs "main1". The program should output "functionA" and "main2" too.
This seems to have two reasons:
First of all you need to add
cudaDeviceSynchronize();
after the CUDA routine in order to block the main until the device has completed all tasks.
Furthermore this might happen if you set the wrong GPU architecture/compute capability XX when compiling the code
$ nvcc -gencode=arch=compute_XX,code=sm_XX -o my_app my_app.cu
In this case only the host code is run while the parts on the accelerator will be omitted it seems. You can find an overview of the corresponding number XX for the different hardware generations over here. The K20m you are running is 35. So it should be
$ nvcc -gencode=arch=compute_35,code=sm_35 -o my_app my_app.cu
in your case.
This might also occur if you have multiple graphic accelerators in your system and the code is executed on the wrong one. Each graphics card/accelerator is assigned a particular device id. The device with number 0 should be assigned automatically to the most powerful device and will be used by default. Therefore the first time I compiled the code on my system containing a powerful Tesla K80 (architecture 37) and a low power Quadro P620 (architecture 60) I selected 37 and had the same error as you have while when selecting 60 the code would run. I then used then the Querying Device Properties example to give me a list of the CUDA-capable devices and their corresponding device id, just to find out that on my system the Tesla K80 is set as 1 and 2 while the simple Quadro P620 graphics card is set as 0. I assume this is the case as the K80 is deprecated in CUDA 11!
You can select the device inside your code with cudaSetDevice or change it when launching the program with
$ CUDA_VISIBLE_DEVICES="1" ./my_app
where 1 has to be replaced by the device id you wish to use. Doing so should make your code run without any problems.
You can also test if this really is the issue this by cloning the Github repository of "Learn CUDA Programming", then browsing Chapter01/01_cuda_introduction/01_hello_world/, compile the make file with $ make and finally run it with $ ./hello_world. It automatically compiles for multiple architectures/compute capabilities and should therefore run without any issue!

why does use "perf stat -B -e branches,branch-misses ./a.out" to get one process branch prediction miss is always zero in centos7 [duplicate]

I want to evaluate the performance of one process by "branch-misses" hardware event. But when I used perf stat to get "branch-misses" data, it always return 0 just because my os is in KVM.
Because it's trouble for me to get one real machine to do the test. So I want to know that is there any method to get "branch-misses" by "perf stat" when I am in KVM.
I'm really need your help. Thanks a lot.

Limiting Java 8 Memory Consumption

I'm running three Java 8 JVMs on a 64 bit Ubuntu VM which was built from a minimal install with nothing extra running other than the three JVMs. The VM itself has 2GB of memory and each JVM was limited by -Xmx512M which I assumed would be fine as there would be a couple of hundred MB spare.
A few weeks ago, one crashed and the hs_err_pid dump showed:
# There is insufficient memory for the Java Runtime Environment to continue.
# Native memory allocation (mmap) failed to map 196608 bytes for committing reserved memory.
# Possible reasons:
# The system is out of physical RAM or swap space
# In 32 bit mode, the process size limit was hit
# Possible solutions:
# Reduce memory load on the system
# Increase physical memory or swap space
# Check if swap backing store is full
# Use 64 bit Java on a 64 bit OS
# Decrease Java heap size (-Xmx/-Xms)
# Decrease number of Java threads
# Decrease Java thread stack sizes (-Xss)
# Set larger code cache with -XX:ReservedCodeCacheSize=
# This output file may be truncated or incomplete.
I restarted the JVM with a reduced heap size of 384MB and so far everything is fine. However when I currently look at the VM using the ps command and sort in descending RSS size I see
RSS %MEM VSZ PID CMD
708768 35.4 2536124 29568 java -Xms64m -Xmx512m ...
542776 27.1 2340996 12934 java -Xms64m -Xmx384m ...
387336 19.3 2542336 6788 java -Xms64m -Xmx512m ...
12128 0.6 288120 1239 /usr/lib/snapd/snapd
4564 0.2 21476 27132 -bash
3524 0.1 5724 1235 /sbin/iscsid
3184 0.1 37928 1 /sbin/init
3032 0.1 27772 28829 ps ax -o rss,pmem,vsz,pid,cmd --sort -rss
3020 0.1 652988 1308 /usr/bin/lxcfs /var/lib/lxcfs/
2936 0.1 274596 1237 /usr/lib/accountsservice/accounts-daemon
..
..
and the free command shows
total used free shared buff/cache available
Mem: 1952 1657 80 20 213 41
Swap: 0 0 0
Taking the first process as an example, there is an RSS size of 708768 KB even though the heap limit would be 524288 KB (512*1024).
I am aware that extra memory is used over the JVM heap but the question is how can I control this to ensure I do not run out of memory again ? I am trying to set the heap size for each JVM as large as I can without crashing them.
Or is there a good general guideline as to how to set JVM heap size in relation to overall memory availability ?
There does not appear to be a way of controlling how much extra memory the JVM will use over the heap. However by monitoring the application over a period of time, a good estimate of this amount can be obtained. If the overall consumption of the java process is higher than desired, then the heap size can be reduced. Further monitoring is needed to see if this impacts performance.
Continuing with the example above and using the command ps ax -o rss,pmem,vsz,pid,cmd --sort -rss we see usage as of today is
RSS %MEM VSZ PID CMD
704144 35.2 2536124 29568 java -Xms64m -Xmx512m ...
429504 21.4 2340996 12934 java -Xms64m -Xmx384m ...
367732 18.3 2542336 6788 java -Xms64m -Xmx512m ...
13872 0.6 288120 1239 /usr/lib/snapd/snapd
..
..
These java processes are all running the same application but with different data sets. The first process (29568) has stayed stable using about 190M beyond the heap limit while the second (12934) has reduced from 156M to 35M. The total memory usage of the third has stayed well under the heap size which suggests the heap limit could be reduced.
It would seem that allowing 200MB extra non heap memory per java process here would be more than enough as that gives 600MB leeway total. Subtracting this from 2GB leaves 1400MB so the three -Xmx parameter values combined should be less than this amount.
As will be gleaned from reading the article pointed out in a comment by Fairoz there are many different ways in which the JVM can use non heap memory. One of these that is measurable though is the thread stack size. The default for a JVM can be found on linux using java -XX:+PrintFlagsFinal -version | grep ThreadStackSize In the case above it is 1MB and as there are about 25 threads, we can safely say that at least 25MB extra will always be required.

How to run record instruction-history and function-call-history in GDB?

(EDIT: per the first answer below the current "trick" seems to be using an Atom processor. But I hope some gdb guru can answer if this is a fundamental limitation, or whether there adding support for other processors is on the roadmap?)
Reverse execution seems to be working in my environment: I can reverse-continue, see a plausible record log, and move around within it:
(gdb) start
...Temporary breakpoint 5 at 0x8048460: file bang.cpp, line 13.
Starting program: /home/thomasg/temp/./bang
Temporary breakpoint 5, main () at bang.cpp:13
13 f(1000);
(gdb) record
(gdb) continue
Continuing.
Breakpoint 3, f (d=900) at bang.cpp:5
5 if(d) {
(gdb) info record
Active record target: record-full
Record mode:
Lowest recorded instruction number is 1.
Highest recorded instruction number is 1005.
Log contains 1005 instructions.
Max logged instructions is 200000.
(gdb) reverse-continue
Continuing.
Breakpoint 3, f (d=901) at bang.cpp:5
5 if(d) {
(gdb) record goto end
Go forward to insn number 1005
#0 f (d=900) at bang.cpp:5
5 if(d) {
However the instruction and function histories aren't available:
(gdb) record instruction-history
You can't do that when your target is `record-full'
(gdb) record function-call-history
You can't do that when your target is `record-full'
And the only target type available is full, the other documented type "btrace" fails with "Target does not support branch tracing."
So quite possibly it just isn't supported for this target, but as it's a mainstream modern one (gdb 7.6.1-ubuntu, on amd64 Linux Mint "Petra" running an "Intel(R) Core(TM) i5-3570") I'm hoping that I've overlooked a crucial step or config?
It seems that there is no other solution except a CPU that supports it.
More precisely, your kernel has to support Intel Processor Tracing (Intel PT). This can be checked in Linux with:
grep intel_pt /proc/cpuinfo
See also: https://unix.stackexchange.com/questions/43539/what-do-the-flags-in-proc-cpuinfo-mean
The commands only works in record btrace mode.
In the GDB source commit beab5d9, it is nat/linux-btrace.c:kernel_supports_pt that checks if we can enter btrace. The following checks are carried out:
check if /sys/bus/event_source/devices/intel_pt/type exists and read the type
do a syscall (SYS_perf_event_open, &attr, child, -1, -1, 0); with the read type, and see if it returns >=0. TODO: why not use the C wrapper?
The first check fails for me: the file does not exist.
Kernel side
cd into the kernel 4.1 source and:
git grep '"intel_pt"'
we find arch/x86/kernel/cpu/perf_event_intel_pt.c which sets up that file. In particular, it does:
if (!test_cpu_cap(&boot_cpu_data, X86_FEATURE_INTEL_PT))
goto fail;
so intel_pt is a pre-requisite.
How I've found kernel_supports_pt
First grep for:
git grep 'Target does not support branch tracing.'
which leads us to btrace.c:btrace_enable. After a quick debug with:
gdb -q -ex start -ex 'b btrace_enable' -ex c --args /home/ciro/git/binutils-gdb/install/bin/gdb --batch -ex start -ex 'record btrace' ./hello_world.out
Virtual box does not support it either: Extract execution log from gdb record in a VirtualBox VM
Intel SDE
Intel SDE 7.21 already has this CPU feature, checked with:
./sde64 -- cpuid | grep 'Intel processor trace'
But I'm not sure if the Linux kernel can be run on it: https://superuser.com/questions/950992/how-to-run-the-linux-kernel-on-intel-software-development-emulator-sde
Other GDB methods
More generic questions, with less efficient software solutions:
call graph: List of all function calls made in an application
instruction trace: Displaying each assembly instruction executed in gdb
At least a partial answer (for the "am I doing it wrong" aspect) - from gdb-7.6.50.20140108/gdb/NEWS
* A new record target "record-btrace" has been added. The new target
uses hardware support to record the control-flow of a process. It
does not support replaying the execution, but it implements the
below new commands for investigating the recorded execution log.
This new recording method can be enabled using:
record btrace
The "record-btrace" target is only available on Intel Atom processors
and requires a Linux kernel 2.6.32 or later.
* Two new commands have been added for record/replay to give information
about the recorded execution without having to replay the execution.
The commands are only supported by "record btrace".
record instruction-history prints the execution history at
instruction granularity
record function-call-history prints the execution history at
function granularity
It's not often that I envy the owner of an Atom processor ;-)
I'll edit the question to refocus upon the question of workarounds or plans for future support.