Can someone explain this valgrind output with open mpi? - c++

I have an application that uses OpenMPI and launch it on Windows and Linux. The version for Windows is working fine, however, running on a Linux cause memory allocation error. The problem occurs for certain app arguments, that require more calculations.
To eliminate memory leaks I checked Linux version app by using Valgrind and got some output. After all, I tried to search information about the output and found some posts on stack overflow and GitHub(not enough reputation to attach links). After all, I updated openMPI to 2.0.2 and check app again. New output. Is it memory leaks in OpenMPI or I'm doing something wrong?
A piece of output:
==16210== 4 bytes in 1 blocks are definitely lost in loss record 5 of 327
==16210== at 0x4C2DBB6: malloc (vg_replace_malloc.c:299)
==16210== by 0x5657A59: strdup (strdup.c:42)
==16210== by 0x51128E6: opal_basename (in /home/vshmelev/OMPI_2.0.2/lib/libopen-pal.so.20.2.0)
==16210== by 0x7DDECA9: ???
==16210== by 0x7DDEDD4: ???
==16210== by 0x6FBFF84: ???
==16210== by 0x4E4EA9E: orte_init (in /home/vshmelev/OMPI_2.0.2/lib/libopen-rte.so.20.1.0)
==16210== by 0x4041FD: orterun (orterun.c:818)
==16210== by 0x4034E5: main (main.c:13)
OpenMPI version:Open MPI: 2.0.2
Valgrind version: valgrind-3.12.0
Virtual mashine characteristics: Ubuntu 16.04 LTS x64
In case of using MPICH, the Valgrind output is:
==87863== HEAP SUMMARY:
==87863== in use at exit: 131,120 bytes in 2 blocks
==87863== total heap usage: 2,577 allocs, 2,575 frees, 279,908 bytes allocated
==87863==
==87863== 131,120 bytes in 2 blocks are still reachable in loss record 1 of 1
==87863== at 0x4C2DBB6: malloc (vg_replace_malloc.c:299)
==87863== by 0x425803: alloc_fwd_hash (sock.c:332)
==87863== by 0x425803: HYDU_sock_forward_stdio (sock.c:376)
==87863== by 0x432A99: HYDT_bscu_stdio_cb (bscu_cb.c:19)
==87863== by 0x42D9BF: HYDT_dmxu_poll_wait_for_event (demux_poll.c:75)
==87863== by 0x42889F: HYDT_bscu_wait_for_completion (bscu_wait.c:60)
==87863== by 0x42863C: HYDT_bsci_wait_for_completion (bsci_wait.c:21)
==87863== by 0x40B123: HYD_pmci_wait_for_completion (pmiserv_pmci.c:217)
==87863== by 0x4035C5: main (mpiexec.c:343)
==87863==
==87863== LEAK SUMMARY:
==87863== definitely lost: 0 bytes in 0 blocks
==87863== indirectly lost: 0 bytes in 0 blocks
==87863== possibly lost: 0 bytes in 0 blocks
==87863== still reachable: 131,120 bytes in 2 blocks
==87863== suppressed: 0 bytes in 0 blocks
==87863==
==87863== For counts of detected and suppressed errors, rerun with: -v
==87863== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)

These outputs point to some memory leak in the MPI library, not your application code. You can safely ignore them.
More specifically, these leaks are coming from the launchers. ORTE is the runtime environment for OpenMPI responsible for launching and managing MPI processes. Hydra is the launcher and process manager for MPICH.

The term "definitely lost" means the main function of your program at line 13 (As far as i see in the output) is leaking memory directly or calls some other function (orterun) which causes memory leak . you must fix those leaks or provide some more of your code.
take a look here before everything.

Related

__static_initialization_and_destruction_0 seg fault

I was developing a C++ program and everything was working fine. Then, while I was programming, I ran make and ran my program like usual. But during the execution of it all, my computer crashed and shut itself off. I reopened my computer and ran make again but this time it gave me a bunch of errors.
Everything seemed off, like my whole computer is corrupted. But everything was working as intended in my operating system except the stuff related to my program. I've tried making a new c++ project, it works just fine. I've tried deleting the project and re-compiling it from github to no avail.
I managed to compile the program in the end but now it gives me a seg fault. The first thing my program does is to print out Starting... to the screen, but this segfault occurs without ever printing that, so it led me to believe that this error is linker related. (Even when the make command was failing, before I fixed it, it told me there was a linker error)
Here is what valgrind says:
turgut#turgut-N56VZ:~/Desktop/CppProjects/videoo-render$ valgrind bin/Renderer
==7521== Memcheck, a memory error detector
==7521== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al.
==7521== Using Valgrind-3.18.1 and LibVEX; rerun with -h for copyright info
==7521== Command: bin/Renderer
==7521==
==7521== Invalid read of size 1
==7521== at 0x484FBD4: strcmp (in /usr/libexec/valgrind/vgpreload_memcheck-amd64-linux.so)
==7521== by 0x121377: __static_initialization_and_destruction_0 (OpenGLRenderer.cpp:111)
==7521== by 0x121377: _GLOBAL__sub_I__ZN6OpenGL7Texture5max_zE (OpenGLRenderer.cpp:197)
==7521== by 0x659FEBA: call_init (libc-start.c:145)
==7521== by 0x659FEBA: __libc_start_main##GLIBC_2.34 (libc-start.c:379)
==7521== by 0x1216A4: (below main) (in /home/turgut/Desktop/CppProjects/videoo-render/bin/Renderer)
==7521== Address 0x0 is not stack'd, malloc'd or (recently) free'd
==7521==
==7521==
==7521== Process terminating with default action of signal 11 (SIGSEGV)
==7521== Access not within mapped region at address 0x0
==7521== at 0x484FBD4: strcmp (in /usr/libexec/valgrind/vgpreload_memcheck-amd64-linux.so)
==7521== by 0x121377: __static_initialization_and_destruction_0 (OpenGLRenderer.cpp:111)
==7521== by 0x121377: _GLOBAL__sub_I__ZN6OpenGL7Texture5max_zE (OpenGLRenderer.cpp:197)
==7521== by 0x659FEBA: call_init (libc-start.c:145)
==7521== by 0x659FEBA: __libc_start_main##GLIBC_2.34 (libc-start.c:379)
==7521== by 0x1216A4: (below main) (in /home/turgut/Desktop/CppProjects/videoo-render/bin/Renderer)
==7521== If you believe this happened as a result of a stack
==7521== overflow in your program's main thread (unlikely but
==7521== possible), you can try to increase the size of the
==7521== main thread stack using the --main-stacksize= flag.
==7521== The main thread stack size used in this run was 8388608.
==7521==
==7521== HEAP SUMMARY:
==7521== in use at exit: 72,741 bytes in 3 blocks
==7521== total heap usage: 3 allocs, 0 frees, 72,741 bytes allocated
==7521==
==7521== LEAK SUMMARY:
==7521== definitely lost: 0 bytes in 0 blocks
==7521== indirectly lost: 0 bytes in 0 blocks
==7521== possibly lost: 0 bytes in 0 blocks
==7521== still reachable: 72,741 bytes in 3 blocks
==7521== suppressed: 0 bytes in 0 blocks
==7521== Rerun with --leak-check=full to see details of leaked memory
==7521==
==7521== For lists of detected and suppressed errors, rerun with: -s
==7521== ERROR SUMMARY: 1 errors from 1 contexts (suppressed: 0 from 0)
Segmentation fault (core dumped)
turgut#turgut-N56VZ:~/Desktop/Cpp
OpenGLRenderer.cpp:197 is just the end of the file and here is what's writeen in OpenGLRenderer.cpp:111:
static bool __debug = strcmp(getenv("DEBUG"), "true") == 0;
It says that there is an error with strcmp but I've tried using that function on a different project and it worked just fine.
What could be the reason for this? I'm on ubuntu 22.04, gcc verison 11.2.0.
this segfault occurs without ever printing that, so it led me to believe that this error is linker related.
The linker is not involved in your program running, so it can't be "linker related".
There is a dynamic loader (if your program uses shared libraries), so perhaps that's what you meant.
In any case, the crash is happening because OpenGLRenderer.cpp:111 (probably in libGL.so) is calling strcmp() with one of the arguments being NULL (which is not a valid thing to do). This does happen before main.
This line:
static bool __debug = strcmp(getenv("DEBUG"), "true") == 0;
is buggy: it will crash when DEBUG is not set in the environment (getenv("DEBUG") will return NULL in that case).
As a workaround, you can run export DEBUG=off, before running your program and the crash will go away.
It's unclear whether you inserted this line into OpenGLRenderer.cpp yourself or whether it was already present, but it's buggy either way.
P.S. A correct way to initialize __debug could be:
static const char *debug_str = getenv("DEBUG");
static const bool debug = strcmp(debug_str == NULL ? "off" : debug_str, "true") == 0;
P.P.S. Avoid using identifiers prefixed with __ (such as __debug) -- they are reserved.

Valgrind OpenCV

Here is my test program:
#include "opencv2/videoio.hpp"
int main(int argc, char** argv) {
cv::VideoCapture videoCapture(argv[1]);
cv::Mat frame;
videoCapture.read(frame);
return 0;
}
I run this program like this:
valgrind --leak-check=yes ./GyroRecord ./walks6/w63/39840012.avi > valgrind_output 2>&1
So that the entire output is saved in the valgrind_output file.
The contents of valgrind_output can be checked here.
But, if the link dies in the future, this is the summary:
==9677== LEAK SUMMARY:
==9677== definitely lost: 0 bytes in 0 blocks
==9677== indirectly lost: 0 bytes in 0 blocks
==9677== possibly lost: 1,352 bytes in 18 blocks
==9677== still reachable: 166,408 bytes in 1,296 blocks
==9677== of which reachable via heuristic:
==9677== newarray : 1,536 bytes in 16 blocks
==9677== suppressed: 0 bytes in 0 blocks
==9677== Reachable blocks (those to which a pointer was found) are not shown.
==9677== To see them, rerun with: --leak-check=full --show-leak-kinds=all
==9677==
==9677== For counts of detected and suppressed errors, rerun with: -v
==9677== ERROR SUMMARY: 18 errors from 18 contexts (suppressed: 0 from 0)
I would like to reduce "possibly lost" bytes to 0. Is that possible? Or will I always have some "possibly lost" bytes when using OpenCV?
OpenCV comes with suppression files (with the extension .supp) for valgrind that can be used to hide messages about resources allocated (often early in he program's execution) that will be kept allocated until the program dies and the OS has to clean up the mess.
The suppression files are placed in /usr/share/OpenCV (on my system):
Example:
valgrind --leak-check=yes --suppressions=/usr/share/OpenCV/valgrind.supp --suppressions=/usr/share/OpenCV/valgrind_3rdparty.supp ./GyroRecord ./walks6/w63/39840012.avi
Using these helped me a lot when running valgrind on an OpenCV project.

pbs_server, E5-2620v4 and general protection

I am trying to install torque 6.0.2 on Debian 8.5 on a Intel Xeon E5-2620v4. However, when i try start pbs_server i returned a segment fault, with gdb:
#1 0x0000000000440ab6 in container::item_container<pbsnode*>::unlock (this=0xb5d900 <allnodes>) at ../../src/include/container.hpp:537
#2 0x00000000004b787f in mom_hierarchy_handler::nextNode (this=0x4e610c0 <hierarchy_handler>, iter=0x7fffffff98b8) at mom_hierarchy_handler.cpp:122
#3 0x00000000004b7a7d in mom_hierarchy_handler::make_default_hierarchy (this=0x4e610c0 <hierarchy_handler>) at mom_hierarchy_handler.cpp:149
#4 0x00000000004b898d in mom_hierarchy_handler::loadHierarchy (this=0x4e610c0 <hierarchy_handler>) at mom_hierarchy_handler.cpp:433
#5 0x00000000004b8ae8 in mom_hierarchy_handler::initialLoadHierarchy (this=0x4e610c0 <hierarchy_handler>) at mom_hierarchy_handler.cpp:472
#6 0x000000000045262a in pbsd_init (type=1) at pbsd_init.c:2299
#7 0x00000000004591ff in main (argc=2, argv=0x7fffffffdec8) at pbsd_main.c:1883
dmesg:
traps: pbs_server[22249] general protection ip:7f9c08a7a2c8 sp:7ffe520b5238 error:0 in libpthread-2.19.so[7f9c08a69000+18000]
valgrind:
==22381== Memcheck, a memory error detector
==22381== Copyright (C) 2002-2013, and GNU GPL'd, by Julian Seward et al.
==22381== Using Valgrind-3.10.0 and LibVEX; rerun with -h for copyright info
==22381== Command: pbs_server
==22381==
==22381==
==22381== HEAP SUMMARY:
==22381== in use at exit: 18,051 bytes in 53 blocks
==22381== total heap usage: 169 allocs, 116 frees, 42,410 bytes allocated
==22381==
==22382==
==22382== HEAP SUMMARY:
==22382== in use at exit: 19,755 bytes in 56 blocks
==22382== total heap usage: 172 allocs, 116 frees, 44,114 bytes allocated
==22382==
==22381== LEAK SUMMARY:
==22381== definitely lost: 0 bytes in 0 blocks
==22381== indirectly lost: 0 bytes in 0 blocks
==22381== possibly lost: 0 bytes in 0 blocks
==22381== still reachable: 18,051 bytes in 53 blocks
==22381== suppressed: 0 bytes in 0 blocks
==22381== Rerun with --leak-check=full to see details of leaked memory
==22381==
==22381== For counts of detected and suppressed errors, rerun with: -v
==22381== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)
==22383==
==22383== Process terminating with default action of signal 11 (SIGSEGV)
==22383== General Protection Fault
==22383== at 0x72192CB: __lll_unlock_elision (elision-unlock.c:33)
==22383== by 0x4E7E1A: unlock_node(pbsnode*, char const*, char const*, int) (u_lock_ctl.c:268)
==22383== by 0x4B7A66: mom_hierarchy_handler::make_default_hierarchy() (mom_hierarchy_handler.cpp:164)
==22383== by 0x4B898C: mom_hierarchy_handler::loadHierarchy() (mom_hierarchy_handler.cpp:433)
==22383== by 0x4B8AE7: mom_hierarchy_handler::initialLoadHierarchy() (mom_hierarchy_handler.cpp:472)
==22383== by 0x452629: pbsd_init(int) (pbsd_init.c:2299)
==22383== by 0x4591FE: main (pbsd_main.c:1883)
==22382== LEAK SUMMARY:
==22382== definitely lost: 0 bytes in 0 blocks
==22382== indirectly lost: 0 bytes in 0 blocks
==22382== possibly lost: 0 bytes in 0 blocks
==22382== still reachable: 19,755 bytes in 56 blocks
==22382== suppressed: 0 bytes in 0 blocks
==22382== Rerun with --leak-check=full to see details of leaked memory
==22382==
==22382== For counts of detected and suppressed errors, rerun with: -v
==22382== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)
==22383==
==22383== HEAP SUMMARY:
==22383== in use at exit: 325,348 bytes in 186 blocks
==22383== total heap usage: 297 allocs, 111 frees, 442,971 bytes allocated
==22383==
==22383== LEAK SUMMARY:
==22383== definitely lost: 134 bytes in 6 blocks
==22383== indirectly lost: 28 bytes in 3 blocks
==22383== possibly lost: 524 bytes in 17 blocks
==22383== still reachable: 324,662 bytes in 160 blocks
==22383== suppressed: 0 bytes in 0 blocks
==22383== Rerun with --leak-check=full to see details of leaked memory
==22383==
==22383== For counts of detected and suppressed errors, rerun with: -v
==22383== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)
~
No other software have this behavior, i tested the machine by 2 days with full load without problens. Already try to update the processors microcode. Please, anybody have this behavior with torque 6.0.2 or some othe scenarios ?
Best regards.
This is no microcode fault. It is an outright lock balance issue in whatever software you're running (and not in glibc/libpthreads).
Don't try to unlock an already unlocked lock. That's forbidden behavior, and the reason for the trap.
For performance reasons, glibc doesn't bother to test for it and segfault, so a lot of broken code got away with it for a long time. The hardware implementations of lock elision, OTOH, do raise traps (Intel TSX, IBM Power 8, S390/X...), so this kind of breakage is going to become apparent everywhere, very fast.

Valgrind Error: in use at exit: 72,704 bytes C++ Initialization List weirdness with char*

Issue:
I have a weird issue that I wasn't expecting. I have a class called Answers
and within the header is this:
class Answer
{
char* aText;
bool b_correct;
public:
Answer():aText(0){;} //default constructor
}
The main (testing) driver code is this:
int main(void)
{
static const unsigned int MAX_ANSWERS = 5;
Answer answers[MAX_ANSWERS];
}
The (unexpected) weirdness I am getting is that there is an alloc happening, and I haven't used a new anywhere in my code yet. I'm guessing that the char* is calling this in the initialization list.
I am using valgrind to test my code, and I'm getting 11 allocs and 10 frees. When I remove the initializer of :aText(0), the extra alloc goes away.
I get that this is badly constructed code. I am following a course outline to learn how to write in C++. Can someone please help me understand how the memory is allocated or what's happening during the initialization list to cause a call to new?
I know the error is coming from the code shown. I know the extra alloc is happening When I compile and run just this code.
Valgrind Output:
==12598== Memcheck, a memory error detector
==12598== Copyright (C) 2002-2013, and GNU GPL'd, by Julian Seward et al.
==12598== Using Valgrind-3.10.1 and LibVEX; rerun with -h for copyright info
==12598== Command: ./Answers
==12598==
==12598==
==12598== HEAP SUMMARY:
==12598== in use at exit: 72,704 bytes in 1 blocks
==12598== total heap usage: 1 allocs, 0 frees, 72,704 bytes allocated
==12598==
==12598== LEAK SUMMARY:
==12598== definitely lost: 0 bytes in 0 blocks
==12598== indirectly lost: 0 bytes in 0 blocks
==12598== possibly lost: 0 bytes in 0 blocks
==12598== still reachable: 72,704 bytes in 1 blocks
==12598== suppressed: 0 bytes in 0 blocks
==12598== Rerun with --leak-check=full to see details of leaked memory
==12598==
==12598== For counts of detected and suppressed errors, rerun with: -v
==12598== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)
Platform Information:
Fedora 22
gcc.x86_64 5.1.1-4.fc22
valgrind.x86_64 1:3.10.1-13.fc22
codeblocks.x86_64 13.12-14.fc22
This is a known GCC 5.1 bug, not a valgrind bug.
Details here:
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=64535
Possible workarounds:
Downgrade GCC to an earlier version or wait for Valgrind to update a fix for this error. Both solutions are being worked on by their respective communities.

Using valgrind to find a memory leak in the mysql c++ client

I'm using valgrind to try and track down a memory leak is the mysql c++ client distributed from mysql.
In both the examples (resultset.cpp) and my own program, there is a single 56 byte block that is not freed. In my own program, I've traced the leak to a call to the mysql client.
Here are the results when I run the test:
valgrind --leak-check=full --show-reachable=yes ./my-executable
==29858== Memcheck, a memory error detector
==29858== Copyright (C) 2002-2009, and GNU GPL'd, by Julian Seward et al.
==29858== Using Valgrind-3.6.0.SVN-Debian and LibVEX; rerun with -h for copyright info
==29858== Command: ./my-executable
==29858==
==29858==
==29858== HEAP SUMMARY:
==29858== in use at exit: 56 bytes in 1 blocks
==29858== total heap usage: 693 allocs, 692 frees, 308,667 bytes allocated
==29858==
==29858== 56 bytes in 1 blocks are still reachable in loss record 1 of 1
==29858== at 0x4C284A8: malloc (vg_replace_malloc.c:236)
==29858== by 0x400D334: _dl_map_object_deps (dl-deps.c:506)
==29858== by 0x4013652: dl_open_worker (dl-open.c:291)
==29858== by 0x400E9C5: _dl_catch_error (dl-error.c:178)
==29858== by 0x4012FF9: _dl_open (dl-open.c:583)
==29858== by 0x7077BCF: do_dlopen (dl-libc.c:86)
==29858== by 0x400E9C5: _dl_catch_error (dl-error.c:178)
==29858== by 0x7077D26: __libc_dlopen_mode (dl-libc.c:47)
==29858== by 0x72E5FEB: pthread_cancel_init (unwind-forcedunwind.c:53)
==29858== by 0x72E614B: _Unwind_ForcedUnwind (unwind-forcedunwind.c:126)
==29858== by 0x72E408F: __pthread_unwind (unwind.c:130)
==29858== by 0x72DDEB4: pthread_exit (pthreadP.h:265)
==29858==
==29858== LEAK SUMMARY:
==29858== definitely lost: 0 bytes in 0 blocks
==29858== indirectly lost: 0 bytes in 0 blocks
==29858== possibly lost: 0 bytes in 0 blocks
==29858== still reachable: 56 bytes in 1 blocks
==29858== suppressed: 0 bytes in 0 blocks
==29858==
==29858== For counts of detected and suppressed errors, rerun with: -v
==29858== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 8 from 6)
I have a few questions regarding this:
How should I interpret the --show-reachable block?
Is that block useful for me to try and zero in on the error?
If the block is not useful, does valgrind have another mechanism that would help me trace the leak?
If not, is there some other tool (hopefully OSS on linux) to help me narrow this down?
Thanks in advance..
UPDATE: Here is the code that I found on my system for the definition of pthread_exit. I'm not certain that this is the actual source that is being invoked. However, if it is, can anyone explain what might be going wrong?
void
pthread_exit (void *retval)
{
/* specific to PTHREAD_TO_WINTHREAD */
ExitThread ((DWORD) ((size_t) retval)); /* thread becomes signalled so its death can be waited upon */
/*NOTREACHED*/
assert (0); return; /* void fnc; can't return an error code */
}
Reachable just means that the blocks had a valid pointer referencing them in scope when the program exited, which indicates that the program does not explicitly free everything on exit because it relies on the underlying OS to do so. What you should be looking for are lost blocks, where blocks of memory lost all references to them and can no longer be freed.
So, the 56 bytes were probably allocated in main, which did not explicitly free them. What you posted does not show a memory leak. It shows main freeing everything but what main allocated because main assumes that when it dies, all memory will be reclaimed by the kernel.
Specifically, it's pthread (in main) making this assumption (which is a valid assumption on darn near everything found in production written in the last 15+ years). The need to free blocks that still have a valid reference on exit is a bit of a contentious point, but for this specific question all that needs to be mentioned is that the assumption was made.
Edit
It's actually pthread_exit() not cleaning something up on exit, but as explained it probably doesn't need to (or quite possibly can't) once it reaches that point.