pbs_server, E5-2620v4 and general protection - c++

I am trying to install torque 6.0.2 on Debian 8.5 on a Intel Xeon E5-2620v4. However, when i try start pbs_server i returned a segment fault, with gdb:
#1 0x0000000000440ab6 in container::item_container<pbsnode*>::unlock (this=0xb5d900 <allnodes>) at ../../src/include/container.hpp:537
#2 0x00000000004b787f in mom_hierarchy_handler::nextNode (this=0x4e610c0 <hierarchy_handler>, iter=0x7fffffff98b8) at mom_hierarchy_handler.cpp:122
#3 0x00000000004b7a7d in mom_hierarchy_handler::make_default_hierarchy (this=0x4e610c0 <hierarchy_handler>) at mom_hierarchy_handler.cpp:149
#4 0x00000000004b898d in mom_hierarchy_handler::loadHierarchy (this=0x4e610c0 <hierarchy_handler>) at mom_hierarchy_handler.cpp:433
#5 0x00000000004b8ae8 in mom_hierarchy_handler::initialLoadHierarchy (this=0x4e610c0 <hierarchy_handler>) at mom_hierarchy_handler.cpp:472
#6 0x000000000045262a in pbsd_init (type=1) at pbsd_init.c:2299
#7 0x00000000004591ff in main (argc=2, argv=0x7fffffffdec8) at pbsd_main.c:1883
dmesg:
traps: pbs_server[22249] general protection ip:7f9c08a7a2c8 sp:7ffe520b5238 error:0 in libpthread-2.19.so[7f9c08a69000+18000]
valgrind:
==22381== Memcheck, a memory error detector
==22381== Copyright (C) 2002-2013, and GNU GPL'd, by Julian Seward et al.
==22381== Using Valgrind-3.10.0 and LibVEX; rerun with -h for copyright info
==22381== Command: pbs_server
==22381==
==22381==
==22381== HEAP SUMMARY:
==22381== in use at exit: 18,051 bytes in 53 blocks
==22381== total heap usage: 169 allocs, 116 frees, 42,410 bytes allocated
==22381==
==22382==
==22382== HEAP SUMMARY:
==22382== in use at exit: 19,755 bytes in 56 blocks
==22382== total heap usage: 172 allocs, 116 frees, 44,114 bytes allocated
==22382==
==22381== LEAK SUMMARY:
==22381== definitely lost: 0 bytes in 0 blocks
==22381== indirectly lost: 0 bytes in 0 blocks
==22381== possibly lost: 0 bytes in 0 blocks
==22381== still reachable: 18,051 bytes in 53 blocks
==22381== suppressed: 0 bytes in 0 blocks
==22381== Rerun with --leak-check=full to see details of leaked memory
==22381==
==22381== For counts of detected and suppressed errors, rerun with: -v
==22381== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)
==22383==
==22383== Process terminating with default action of signal 11 (SIGSEGV)
==22383== General Protection Fault
==22383== at 0x72192CB: __lll_unlock_elision (elision-unlock.c:33)
==22383== by 0x4E7E1A: unlock_node(pbsnode*, char const*, char const*, int) (u_lock_ctl.c:268)
==22383== by 0x4B7A66: mom_hierarchy_handler::make_default_hierarchy() (mom_hierarchy_handler.cpp:164)
==22383== by 0x4B898C: mom_hierarchy_handler::loadHierarchy() (mom_hierarchy_handler.cpp:433)
==22383== by 0x4B8AE7: mom_hierarchy_handler::initialLoadHierarchy() (mom_hierarchy_handler.cpp:472)
==22383== by 0x452629: pbsd_init(int) (pbsd_init.c:2299)
==22383== by 0x4591FE: main (pbsd_main.c:1883)
==22382== LEAK SUMMARY:
==22382== definitely lost: 0 bytes in 0 blocks
==22382== indirectly lost: 0 bytes in 0 blocks
==22382== possibly lost: 0 bytes in 0 blocks
==22382== still reachable: 19,755 bytes in 56 blocks
==22382== suppressed: 0 bytes in 0 blocks
==22382== Rerun with --leak-check=full to see details of leaked memory
==22382==
==22382== For counts of detected and suppressed errors, rerun with: -v
==22382== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)
==22383==
==22383== HEAP SUMMARY:
==22383== in use at exit: 325,348 bytes in 186 blocks
==22383== total heap usage: 297 allocs, 111 frees, 442,971 bytes allocated
==22383==
==22383== LEAK SUMMARY:
==22383== definitely lost: 134 bytes in 6 blocks
==22383== indirectly lost: 28 bytes in 3 blocks
==22383== possibly lost: 524 bytes in 17 blocks
==22383== still reachable: 324,662 bytes in 160 blocks
==22383== suppressed: 0 bytes in 0 blocks
==22383== Rerun with --leak-check=full to see details of leaked memory
==22383==
==22383== For counts of detected and suppressed errors, rerun with: -v
==22383== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)
~
No other software have this behavior, i tested the machine by 2 days with full load without problens. Already try to update the processors microcode. Please, anybody have this behavior with torque 6.0.2 or some othe scenarios ?
Best regards.

This is no microcode fault. It is an outright lock balance issue in whatever software you're running (and not in glibc/libpthreads).
Don't try to unlock an already unlocked lock. That's forbidden behavior, and the reason for the trap.
For performance reasons, glibc doesn't bother to test for it and segfault, so a lot of broken code got away with it for a long time. The hardware implementations of lock elision, OTOH, do raise traps (Intel TSX, IBM Power 8, S390/X...), so this kind of breakage is going to become apparent everywhere, very fast.

Related

valgrind doesn't recognize invalid write

Valgring doesn`t detect memory errors.
I am using valgrind 3.11, gcc 5.4.0 under ubuntu and
have an incorrect code in my program, like in the sample.
I analyzed this program, using valgrind. But valgrind doesn't report about any errors.
#include <string.h>
int main(){
int a[3];
memcpy(a,"aaabbbcccdddeeefffggghhh", 24);
return 0;
}
What's wrong with valgrind?
valgrind doesn't know a, not its size then, while you stay in the stack it cannot detect the error
To compare, having that :
#include <string.h>
int main(){
int * a = new int[3];
memcpy(a,"aaabbbcccdddeeefffggghhh", 24);
return 0;
}
valgrind can detect the error because it knows the size of the allocated block :
pi#raspberrypi:/tmp $ valgrind ./a.out
==16164== Memcheck, a memory error detector
==16164== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al.
==16164== Using Valgrind-3.13.0 and LibVEX; rerun with -h for copyright info
==16164== Command: ./a.out
==16164==
==16164== Invalid write of size 8
==16164== at 0x4865F44: ??? (in /usr/lib/arm-linux-gnueabihf/libarmmem.so)
==16164== Address 0x4bc9f60 is 8 bytes inside a block of size 12 alloc'd
==16164== at 0x48485F0: operator new[](unsigned int) (vg_replace_malloc.c:417)
==16164== by 0x105A7: main (v.cc:3)
==16164==
==16164== Invalid write of size 8
==16164== at 0x4865F54: ??? (in /usr/lib/arm-linux-gnueabihf/libarmmem.so)
==16164== Address 0x4bc9f68 is 4 bytes after a block of size 12 alloc'd
==16164== at 0x48485F0: operator new[](unsigned int) (vg_replace_malloc.c:417)
==16164== by 0x105A7: main (v.cc:3)
==16164==
==16164==
==16164== HEAP SUMMARY:
==16164== in use at exit: 12 bytes in 1 blocks
==16164== total heap usage: 2 allocs, 1 frees, 20,236 bytes allocated
==16164==
==16164== LEAK SUMMARY:
==16164== definitely lost: 12 bytes in 1 blocks
==16164== indirectly lost: 0 bytes in 0 blocks
==16164== possibly lost: 0 bytes in 0 blocks
==16164== still reachable: 0 bytes in 0 blocks
==16164== suppressed: 0 bytes in 0 blocks
==16164== Rerun with --leak-check=full to see details of leaked memory
==16164==
==16164== For counts of detected and suppressed errors, rerun with: -v
==16164== ERROR SUMMARY: 2 errors from 2 contexts (suppressed: 6 from 3)

Valgrind OpenCV

Here is my test program:
#include "opencv2/videoio.hpp"
int main(int argc, char** argv) {
cv::VideoCapture videoCapture(argv[1]);
cv::Mat frame;
videoCapture.read(frame);
return 0;
}
I run this program like this:
valgrind --leak-check=yes ./GyroRecord ./walks6/w63/39840012.avi > valgrind_output 2>&1
So that the entire output is saved in the valgrind_output file.
The contents of valgrind_output can be checked here.
But, if the link dies in the future, this is the summary:
==9677== LEAK SUMMARY:
==9677== definitely lost: 0 bytes in 0 blocks
==9677== indirectly lost: 0 bytes in 0 blocks
==9677== possibly lost: 1,352 bytes in 18 blocks
==9677== still reachable: 166,408 bytes in 1,296 blocks
==9677== of which reachable via heuristic:
==9677== newarray : 1,536 bytes in 16 blocks
==9677== suppressed: 0 bytes in 0 blocks
==9677== Reachable blocks (those to which a pointer was found) are not shown.
==9677== To see them, rerun with: --leak-check=full --show-leak-kinds=all
==9677==
==9677== For counts of detected and suppressed errors, rerun with: -v
==9677== ERROR SUMMARY: 18 errors from 18 contexts (suppressed: 0 from 0)
I would like to reduce "possibly lost" bytes to 0. Is that possible? Or will I always have some "possibly lost" bytes when using OpenCV?
OpenCV comes with suppression files (with the extension .supp) for valgrind that can be used to hide messages about resources allocated (often early in he program's execution) that will be kept allocated until the program dies and the OS has to clean up the mess.
The suppression files are placed in /usr/share/OpenCV (on my system):
Example:
valgrind --leak-check=yes --suppressions=/usr/share/OpenCV/valgrind.supp --suppressions=/usr/share/OpenCV/valgrind_3rdparty.supp ./GyroRecord ./walks6/w63/39840012.avi
Using these helped me a lot when running valgrind on an OpenCV project.

Valgrind throws no error but not all heap allocations have been freed

This is what i get after executing my program with Valgrind:
1 jscherman#jscherman:~/ClionProjects/algo2-t4-tries$ g++ Set.hpp tests.cpp DiccString.hpp && valgrind --leak-check=yes --show-leak-kinds=all ./a.out
2 ==6823== Memcheck, a memory error detector
3 ==6823== Copyright (C) 2002-2015, and GNU GPL'd, by Julian Seward et al.
4 ==6823== Using Valgrind-3.11.0 and LibVEX; rerun with -h for copyright info
5 ==6823== Command: ./a.out
6 ==6823==
7 test_empty_dicc...ok
8 test_copy_constructor...ok
9 test_define_defined...ok
10 test_get..ok
11 test_remove...ok
12 test_remove_tiny...ok
13 test_keys...ok
14 ==6823==
15 ==6823== HEAP SUMMARY:
16 ==6823== in use at exit: 72,704 bytes in 1 blocks
17 ==6823== total heap usage: 282 allocs, 281 frees, 275,300 bytes allocated
18 ==6823==
19 ==6823== 72,704 bytes in 1 blocks are still reachable in loss record 1 of 1
20 ==6823== at 0x4C2DC10: malloc (vg_replace_malloc.c:299)
21 ==6823== by 0x4EC3EFF: ??? (in /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.21)
22 ==6823== by 0x40104E9: call_init.part.0 (dl-init.c:72)
23 ==6823== by 0x40105FA: call_init (dl-init.c:30)
24 ==6823== by 0x40105FA: _dl_init (dl-init.c:120)
25 ==6823== by 0x4000CF9: ??? (in /lib/x86_64-linux-gnu/ld-2.23.so)
26 ==6823==
27 ==6823== LEAK SUMMARY:
28 ==6823== definitely lost: 0 bytes in 0 blocks
29 ==6823== indirectly lost: 0 bytes in 0 blocks
30 ==6823== possibly lost: 0 bytes in 0 blocks
31 ==6823== still reachable: 72,704 bytes in 1 blocks
32 ==6823== suppressed: 0 bytes in 0 blocks
33 ==6823==
34 ==6823== For counts of detected and suppressed errors, rerun with: -v
35 ==6823== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)
It seems like there are no leaks as last line of the output says. Yet, we have also this line:
17 ==6823== total heap usage: 282 allocs, 281 frees, 275,300 bytes allocated
How is that i don't have any errors but there is still an allocation that hasn't been freed? Is there something wrong with my program or maybe something being done by Valgrind behind the scenes?
The backtrace reported by valgrind shows that the memory allocation in question was made in the initialization function of one of the shared libraries loaded by the application, apparently the C++ library itself.
It is quite common for shared libraries to make one-time allocations for various bits and pieces of data, but not bother to explicitly deallocate them, when they get unloaded.
This does not comprise a memory leak in your own code.
valgrind comes with a list of known allocations of this nature, it's called a "suppression list", for the explicit purpose of suppressing reports about these known one-off allocations.
But, occasionally, these suppression lists do miss an allocation, or two.

Valgrind Error: in use at exit: 72,704 bytes C++ Initialization List weirdness with char*

Issue:
I have a weird issue that I wasn't expecting. I have a class called Answers
and within the header is this:
class Answer
{
char* aText;
bool b_correct;
public:
Answer():aText(0){;} //default constructor
}
The main (testing) driver code is this:
int main(void)
{
static const unsigned int MAX_ANSWERS = 5;
Answer answers[MAX_ANSWERS];
}
The (unexpected) weirdness I am getting is that there is an alloc happening, and I haven't used a new anywhere in my code yet. I'm guessing that the char* is calling this in the initialization list.
I am using valgrind to test my code, and I'm getting 11 allocs and 10 frees. When I remove the initializer of :aText(0), the extra alloc goes away.
I get that this is badly constructed code. I am following a course outline to learn how to write in C++. Can someone please help me understand how the memory is allocated or what's happening during the initialization list to cause a call to new?
I know the error is coming from the code shown. I know the extra alloc is happening When I compile and run just this code.
Valgrind Output:
==12598== Memcheck, a memory error detector
==12598== Copyright (C) 2002-2013, and GNU GPL'd, by Julian Seward et al.
==12598== Using Valgrind-3.10.1 and LibVEX; rerun with -h for copyright info
==12598== Command: ./Answers
==12598==
==12598==
==12598== HEAP SUMMARY:
==12598== in use at exit: 72,704 bytes in 1 blocks
==12598== total heap usage: 1 allocs, 0 frees, 72,704 bytes allocated
==12598==
==12598== LEAK SUMMARY:
==12598== definitely lost: 0 bytes in 0 blocks
==12598== indirectly lost: 0 bytes in 0 blocks
==12598== possibly lost: 0 bytes in 0 blocks
==12598== still reachable: 72,704 bytes in 1 blocks
==12598== suppressed: 0 bytes in 0 blocks
==12598== Rerun with --leak-check=full to see details of leaked memory
==12598==
==12598== For counts of detected and suppressed errors, rerun with: -v
==12598== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)
Platform Information:
Fedora 22
gcc.x86_64 5.1.1-4.fc22
valgrind.x86_64 1:3.10.1-13.fc22
codeblocks.x86_64 13.12-14.fc22
This is a known GCC 5.1 bug, not a valgrind bug.
Details here:
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=64535
Possible workarounds:
Downgrade GCC to an earlier version or wait for Valgrind to update a fix for this error. Both solutions are being worked on by their respective communities.

Using valgrind to find a memory leak in the mysql c++ client

I'm using valgrind to try and track down a memory leak is the mysql c++ client distributed from mysql.
In both the examples (resultset.cpp) and my own program, there is a single 56 byte block that is not freed. In my own program, I've traced the leak to a call to the mysql client.
Here are the results when I run the test:
valgrind --leak-check=full --show-reachable=yes ./my-executable
==29858== Memcheck, a memory error detector
==29858== Copyright (C) 2002-2009, and GNU GPL'd, by Julian Seward et al.
==29858== Using Valgrind-3.6.0.SVN-Debian and LibVEX; rerun with -h for copyright info
==29858== Command: ./my-executable
==29858==
==29858==
==29858== HEAP SUMMARY:
==29858== in use at exit: 56 bytes in 1 blocks
==29858== total heap usage: 693 allocs, 692 frees, 308,667 bytes allocated
==29858==
==29858== 56 bytes in 1 blocks are still reachable in loss record 1 of 1
==29858== at 0x4C284A8: malloc (vg_replace_malloc.c:236)
==29858== by 0x400D334: _dl_map_object_deps (dl-deps.c:506)
==29858== by 0x4013652: dl_open_worker (dl-open.c:291)
==29858== by 0x400E9C5: _dl_catch_error (dl-error.c:178)
==29858== by 0x4012FF9: _dl_open (dl-open.c:583)
==29858== by 0x7077BCF: do_dlopen (dl-libc.c:86)
==29858== by 0x400E9C5: _dl_catch_error (dl-error.c:178)
==29858== by 0x7077D26: __libc_dlopen_mode (dl-libc.c:47)
==29858== by 0x72E5FEB: pthread_cancel_init (unwind-forcedunwind.c:53)
==29858== by 0x72E614B: _Unwind_ForcedUnwind (unwind-forcedunwind.c:126)
==29858== by 0x72E408F: __pthread_unwind (unwind.c:130)
==29858== by 0x72DDEB4: pthread_exit (pthreadP.h:265)
==29858==
==29858== LEAK SUMMARY:
==29858== definitely lost: 0 bytes in 0 blocks
==29858== indirectly lost: 0 bytes in 0 blocks
==29858== possibly lost: 0 bytes in 0 blocks
==29858== still reachable: 56 bytes in 1 blocks
==29858== suppressed: 0 bytes in 0 blocks
==29858==
==29858== For counts of detected and suppressed errors, rerun with: -v
==29858== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 8 from 6)
I have a few questions regarding this:
How should I interpret the --show-reachable block?
Is that block useful for me to try and zero in on the error?
If the block is not useful, does valgrind have another mechanism that would help me trace the leak?
If not, is there some other tool (hopefully OSS on linux) to help me narrow this down?
Thanks in advance..
UPDATE: Here is the code that I found on my system for the definition of pthread_exit. I'm not certain that this is the actual source that is being invoked. However, if it is, can anyone explain what might be going wrong?
void
pthread_exit (void *retval)
{
/* specific to PTHREAD_TO_WINTHREAD */
ExitThread ((DWORD) ((size_t) retval)); /* thread becomes signalled so its death can be waited upon */
/*NOTREACHED*/
assert (0); return; /* void fnc; can't return an error code */
}
Reachable just means that the blocks had a valid pointer referencing them in scope when the program exited, which indicates that the program does not explicitly free everything on exit because it relies on the underlying OS to do so. What you should be looking for are lost blocks, where blocks of memory lost all references to them and can no longer be freed.
So, the 56 bytes were probably allocated in main, which did not explicitly free them. What you posted does not show a memory leak. It shows main freeing everything but what main allocated because main assumes that when it dies, all memory will be reclaimed by the kernel.
Specifically, it's pthread (in main) making this assumption (which is a valid assumption on darn near everything found in production written in the last 15+ years). The need to free blocks that still have a valid reference on exit is a bit of a contentious point, but for this specific question all that needs to be mentioned is that the assumption was made.
Edit
It's actually pthread_exit() not cleaning something up on exit, but as explained it probably doesn't need to (or quite possibly can't) once it reaches that point.