I am trying to track down a bug that occasionally crashes my app in the destructor of this trivial C++ class:
class CrashClass {
public:
CrashClass(double r1, double s1, double r2, double s2, double r3, double s3, string dateTime) : mR1(r1), mS1(s1), mR2(r2), mS2(s2), mR3(r3), mS3(s3), mDateTime(dateTime) { }
CrashClass() : mR1(0), mS1(0), mR2(0), mS2(0), mR3(0), mS3(0) { }
~CrashClass() {}
string GetDateTime() { return mDateTime; }
private:
double mR1, mS1, mR2, mS2, mR3, mS3;
string mDateTime;
};
A bunch of those objects is stuck in a standard C++ vector and used in a second class:
class MyClass {
(...)
private:
vector<CrashClass> mCrashClassVec;
};
MyClass is created and dealloc'd as required many times over.
The code is using C++17 on the latest Xcode 10.1 under macOS 10.14.4.
All of this is part of a computationally intensive simulation app running for multiple hours to days. On a 6-core i7 machine running 12 calculations in parallel (using macOS' GCD framework) this frequently crashes after a couple of hours with a
pointer being freed was not allocated
error when invoking mCrashClassVec.clear() on the member in MyClass, i.e.
frame #0: 0x00007fff769a72f6 libsystem_kernel.dylib`__pthread_kill + 10
frame #1: 0x00000001004aa80d libsystem_pthread.dylib`pthread_kill + 284
frame #2: 0x00007fff769116a6 libsystem_c.dylib`abort + 127
frame #3: 0x00007fff76a1f977 libsystem_malloc.dylib`malloc_vreport + 545
frame #4: 0x00007fff76a1f738 libsystem_malloc.dylib`malloc_report + 151
frame #5: 0x0000000100069448 BackTester`MyClass::DoStuff(int, int) [inlined] std::__1::__libcpp_deallocate(__ptr=<unavailable>) at new:236 [opt]
frame #6: 0x0000000100069443 BackTester`MyClass::DoStuff(int, int) [inlined] std::__1::allocator<char>::deallocate(__p=<unavailable>) at memory:1796 [opt]
frame #7: 0x0000000100069443 BackTester`MyClass::DoStuff(int, int) [inlined] std::__1::allocator_traits<std::__1::allocator<char> >::deallocate(__p=<unavailable>) at memory:1555 [opt]
frame #8: 0x0000000100069443 BackTester`MyClass::DoStuff(int, int) [inlined] std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >::~basic_string() at string:1941 [opt]
frame #9: 0x0000000100069439 BackTester`MyClass::DoStuff(int, int) [inlined] std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >::~basic_string() at string:1936 [opt]
frame #10: 0x0000000100069439 BackTester`MyClass::DoStuff(int, int) [inlined] CrashClass::~CrashClass(this=<unavailable>) at CrashClass.h:61 [opt]
frame #11: 0x0000000100069439 BackTester`MyClass::DoStuff(int, int) [inlined] CrashClass::~CrashClass(this=<unavailable>) at CrashClass.h:61 [opt]
frame #12: 0x0000000100069439 BackTester`MyClass::DoStuff(int, int) [inlined] std::__1::allocator<CrashClass>::destroy(this=<unavailable>, __p=<unavailable>) at memory:1860 [opt]
frame #13: 0x0000000100069439 BackTester`MyClass::DoStuff(int, int) [inlined] void std::__1::allocator_traits<std::__1::allocator<CrashClass> >::__destroy<CrashClass>(__a=<unavailable>, __p=<unavailable>) at memory:1727 [opt]
frame #14: 0x0000000100069439 BackTester`MyClass::DoStuff(int, int) [inlined] void std::__1::allocator_traits<std::__1::allocator<CrashClass> >::destroy<CrashClass>(__a=<unavailable>, __p=<unavailable>) at memory:1595 [opt]
frame #15: 0x0000000100069439 BackTester`MyClass::DoStuff(int, int) [inlined] std::__1::__vector_base<CrashClass, std::__1::allocator<CrashClass> >::__destruct_at_end(this=<unavailable>, __new_last=0x00000001011ad000) at vector:413 [opt]
frame #16: 0x0000000100069429 BackTester`MyClass::DoStuff(int, int) [inlined] std::__1::__vector_base<CrashClass, std::__1::allocator<CrashClass> >::clear(this=<unavailable>) at vector:356 [opt]
frame #17: 0x0000000100069422 BackTester`MyClass::DoStuff(int, int) [inlined] std::__1::vector<CrashClass, std::__1::allocator<CrashClass> >::clear(this=<unavailable>) at vector:749 [opt]
Side note: The vector being cleared might have no elements (yet).
In the stacktrace (bt all) I can see other threads performing operations on their copies of CrashClass vectors but as far as I can see from comparing addresses in the stack trace all of those are in fact private copies (as designed), i.e. none of this data is shared between the threads.
Naturally the bug only occurs in full production mode, i.e. all attempts to reproduce the crash
running in DEBUG mode,
running under Lldb's (Xcode's) Address Sanitizer (for many hours/overnight),
running under Lldb's (Xcode's) Thread Sanitizer (for many hours/overnight),
running a cut-down version of the class with just the critical code left/replicated,
failed and did not trigger the crash.
Why might deallocating a simple member allocated on the stack fail with a pointer being freed was not allocated error?
Also additional hints on how to debug this or trigger the bug in a more robust to investigate further are very much welcome.
Update 5/2019
The bug is still around intermittently crashing the app and I'm starting to believe that the issues I'm experiencing are actually caused by Intel's data corruption bug in recent CPU models..
https://mjtsai.com/blog/2019/05/17/microarchitectural-data-sampling-mds-mitigation/
https://mjtsai.com/blog/2017/06/27/bug-in-skylake-and-kaby-lake-hyper-threading/
https://www.tomshardware.com/news/hyperthreading-kaby-lake-skylake-skylake-x,34876.html
You might try a few tricks:
Run the production version using a single thread for an even longer duration (say a week or 2) to see if it crashes.
Ensure that you don't consume all available RAM taking into account the fact that you might have memory fragmentation.
Ensure that your program does not have memory leak or increase memory usage the more long it runs.
Add some tracking by adding extra value, set value to something known in destructor (so you would recognize the pattern if you do a double delete).
Try to run the program under another platform and compiler.
Your compiler or library might contains bugs. Try another (more recent) version.
Remove code from the original version until it crashes no more. That works better if you can consistently get the crash with a sequence that somehow corrupt memory.
Once you got a crash, run the program with the exact same data (for each thread) and see if it always crash at the same location.
Rewrite or validate any unsafe code in your application. Avoid casting, printf and other old school variable argument function and any unsafe strcpy and similar function.
Use checked STL version.
Try unoptimized release version.
Try optimized debug version.
Learn the differences between DEBUG and RELEASE version for your compiler.
Rewrite problematic code from zero. Maybe it won't have the bug.
Inspect the data when it crashes.
Review your error/exception handling to see if you ignore some potential problem.
Test how you program behave when it run out of memory, out of disk space, when an exception is thrown…
Ensure that your debugger stop at each thrown exception handled or not.
Ensure that your program compile and run without warnings or that you understand them and are sure it does not matters.
Inspect the data when it crash to see if look good.
You might reserve memory to reduce fragmentation and reallocation. If your program runs for hours, it might be possible that the memory get too much fragmented and the system cannot find a block that is big enough.
Since your program is multithreaded, ensure that your run-time is also compatible with that.
Ensure that you don't share data across thread or that they are adequately protected.
Related
Using the following setup:
Cortex-M3 based µC
gcc-arm cross toolchain
using C and C++
FreeRtos 7.5.3
Eclipse Luna
Segger Jlink with JLinkGDBServer
Code Confidence FreeRtos debug plugin
Using JLinkGDBServer and eclipse as debug frontend, I always have a nice stacktrace when stepping through my code. When using the Code Confidence freertos tools (eclipse plugin), I also see the stacktraces of all threads which are currently not running (without that plugin, I see just the stacktrace of the active thread). So far so good.
But now, when my application fall into a hardfault, the stacktrace is lost.
Well, I know the technique on how to find out the code address which causes the hardfault (as seen here).
But this is very poor information compared to full stacktrace.
Ok, some times when falling into hardfault there is no way to retain a stacktrace, e.g. when the stack is corrupted by the faulty code. But if the stack is healty, I think that getting a stacktrace might be possible (isn't it?).
I think the reason for loosing the stacktrace when in hardfault is, that the stackpointer would be swiched from PSP to MSP automatically by the Cortex-M3 architecture. One idea is now, to (maybe) set the MSP to the previous PSP value (and maybe have to do some additional stack preperation?).
Any suggestions on how to do that or other approaches to retain a stacktrace when in hardfault?
Edit 2015-07-07, added more details.
I uses this code to provocate a hardfault:
__attribute__((optimize("O0"))) static void checkHardfault() {
volatile uint32_t* varAtOddAddress = (uint32_t*)-1;
(*varAtOddAddress)++;
}
When stepping into checkHardfault(), my stacktrace looks good like this:
gdb-> backtrace
#0 checkHardfault () at Main.cxx:179
#1 0x100360f6 in GetOneEvent () at Main.cxx:185
#2 0x1003604e in executeMainLoop () at Main.cxx:121
#3 0x1001783a in vMainTask (pvParameters=0x0) at Main.cxx:408
#4 0x00000000 in ?? ()
When run into the hardfault (at (*varAtOddAddress)++;) and find myself inside of the HardFault_Handler(), the stacktrace is:
gdb-> backtrace
#0 HardFault_Handler () at Hardfault.c:312
#1 <signal handler called>
#2 0x10015f36 in prvPortStartFirstTask () at freertos/portable/GCC/ARM_CM3/port.c:224
#3 0x10015fd6 in xPortStartScheduler () at freertos/portable/GCC/ARM_CM3/port.c:301
Backtrace stopped: previous frame inner to this frame (corrupt stack?)
The quickest way to get the debugger to give you the details of the state prior to the hard fault is to return the processor to the state prior to the hard fault.
In the debugger, write a script that takes the information from the various hardware registers and restore PC, LR, R0-R14 to the state just prior to causing the hard fault, then do your stack dump.
Of course, this isn't always helpful when you end up at the hard fault because of popping stuff off of a blown stack or stomping on stuff in memory. You generally tend to corrupt a bunch of the important registers, return back to some crazy spot in memory, and then execute whatever's there. You can end up hard faulting many thousands (millions?) of cycles after your real problem happens.
Consider using the following gdb macro to restore the register contents:
define hfstack
set $frame_ptr = (unsigned *)$sp
if $lr & 0x10
set $sp = $frame_ptr + (8 * 4)
else
set $sp = $frame_ptr + (26 * 4)
end
set $lr = $frame_ptr[5]
set $pc = $frame_ptr[6]
bt
end
document hfstack
set the correct stack context after a hard fault on Cortex M
end
Before I ask my question let me explain my environment:
I have a C/C++ application that runs continuously (Infinite loop) inside an embedded Linux device.
The application records some data from the system and stores them in text files on an SD-card (1 file per day).
The recording occurs on a specific trigger detected from the systems (each 5 minutes for example) and each trigger inserts a new line in the text files.
Typical datatypes used within the application are: (o/i)stream, char arrays, char*, c_str() function, structs and struct*, static string arrays, #define, enums, FILE*, vector<>, and usual ones (int, string, etc.). Some of these datatypes are passed as arguments to functions.
The application is cross compiled with a custom GCC compiler within a Buildroot and BusyBox package for the device's CPU Atmel AT91RM9200QU.
The application executes some system commands using popen in which the output is read using the resulting FILE*
Now the application is running for three days and I noticed an increase of 32 KB byte in the virtual storage (VSZ from the top command) each day. By mistake the device restarted, I launched the application again and the VSZ value started from the usual value on each fresh start (about 2532 KB).
I developed another application that monitors the VSZ value for the application and it is scheduled using crontab each on each our to start monitor. I noticed at some point during the day the 32 KB I noticed happened 4 KB each hour.
So the main question is, what would be the reason that the VSZ increase ? Eventually it will reach a limit causing the system to crash that is my concern because the device have approx. 27 MB of RAM.
Update: Beside the VSZ value, the RSS also increases. I ran the application under valgrind --leak-check=full and after the first recording I aborted the application and the following message appeared many many times!.
==28211== 28 bytes in 1 blocks are possibly lost in loss record 15 of 52
==28211== at 0x4C29670: operator new(unsigned long) (in /usr/lib64/valgrind/vgpreload_memcheck-amd64-linux.so)
==28211== by 0x4EF33D8: std::string::_Rep::_S_create(unsigned long, unsigned long, std::allocator<char> const&) (in /usr/lib64/libstdc++.so.6.0.19)
==28211== by 0x4EF4B00: char* std::string::_S_construct<char const*>(char const*, char const*, std::allocator<char> const&, std::forward_iterator_tag) (in /usr/lib64/libstdc++.so.6.0.19)
==28211== by 0x4EF4F17: std::basic_string<char, std::char_traits<char>, std::allocator<char> >::basic_string(char const*, std::allocator<char> const&) (in /usr/lib64/libstdc++.so.6.0.19)
==28211== by 0x403842: __static_initialization_and_destruction_0 (gatewayfunctions.h:28)
*==28211== by 0x403842: _GLOBAL__sub_I__Z18szBuildUDPTelegramSsii (gatewayfunctions.cpp:396)
==28211== by 0x41AE7C: __libc_csu_init (elf-init.c:88)
==28211== by 0x5676A94: (below main) (in /lib64/libc-2.19.so)
The same message appears, except that the line with * appears with a different file name. The other thing I notice, line 28 of file gatewayfunctions.h is a static string array declaration, this array is used in two files only. Any suggestions ?
I'm running an app on iOS and periodical (not very often) it crashes with EXC_BAD_ACCESS.
The crash occurs while starting boost::thread:
boost::thread(boost::bind(&SomeClass::someStaticFunction, someParam));
and the call stack i see is:
* thread #35: tid = 0x2a822, 0x00d2469e NdsVgconnectTestApp`boost::(anonymous namespace)::thread_proxy(param=<unavailable>) + 246 at thread.cpp:164, stop reason = EXC_BAD_ACCESS (code=1, address=0x20000008)
* frame #0: 0x00d2469e NdsVgconnectTestApp`boost::(anonymous namespace)::thread_proxy(param=<unavailable>) + 246 at thread.cpp:164
frame #1: 0x3b877918 libsystem_pthread.dylib`_pthread_body + 140
frame #2: 0x3b87788a libsystem_pthread.dylib`_pthread_start + 102
I'm passing to boost::thread a static function so its hard to believe that there is some problem with addressing or pointer corruption. So my question is: Can EXC_BAD_ACCESS crash be an artifact of iOS device running out of memory or the app exceeding the memory limit given by the OS?
We are facing C++ application crash issue due to segmentation fault on RED hat Linux. We are using embedded python in C++.
Please find below my limitation
Don’t I have access to production machine where application crashes. Client send us core dump files when application crashes.
Problem is not reproducible on our test machine which has exactly same configuration as production machine.
Sometime application crashes after 1 hour, 4 hour ….1 day or 1 week. We haven’t get time frame or any specific pattern in which application crashes.
Application is complex and embedded python code is used from lot of places from within application. We have done extensive code reviews but couldn’t find the fix by doing code review.
As per stack trace in core dump, it is crashing around multiplication operation, reviewed code for such operation in code we haven’t get any code where such operation is performed. Might be such operations are called through python scripts executed from embedded python on which we don’t have control or we can’t review it.
We can’t use any profiling tool on production environment like Valgrind.
We are using gdb on our local machine to analyze core dump. We can’t run gdb on production machine.
Please find below the efforts we have putted in.
We have analyzed logs and continuously fired request that coming towards our application on our test environment to reproduce the problem.
We are not getting crash point in logs. Every time we get different logs. I think this is due to; Memory is smashed somewhere else and application crashes after sometime.
We have checked load at any point on our application and it is never exceeded our application limit.
Memory utilization of our application is also normal.
We have profiled our application with help of Valgrind in our test machine and removed valgrind errors but application is still crashing.
I appreciate any help to guide us to proceed further to solve the problem.
Below is the version details
Red hat linux server 5.6 (Tikanga)
Python 2.6.2 GCC 4.1
Following is the stack trace I am getting from the core dump files they have shared (on my machine). FYI, We don’t have access to production machine to run gdb on core dump files.
0 0x00000033c6678630 in ?? ()
1 0x00002b59d0e9501e in PyString_FromFormatV (format=0x2b59d0f2ab00 "can't multiply sequence by non-int of type '%.200s'", vargs=0x46421f20) at Objects/stringobject.c:291
2 0x00002b59d0ef1620 in PyErr_Format (exception=0x2b59d1170bc0, format=<value optimized out>) at Python/errors.c:548
3 0x00002b59d0e4bf1c in PyNumber_Multiply (v=0x2aaaac080600, w=0x2b59d116a550) at Objects/abstract.c:1192
4 0x00002b59d0ede326 in PyEval_EvalFrameEx (f=0x732b670, throwflag=<value optimized out>) at Python/ceval.c:1119
5 0x00002b59d0ee2493 in call_function (f=0x7269330, throwflag=<value optimized out>) at Python/ceval.c:3794
6 PyEval_EvalFrameEx (f=0x7269330, throwflag=<value optimized out>) at Python/ceval.c:2389
7 0x00002b59d0ee2493 in call_function (f=0x70983f0, throwflag=<value optimized out>) at Python/ceval.c:3794
8 PyEval_EvalFrameEx (f=0x70983f0, throwflag=<value optimized out>) at Python/ceval.c:2389
9 0x00002b59d0ee2493 in call_function (f=0x6f1b500, throwflag=<value optimized out>) at Python/ceval.c:3794
10 PyEval_EvalFrameEx (f=0x6f1b500, throwflag=<value optimized out>) at Python/ceval.c:2389
11 0x00002b59d0ee2493 in call_function (f=0x2aaab09d52e0, throwflag=<value optimized out>) at Python/ceval.c:3794
12 PyEval_EvalFrameEx (f=0x2aaab09d52e0, throwflag=<value optimized out>) at Python/ceval.c:2389
13 0x00002b59d0ee2d9f in ?? () at Python/ceval.c:2968 from /usr/local/lib/libpython2.6.so.1.0
14 0x0000000000000007 in ?? ()
15 0x00002b59d0e83042 in lookdict_string (mp=<value optimized out>, key=0x46424dc0, hash=40722104) at Objects/dictobject.c:412
16 0x00002aaab09d5458 in ?? ()
17 0x00002aaab09d5458 in ?? ()
18 0x00002aaab02a91f0 in ?? ()
19 0x00002aaab0b2c3a0 in ?? ()
20 0x0000000000000004 in ?? ()
21 0x00000000026d5eb8 in ?? ()
22 0x00002aaab0b2c3a0 in ?? ()
23 0x00002aaab071e080 in ?? ()
24 0x0000000046422bf0 in ?? ()
25 0x0000000046424dc0 in ?? ()
26 0x00000000026d5eb8 in ?? ()
27 0x00002aaab0987710 in ?? ()
28 0x00002b59d0ee2de2 in PyEval_EvalFrame (f=0x0) at Python/ceval.c:538
29 0x0000000000000000 in ?? ()
You are almost certainly doing something bad with pointers in your C++ code, which can be very tough to debug.
Do not assume that the stack trace is relevant. It might be relevant, but pointer misuse can often lead to crashes some time later
Build with full warnings on. The compiler can point out some non-obvious pointer misuse, such as returning a reference to a local.
Investigate your arrays. Try replacing arrays with std::vector (C++03) or std::array (C++11) so you can iterate using begin() and end() and you can index using at().
Investigate your pointers. Replace them with std::unique_ptr(C++11) or boost::scoped_ptr wherever you can (there should be no overhead in release builds). Replace the rest with shared_ptr or weak_ptr. Any that can't be replaced are probably the source of problematic logic.
Because of the very problems you're seeing, modern C++ allows almost all raw pointer usage to be removed entirely. Try it.
First things first, compile both your binary and libpython with debug symbols and push it out. The stack trace will be much easier to follow.
The relevant argument to g++ is -g.
Suggestions:
As already suggested, provide a complete debug build
Provide a memory test tool and a CPU torture test
Load debug symbols of python library when analyzing the core dump
The stacktrace shows something concerning eval(), so I guess you do dynamic code generation and evaluation/execution. If so, within this code, or passed arguments, there might be the actual error. Assertions at any interface to the code and code dumps may help.
Running valgrind, I get loads of memory leaks in opencv, especially with the function of namedWindow.
In the main, I have an image CSImg and PGImg:
std::string cs = "Computer Science Students";
std::string pg = "Politics and Government Students";
CSImg.displayImage(cs);
cv::destroyWindow(cs);
PGImg.displayImage(pg);
cv::destroyWindow(pg);
display image function is:
void ImageHandler::displayImage(std::string& windowname){
namedWindow(windowname);
imshow(windowname, m_image);
waitKey(7000);
}
Valgrind is giving me enormous memory leaks when I do displayImage.
For example:
==6561== 2,359,544 bytes in 1 blocks are possibly lost in loss record 3,421 of 3,421
==6561== at 0x4C2B3F8: malloc (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==6561== by 0x4F6C94C: cv::fastMalloc(unsigned long) (in /usr/lib/libopencv_core.so.2.3.1)
==6561== by 0x4F53650: cvCreateData (in /usr/lib/libopencv_core.so.2.3.1)
==6561== by 0x4F540F0: cvCreateMat (in /usr/lib/libopencv_core.so.2.3.1)
==6561== by 0x56435AF: cvImageWidgetSetImage(_CvImageWidget*, void const*) (in /usr/lib/libopencv_highgui.so.2.3.1)
==6561== by 0x5644C14: cvShowImage (in /usr/lib/libopencv_highgui.so.2.3.1)
==6561== by 0x5642AF7: cv::imshow(std::string const&, cv::_InputArray const&) (in /usr/lib/libopencv_highgui.so.2.3.1)
==6561== by 0x40CED7: ImageHandler::displayImage(std::string&) (imagehandler.cpp:33)
==6561== by 0x408CF5: main (randomU.cpp:601)
imagehandler.cpp, line 33 is:
imshow(windowname, m_image); //the full function is written above ^
randomU.cpp line 601 is:
CSImg.displayImage(cs);
Any help is appreciated.
Ask for any further info you need.
Sorry, the stark reality looks like OpenCV leaks. It leaks from the side of its Qt interface too due to self-references according to the Leaks Instrument (XCode tools).
Other proof that this is not just a false alarm: On my Mac, Opencv 2.4.3 continuously grows in the memory (according to Activity Monitor) when processing webcam input. (I am not using any pointers or data strorages so theoretically my OpenCV program should remain of constant size.)
Actually you don't need to call namedWindow anymore. You just call a "naked" cv::imshow(windowname,m_image). It works fine even if you overwrite.
REMARK:
waitKey has two usages:
1. to wait forever, then waitKey(0);
2. to wait for just a bit, possibly because you are displaying input from your webcam. Then do waitKey(30); (or less, depending on the fps of what you are playing. For movies, 30.)