I use libcurl for FTP works and it works fine but if left idle for some time it just crashes. Here is the backtrace which despite reading it for some time I cannot make sense of what is wrong. The trace does not show where in my functions crash originates and so am left orphan here by debugger. I use threads if that add value
Compiler is GCC 4.7 on Linux
0 0x00007fff8e09b524 addbyter /home/stefano/Desktop/myproject/curl-7.33.0/lib/mprintf.c 914
1 0x00007fff8e09a32f dprintf_formatf /home/stefano/Desktop/myproject/curl-7.33.0/lib/mprintf.c 572
2 0x00007fff8e09b5a4 curl_mvsnprintf /home/stefano/Desktop/myproject/curl-7.33.0/lib/mprintf.c 932
3 0x00007fff8e089510 Curl_failf /home/stefano/Desktop/myproject/curl-7.33.0/lib/sendf.c 152
4 0x00007fff8e07dbf4 Curl_resolv_timeout /home/stefano/Desktop/myproject/curl-7.33.0/lib/hostip.c 618
5 0x00007fff78012bf8 ??
6 0x000000c300000016 ??
7 0x00007fff8e0d3604 ??
8 0x0000000000000002 ??
9 0x00000000001b7740 ??
10 0x0000000000000000 ??
UPDATE 1
Run it again under debugger and met a crash at the line
FILE *fd;
fd = fopen(files[i].c_str(), "rb"); //<---here goes the crash!
files[i].c_str() is supposed to give const* char from wxString
The new BT is
0 0x00007fff8e08952a Curl_failf /home/stefano/Desktop/myproject/curl-7.33.0/lib/sendf.c 154
1 0x00007fff8e07dbf4 Curl_resolv_timeout /home/stefano/Desktop/myproject/curl-7.33.0/lib/hostip.c 618
2 0x00007fff780158c8 ??
3 0x00000000001b7730 ??
4 0x00007fff78009808 ??
5 0x00007fff78015e79 ??
6 0x00007fff78009808 ??
7 0x00007fff8c8a04a0 ??
8 0x00007fff8e0c84ca ftp_multi_statemach /home/stefano/Desktop/myproject/curl-7.33.0/lib/ftp.c 3113
Such error can be caused if you're using curl in non-main thread. When curl can't resolve dns entry, it sends a signal (by default) to interrupt a thread by timeout. Signals are not thread safe and can cause a crash. You should compile libcurl with --enable-threaded-resolver or with support of c-ares.
Also for me it was useful to disable signals at all
curl_easy_setopt(curl, CURLOPT_NOSIGNAL, 1)
Related
I have a core on both Solaris/Linux platforms and I don´t see the problem.
On a linux platform, I have the following core:
(gdb) where
#0 0x001aa81b in do_lookup_x () from /lib/ld-linux.so.2
#1 0x001ab0da in _dl_lookup_symbol_x () from /lib/ld-linux.so.2
#2 0x001afa05 in _dl_fixup () from /lib/ld-linux.so.2
#3 0x001b5c90 in _dl_runtime_resolve () from /lib/ld-linux.so.2
#4 0x00275e4c in __gxx_personality_v0 () from /opt/gnatpro/lib/libstdc++.so.6
#5 0x00645cfe in _Unwind_RaiseException_Phase2 (exc=0x2a7b10, context=0xffd58434) at ../../../src/libgcc/../gcc/unwind.inc:67
#6 0x00646082 in _Unwind_RaiseException (exc=0x2a7b10) at ../../../src/libgcc/../gcc/unwind.inc:136
#7 0x0027628d in __cxa_throw () from /opt/gnatpro/lib/libstdc++.so.6
#8 0x00276e4f in operator new(unsigned int) () from /opt/gnatpro/lib/libstdc++.so.6
#9 0x08053737 in Receptor::receive (this=0x93c12d8, msj=...) at Receptor.cc:477
#10 0x08099666 in EventProcessor::run (this=0xffd75580) at EventProcessor.cc:437
#11 0x0809747d in SEventProcessor::run (this=0xffd75580) at SEventProcessor.cc:80
#12 0x08065564 in main (argc=1, argv=0xffd76734) at my_project.cc:20
On a Solaris platform I have another core:
$ pstack core.ultimo
core 'core.ultimo' of 9220: my_project_sun
----------------- lwp# 1 / thread# 1 --------------------
0006fa28 __1cDstdGvector4CpnMDistribuidor_n0AJallocator4C2___Dend6kM_pk2_ (1010144, 1ce84, ffbd0df8, ffb7a18c, fffffff8, ffbedc7c) + 30
0005d580 __1cDstdGvector4CpnMDistribuidor_n0AJallocator4C2___Esize6kM_I_ (1010144, 219, 1ce84, ffffffff, fffffff8, ffbedc7c) + 30
0005ab14 __1cTReceptorHreceive6MrnKMensaje__v_ (33e630, ffbede70, ffffffff, 33e634, 33e68c, 0) + 1d4
0015df78 __1cREventProcessorDrun6M_v_ (ffbede18, 33e630, dcc, 1, 33e730, 6e) + 350
00159a50 __1cWSEventProcessorDrun6M_v_ (da08000, 2302f7, 111de0c, 159980, ff1fa07c, cc) + 48
000b6acc main (1, ffbeef74, ffbeef7c, 250000, 0, 0) + 16c
00045e10 _start (0, 0, 0, 0, 0, 0) + 108
----------------- lwp# 2 / thread# 2 --------------------
...
The piece of code is:
...
msj2.tipo(UPDATE);
for(i = 0; i < distr.size(); ++i)
{
distr[i]->insert(new Mensaje(msj2)); **--> Receptor.cc:477**
}
...
This core happens randomly, sometimes the process is running for weeks.
The size of the core is 4291407872 B.
I am running valgrind to see if the heap is corrupted but by now I have not encountered problems as "Invalid read", "Invalid write" ...
Also, when I was running valgrind I have found twice the following message:
==19002== Syscall param semctl(arg) points to uninitialised byte(s)
and I have detected the lines of code but could these errors lead to the core? I think that I have seen these errors with valgrind before and they weren´t as important and the ones that say "Invalid read/write".
If you have any idea how to solve this problem, it would be highly appreciated.
The core size is the clue. The largest 32-bit unsigned number is 4,294,967,295. Your core is quite close to that indicating that the process is out of memory. The most likely cause is a memory leak.
See my recent article Memory Leaks in C/C++
Valgrind will find the issue for you on Linux. You have to start it with the --leak-check option for this. It will check for leaks when the process exits gracefully so you will need a way to shut the process down.
Dtrace with dbx on Solaris will also likely work.
Also, when I was running valgrind I have found twice the following
message:
==19002== Syscall param semctl(arg) points to uninitialised byte(s)
and I have detected the lines of code but could these errors lead to
the core?
Yes, that could result in a SIGSEGV, as it's quite likely undefined behavior. (I'm not going to say it's definitely undefined behavior without seeing the actual code - but it likely is.) It's not likely that doing that can cause a SIGSEGV, but then again the intermittent failure you're seeing doesn't happen all that often. So you do need to fix that problem.
In addition to valgrind, on Solaris you can also use libumem and watchmalloc to check for problems managing heap memory. See the man pages for umem_debug and watchmalloc to get started.
To use dbx on Solaris, you need to have Solaris Studio installed (it's free). Solaris Studio also offers a way to use the run-time memory checking of dbx without having to directly invoke the dbx debugger. See the man page for bcheck. The bcheck man page will be in the Solaris Studio installation directory tree, in the man directory.
And if it is a memory leak, you should be able to see the process address space growing over time.
I am creating a lot of QWebViews which each create QThreads. The problem is that I am running out of stack space to create more threads. Therefore, I was wondering if there is a way to clean up existing threads. To be clear, I am not the one creating these threads: Qt creates the threads when I show a QWebView.
Most of the threads (about 400/500) have the exact same stack:
0 ntdll!RtlEnableEarlyCriticalSectionEventCreation C:\Windows\system32\ntdll.dll 0 0x770b013d
1 KERNEL32!GetVolumePathNamesForVolumeNameA C:\Windows\syswow64\kernel32.dll 0 0x766d1a2c
2 USER32!MessageBoxA C:\Windows\syswow64\user32.dll 0 0x74cd086a
3 QEventDispatcherWin32::processEvents qeventdispatcher_win.cpp 831 0x69de3948
4 QEventLoop::processEvents qeventloop.cpp 149 0x69dbf0c5
5 QEventLoop::exec qeventloop.cpp 204 0x69dbf223
6 QThread::exec qthread.cpp 501 0x69cd412b
7 QThread::run qthread.cpp 568 0x69cd4283
8 QThreadPrivate::start qthread_win.cpp 346 0x69cd54d1
9 msvcrt!_itow_s C:\Windows\syswow64\msvcrt.dll 0 0x75401287
10 msvcrt!_endthreadex C:\Windows\syswow64\msvcrt.dll 0 0x75401328
11 KERNEL32!BaseCleanupAppcompatCacheSupport C:\Windows\syswow64\kernel32.dll 0 0x766d339a
12 ntdll!RtlpNtSetValueKey C:\Windows\system32\ntdll.dll 0 0x770c9ef2
13 ntdll!RtlpNtSetValueKey C:\Windows\system32\ntdll.dll 0 0x770c9ec5
14 ?? 0
Is there any way to clean them up?
Since I realized that the thread was allocated for the NetworkManager, I created a global NetworkManager instance and set it to the NetworkAccessManager of every QWebView. This allowed it to reuse the same group of threads for all of the pages and thus doesnt leave many threads laying around.
We are facing C++ application crash issue due to segmentation fault on RED hat Linux. We are using embedded python in C++.
Please find below my limitation
Don’t I have access to production machine where application crashes. Client send us core dump files when application crashes.
Problem is not reproducible on our test machine which has exactly same configuration as production machine.
Sometime application crashes after 1 hour, 4 hour ….1 day or 1 week. We haven’t get time frame or any specific pattern in which application crashes.
Application is complex and embedded python code is used from lot of places from within application. We have done extensive code reviews but couldn’t find the fix by doing code review.
As per stack trace in core dump, it is crashing around multiplication operation, reviewed code for such operation in code we haven’t get any code where such operation is performed. Might be such operations are called through python scripts executed from embedded python on which we don’t have control or we can’t review it.
We can’t use any profiling tool on production environment like Valgrind.
We are using gdb on our local machine to analyze core dump. We can’t run gdb on production machine.
Please find below the efforts we have putted in.
We have analyzed logs and continuously fired request that coming towards our application on our test environment to reproduce the problem.
We are not getting crash point in logs. Every time we get different logs. I think this is due to; Memory is smashed somewhere else and application crashes after sometime.
We have checked load at any point on our application and it is never exceeded our application limit.
Memory utilization of our application is also normal.
We have profiled our application with help of Valgrind in our test machine and removed valgrind errors but application is still crashing.
I appreciate any help to guide us to proceed further to solve the problem.
Below is the version details
Red hat linux server 5.6 (Tikanga)
Python 2.6.2 GCC 4.1
Following is the stack trace I am getting from the core dump files they have shared (on my machine). FYI, We don’t have access to production machine to run gdb on core dump files.
0 0x00000033c6678630 in ?? ()
1 0x00002b59d0e9501e in PyString_FromFormatV (format=0x2b59d0f2ab00 "can't multiply sequence by non-int of type '%.200s'", vargs=0x46421f20) at Objects/stringobject.c:291
2 0x00002b59d0ef1620 in PyErr_Format (exception=0x2b59d1170bc0, format=<value optimized out>) at Python/errors.c:548
3 0x00002b59d0e4bf1c in PyNumber_Multiply (v=0x2aaaac080600, w=0x2b59d116a550) at Objects/abstract.c:1192
4 0x00002b59d0ede326 in PyEval_EvalFrameEx (f=0x732b670, throwflag=<value optimized out>) at Python/ceval.c:1119
5 0x00002b59d0ee2493 in call_function (f=0x7269330, throwflag=<value optimized out>) at Python/ceval.c:3794
6 PyEval_EvalFrameEx (f=0x7269330, throwflag=<value optimized out>) at Python/ceval.c:2389
7 0x00002b59d0ee2493 in call_function (f=0x70983f0, throwflag=<value optimized out>) at Python/ceval.c:3794
8 PyEval_EvalFrameEx (f=0x70983f0, throwflag=<value optimized out>) at Python/ceval.c:2389
9 0x00002b59d0ee2493 in call_function (f=0x6f1b500, throwflag=<value optimized out>) at Python/ceval.c:3794
10 PyEval_EvalFrameEx (f=0x6f1b500, throwflag=<value optimized out>) at Python/ceval.c:2389
11 0x00002b59d0ee2493 in call_function (f=0x2aaab09d52e0, throwflag=<value optimized out>) at Python/ceval.c:3794
12 PyEval_EvalFrameEx (f=0x2aaab09d52e0, throwflag=<value optimized out>) at Python/ceval.c:2389
13 0x00002b59d0ee2d9f in ?? () at Python/ceval.c:2968 from /usr/local/lib/libpython2.6.so.1.0
14 0x0000000000000007 in ?? ()
15 0x00002b59d0e83042 in lookdict_string (mp=<value optimized out>, key=0x46424dc0, hash=40722104) at Objects/dictobject.c:412
16 0x00002aaab09d5458 in ?? ()
17 0x00002aaab09d5458 in ?? ()
18 0x00002aaab02a91f0 in ?? ()
19 0x00002aaab0b2c3a0 in ?? ()
20 0x0000000000000004 in ?? ()
21 0x00000000026d5eb8 in ?? ()
22 0x00002aaab0b2c3a0 in ?? ()
23 0x00002aaab071e080 in ?? ()
24 0x0000000046422bf0 in ?? ()
25 0x0000000046424dc0 in ?? ()
26 0x00000000026d5eb8 in ?? ()
27 0x00002aaab0987710 in ?? ()
28 0x00002b59d0ee2de2 in PyEval_EvalFrame (f=0x0) at Python/ceval.c:538
29 0x0000000000000000 in ?? ()
You are almost certainly doing something bad with pointers in your C++ code, which can be very tough to debug.
Do not assume that the stack trace is relevant. It might be relevant, but pointer misuse can often lead to crashes some time later
Build with full warnings on. The compiler can point out some non-obvious pointer misuse, such as returning a reference to a local.
Investigate your arrays. Try replacing arrays with std::vector (C++03) or std::array (C++11) so you can iterate using begin() and end() and you can index using at().
Investigate your pointers. Replace them with std::unique_ptr(C++11) or boost::scoped_ptr wherever you can (there should be no overhead in release builds). Replace the rest with shared_ptr or weak_ptr. Any that can't be replaced are probably the source of problematic logic.
Because of the very problems you're seeing, modern C++ allows almost all raw pointer usage to be removed entirely. Try it.
First things first, compile both your binary and libpython with debug symbols and push it out. The stack trace will be much easier to follow.
The relevant argument to g++ is -g.
Suggestions:
As already suggested, provide a complete debug build
Provide a memory test tool and a CPU torture test
Load debug symbols of python library when analyzing the core dump
The stacktrace shows something concerning eval(), so I guess you do dynamic code generation and evaluation/execution. If so, within this code, or passed arguments, there might be the actual error. Assertions at any interface to the code and code dumps may help.
I have a strange problem that I can't solve. Please help!
The program is a multithreaded c++ application that runs on ARM Linux machine. Recently I began testing it for the long runs and sometimes it crashes after 1-2 days like so:
*** glibc detected ** /root/client/my_program: free(): invalid pointer: 0x002a9408 ***
When I open core dump I see that the main thread it seems has a corrupt stack: all I can see is infinite abort() calls.
GNU gdb (GDB) 7.3
...
This GDB was configured as "--host=i686 --target=arm-linux".
[New LWP 706]
[New LWP 700]
[New LWP 702]
[New LWP 703]
[New LWP 704]
[New LWP 705]
Core was generated by `/root/client/my_program'.
Program terminated with signal 6, Aborted.
#0 0x001c44d4 in raise ()
(gdb) bt
#0 0x001c44d4 in raise ()
#1 0x001c47e0 in abort ()
#2 0x001c47e0 in abort ()
#3 0x001c47e0 in abort ()
#4 0x001c47e0 in abort ()
#5 0x001c47e0 in abort ()
#6 0x001c47e0 in abort ()
#7 0x001c47e0 in abort ()
#8 0x001c47e0 in abort ()
#9 0x001c47e0 in abort ()
#10 0x001c47e0 in abort ()
#11 0x001c47e0 in abort ()
And it goes on and on. I tried to get to the bottom of it by moving up the stack: frame 3000 or even more, but eventually core dump runs out of frames and I still can't see why this has happened.
When I examine the other threads everything seems normal there.
(gdb) info threads
Id Target Id Frame
6 LWP 705 0x00132f04 in nanosleep ()
5 LWP 704 0x001e7a70 in select ()
4 LWP 703 0x00132f04 in nanosleep ()
3 LWP 702 0x00132318 in sem_wait ()
2 LWP 700 0x00132f04 in nanosleep ()
* 1 LWP 706 0x001c44d4 in raise ()
(gdb) thread 5
[Switching to thread 5 (LWP 704)]
#0 0x001e7a70 in select ()
(gdb) bt
#0 0x001e7a70 in select ()
#1 0x00057ad4 in CSerialPort::read (this=0xbea7d98c, string_buffer=..., delimiter=..., timeout_ms=1000) at CSerialPort.cpp:202
#2 0x00070de4 in CScanner::readResponse (this=0xbea7d4cc, resp_recv=..., timeout=1000, delim=...) at PidScanner.cpp:657
#3 0x00071198 in CScanner::sendExpect (this=0xbea7d4cc, cmd=..., exp_str=..., rcv_str=..., timeout=1000) at PidScanner.cpp:604
#4 0x00071d48 in CScanner::pollPid (this=0xbea7d4cc, mode=1, pid=12, pid_str=...) at PidScanner.cpp:525
#5 0x00072ce0 in CScanner::poll1 (this=0xbea7d4cc)
#6 0x00074c78 in CScanner::Poll (this=0xbea7d4cc)
#7 0x00089edc in CThread5::Thread5Poll (this=0xbea7d360)
#8 0x0008c140 in CThread5::run (this=0xbea7d360)
#9 0x00088698 in CThread::threadFunc (p=0xbea7d360)
#10 0x0012e6a0 in start_thread ()
#11 0x001e90e8 in clone ()
#12 0x001e90e8 in clone ()
Backtrace stopped: previous frame identical to this frame (corrupt stack?)
(Classes and functions names are a bit wierd because I changed them -:)
So, thread #1 is where the stack is corrupt, backtrace of every other (2-6) shows
Backtrace stopped: previous frame identical to this frame (corrupt stack?).
It happends because threads 2-6 are created in the thread #1.
The thing is that I can't run the program in gdb because it runs on an embedded system. I can't use remote gdb server. The only option is examining core dumps that occur not very often.
Could you please suggest something that could move me forward with this? (Maybe something else I can extract from the core dump or maybe somehow to make some hooks in the code to catch abort() call).
UPDATE: Basile Starynkevitch suggested to use Valgrind, but turns out it's ported only for ARMv7. I have ARM 926 which is ARMv5, so this won't work for me. There are some efforts to compile valgrind for ARMv5 though: Valgrind cross compilation for ARMv5tel, valgrind on the ARM9
UPDATE 2: Couldn't make Electric Fence work with my program. The program uses C++ and pthreads. The version of Efence I got, 2.1.13 crashed in a arbitrary place after I start a thread and try to do something more or less complicated (for example to put a value into an STL vector). I saw people mentioning some patches for Efence on the web but didn't have time to try them. I tried this on my Linux PC, not on the ARM, and other tools like valgrind or Dmalloc don't report any problems with the code. So, everyone using version 2.1.13 of efence be prepared to have problems with pthreads (or maybe pthread + C++ + STL, don't know).
My guess for the "infinite' aborts is that either abort() causes a loop (e.g. abort -> signal handler -> abort -> ...) or that gdb can't correctly interpret the frames on the stack.
In either case I would suggest manually checking out the stack of the problematic thread. If abort causes a loop, you should see a pattern or at least the return address of abort repeating every so often. Perhaps you can then more easily find the root of the problem by manually skipping large parts of the (repeating) stack.
Otherwise, you should find that there is no repeating pattern and hopefully the return address of the failing function somewhere on the stack. In the worst case such addresses are overwritten due to a buffer overflow or such, but perhaps then you can still get lucky and recognise what it is overwritten with.
One possibility here is that something in that thread has very, very badly smashed the stack by vastly overwriting an on-stack data structure, destroying all the needed data on the stack in the process. That makes postmortem debugging very unpleasant.
If you can reproduce the problem at will, the right thing to do is to run the thread under gdb and watch what is going on precisely at the moment when the the stack gets nuked. This may, in turn, require some sort of careful search to determine where exactly the error is happening.
If you cannot reproduce the problem at will, the best I can suggest is very carefully looking for clues in the thread local storage for that thread to see if it hints at where the thread was executing before death hit.
I'm coding a game for Mac in c++, and I'm getting a SIGABRT, and the console prints the following:
terminate called after throwing an instance of 'boost::exception_detail::clone_impl<boost::exception_detail::error_info_injector<boost::bad_lexical_cast> >'
what(): bad lexical cast: source type value could not be interpreted as target
Program received signal: “SIGABRT”.
So, I'm doing a bad lexical_cast. But the problem is that I can't know where, because the call stack is as follows:
#0 0x7fff85fb629a in mach_msg_trap
#1 0x7fff85fb690d in mach_msg
#2 0x7fff81f58932 in __CFRunLoopRun
#3 0x7fff81f57dbf in CFRunLoopRunSpecific
#4 0x7fff88dba7ee in RunCurrentEventLoopInMode
#5 0x7fff88dba5f3 in ReceiveNextEventCommon
#6 0x7fff88dba4ac in BlockUntilNextEventMatchingListInMode
#7 0x7fff84f85e64 in _DPSNextEvent
#8 0x7fff84f857a9 in -[NSApplication nextEventMatchingMask:untilDate:inMode:dequeue:]
#9 0x7fff84f4b48b in -[NSApplication run]
#10 0x7fff84f441a8 in NSApplicationMain
#11 0x1000ef759 in os_gameMainLoop at main-osx.mm:22
#12 0x10009a97d in main at words.cpp:18
That´s not the right stack.
What's match_msg_trap?
Whay am I getting this call stack?
Do I have any way to get a good call stack on the crash?
Thanks!
The debugger stopped in the wrong thread. Try t a a bt in GDB to see backtraces for all the threads.
mach_msg_trap is where threads park while they are waiting for a message to come in. So, you are looking at a thread that isn't running. Mach is the name of the message-passing interface on OS X.