I've been working on a webrtc datachannel library in C/C++ and wrote a program in C to:
Create two peers from the same process.
Establish a connection between them.
Close the connection if it's successful.
Everything runs fine on a debian docker container and on my host opensuse tumbleweed (all x86_64 and 64bit), but on alpine linux container (64bit x86_64), I'm getting a SEGFAULT inside the child processes:
The function above is from the program's dependency "libnice". It seems like *agent == NULL and there is no way that is made null in the caller's scope. I even inserted a printf("Argument is %p", agent); right before the function call and it prints out its memory and I can verify it's not null. From the disassembly, it looks like the line where copying the agent's contents (0x557a1d20) as the local variable in the callee's stack results in a segfault. The segfault always occurs at this point even after a make clean and recompilation. Fail at activation record? Stack corruption?
UPDATE: I made a more lightweight container and ran it, and now it segfaults at a different place in that same priv_conn_keepalive_tick_unlocked. The argument seems to be set though (Notice the 0x7ffff7f9ad08):
Since I thought I might be hitting the libmusl's default stack limit of 80k, I used getrlimit(RLIMIT_STACK, &rl) to obtain the stack size and it looks like it's already 8 MB and not 80k. Increasing this limit further does not seem to make any difference except that if I assign more than 8 MB, my program crashes early inside the Gdb. Gdb says it got an unknown signal "? ?"; outside the gdb, it crashes at the normal point where it normally crashes without the altered stack size.
I'm not sure what exactly the problem is (stack corruption?) and what to do next to resolve this.
Here's my program's flow:
For every peer that is created, a child process is created with a fork(). Parent <--> child communication is done by ZeroMQ and I use protocol buffers to forward any callbacks (and its arguments) that are triggered inside the child onto an event loop running in the parent process.
So for the above program, there are 2 child processes and 1 parent process.
Steps to reproduce:
Source file: https://github.com/hamon-in/librtcdcpp/blob/alpine-test/examples/websocket_client/2in1.c
Alpine docker container: https://github.com/hamon-in/librtcdcpp/blob/alpine-test/Dockerfile.amd64
Run the container and binary is located at /psl-librtcdcpp/examples/websocket_client/2in1
2in1 will spawn two child processes both of which will crash.
On further investigation, the crash is in an instruction writing at a mildly large negative offset from the stack base pointer, so it's probably just a simple stack overflow.
The right way to fix this is reducing the excess stack usage or explicitly requesting a large stack at pthread_create time, but I don't see where pthread_create is being called from. A quick check to verify that this is the problem would be to override the default stack size for new threads by performing the following somewhere early in the program:
pthread_attr_t attr;
pthread_attr_init(&attr);
pthread_attr_setstacksize(&attr, 1<<20); // 1 MB
pthread_setattr_default_np(&attr);
Add -Werror=implicit-function-declaration to your CFLAGS and you'll immediately have the cause. The key clue is the pointer value 0x557a1d20, which is almost surely the result of truncating a pointer to 32 bits. This happens when you failed to declare a function that returns a pointer and the compiler (by an awful backwards default) assumes it returns int rather than producing an error, then subsequently allows the implicit conversion from int to pointer despite the C language disallowing it.
Related
i have written a chess engine which plays at a high level. Unfortunately we are known for multiple engine crashes. There are a few people trying to figure out the reason for those crashes. The code can be found here. We tracked all crashes we encountered down to one file which deals with all the input/output from/to the user. uci.cpp is the file.
The job of uci.cpp/h is to implement the universal chess protocol (uci). For that purpose we have some global Board object which represents a position. To be able to receive a stop command while the engine searches a specific position, we brought the search into its own thread. We use thread.joinable() to check if a search is still running and searchThread.join() before we start a new search to make sure we do not have multiple searches running.
Someone has sent us the list of crashes he was able to provoke on his machine:
1a) general protection fault on libstdc++.so.6.0.28, thread::join crash
1b) segfault on libpthread-2.31.so (replace 2.31 by your libc version), also thread::join crash
2a) trap stack segment on the binary itself, delete board crash
2b) segfault on the binary itself, delete board crash
Error type 1/2: thread.join()
The first 2 crash types are both related to calling searchThread.join() although we check if its joinable.
void uci_stop() {
search_stop();
if (searchThread.joinable()) {
searchThread.join();
}
}
What are reasons that could .join() to fail?
We analysed our code a while ago and were not able to find any memory leaks. As far as we know, all data that is allocated within the searchThread will also get deleted.
Error type 3/4: delete board;
We use a global pointer for the board and in very rare instances, deleting that pointer will fail.
I found a potential solution which will statically allocate the board object globally which means I wouldnt havent to use new and delete anymore.
// old:
Board* board;
// new:
Board board{""};
We think that this solves the problem and is not the main part of this question although we are still curious why delete board could fail in some instances.
All crashes have only occured in roughly 2% of all games. With a game taking about 60 moves, thread.join() fails about 1 in 1000 times.
I have been having a peculiar problem. I have developed a C++ program on a Linux cluster at work. I have tried to use it home on an Ubuntu 14.04 machine, but the program, which is composed of 6 files: main.hpp,main.cpp (dependent on) sarsa.hpp,sarsa.cpp (class Sarsa) (dependent on) wec.hpp,wec.cpp, does compile, but when I run it it either returns segmenation fault or does not enter one fundamental function of the class Sarsa.
The main code calls the constructor and setter functions without problems:
Sarsa run;
run.setVectorSize(memory,3,tilings,1000);
etc.
However, it cannot run the public function episode , since learningRate, which should contain a large integer, returns 0 for all episodes (iterations).
learningRate[episode]=run.episode(numSteps,graph);}
I tried to debug the code with gdb, which has returned:
Program received signal SIGSEGV, Segmentation fault.
0x0000000000408f4a in main () at main.cpp:152
152 learningRate[episode]=run.episode(numSteps,graph);}
I also tried valgrind, which returned:
==10321== Uninitialised value was created by a stack allocation
==10321== at 0x408CAD: main (main.cpp:112)
But no memory leakage issues.
I was wondering if there was a setting to try to debug the external file sarsa.cpp, since I think that class is likely to be the culpript
In the file, I use C++v11 language (I would be expecting errors at compile-time,though), so I even compiled with g++ -std=c++0x, but there were no improvements.
Unluckily, because of the size of the code, I cannot post it here. I would really appreciate any help with this problem. Am I missing anything obvious? Could you help me at least with the debugging?
Thank you in advance for the help.
Correction:
main.cpp:
Definition of the global array:
`#define numEpisodes 10
int learningRate[numEpisodes];`
Towards the end of the main function:
for (int episode; episode<numEpisodes; episode++) {
if (episode==(numEpisodes-1)) { // Save the simulation data only at the
graph=true;} // last episode
learningRate[episode]=run.episode(numSteps,graph);}
As the code you just added to the question reveals, the problem arises because you did not initialize the episode variable. The behavior of any code that uses its value before you assign one is undefined, so it is entirely reasonable that the program behaves differently in one environment than in another.
A segmentation fault indicates an invalid memory access. Usually this means that somewhere, you're reading or writing past the end of an array, or through an invalid pointer, or through an object that has already been freed. You don't necessarily get the segmentation fault at the point where the bug occurs; for instance, you could write past the end of an array onto heap metadata, which causes a crash later on when you try to allocate or release an unrelated object. So it's perfectly reasonable for a program to appear to work on one system but crash on another.
In this case, I'd start by looking at learningRate[episode]. What is the value of episode? Is it within the bounds of learningRate?
I was wondering if there was a setting to try to debug the external file sarsa.cpp, since I think that class is likely to be the culpript
It's possible to set breakpoints in functions other than main.cpp.
break location
Set a breakpoint at the given location, which can specify a function name, a line number, or an address of an instruction.
At least, I think that's your question. You'll also need to know how to step into functions.
More importantly, you need to learn what your tools are trying to tell you. A segfault is the operating system's reaction to an attempt to dereference memory that doesn't belong to you. One common reason for that is trying to dereference NULL. Another would be trying to dereference a pointer that was never initialized. The Valgrind error message suggests that you may have an unitialized pointer.
Without the code, I can't tell you why the pointer isn't initialized when you run the program on your home system, but is (apparently) initialized when you run it at work. I suspect that you don't have the necessary data on your home system, but you'll need to investigate and figure that out. The fundamental question to keep asking yourself is "what is different between my home computer an dmy work computer?"
In Ubuntu 14.04, I have a C++ API as a shared library which I am opening using dlopen, and then creating pointers to functions using dlsym. One of these functions CloseAPI releases the API from memory. Here is the syntax:
void* APIhandle = dlopen("Kinova.API.USBCommandLayerUbuntu.so", RTLD_NOW|RTLD_GLOBAL);
int (*CloseAPI) = (int (*)()) dlsym(APIhandle,"CloseAPI");
If I ensure that during my code, the CloseAPI function is always called before the main function returns, then everything seems fine, and I can run the program again the next time. However, if I Ctrl-C and interrupt the program before it has had time to call CloseAPI, then on the next time I run the program, I get a return error whenever I call any of the API functions. I have no documentation saying what this error is, but my intuition is that there is some sort of lock on the library from the previous run of the program. The only thing that allows me to run the program again, is to restart my machine. Logging in and out does not work.
So, my questions are:
1) If my library is a shared library, why am I getting this error when I would have thought a shared library can be loaded by more than one program simultaneously?
2) How can I resolve this issue if I am going to be expecting Ctrl-C to be happening often, without being able to call CloseAPI?
So, if you do use this api correctly then it requires you to do proper clean up after use (which is not really user friendly).
First of all, if you really need to use Ctrl-C, allow program to end properly on this signal: Is destructor called if SIGINT or SIGSTP issued?
Then use a technique with a stack object containing a resource pointer (to a CloseAPI function in this case). Then make sure this object will call CloseAPI in his destructor (you may want to check if CloseAPI wasn't called before). See more in "Effective C++, Chapter 3: Resource Management".
That it, even if you don't call CloseAPI, pointer container will do it for you.
p.s. you should considering doing it even if you're not going to use Ctrl-C. Imagine exception occurred and your program has to be stopped: then you should be sure you don't leave OS in an undefined state.
TL;DR: How do I automatically add a watch in gdb when a function is called so I can debug some memory corruption?
I am currently dealing with some memory corruption in C++
I am mostly seeing 4-5 types of reaccuring crashes - all of which make little to no sense, so I'm guessing it has to be related to memory corruption.
These crashes only happen on the production server, round about every 2-5hours.
Most of them consist of accessing or passing a null pointer where it cant possibly have existed in the first place.
One of these places is a lambda capturing this. (see below)
Obviously looked at core dumps and even had gdb attached while it crashed
valgrind: I've spent hours staring at multiple instances of valgrind with no success.
Enabled gccs stack protection (-fstack-protector-all)
I have tried looking over the code & the changes, but it has been impossible for me to find anything (100k lines of code total, "On master, 10,437 files have changed and there have been 3,352,600 additions and 85,495 deletions." since the last release on the production server). I might have just plain missed something, or not looked in the right spots - I cant tell.
Used cppcheck to see if there was something plain obvious wrong with the code
If there is an easier/more straight forward method to finding where the corruption occurs feel free to suggest that too.
Lets look at some simplified code.
I have a class, Socket, which manages a client connection.
It is constructed something like this
Listener::OnAccept(fd){
Socket* s = new Socket();
if (s->Setup(fd)){
// push into a vector and do some other things
}
}
Socket::Setup calls (virtual) OnConnect of the Socket class, which then creates a ping event, using a lambda:
Socket::OnConnect(){
m_pingEvent = new Event([this](Event* e){
if (!this->GotPong()){
// close connection
}else{
this->Ping();
}
}, 30 /*seconds*/, true /* loop */);
}
Event accepts an std::function as the callback
m_pingEvent is deleted in the destructor (if set) which will cancel the event if it is running.
What happens (rarely) is that the lambda calls Ping on a nullptr, which calls m_pingPacket->Send() on this=0x1f8, which leads to a segfault.
My question - or rather my proposed solution - would be watching the captured this pointer for writing, which definitely shouldnt happen.
There is only one small issue with that..
How would I even watch such a high ammount of pointers without manually adding each one? (about 400 concurrent connections with a lot (dis)connects)
As for the captured data I found this is in the __closure object:
(gdb) frame 2
#2 0x081b9d63 in operator() (e=0x9b2a748, __closure=0xb5a8318)
at net/socket/Client.cpp:151
151 net/socket/Client.cpp: No such file or directory.
(gdb) ptype __closure
type = const struct {
net::socket::Client * const __this;
} * const
Which I can get when creating the lambda easily by just moving the lambda to "auto callback = " which will be of type:
(gdb) info locals
callback = {__this = 0xb4dd0948}
(gdb) ptype callback
type = struct {
net::socket::Client * const __this;
}
(gdb) print callback
$1 = {__this = 0xb4dd0948}
(This is gcc version 4.7.2 (Debian 4.7.2-5) for reference, might be different with other compilers/versions)
Shortly before posting I realized the struct would probably change address once moved into the std::function (is this correct?)
I've been digging through the gnu "functional" header, but I havent really been able to find anything yet, I'll keep looking (and updating this)
Another note: I am posting this full describtion with all of the details included in case anyone has an easier solution for me. (XY Problem)
Edit:
(gdb) print *(void**)m_pingEvent->m_callback._M_functor._M_unused._M_object
$8 = (void *) 0xb4dd56d8
(gdb) print this
$4 = (net::socket::Client * const) 0xb4dd56d8
Found it :)
Edit2:
break net/socket/Client.cpp:158
commands
silent
watch -l m_pingEvent->m_callback._M_functor._M_unused._M_object
continue
end
This has two flaws: you can only watch 4 addresses at a time & there is no way to delete the watch once the object will be freed.
Soo it's unusable.
Edit 3:
I've figured out how to do the watching using this python script I wrote (linking this one externally since it's quite long): https://gist.github.com/imermcmaps/4a6d8a1577118645acf3
Next issue is making sense of the output..
Added watch 7 -> 0x10eb2200
Hardware watchpoint 7: -location m_pingEvent->m_callback._M_functor._M_unused._M_obj
Old value = (void *) 0x10eba4b0
New value = (void *) 0x10eba400
net::Packet::Packet (this=0x10eb1088) at ../shared/net/Packet.cpp:13
Like it's saying it changed from an old value, which shouldn't even be the original value, since I'm checking if the this pointer and the pointer value match, which they do.
Edit 4 (yay):
Turns out watch -l doesnt work like i want it to.
Manually grabbing the address and then watching that address seems to work
How do I automatically add a watch in gdb when a function is called so
I can debug some memory corruption?
Memory corruption is often detected after the real corruption has already occurred by some modules loaded within your process. So manual debugging may not be very useful for real complex projects.Because any third party modules/library which is loaded within your process may also lead to this problem. From your post it looks like this problem is not reproducible always which indicates that this might be related to threading/synchronization problem which leads to some sort of memory corruption. So based on my experience i strongly suggest you to concentrate on reproducing the problem under dynamic tools(Valgrind/Helgrind).
However as you have mentioned in your question that you are able to attach your program using Valgrind. So you may want to attach your program(a.out) in case you have not done in this way.
$ valgrind --tool=memcheck --db-attach=yes ./a.out
This way Valgrind would automatically attach your program in the debugger when your first memory error is detected so that you can do live debugging(GDB). This seems to be the best possible way to find out the root cause of your problem.
However I think that there may be some data racing scenario which is leading to memory corruption.So you may want to use Helgrind to check/find data racing/threading problem which might be leading to this problem.
For more information on these, you may refer the following post:
https://stackoverflow.com/a/22658693/2724703
https://stackoverflow.com/a/22617989/2724703
I have a very strange bug cropping up right now in a fairly massive C++ application at work (massive in terms of CPU and RAM usage as well as code length - in excess of 100,000 lines). This is running on a dual-core Sun Solaris 10 machine. The program subscribes to stock price feeds and displays them on "pages" configured by the user (a page is a window construct customized by the user - the program allows the user to configure such pages). This program used to work without issue until one of the underlying libraries became multi-threaded. The parts of the program affected by this have been changed accordingly. On to my problem.
Roughly once in every three executions the program will segfault on startup. This is not necessarily a hard rule - sometimes it'll crash three times in a row then work five times in a row. It's the segfault that's interesting (read: painful). It may manifest itself in a number of ways, but most commonly what will happen is function A calls function B and upon entering function B the frame pointer will suddenly be set to 0x000002. Function A:
result_type emit(typename type_trait<T_arg1>::take _A_a1) const
{ return emitter_type::emit(impl_, _A_a1); }
This is a simple signal implementation. impl_ and _A_a1 are well-defined within their frame at the crash. On actual execution of that instruction, we end up at program counter 0x000002.
This doesn't always happen on that function. In fact it happens in quite a few places, but this is one of the simpler cases that doesn't leave that much room for error. Sometimes what will happen is a stack-allocated variable will suddenly be sitting on junk memory (always on 0x000002) for no reason whatsoever. Other times, that same code will run just fine. So, my question is, what can mangle the stack so badly? What can actually change the value of the frame pointer? I've certainly never heard of such a thing. About the only thing I can think of is writing out of bounds on an array, but I've built it with a stack protector which should come up with any instances of that happening. I'm also well within the bounds of my stack here. I also don't see how another thread could overwrite the variable on the stack of the first thread since each thread has it's own stack (this is all pthreads). I've tried building this on a linux machine and while I don't get segfaults there, roughly one out of three times it will freeze up on me.
Stack corruption, 99.9% definitely.
The smells you should be looking carefully for are:-
Use of 'C' arrays
Use of 'C' strcpy-style functions
memcpy
malloc and free
thread-safety of anything using pointers
Uninitialised POD variables.
Pointer Arithmetic
Functions trying to return local variables by reference
I had that exact problem today and was knee-deep in gdb mud and debugging for a straight hour before occurred to me that I simply wrote over array boundaries (where I didn't expect it the least) of a C array.
So, if possible, use vectors instead because any decend STL implementation will give good compiler messages if you try that in debug mode (whereas C arrays punish you with segfaults).
I'm not sure what you're calling a "frame pointer", as you say:
On actual execution of that
instruction, we end up at program
counter 0x000002
Which makes it sound like the return address is being corrupted. The frame pointer is a pointer that points to the location on the stack of the current function call's context. It may well point to the return address (this is an implementation detail), but the frame pointer itself is not the return address.
I don't think there's enough information here to really give you a good answer, but some things that might be culprits are:
incorrect calling convention. If you're calling a function using a calling convention different from how the function was compiled, the stack may become corrupted.
RAM hit. Anything writing through a bad pointer can cause garbage to end up on the stack. I'm not familiar with Solaris, but most thread implementations have the threads in the same process address space, so any thread can access any other thread's stack. One way a thread can get a pointer into another thread's stack is if the address of a local variable is passed to an API that ultimately deals with the pointer on a different thread. unless you synchronize things properly, this will end up with the pointer accessing invalid data. Given that you're dealing with a "simple signal implementation", it seems like it's possible that one thread is sending a signal to another. Maybe one of the parameters in that signal has a pointer to a local?
There's some confusion here between stack overflow and stack corruption.
Stack Overflow is a very specific issue cause by try to use using more stack than the operating system has allocated to your thread. The three normal causes are like this.
void foo()
{
foo(); // endless recursion - whoops!
}
void foo2()
{
char myBuffer[A_VERY_BIG_NUMBER]; // The stack can't hold that much.
}
class bigObj
{
char myBuffer[A_VERY_BIG_NUMBER];
}
void foo2( bigObj big1) // pass by value of a big object - whoops!
{
}
In embedded systems, thread stack size may be measured in bytes and even a simple calling sequence can cause problems. By default on windows, each thread gets 1 Meg of stack, so causing stack overflow is much less of a common problem. Unless you have endless recursion, stack overflows can always be mitigated by increasing the stack size, even though this usually is NOT the best answer.
Stack Corruption simply means writing outside the bounds of the current stack frame, thus potentially corrupting other data - or return addresses on the stack.
At it's simplest:-
void foo()
{
char message[10];
message[10] = '!'; // whoops! beyond end of array
}
That sounds like a stack overflow problem - something is writing beyond the bounds of an array and trampling over the stack frame (and probably the return address too) on the stack. There's a large literature on the subject. "The Shell Programmer's Guide" (2nd Edition) has SPARC examples that may help you.
With C++ unitialized variables and race conditions are likely suspects for intermittent crashes.
Is it possible to run the thing through Valgrind? Perhaps Sun provides a similar tool. Intel VTune (Actually I was thinking of Thread Checker) also has some very nice tools for thread debugging and such.
If your employer can spring for the cost of the more expensive tools, they can really make these sorts of problems a lot easier to solve.
It's not hard to mangle the frame pointer - if you look at the disassembly of a routine you will see that it is pushed at the start of a routine and pulled at the end - so if anything overwrites the stack it can get lost. The stack pointer is where the stack is currently at - and the frame pointer is where it started at (for the current routine).
Firstly I would verify that all of the libraries and related objects have been rebuilt clean and all of the compiler options are consistent - I've had a similar problem before (Solaris 2.5) that was caused by an object file that hadn't been rebuilt.
It sounds exactly like an overwrite - and putting guard blocks around memory isn't going to help if it is simply a bad offset.
After each core dump examine the core file to learn as much as you can about the similarities between the faults. Then try to identify what is getting overwritten. As I remember the frame pointer is the last stack pointer - so anything logically before the frame pointer shouldn't be modified in the current stack frame - so maybe record this and copy it elsewhere and compare upon return.
Is something meaning to assign a value of 2 to a variable but instead is assigning its address to 2?
The other details are lost on me but "2" is the recurring theme in your problem description. ;)
I would second that this definitely sounds like a stack corruption due to out of bound array or buffer writing. Stack protector would be good as long as the writing is sequential, not random.
I second the notion that it is likely stack corruption. I'll add that the switch to a multi-threaded library makes me suspicious that what has happened is a lurking bug has been exposed. Possibly the sequencing the buffer overflow was occurring on unused memory. Now it's hitting another thread's stack. There are many other possible scenarios.
Sorry if that doesn't give much of a hint at how to find it.
I tried Valgrind on it, but unfortunately it doesn't detect stack errors:
"In addition to the performance penalty an important limitation of Valgrind is its inability to detect bounds errors in the use of static or stack allocated data."
I tend to agree that this is a stack overflow problem. The tricky thing is tracking it down. Like I said, there's over 100,000 lines of code to this thing (including custom libraries developed in-house - some of it going as far back as 1992) so if anyone has any good tricks for catching that sort of thing, I'd be grateful. There's arrays being worked on all over the place and the app uses OI for its GUI (if you haven't heard of OI, be grateful) so just looking for a logical fallacy is a mammoth task and my time is short.
Also agreed that the 0x000002 is suspect. It is about the only constant between crashes. Even weirder is the fact that this only cropped up with the multi-threaded switch. I think that the smaller stack as a result of the multiple-threads is what's making this crop up now, but that's pure supposition on my part.
No one asked this, but I built with gcc-4.2. Also, I can guarantee ABI safety here so that's also not the issue. As for the "garbage at the end of the stack" on the RAM hit, the fact that it is universally 2 (though in different places in the code) makes me doubt that as garbage tends to be random.
It is impossible to know, but here are some hints that I can come up with.
In pthreads you must allocate the stack and pass it to the thread. Did you allocate enough? There is no automatic stack growth like in a single threaded process.
If you are sure that you don't corrupt the stack by writing past stack allocated data check for rouge pointers (mostly uninitialized pointers).
One of the threads could overwrite some data that others depend on (check your data synchronisation).
Debugging is usually not very helpful here. I would try to create lots of log output (traces for entry and exit of every function/method call) and then analyze the log.
The fact that the error manifest itself differently on Linux may help. What thread mapping are you using on Solaris? Make sure you map every thread to it's own LWP to ease the debugging.
Also agreed that the 0x000002 is suspect. It is about the only constant between crashes. Even weirder is the fact that this only cropped up with the multi-threaded switch. I think that the smaller stack as a result of the multiple-threads is what's making this crop up now, but that's pure supposition on my part.
If you pass anything on the stack by reference or by address, this would most certainly happen if another thread tried to use it after the first thread returned from a function.
You might be able to repro this by forcing the app onto a single processor. I don't know how you do that with Sparc.