Weird crashes in C++ Project on thread.join()

Weird crashes in C++ Project on thread.join() - c++

i have written a chess engine which plays at a high level. Unfortunately we are known for multiple engine crashes. There are a few people trying to figure out the reason for those crashes. The code can be found here. We tracked all crashes we encountered down to one file which deals with all the input/output from/to the user. uci.cpp is the file.
The job of uci.cpp/h is to implement the universal chess protocol (uci). For that purpose we have some global Board object which represents a position. To be able to receive a stop command while the engine searches a specific position, we brought the search into its own thread. We use thread.joinable() to check if a search is still running and searchThread.join() before we start a new search to make sure we do not have multiple searches running.
Someone has sent us the list of crashes he was able to provoke on his machine:
1a) general protection fault on libstdc++.so.6.0.28, thread::join crash
1b) segfault on libpthread-2.31.so (replace 2.31 by your libc version), also thread::join crash
2a) trap stack segment on the binary itself, delete board crash
2b) segfault on the binary itself, delete board crash
Error type 1/2: thread.join()
The first 2 crash types are both related to calling searchThread.join() although we check if its joinable.
void uci_stop() {
search_stop();
if (searchThread.joinable()) {
searchThread.join();
}
}
What are reasons that could .join() to fail?
We analysed our code a while ago and were not able to find any memory leaks. As far as we know, all data that is allocated within the searchThread will also get deleted.
Error type 3/4: delete board;
We use a global pointer for the board and in very rare instances, deleting that pointer will fail.
I found a potential solution which will statically allocate the board object globally which means I wouldnt havent to use new and delete anymore.
// old:
Board* board;
// new:
Board board{""};
We think that this solves the problem and is not the main part of this question although we are still curious why delete board could fail in some instances.
All crashes have only occured in roughly 2% of all games. With a game taking about 60 moves, thread.join() fails about 1 in 1000 times.

Related

Rare segmentation fault during object creation with new

In a Java application, I use JNI to call several C++ methods. One of the methods creates an object that has to persist after the method finished and that is used in other method calls. To this end, I create a pointer of the object, which I return to Java as a reference for later access (note: the Java class implements Closable and in the close method, I call a method to delete the object).
However, in rare cases, approximately after 50.000 calls, the C++ code throws a segmentation fault. Based on the content of the log file, only a few lines of code are suspicious to be the source of error (they between the last printed log message and the next one):
MyObject* handle = new MyObject(some_vector, shared_ptr1, shared_ptr2);
handles.insert(handle); // handles is a std::set
jlong handleId = (jlong) handle;
I'd like to know whether there is a possible issue here apart from the fact that I'm using old-style C pointers. Could multi-threading be a problem? Or could the pointer ID be truncated when converted to jlong?
I also want to note that from my previous experience, I'm aware that the log is only a rough indicator of where a segmentation fault occurred. It may as well have been occurred later in the code and the next log message was simply not printed yet. However, reproducing this error may take 1-2 days, so I'd like to check out whether these lines have a problem.

After removing my std::set from the code, the error did not occur anymore. Conclusion: std::set in multithreading must be protected to avoid unrecoverable crashes.

Program segfaults on alpine linux. How do I resolve it?

I've been working on a webrtc datachannel library in C/C++ and wrote a program in C to:
Create two peers from the same process.
Establish a connection between them.
Close the connection if it's successful.
Everything runs fine on a debian docker container and on my host opensuse tumbleweed (all x86_64 and 64bit), but on alpine linux container (64bit x86_64), I'm getting a SEGFAULT inside the child processes:
The function above is from the program's dependency "libnice". It seems like *agent == NULL and there is no way that is made null in the caller's scope. I even inserted a printf("Argument is %p", agent); right before the function call and it prints out its memory and I can verify it's not null. From the disassembly, it looks like the line where copying the agent's contents (0x557a1d20) as the local variable in the callee's stack results in a segfault. The segfault always occurs at this point even after a make clean and recompilation. Fail at activation record? Stack corruption?
UPDATE: I made a more lightweight container and ran it, and now it segfaults at a different place in that same priv_conn_keepalive_tick_unlocked. The argument seems to be set though (Notice the 0x7ffff7f9ad08):
Since I thought I might be hitting the libmusl's default stack limit of 80k, I used getrlimit(RLIMIT_STACK, &rl) to obtain the stack size and it looks like it's already 8 MB and not 80k. Increasing this limit further does not seem to make any difference except that if I assign more than 8 MB, my program crashes early inside the Gdb. Gdb says it got an unknown signal "? ?"; outside the gdb, it crashes at the normal point where it normally crashes without the altered stack size.
I'm not sure what exactly the problem is (stack corruption?) and what to do next to resolve this.
Here's my program's flow:
For every peer that is created, a child process is created with a fork(). Parent <--> child communication is done by ZeroMQ and I use protocol buffers to forward any callbacks (and its arguments) that are triggered inside the child onto an event loop running in the parent process.
So for the above program, there are 2 child processes and 1 parent process.
Steps to reproduce:
Source file: https://github.com/hamon-in/librtcdcpp/blob/alpine-test/examples/websocket_client/2in1.c
Alpine docker container: https://github.com/hamon-in/librtcdcpp/blob/alpine-test/Dockerfile.amd64
Run the container and binary is located at /psl-librtcdcpp/examples/websocket_client/2in1
2in1 will spawn two child processes both of which will crash.

On further investigation, the crash is in an instruction writing at a mildly large negative offset from the stack base pointer, so it's probably just a simple stack overflow.
The right way to fix this is reducing the excess stack usage or explicitly requesting a large stack at pthread_create time, but I don't see where pthread_create is being called from. A quick check to verify that this is the problem would be to override the default stack size for new threads by performing the following somewhere early in the program:
pthread_attr_t attr;
pthread_attr_init(&attr);
pthread_attr_setstacksize(&attr, 1<<20); // 1 MB
pthread_setattr_default_np(&attr);

Add -Werror=implicit-function-declaration to your CFLAGS and you'll immediately have the cause. The key clue is the pointer value 0x557a1d20, which is almost surely the result of truncating a pointer to 32 bits. This happens when you failed to declare a function that returns a pointer and the compiler (by an awful backwards default) assumes it returns int rather than producing an error, then subsequently allows the implicit conversion from int to pointer despite the C language disallowing it.

Segmentation fault on one Linux machine but not another with C++ code

I have been having a peculiar problem. I have developed a C++ program on a Linux cluster at work. I have tried to use it home on an Ubuntu 14.04 machine, but the program, which is composed of 6 files: main.hpp,main.cpp (dependent on) sarsa.hpp,sarsa.cpp (class Sarsa) (dependent on) wec.hpp,wec.cpp, does compile, but when I run it it either returns segmenation fault or does not enter one fundamental function of the class Sarsa.
The main code calls the constructor and setter functions without problems:
Sarsa run;
run.setVectorSize(memory,3,tilings,1000);
etc.
However, it cannot run the public function episode , since learningRate, which should contain a large integer, returns 0 for all episodes (iterations).
learningRate[episode]=run.episode(numSteps,graph);}
I tried to debug the code with gdb, which has returned:
Program received signal SIGSEGV, Segmentation fault.
0x0000000000408f4a in main () at main.cpp:152
152 learningRate[episode]=run.episode(numSteps,graph);}
I also tried valgrind, which returned:
==10321== Uninitialised value was created by a stack allocation
==10321== at 0x408CAD: main (main.cpp:112)
But no memory leakage issues.
I was wondering if there was a setting to try to debug the external file sarsa.cpp, since I think that class is likely to be the culpript
In the file, I use C++v11 language (I would be expecting errors at compile-time,though), so I even compiled with g++ -std=c++0x, but there were no improvements.
Unluckily, because of the size of the code, I cannot post it here. I would really appreciate any help with this problem. Am I missing anything obvious? Could you help me at least with the debugging?
Thank you in advance for the help.
Correction:
main.cpp:
Definition of the global array:
`#define numEpisodes 10
int learningRate[numEpisodes];`
Towards the end of the main function:
for (int episode; episode<numEpisodes; episode++) {
if (episode==(numEpisodes-1)) { // Save the simulation data only at the
graph=true;} // last episode
learningRate[episode]=run.episode(numSteps,graph);}

As the code you just added to the question reveals, the problem arises because you did not initialize the episode variable. The behavior of any code that uses its value before you assign one is undefined, so it is entirely reasonable that the program behaves differently in one environment than in another.

A segmentation fault indicates an invalid memory access. Usually this means that somewhere, you're reading or writing past the end of an array, or through an invalid pointer, or through an object that has already been freed. You don't necessarily get the segmentation fault at the point where the bug occurs; for instance, you could write past the end of an array onto heap metadata, which causes a crash later on when you try to allocate or release an unrelated object. So it's perfectly reasonable for a program to appear to work on one system but crash on another.
In this case, I'd start by looking at learningRate[episode]. What is the value of episode? Is it within the bounds of learningRate?

I was wondering if there was a setting to try to debug the external file sarsa.cpp, since I think that class is likely to be the culpript
It's possible to set breakpoints in functions other than main.cpp.
break location
Set a breakpoint at the given location, which can specify a function name, a line number, or an address of an instruction.
At least, I think that's your question. You'll also need to know how to step into functions.
More importantly, you need to learn what your tools are trying to tell you. A segfault is the operating system's reaction to an attempt to dereference memory that doesn't belong to you. One common reason for that is trying to dereference NULL. Another would be trying to dereference a pointer that was never initialized. The Valgrind error message suggests that you may have an unitialized pointer.
Without the code, I can't tell you why the pointer isn't initialized when you run the program on your home system, but is (apparently) initialized when you run it at work. I suspect that you don't have the necessary data on your home system, but you'll need to investigate and figure that out. The fundamental question to keep asking yourself is "what is different between my home computer an dmy work computer?"

My code crashes on delete this

I get a segmentation fault when attempting to delete this.
I know what you think about delete this, but it has been left over by my predecessor. I am aware of some precautions I should take, which have been validated and taken care of.
I don't get what kind of conditions might lead to this crash, only once in a while. About 95% of the time the code runs perfectly fine but sometimes this seems to be corrupted somehow and crash.
The destructor of the class doesn't do anything btw.
Should I assume that something is corrupting my heap somewhere else and that the this pointer is messed up somehow?
Edit : As requested, the crashing code:
long CImageBuffer::Release()
{
long nRefCount = InterlockedDecrement(&m_nRefCount);
if(nRefCount == 0)
{
delete this;
}
return nRefCount;
}
The object has been created with a new, it is not in any kind of array.

The most obvious answer is : don't delete this.
If you insists on doing that, then use common ways of finding bugs :
1. use valgrind (or similar tool) to find memory access problems
2. write unit tests
3. use debugger (prepare for loooong staring at the screen - depends on how big your project is)

It seems like you've mismatched new and delete. Note that delete this; can only be used on an object which was allocated using new (and in case of overridden operator new, or multiple copies of the C++ runtime, the particular new that matches delete found in the current scope)

Crashes upon deallocation can be a pain: It is not supposed to happen, and when it happens, the code is too complicated to easily find a solution.
Note: The use of InterlockedDecrement have me assume you are working on Windows.
Log everything
My own solution was to massively log the construction/destruction, as the crash could well never happen while debugging:
Log the construction, including the this pointer value, and other relevant data
Log the destruction, including the this pointer value, and other relevant data
This way, you'll be able to see if the this was deallocated twice, or even allocated at all.
... everything, including the stack
My problem happened in Managed C++/.NET code, meaning that I had easy access to the stack, which was a blessing. You seem to work on plain C++, so retrieving the stack could be a chore, but still, it remains very very useful.
You should try to load code from internet to print out the current stack for each log. I remember playing with http://www.codeproject.com/KB/threads/StackWalker.aspx for that.
Note that you'll need to either be in debug build, or have the PDB file along the executable file, to make sure the stack will be fully printed.
... everything, including multiple crashes
I believe you are on Windows: You could try to catch the SEH exception. This way, if multiple crashes are happening, you'll see them all, instead of seeing only the first, and each time you'll be able to mark "OK" or "CRASHED" in your logs. I went even as far as using maps to remember addresses of allocations/deallocations, thus organizing the logs to show them together (instead of sequentially).
I'm at home, so I can't provide you with the exact code, but here, Google is your friend, but the thing to remember is that you can't have a __try/__except handdler everywhere (C++ unwinding and C++ exception handlers are not compatible with SEH), so you'll have to write an intermediary function to catch the SEH exception.
Is your crash thread-related?
Last, but not least, the "I happens only 5% of the time" symptom could be caused by different code path executions, or the fact you have multiple threads playing together with the same data.
The InterlockedDecrement part bothers me: Is your object living in multiple threads? And is m_nRefCount correctly aligned and volatile LONG?
The correctly aligned and LONG part are important, here.
If your variable is not a LONG (for example, it could be a size_t, which is not a LONG on a 64-bit Windows), then the function could well work the wrong way.
The same can be said for a variable not aligned on 32-byte boundaries. Is there #pragma pack() instructions in your code? Does your projet file change the default alignment (I assume you're working on Visual Studio)?
For the volatile part, InterlockedDecrement seem to generate a Read/Write memory barrier, so the volatile part should not be mandatory (see http://msdn.microsoft.com/en-us/library/f20w0x5e.aspx).

I get an exception if I leave the program running for a while

Platform : Win32
Language : C++
I get an error if I leave the program running for a while (~10 min).
Unhandled exception at 0x10003fe2 in ImportTest.exe: 0xC0000005: Access violation reading location 0x003b1000.
I think it could be a memory leak but I don't know how to find that out.
Im also unable to 'free()' memory because it always causes (maybe i shouldn't be using free() on variables) :
Unhandled exception at 0x76e81f70 in ImportTest.exe: 0xC0000005: Access violation reading location 0x0fffffff.
at that stage the program isn't doing anything and it is just waiting for user input
dllHandle = LoadLibrary(L"miniFMOD.dll");
playSongPtr = (playSongT)GetProcAddress(dllHandle,"SongPlay");
loadSongPtr = (loadSongT)GetProcAddress(dllHandle,"SongLoadFromFile");
int songHandle = loadSongPtr("FILE_PATH");
// ... {just output , couldn't cause errors}
playSongPtr(songHandle);
getch(); // that is where it causes an error if i leave it running for a while
Edit 2:
playSongPtr(); causes the problem. but i don't know how to fix it

I think it's pretty clear that your program has a bug. If you don't know where to start looking, a useful technique is "divide and conquer".
Start with your program in a state where you can cause the exception to happen. Eliminate half your code, and try again. If the exception still happens, then you've got half as much code to look through. If the exception doesn't happen, then it might have been related to the code you just removed.
Repeat the above until you isolate the problem.
Update: You say "at that stage the program isn't doing anything" but clearly it is doing something (otherwise it wouldn't crash). Is your program a console mode program? If so, what function are you using to wait for user input? If not, then is it a GUI mode program? Have you opened a dialog box and are waiting for something to happen? Have you got any Windows timers running? Any threads?
Update 2: In light of the small snippet of code you posted, I'm pretty sure that if you try to remove the call to the playSongPtr(songHandle) function, then your problem is likely to go away. You will have to investigate what the requirements are for "miniFMOD.dll". For example, that DLL might assume that it's running in a GUI environment instead of a console program, and may do things that don't necessarily work in console mode. Also, in order to do anything in the background (including playing a song), that DLL probably needs to create a thread to periodically load the next bit of the song and queue it in the play buffer. You can check the number of threads being created by your program in Task Manager (or better, Process Explorer). If it's more than one, then there are other things going on that you aren't directly controlling.

The error tells you that memory is accessed which you have not allocated at the moment. It could be a pointer error like dereferencing NULL. Another possibility is that you use memory after you freed it.
The first step would be to check your code for NULL reference checks, i.e. make sure you have a valid pointer before you use it, and to check the lifecycle of all allocated and freed resources. Writing NULL's over references you just freed might help find the problem spot.

I doubt this particular problem is a memory leak; the problem is dereferencing a pointer that does not point to something useful. To check for a memory leak, watch your process in your operating system's process list tool (task manager, ps, whatever) and see if the "used memory" value keeps growing.
On calling free: You should call free() once and only once on the non-null values returned from malloc(), calloc() or strdup(). Calling free() less than once will lead to a memory leak. Calling free() more than once will lead to memory corruption.
You should get a stack trace to see what is going on when the process crashes. Based on my reading of the addresses involved you probably have a stack overflow or have an incorrect pointer calculation using a stack address (in C/C++ terms: an "auto" variable.) A stack trace will tell you how you got to the point where it crashed.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js