I am examining a few crashes that all have the signal SIGSEGV with the reason SEGV_ACCERR. After searching for SEGV_ACCERR, the closest thing I have found to a human readable explanation is: Invalid Permissions for object
What does this mean in a more general sense? When would a SEGV_ACCERR arise? Is there more specific documentation on this reason?
This is an error that I have mostly seen on 64 bit iOS devices and can happen if multiple threads read and change a variable under ARC. For example, I fixed a crash today where multiple background threads were reading and using a static NSDate and NSString variable and updating them without doing any kind of locking or queueing.
Using core data objects on multiple threads can also cause this crash, as I have seen many times in my crash logs.
I also use Crittercism, and this particular crash was a SEGV_ACCERR that only affected 64 bit devices.
As stated in the man page of sigaction, SEGV_ACCERR is a signal code for SIGSEGV that specifies Invalid permissions for mapped object. Contrary to SEGV_MAPERR which means that the address is not mapped to a valid object, SEGV_ACCERR means the address matches an object, but for sure it is neither the good one, nor one the process is allowed to access.
I've seen this in cases where code tries to execute from places other than "text".
For eg, if your pointer is pointing to a function in heap or stack and you try to execute that code (from heap or stack), the CPU throws this exception.
It's possible to get a SEGV_ACCERR because of a stack overflow. Specifically, this happened to me on Android ARM64 with the following:
VeryLargeStruct s;
s = {}; // SEGV_ACCERR
It seems that the zero-initialization created a temporary that caused a stack overflow. This only happened with -O0; presumably the temporary was optimized away at higher optimization levels.
On android arm64 if stack.cpp contains:
struct VeryLargeStruct {
int array[4096*4096];
};
int main() {
struct VeryLargeStruct s;
s = {};
}
and typing:
aarch64-linux-android26-clang++ -std=c++20 -g -DANDROID_STL=c++_static -static-libstdc++ stack.cpp -o stack
adb push stack /data/local/tmp
adb shell /data/local/tmp/stack
/data/tombstones/tombstone_01 contains a SEGV_MAPERR, not SEGV_ACCERR:
id: 11363, tid: 11363, name: stack >>> /data/local/tmp/stack <<<
signal 11 (SIGSEGV), code 1 (SEGV_MAPERR), fault addr 0x7ff2d02dd8
I get SEGV_ACCERR, when const.c contains:
int main() {
char *str="hello";
str[0]='H';
}
Then /data/tombstones/tombstone_00 contains:
pid: 9844, tid: 9844, name: consts >>> /data/local/tmp/consts <<<
Signal 11 (SIGSEGV), code 2 (SEGV_ACCERR), fault addr 0x55d10674e8
Related
I've been working on a webrtc datachannel library in C/C++ and wrote a program in C to:
Create two peers from the same process.
Establish a connection between them.
Close the connection if it's successful.
Everything runs fine on a debian docker container and on my host opensuse tumbleweed (all x86_64 and 64bit), but on alpine linux container (64bit x86_64), I'm getting a SEGFAULT inside the child processes:
The function above is from the program's dependency "libnice". It seems like *agent == NULL and there is no way that is made null in the caller's scope. I even inserted a printf("Argument is %p", agent); right before the function call and it prints out its memory and I can verify it's not null. From the disassembly, it looks like the line where copying the agent's contents (0x557a1d20) as the local variable in the callee's stack results in a segfault. The segfault always occurs at this point even after a make clean and recompilation. Fail at activation record? Stack corruption?
UPDATE: I made a more lightweight container and ran it, and now it segfaults at a different place in that same priv_conn_keepalive_tick_unlocked. The argument seems to be set though (Notice the 0x7ffff7f9ad08):
Since I thought I might be hitting the libmusl's default stack limit of 80k, I used getrlimit(RLIMIT_STACK, &rl) to obtain the stack size and it looks like it's already 8 MB and not 80k. Increasing this limit further does not seem to make any difference except that if I assign more than 8 MB, my program crashes early inside the Gdb. Gdb says it got an unknown signal "? ?"; outside the gdb, it crashes at the normal point where it normally crashes without the altered stack size.
I'm not sure what exactly the problem is (stack corruption?) and what to do next to resolve this.
Here's my program's flow:
For every peer that is created, a child process is created with a fork(). Parent <--> child communication is done by ZeroMQ and I use protocol buffers to forward any callbacks (and its arguments) that are triggered inside the child onto an event loop running in the parent process.
So for the above program, there are 2 child processes and 1 parent process.
Steps to reproduce:
Source file: https://github.com/hamon-in/librtcdcpp/blob/alpine-test/examples/websocket_client/2in1.c
Alpine docker container: https://github.com/hamon-in/librtcdcpp/blob/alpine-test/Dockerfile.amd64
Run the container and binary is located at /psl-librtcdcpp/examples/websocket_client/2in1
2in1 will spawn two child processes both of which will crash.
On further investigation, the crash is in an instruction writing at a mildly large negative offset from the stack base pointer, so it's probably just a simple stack overflow.
The right way to fix this is reducing the excess stack usage or explicitly requesting a large stack at pthread_create time, but I don't see where pthread_create is being called from. A quick check to verify that this is the problem would be to override the default stack size for new threads by performing the following somewhere early in the program:
pthread_attr_t attr;
pthread_attr_init(&attr);
pthread_attr_setstacksize(&attr, 1<<20); // 1 MB
pthread_setattr_default_np(&attr);
Add -Werror=implicit-function-declaration to your CFLAGS and you'll immediately have the cause. The key clue is the pointer value 0x557a1d20, which is almost surely the result of truncating a pointer to 32 bits. This happens when you failed to declare a function that returns a pointer and the compiler (by an awful backwards default) assumes it returns int rather than producing an error, then subsequently allows the implicit conversion from int to pointer despite the C language disallowing it.
I have been having a peculiar problem. I have developed a C++ program on a Linux cluster at work. I have tried to use it home on an Ubuntu 14.04 machine, but the program, which is composed of 6 files: main.hpp,main.cpp (dependent on) sarsa.hpp,sarsa.cpp (class Sarsa) (dependent on) wec.hpp,wec.cpp, does compile, but when I run it it either returns segmenation fault or does not enter one fundamental function of the class Sarsa.
The main code calls the constructor and setter functions without problems:
Sarsa run;
run.setVectorSize(memory,3,tilings,1000);
etc.
However, it cannot run the public function episode , since learningRate, which should contain a large integer, returns 0 for all episodes (iterations).
learningRate[episode]=run.episode(numSteps,graph);}
I tried to debug the code with gdb, which has returned:
Program received signal SIGSEGV, Segmentation fault.
0x0000000000408f4a in main () at main.cpp:152
152 learningRate[episode]=run.episode(numSteps,graph);}
I also tried valgrind, which returned:
==10321== Uninitialised value was created by a stack allocation
==10321== at 0x408CAD: main (main.cpp:112)
But no memory leakage issues.
I was wondering if there was a setting to try to debug the external file sarsa.cpp, since I think that class is likely to be the culpript
In the file, I use C++v11 language (I would be expecting errors at compile-time,though), so I even compiled with g++ -std=c++0x, but there were no improvements.
Unluckily, because of the size of the code, I cannot post it here. I would really appreciate any help with this problem. Am I missing anything obvious? Could you help me at least with the debugging?
Thank you in advance for the help.
Correction:
main.cpp:
Definition of the global array:
`#define numEpisodes 10
int learningRate[numEpisodes];`
Towards the end of the main function:
for (int episode; episode<numEpisodes; episode++) {
if (episode==(numEpisodes-1)) { // Save the simulation data only at the
graph=true;} // last episode
learningRate[episode]=run.episode(numSteps,graph);}
As the code you just added to the question reveals, the problem arises because you did not initialize the episode variable. The behavior of any code that uses its value before you assign one is undefined, so it is entirely reasonable that the program behaves differently in one environment than in another.
A segmentation fault indicates an invalid memory access. Usually this means that somewhere, you're reading or writing past the end of an array, or through an invalid pointer, or through an object that has already been freed. You don't necessarily get the segmentation fault at the point where the bug occurs; for instance, you could write past the end of an array onto heap metadata, which causes a crash later on when you try to allocate or release an unrelated object. So it's perfectly reasonable for a program to appear to work on one system but crash on another.
In this case, I'd start by looking at learningRate[episode]. What is the value of episode? Is it within the bounds of learningRate?
I was wondering if there was a setting to try to debug the external file sarsa.cpp, since I think that class is likely to be the culpript
It's possible to set breakpoints in functions other than main.cpp.
break location
Set a breakpoint at the given location, which can specify a function name, a line number, or an address of an instruction.
At least, I think that's your question. You'll also need to know how to step into functions.
More importantly, you need to learn what your tools are trying to tell you. A segfault is the operating system's reaction to an attempt to dereference memory that doesn't belong to you. One common reason for that is trying to dereference NULL. Another would be trying to dereference a pointer that was never initialized. The Valgrind error message suggests that you may have an unitialized pointer.
Without the code, I can't tell you why the pointer isn't initialized when you run the program on your home system, but is (apparently) initialized when you run it at work. I suspect that you don't have the necessary data on your home system, but you'll need to investigate and figure that out. The fundamental question to keep asking yourself is "what is different between my home computer an dmy work computer?"
TL;DR: How do I automatically add a watch in gdb when a function is called so I can debug some memory corruption?
I am currently dealing with some memory corruption in C++
I am mostly seeing 4-5 types of reaccuring crashes - all of which make little to no sense, so I'm guessing it has to be related to memory corruption.
These crashes only happen on the production server, round about every 2-5hours.
Most of them consist of accessing or passing a null pointer where it cant possibly have existed in the first place.
One of these places is a lambda capturing this. (see below)
Obviously looked at core dumps and even had gdb attached while it crashed
valgrind: I've spent hours staring at multiple instances of valgrind with no success.
Enabled gccs stack protection (-fstack-protector-all)
I have tried looking over the code & the changes, but it has been impossible for me to find anything (100k lines of code total, "On master, 10,437 files have changed and there have been 3,352,600 additions and 85,495 deletions." since the last release on the production server). I might have just plain missed something, or not looked in the right spots - I cant tell.
Used cppcheck to see if there was something plain obvious wrong with the code
If there is an easier/more straight forward method to finding where the corruption occurs feel free to suggest that too.
Lets look at some simplified code.
I have a class, Socket, which manages a client connection.
It is constructed something like this
Listener::OnAccept(fd){
Socket* s = new Socket();
if (s->Setup(fd)){
// push into a vector and do some other things
}
}
Socket::Setup calls (virtual) OnConnect of the Socket class, which then creates a ping event, using a lambda:
Socket::OnConnect(){
m_pingEvent = new Event([this](Event* e){
if (!this->GotPong()){
// close connection
}else{
this->Ping();
}
}, 30 /*seconds*/, true /* loop */);
}
Event accepts an std::function as the callback
m_pingEvent is deleted in the destructor (if set) which will cancel the event if it is running.
What happens (rarely) is that the lambda calls Ping on a nullptr, which calls m_pingPacket->Send() on this=0x1f8, which leads to a segfault.
My question - or rather my proposed solution - would be watching the captured this pointer for writing, which definitely shouldnt happen.
There is only one small issue with that..
How would I even watch such a high ammount of pointers without manually adding each one? (about 400 concurrent connections with a lot (dis)connects)
As for the captured data I found this is in the __closure object:
(gdb) frame 2
#2 0x081b9d63 in operator() (e=0x9b2a748, __closure=0xb5a8318)
at net/socket/Client.cpp:151
151 net/socket/Client.cpp: No such file or directory.
(gdb) ptype __closure
type = const struct {
net::socket::Client * const __this;
} * const
Which I can get when creating the lambda easily by just moving the lambda to "auto callback = " which will be of type:
(gdb) info locals
callback = {__this = 0xb4dd0948}
(gdb) ptype callback
type = struct {
net::socket::Client * const __this;
}
(gdb) print callback
$1 = {__this = 0xb4dd0948}
(This is gcc version 4.7.2 (Debian 4.7.2-5) for reference, might be different with other compilers/versions)
Shortly before posting I realized the struct would probably change address once moved into the std::function (is this correct?)
I've been digging through the gnu "functional" header, but I havent really been able to find anything yet, I'll keep looking (and updating this)
Another note: I am posting this full describtion with all of the details included in case anyone has an easier solution for me. (XY Problem)
Edit:
(gdb) print *(void**)m_pingEvent->m_callback._M_functor._M_unused._M_object
$8 = (void *) 0xb4dd56d8
(gdb) print this
$4 = (net::socket::Client * const) 0xb4dd56d8
Found it :)
Edit2:
break net/socket/Client.cpp:158
commands
silent
watch -l m_pingEvent->m_callback._M_functor._M_unused._M_object
continue
end
This has two flaws: you can only watch 4 addresses at a time & there is no way to delete the watch once the object will be freed.
Soo it's unusable.
Edit 3:
I've figured out how to do the watching using this python script I wrote (linking this one externally since it's quite long): https://gist.github.com/imermcmaps/4a6d8a1577118645acf3
Next issue is making sense of the output..
Added watch 7 -> 0x10eb2200
Hardware watchpoint 7: -location m_pingEvent->m_callback._M_functor._M_unused._M_obj
Old value = (void *) 0x10eba4b0
New value = (void *) 0x10eba400
net::Packet::Packet (this=0x10eb1088) at ../shared/net/Packet.cpp:13
Like it's saying it changed from an old value, which shouldn't even be the original value, since I'm checking if the this pointer and the pointer value match, which they do.
Edit 4 (yay):
Turns out watch -l doesnt work like i want it to.
Manually grabbing the address and then watching that address seems to work
How do I automatically add a watch in gdb when a function is called so
I can debug some memory corruption?
Memory corruption is often detected after the real corruption has already occurred by some modules loaded within your process. So manual debugging may not be very useful for real complex projects.Because any third party modules/library which is loaded within your process may also lead to this problem. From your post it looks like this problem is not reproducible always which indicates that this might be related to threading/synchronization problem which leads to some sort of memory corruption. So based on my experience i strongly suggest you to concentrate on reproducing the problem under dynamic tools(Valgrind/Helgrind).
However as you have mentioned in your question that you are able to attach your program using Valgrind. So you may want to attach your program(a.out) in case you have not done in this way.
$ valgrind --tool=memcheck --db-attach=yes ./a.out
This way Valgrind would automatically attach your program in the debugger when your first memory error is detected so that you can do live debugging(GDB). This seems to be the best possible way to find out the root cause of your problem.
However I think that there may be some data racing scenario which is leading to memory corruption.So you may want to use Helgrind to check/find data racing/threading problem which might be leading to this problem.
For more information on these, you may refer the following post:
https://stackoverflow.com/a/22658693/2724703
https://stackoverflow.com/a/22617989/2724703
What are the scenarios where a process gets a SIGABRT in C++? Does this signal always come from within the process or can this signal be sent from one process to another?
Is there a way to identify which process is sending this signal?
abort() sends the calling process the SIGABRT signal, this is how abort() basically works.
abort() is usually called by library functions which detect an internal error or some seriously broken constraint. For example malloc() will call abort() if its internal structures are damaged by a heap overflow.
SIGABRT is commonly used by libc and other libraries to abort the program in case of critical errors. For example, glibc sends an SIGABRT in case of a detected double-free or other heap corruptions.
Also, most assert implementations make use of SIGABRT in case of a failed assert.
Furthermore, SIGABRT can be sent from any other process like any other signal. Of course, the sending process needs to run as same user or root.
You can send any signal to any process using the kill(2) interface:
kill -SIGABRT 30823
30823 was a dash process I started, so I could easily find the process I wanted to kill.
$ /bin/dash
$ Aborted
The Aborted output is apparently how dash reports a SIGABRT.
It can be sent directly to any process using kill(2), or a process can send the signal to itself via assert(3), abort(3), or raise(3).
It usually happens when there is a problem with memory allocation.
It happened to me when my program was trying to allocate an
array with negative size.
There's another simple cause in case of c++.
std::thread::~thread{
if((joinable ())
std::terminate ();
}
i.e. scope of thread ended but you forgot to call either
thread::join();
or
thread::detach();
The GNU libc will print out information to /dev/tty regarding some fatal conditions before it calls abort() (which then triggers SIGABRT), but if you are running your program as a service or otherwise not in a real terminal window, these message can get lost, because there is no tty to display the messages.
See my post on redirecting libc to write to stderr instead of /dev/tty:
Catching libc error messages, redirecting from /dev/tty
A case when process get SIGABRT from itself:
Hrvoje mentioned about a buried pure virtual being called from ctor generating an abort, i recreated an example for this.
Here when d is to be constructed, it first calls its base class A ctor,
and passes inside pointer to itself.
the A ctor calls pure virtual method before table was filled with valid pointer,
because d is not constructed yet.
#include<iostream>
using namespace std;
class A {
public:
A(A *pa){pa->f();}
virtual void f()=0;
};
class D : public A {
public:
D():A(this){}
virtual void f() {cout<<"D::f\n";}
};
int main(){
D d;
A *pa = &d;
pa->f();
return 0;
}
compile: g++ -o aa aa.cpp
ulimit -c unlimited
run: ./aa
pure virtual method called
terminate called without an active exception
Aborted (core dumped)
now lets quickly see the core file, and validate that SIGABRT was indeed called:
gdb aa core
see regs:
i r
rdx 0x6 6
rsi 0x69a 1690
rdi 0x69a 1690
rip 0x7feae3170c37
check code:
disas 0x7feae3170c37
mov $0xea,%eax = 234 <- this is the kill syscall, sends signal to process
syscall <-----
http://blog.rchapman.org/posts/Linux_System_Call_Table_for_x86_64/
234 sys_tgkill pid_t tgid pid_t pid int sig = 6 = SIGABRT
:)
In my case, it was due to an input in an array at an index equal to the length of the array.
string x[5];
for(int i=1; i<=5; i++){
cin>>x[i];
}
x[5] is being accessed which is not present.
I will give my answer from a competitive programming(cp) perspective, but it applies to other domains as well.
Many a times while doing cp, constraints are quite large.
For example : I had a question with a variables N, M, Q such that 1 ≤ N, M, Q < 10^5.
The mistake I was making was I declared a 2D integer array of size 10000 x 10000 in C++ and struggled with the SIGABRT error at Codechef for almost 2 days.
Now, if we calculate :
Typical size of an integer : 4 bytes
No. of cells in our array : 10000 x 10000
Total size (in bytes) : 400000000 bytes = 4*10^8 ≈ 400 MB
Your solutions to such questions will work on your PC(not always) as it can afford this size.
But the resources at coding sites(online judges) is limited to few KBs.
Hence, the SIGABRT error and other such errors.
Conclusion:
In such questions, we ought not to declare an array or vector or any other DS of this size, but our task is to make our algorithm such efficient that it works without them(DS) or with less memory.
PS : There might be other reasons for this error; above was one of them.
As "#sarnold", aptly pointed out, any process can send signal to any other process, hence, one process can send SIGABORT to other process & in that case the receiving process is unable to distinguish whether its coming because of its own tweaking of memory etc, or someone else has "unicastly", send to it.
In one of the systems I worked there is one deadlock detector which actually detects if process is coming out of some task by giving heart beat or not. If not, then it declares the process is in deadlock state and sends SIGABORT to it.
I just wanted to share this prospective with reference to question asked.
Regarding the first question: What are the scenarios where a process gets a SIGABRT in C++?
I can think of two special cases where a C++ program is aborted automatically -- not by directly calling std::abort() or std::terminate():
One: Throw an exception while an exception is being handled.
try {
throw "abc";
}
catch (...) {
throw "def"; // abort here
}
Two: An uncaught exception that attempts to propagates outside main().
int main(int argc, char** argv)
{
throw "abc"; // abort here
}
C++ experts could probably name more special cases.
There is also a lot of good info on these reference pages:
https://en.cppreference.com/w/cpp/utility/program/abort
https://en.cppreference.com/w/cpp/error/terminate
For Android native code, here are some reasons abort is called according to https://source.android.com/devices/tech/debug/native-crash :
Aborts are interesting because they are deliberate. There are many different ways to abort (including calling abort(3), failing an assert(3), using one of the Android-specific fatal logging types), but all involve calling abort.
The error munmap_chunk invalid pointer also causes a SIGABRT and in my case it was very hard to debug as I was not using pointers at all. It turned out that it was related to std::sort().
std::sort() requires a compare function that creates a strict weak ordering! That means both comparator(a, b) and comparator(b, a) must return false when a==b holds. (see https://en.cppreference.com/w/cpp/named_req/Compare) In my case I defined the operator< in my struct like below:
bool operator<(const MyStruct& o) const {
return value <= o.value; // Note the equality sign
}
and this was causing the SIGABRT because the function does not create a strict weak ordering. Removing the = solved the problem.
Platform : Win32
Language : C++
I get an error if I leave the program running for a while (~10 min).
Unhandled exception at 0x10003fe2 in ImportTest.exe: 0xC0000005: Access violation reading location 0x003b1000.
I think it could be a memory leak but I don't know how to find that out.
Im also unable to 'free()' memory because it always causes (maybe i shouldn't be using free() on variables) :
Unhandled exception at 0x76e81f70 in ImportTest.exe: 0xC0000005: Access violation reading location 0x0fffffff.
at that stage the program isn't doing anything and it is just waiting for user input
dllHandle = LoadLibrary(L"miniFMOD.dll");
playSongPtr = (playSongT)GetProcAddress(dllHandle,"SongPlay");
loadSongPtr = (loadSongT)GetProcAddress(dllHandle,"SongLoadFromFile");
int songHandle = loadSongPtr("FILE_PATH");
// ... {just output , couldn't cause errors}
playSongPtr(songHandle);
getch(); // that is where it causes an error if i leave it running for a while
Edit 2:
playSongPtr(); causes the problem. but i don't know how to fix it
I think it's pretty clear that your program has a bug. If you don't know where to start looking, a useful technique is "divide and conquer".
Start with your program in a state where you can cause the exception to happen. Eliminate half your code, and try again. If the exception still happens, then you've got half as much code to look through. If the exception doesn't happen, then it might have been related to the code you just removed.
Repeat the above until you isolate the problem.
Update: You say "at that stage the program isn't doing anything" but clearly it is doing something (otherwise it wouldn't crash). Is your program a console mode program? If so, what function are you using to wait for user input? If not, then is it a GUI mode program? Have you opened a dialog box and are waiting for something to happen? Have you got any Windows timers running? Any threads?
Update 2: In light of the small snippet of code you posted, I'm pretty sure that if you try to remove the call to the playSongPtr(songHandle) function, then your problem is likely to go away. You will have to investigate what the requirements are for "miniFMOD.dll". For example, that DLL might assume that it's running in a GUI environment instead of a console program, and may do things that don't necessarily work in console mode. Also, in order to do anything in the background (including playing a song), that DLL probably needs to create a thread to periodically load the next bit of the song and queue it in the play buffer. You can check the number of threads being created by your program in Task Manager (or better, Process Explorer). If it's more than one, then there are other things going on that you aren't directly controlling.
The error tells you that memory is accessed which you have not allocated at the moment. It could be a pointer error like dereferencing NULL. Another possibility is that you use memory after you freed it.
The first step would be to check your code for NULL reference checks, i.e. make sure you have a valid pointer before you use it, and to check the lifecycle of all allocated and freed resources. Writing NULL's over references you just freed might help find the problem spot.
I doubt this particular problem is a memory leak; the problem is dereferencing a pointer that does not point to something useful. To check for a memory leak, watch your process in your operating system's process list tool (task manager, ps, whatever) and see if the "used memory" value keeps growing.
On calling free: You should call free() once and only once on the non-null values returned from malloc(), calloc() or strdup(). Calling free() less than once will lead to a memory leak. Calling free() more than once will lead to memory corruption.
You should get a stack trace to see what is going on when the process crashes. Based on my reading of the addresses involved you probably have a stack overflow or have an incorrect pointer calculation using a stack address (in C/C++ terms: an "auto" variable.) A stack trace will tell you how you got to the point where it crashed.