I'm writing a server Linux daemon. I want to know what protocol is in the UNIX/Linux community for what a daemon should do when it encounters a fatal error (eg. a server failing to listen, a segmentation fault, etc.). I have already done the whole thing with the system log, but I want to know what to do with a fatal error. Should I log and keep running in a infinite, do-nothing loop? Should I log and exit? What is the standard thing to do here and how do I do it?
The daemon is written in C++, and I'm using a custom exception system to wrap POSIX error codes, so I'll know when things are fatal.
There are degrees of 'fatal error'.
A server failing to listen is possibly a temporary issue; your daemon should probably continue trying to connect, maybe retrying periodically, and backing off slowly (1 second, 2 seconds, 4 seconds, etc).
If you catch a seg fault, maybe the best thing is to try to restart itself, by re-executing the daemon. It might recur, of course.
You shouldn't go into an infinite do-nothing loop; you should terminate rather than do that. If your loop isn't infinite but could be broken by a signal or something, maybe do-nothing is OK; I recommend the pause() system call as a way to do nothing without consuming CPU time.
You should certainly log what you're doing and why before you exit.
Related
I am currently working on a c++ daemon program that listen on a port for incoming requests.
I would like to catch all the errors related to the program for that I implemented a logger in my program and catched some eventual ones, but other errors remain uncatchable using those methodes, example the Segfault or if the program was stopped because of memory shortage.
I had the idea of using the 'bmesg' which contains logs of different process and then take what I need from there. the problem with this approach is that the logs from 'bmesg' don't contain human readable information more than that, the logs are not dated, so i used 'gdb' on my program, now i logs are more elaborated and contain better information but i can't catch the message of 'gdb'
my questions are:
Is my approach to this problem correct? If yes how can i continue from where i am now
Is there another way to listen to errors better than this.
I will need somthing similar in a C program do you have a suggestion.
EDIT
after some research i think i will use another deamon to check every 5 min or so if my other deamon is running or not in order to re-launch it if its down. with this setteled i now need to record the error. this is where i am stuck
C++ code with around 5k lines hangs randomly - in linux. My code deals with transmitting and reception of packets through RAW socket. The code just stops at a point randomly without any response - not even [ctrl+c] proves handy :: every time after hang I used to kill the process.
I tried GDB and result was same it hanged - ctrl+c produced a SIGTERM error message .
On using valgrind the code hanged similarly .
How to debug this issue? Is it any kind of system error?
Using strace command , it was clear that the hang was due to futex_wait_private issue. Socket read was pushed into deadlock scenario. On increasing the select timeout value - the issue could be resolved.
What is the technique to log the segmentation faults and run time errors which crash the program, through a remote logging library?
The language is C++.
Here is the solution for printing backtrace, when you get a segfault, as an example what you can do when such an error happens.
That leaves you a problem of logging the error to the remote library. I would suggest keeping the signal handler, as simple, as possible and logging to the local file, because you cannot assume, that previously initialized logging library works correctly, when segmentation fault occured.
What is the technique to log the segmentation faults and run time errors which crash the program, through a remote logging library?
From my experience, trying to log (remotely or into file) debugging messages while program is crashing might not be very reliable, especially if APP takes system down along with it:
With TCP connection you might lose last several messages while system is crashing. (TCP maintains data packet order and uses error correction, AFAIK. So if app just quits, some data can be lost before being transmitted)
With UDP connection you might lose messages because of the nature of UDP and receive them out-of-order
If you're writing into file, OS might discard most recent changes (buffers not flushed, journaled filesystem reverting to earlier state of the file).
Flushing buffers after every write or sending messages via TCP/UDP might induce performance penalties for a program that produces thousands of messages per second.
So as far as I know, the good idea is to maintain in-memory plaintext log-file and write a core dump once program has crashed. This way you'll be able to find contents of log file within core dump. Also writing into in-memory log will be significantly faster than writing into file or sending messages over network. Alternatively, you could use some kind of "dual logging" - write every debug message immediately into in-memory log, and then send them asynchronously (in another thread) into log file or over the network.
Handling of exceptions:
Platform-specific. On windows platform you can use _set_se_handlers and use it to either generate backtrace or to translate platform exceptions into c++ exceptions.
On linux I think you should be able to create a handler for SIGSEGV signal.
While catching segfault sounds like a decent idea, instead of trying to handle it from within the program it makes sense to generate core dump and bail. On windows you can use MiniDumpWriteDump from within the program and on linux system can be configured to produce core dumps in shell (ulimit -c, I think?).
I'd like to give some solutions:
using core dump and start a daemon to monitor and collect core dumps and send to your host.
GDB (with GdbServer), you can debug remotely and see backtrace if crashed.
To catch the segfault signal and send a log accordingly, read this post:
Is there a point to trapping "segfault"?
If it turns out that you wont be able to send the log from a signal handler (maybe the crash occurred before the logger has been intitialized), then you may need to write the info to file and have an external entity send it remotely.
EDIT: Putting back some original info to be able to send the core file remotely too
To be able to send the core file remotely, you'll need an external entity (a different process than the one that crashed) that will "wait" for core files and send them remotely as they appear. (possibly using scp) Additionally, the crashing process could catch the segfault signal and notify the monitoring process that a crash has occurred and a core file will be available soon.
Is it possible to have a program restart automatically if it crashes?
Something like:
An unhandled exception is thrown.
Release all resources allocated by process.
Start over and call main.
I would like this behavior for a server application I'm working on. If clients miss use the server it can get a std::bac_alloc exception, in which case I would like the server to simply restart instead of crashing and shutting down, thus avoiding manual startup.
I've done this before in Windows by running said program from another program via a win32 CreateProcess call. The other program then waits on the "monitored" process to exit, and calls its CreateProcess() again if it does. You wait for a process to exit by performing a WaitForSingleObject on the process' handle, which you get as one of the return values from your CreateProcess() call.
You will of course want to program in some way to make the monitoring process shut itself and its child process down.
Let Windows be your watchdog. You can call ChangeServiceConfig2 to set the failure actions for your service. (If your server isn't a service, then you're doing it wrong.) Specify SERVICE_CONFIG_FAILURE_ACTIONS for the dwInfoLevel parameter, and in the SERVICE_FAILURE_ACTIONS structure, set lpsaActions to an array of one or more SC_ACTION values. The type you want is SC_ACTION_RESTART.
I did something similar by implementing a watchdog. The watchdog ran as a service and would wait for a ping (called petting the dog) from the monitored process. If the monitored process died due to an exception, watchdog would cleanup and relaunch the application.
In case the application was not responding(no ping in a certain time) the watchdog would kill it and then restart it.
Here is a link to an implementation that you might want to use:
http://www.codeproject.com/KB/security/WatchDog.aspx
(PS: I implemented my own version but I cannot post it here. I found this from a quick google search and have no first hand experience with this particular implementation.)
If you just catch the exception, it should be possible to just restart your server by internal programming logic without completely restarting the whole program.
Like #T.E.D., we've done this in an application we built. Our application is a windows service, so the helper program stops the service (eventually kill it, if it hangs) and start the service again.
I am trying to write 2 server/client programs under Linux, in which they communicate through named pipes. The problem is that sometimes when I try to write from the server into a pipe that doesn't exist anymore (the client has stopped), I get a "Resource temporarily unavailable" error and the server stops completely.
I understand that this is caused by using a O_NONBLOCK parameter when opening the fifo chanel, indicating the point where the program would usually wait until it could write again in the file, but is there a way to stop this behavior, and not halt the entire program if a problem occurs (shouldn't the write command return -1 ad the program continue normally)?
And another strange thing is that this error only occurs when running the programs outside the ide (eclipse). If I run both programs inside eclipse, on error the write function just returns -1 and the programs continues normally.
If you wish that write() to returns -1 on error (and set errno to EPIPE) instead of stopping your server completly when the write end of your pipe is unconnected, you must ignore the SIGPIPE signal with signal( SIGPIPE, SIG_IGN ).
The problem with this undefined behaviour is strange, you could have a memory problem somewhere or you missed a test. ( or Eclipse does something special to handle signals? )
To quote the section 2 man page for write:
"[errno=]EPIPE An attempt is made to write to a pipe or a FIFO that is not open for reading by any process, or that has only one end open (or to a file descriptor created by socket(3SOCKET), using type SOCK_STREAM that is no longer connected to a peer endpoint). A SIGPIPE signal will also be sent to the thread. The process dies unless special provisions were taken to catch or ignore the signal." [Emphasis mine].
As Platypus said you'll need to ignore the SIGPIPE signal:
signal(SIGPIPE, SIG_IGN). You could also catch the signal and handle the pipe disconnection in a different way in your server.
maybe you can just wrap it into a "try..catch" statement?