catch all errors and exception related my program - c++

I am currently working on a c++ daemon program that listen on a port for incoming requests.
I would like to catch all the errors related to the program for that I implemented a logger in my program and catched some eventual ones, but other errors remain uncatchable using those methodes, example the Segfault or if the program was stopped because of memory shortage.
I had the idea of using the 'bmesg' which contains logs of different process and then take what I need from there. the problem with this approach is that the logs from 'bmesg' don't contain human readable information more than that, the logs are not dated, so i used 'gdb' on my program, now i logs are more elaborated and contain better information but i can't catch the message of 'gdb'
my questions are:
Is my approach to this problem correct? If yes how can i continue from where i am now
Is there another way to listen to errors better than this.
I will need somthing similar in a C program do you have a suggestion.
EDIT
after some research i think i will use another deamon to check every 5 min or so if my other deamon is running or not in order to re-launch it if its down. with this setteled i now need to record the error. this is where i am stuck

Related

How to debug a hanging job resulting from reading from lustre?

I have a job in interruptible sleep state (S), hanging for a few hours.
can't use gdb (gdb will hang when attaching to the PID).
can't use strace, strace will resume the hanging job =(
WCHAN field shows the PID is waiting for ptlrpc. After some search online, it looks like this is a lustre operation. The print files also revealed the program is stuck in reading data from lustre. Any idea or suggestion on how to proceed the diagnose? Or possible reason why the hanging happens?
You can check /proc/$PID/stack on the client to see the whole stack of the process, which would give you some more information about what the process is doing (ptlrpc_set_wait() is just the generic "wait for RPC completion" function).
That said, what is more likely to be useful is to check the kernel console error messages (dmesg and/or /var/log/messages) to see what is going on. Lustre is definitely not shy about logging errors when there is a problem.
Very likely this will show that the client is waiting on a server to complete the RPC, so you'll also have to check the dmesg and/or /var/log/messages To see what the problem is on the server. There are several existing docs that go into detail about how to debug Lustre issues:
https://wiki.lustre.org/Diagnostic_and_Debugging_Tools
https://cug.org/5-publications/proceedings_attendee_lists/CUG11CD/pages/1-program/final_program/Wednesday/12A-Spitz-Paper.pdf
At that point, you are probably best off to check for existing Lustre bugs at https://jira.whamcloud,com/ to search for the first error messages that are reported, or maybe a stack trace. It is very likely (depending on what error is being hit), that there is already a fix available, and upgrading to the latest maintenance release (2.12.7 currently), or applying a patch (if the bug is recently fixed) will sole your problem.

In ColdFusion 10, how can I tell whether OnApplicationEnd is caused by application timeout or server shut down?

In Application.cfc, OnApplicationEnd is called either when the application is timing out or when the server is shutting down. However, can I tell exactly which one is the cause? I only want to run some clean up codes when the server is shutting down but NOT when the application is timing out. Can I actually do that?
The stack trace is probably different in both circumstances. Set up a test to catch an error and log the stack trace in each instance. Then you will know what to look for when onApplicationEnd is called to determine the cause. (You'll need to catch an error every time and search through the stack).
Of course, this comes with a big disclaimer that you're relying on undocumented behaviors that can change with any update to ColdFusion, etc, etc. Honestly, it would be better to encapsulate the logic so it doesn't care why the application is being shut down.

How to report correctly the abrupt end of another process in Linux?

I'm working on a embedded solution where two apps are working: one is the user interface and the other runs in the background providing data for the UI.
Recently I came across with a memory leak or similar error that is making Linux kill the secondary process, leaving the UI in a stopped situation without telling anything for the user about what is going on. I reached the problem by reading Linux's message log file and the software's print on terminal "Kill -myapp".
My question is: how could I notice such an event (and other similar) coming from the secondary software so I could properly report it to the user and log it? I mean, it's easy to have a look time to time in the process 'tree' to see if the secondary app is running and, if it's not, report a "some event happened" in the UI and it's also plausible to have a error-handler system inside the secondary app that makes it write in a log file what just happened and make the UI read that file for new entries from time to time, but how could the UI app knows with better details what is going on in such more abrupt events? (in this case, "Linux killed process", but it could be a "segmentation pipe" or any other) (and if there is another, better solution that this "constant read a log file produced by the secondary app", I'ld also like to know)
Notes: the UI is written in C++/Qt and the secondary app is in C. Although a solution using the Qt library would be welcomed, I think it would be better for the entire programming community if a more generalized solution was given.
You can create a signal handler for POSIX signals such as SIGKILL in the backend process and notify the ui using for example another signal with sigqueue. Any IPC mechanism should work, as long as it's async safe. Read more about signals: tutorial and manual
It may still be a good idea to check from the ui side periodically because the handler might not succeed.
As for a better way to check if process is alive compared to reading the log file:
Check if process exists given its pid

Debugging a TCP-Server

I have a C++ application that accepts TCP connections from client applications.
After a seemingly random time of running fine (days), it stops receiving followup messages from the clients and only sees the first message on each TCP connection. After a re-start all is fine again.
The trouble is, this only happens on the production server where I have to restart is as soon as it gets stuck and I have been uanble to reproduce this on a lab machine. None of the socket operations seems to return an error, that I would see in my logfile and the application is huge so I can't just post the relevant part here.
First messages keep coming through all the time, only subsequent messages aren't received after a while. Even when my application stops receiving the followup-messages, I can see them comming in with Wireshark.
Any ideas how I might find out what is happening ? What should I be looking for ?
Any config settings used here? In the past I have put a condition on a server accept to ignore messages after 50,000 have been processed. This was to prevent run-away situations in development. This code went live on one occasion without changing the config setting to 'allow infinite messages'. The result was exactly what you describe, ok for 2-3 days, then messages sent ok, but just ignored with no errors anywhere.
This may not be the case here, but I mention it as an example of where you may have to look.

How to log the segmentation faults and run time errors which can crash the program, through a remote logging library?

What is the technique to log the segmentation faults and run time errors which crash the program, through a remote logging library?
The language is C++.
Here is the solution for printing backtrace, when you get a segfault, as an example what you can do when such an error happens.
That leaves you a problem of logging the error to the remote library. I would suggest keeping the signal handler, as simple, as possible and logging to the local file, because you cannot assume, that previously initialized logging library works correctly, when segmentation fault occured.
What is the technique to log the segmentation faults and run time errors which crash the program, through a remote logging library?
From my experience, trying to log (remotely or into file) debugging messages while program is crashing might not be very reliable, especially if APP takes system down along with it:
With TCP connection you might lose last several messages while system is crashing. (TCP maintains data packet order and uses error correction, AFAIK. So if app just quits, some data can be lost before being transmitted)
With UDP connection you might lose messages because of the nature of UDP and receive them out-of-order
If you're writing into file, OS might discard most recent changes (buffers not flushed, journaled filesystem reverting to earlier state of the file).
Flushing buffers after every write or sending messages via TCP/UDP might induce performance penalties for a program that produces thousands of messages per second.
So as far as I know, the good idea is to maintain in-memory plaintext log-file and write a core dump once program has crashed. This way you'll be able to find contents of log file within core dump. Also writing into in-memory log will be significantly faster than writing into file or sending messages over network. Alternatively, you could use some kind of "dual logging" - write every debug message immediately into in-memory log, and then send them asynchronously (in another thread) into log file or over the network.
Handling of exceptions:
Platform-specific. On windows platform you can use _set_se_handlers and use it to either generate backtrace or to translate platform exceptions into c++ exceptions.
On linux I think you should be able to create a handler for SIGSEGV signal.
While catching segfault sounds like a decent idea, instead of trying to handle it from within the program it makes sense to generate core dump and bail. On windows you can use MiniDumpWriteDump from within the program and on linux system can be configured to produce core dumps in shell (ulimit -c, I think?).
I'd like to give some solutions:
using core dump and start a daemon to monitor and collect core dumps and send to your host.
GDB (with GdbServer), you can debug remotely and see backtrace if crashed.
To catch the segfault signal and send a log accordingly, read this post:
Is there a point to trapping "segfault"?
If it turns out that you wont be able to send the log from a signal handler (maybe the crash occurred before the logger has been intitialized), then you may need to write the info to file and have an external entity send it remotely.
EDIT: Putting back some original info to be able to send the core file remotely too
To be able to send the core file remotely, you'll need an external entity (a different process than the one that crashed) that will "wait" for core files and send them remotely as they appear. (possibly using scp) Additionally, the crashing process could catch the segfault signal and notify the monitoring process that a crash has occurred and a core file will be available soon.