How to debug a hanging job resulting from reading from lustre? - gdb

I have a job in interruptible sleep state (S), hanging for a few hours.
can't use gdb (gdb will hang when attaching to the PID).
can't use strace, strace will resume the hanging job =(
WCHAN field shows the PID is waiting for ptlrpc. After some search online, it looks like this is a lustre operation. The print files also revealed the program is stuck in reading data from lustre. Any idea or suggestion on how to proceed the diagnose? Or possible reason why the hanging happens?

You can check /proc/$PID/stack on the client to see the whole stack of the process, which would give you some more information about what the process is doing (ptlrpc_set_wait() is just the generic "wait for RPC completion" function).
That said, what is more likely to be useful is to check the kernel console error messages (dmesg and/or /var/log/messages) to see what is going on. Lustre is definitely not shy about logging errors when there is a problem.
Very likely this will show that the client is waiting on a server to complete the RPC, so you'll also have to check the dmesg and/or /var/log/messages To see what the problem is on the server. There are several existing docs that go into detail about how to debug Lustre issues:
https://wiki.lustre.org/Diagnostic_and_Debugging_Tools
https://cug.org/5-publications/proceedings_attendee_lists/CUG11CD/pages/1-program/final_program/Wednesday/12A-Spitz-Paper.pdf
At that point, you are probably best off to check for existing Lustre bugs at https://jira.whamcloud,com/ to search for the first error messages that are reported, or maybe a stack trace. It is very likely (depending on what error is being hit), that there is already a fix available, and upgrading to the latest maintenance release (2.12.7 currently), or applying a patch (if the bug is recently fixed) will sole your problem.

Related

Akka EventsourcedBehavior JournalFailureException but stack trace doesn't show underlying cause

I have an akka persistence app (EventSourcedBehavior based actors, akka 2.6.13) and using akka-persistence-jdbc 3.5.3 for the journal/snapshot plugin along with a cockroachdb cluster. Things work fine, but recently I've seen a lot of event persist failures but the error logs do not show any underlying cause of the issue - no SQL level exceptions in the trace at all. At the same time as this, we usually see errors due to actors being restored, and the journal again throwing JournalFailureExceptions, but no underlying reason.
If I can't see any underlying reasons (the only thing the logs show is async write timed out after 5.00 s (is this timeout value configurable?) does this mean there is something else causing the issues, unrelated to the journal plugin implementation or database? How can I debug this further - i've examined the message handler in my EventSourcedBehavior that has failed when persisting an event to see if is doing anything weird or blocking, but I can't see anything obviously wrong.
Any ideas welcome.
Thanks
The JournalFailureExceptions likely indicate connectivity or slow responses from the DB. If it's just slowness, scaling out/up the cockroach cluster might help.
"async write timed out after" is from cluster sharding's remember-entities feature (that's the only occurrence in Akka) which also indicates connectivity issues or slow responses from the DB.
There is most likely no problem with your behaviors. It's worth noting that remember-entities (especially in eventsourced mode... ddata mode is a little better in this regard if you're OK with not remembering entities across full-cluster restarts) itself puts a substantial load on persistence and your DB and is consistently (if you have more than a few hundred entities) counterproductive, in my experience. Unless you've actually tried disabling it and seen an actual net negative effect, I suggest disabling remember-entities.

Retrive the information for a address using gdb

Upon Running strace on a Java Application I notice some long time the syscall(mostly futex).
futex(0x7f8578001fd4, FUTEX_WAIT_PRIVATE, 1311, NULL) = 0 <15.082094>
I really want to understand the wait on futex is for which shared resources over here.
But, I'm not sure how?
I did some googling and found GDB can be helpful for finding the above cause. But unfortunately, I'm not much aware of GDB as I had barely used it before.
Can some help me understand how to find the answer that I'm looking at.
The futex operation is waiting for another thread to release a lock. You should first look at Java-aware tools to see if this is a high-level Java lock. Perhaps even sending SIGQUIT (by pressing Ctrl+\ is sufficient) for that.

gRPC C++ client calls against Bigtable hangs occasionally

I am having a problem with gRPC C++ client making calls against google cloud Bigtable. These calls occasionally hang and it is only if the call deadline is set the call returns. There is an issue filed with gRPC team: https://github.com/grpc/grpc/issues/6278 with stack trace and a piece of gRPC tracing log provided.
The call that hangs most often is ReadRows stream read call. I have seen MutateRow call hanging a few times as well but that is rather rare.
gRPC tracing shows that there is some response coming back from the server, however that response seems to be insufficient for gRPC client to go on.
I did spend a fair amount of time debugging the code, no obvious problems found so far on the client side, no memory corruptions seen. This is a single-threaded application, making one call at a time, client side concurrency is not a suspect. Client runs on google compute engine box, so the network likely is not an issue as well. gRPC client is kept up to date with the github repository main line.
Any suggestions would be appreciated. If anyone have debugging ideas that would be great as well. Using valgrind, gdb, reducing the application to a subset with reproducible results did not help so far, I have not been able to find out what the problem is. The problem is random and shows up occasionally.
Additional note on May 17, 2016
There was a suggestion that re-tries may help to deal with the issue.
Unfortunately re-tries do not work very well for us because we would have to carry that over into the application logic. We can easily re-try updates, which is MutateRow calls, and we do that, these are not streaming calls and easy to re-try. However once the iteration of the DB query results has begun by the application, if it fails, the re-trying means that the application needs to re-issue the query and start iteration of the results again. Which is problematic. It is always possible to consider a change that would make our applications to read the whole result set at once and then at the application level iterations can be done in memory. Then re-tries can be handled. But that is problematic for all kinds of reasons, like memory footprint and application latencies. We want to process DB query results as soon as they arrive, not when all of them are in memory. There is also timeout added to the call latency when the call hangs. So, re-tries of the query result iterations are really costly to such a degree that they are not practical.
We've experienced hanging issues with gRPC in various languages. The gRPC team is investigating.

catch all errors and exception related my program

I am currently working on a c++ daemon program that listen on a port for incoming requests.
I would like to catch all the errors related to the program for that I implemented a logger in my program and catched some eventual ones, but other errors remain uncatchable using those methodes, example the Segfault or if the program was stopped because of memory shortage.
I had the idea of using the 'bmesg' which contains logs of different process and then take what I need from there. the problem with this approach is that the logs from 'bmesg' don't contain human readable information more than that, the logs are not dated, so i used 'gdb' on my program, now i logs are more elaborated and contain better information but i can't catch the message of 'gdb'
my questions are:
Is my approach to this problem correct? If yes how can i continue from where i am now
Is there another way to listen to errors better than this.
I will need somthing similar in a C program do you have a suggestion.
EDIT
after some research i think i will use another deamon to check every 5 min or so if my other deamon is running or not in order to re-launch it if its down. with this setteled i now need to record the error. this is where i am stuck

How to report correctly the abrupt end of another process in Linux?

I'm working on a embedded solution where two apps are working: one is the user interface and the other runs in the background providing data for the UI.
Recently I came across with a memory leak or similar error that is making Linux kill the secondary process, leaving the UI in a stopped situation without telling anything for the user about what is going on. I reached the problem by reading Linux's message log file and the software's print on terminal "Kill -myapp".
My question is: how could I notice such an event (and other similar) coming from the secondary software so I could properly report it to the user and log it? I mean, it's easy to have a look time to time in the process 'tree' to see if the secondary app is running and, if it's not, report a "some event happened" in the UI and it's also plausible to have a error-handler system inside the secondary app that makes it write in a log file what just happened and make the UI read that file for new entries from time to time, but how could the UI app knows with better details what is going on in such more abrupt events? (in this case, "Linux killed process", but it could be a "segmentation pipe" or any other) (and if there is another, better solution that this "constant read a log file produced by the secondary app", I'ld also like to know)
Notes: the UI is written in C++/Qt and the secondary app is in C. Although a solution using the Qt library would be welcomed, I think it would be better for the entire programming community if a more generalized solution was given.
You can create a signal handler for POSIX signals such as SIGKILL in the backend process and notify the ui using for example another signal with sigqueue. Any IPC mechanism should work, as long as it's async safe. Read more about signals: tutorial and manual
It may still be a good idea to check from the ui side periodically because the handler might not succeed.
As for a better way to check if process is alive compared to reading the log file:
Check if process exists given its pid