How do I solve problems that occur when Akka cannot get heartbeat? - akka

When I work with Akka, I have problems if I do debug.They can't get heartbeat while they're debugging. Then they start not seeing each other anymore.
As far as I can tell, if he can't get a heartbeat after the next 10 seconds, he's out of the cluster and can't see each other anymore. I want to extend this period. For example, if you can't hearbeat for 5 minutes, then you should get out of the cluster.I'm looking for solutions like this.

Turn off auto-downing. That way the node you are debugging might get marked as unreachable, but it will never be removed from the cluster. (And as soon as you release it from being frozen the other nodes will mark it as reachable again.)

Related

How to debug a hanging job resulting from reading from lustre?

I have a job in interruptible sleep state (S), hanging for a few hours.
can't use gdb (gdb will hang when attaching to the PID).
can't use strace, strace will resume the hanging job =(
WCHAN field shows the PID is waiting for ptlrpc. After some search online, it looks like this is a lustre operation. The print files also revealed the program is stuck in reading data from lustre. Any idea or suggestion on how to proceed the diagnose? Or possible reason why the hanging happens?
You can check /proc/$PID/stack on the client to see the whole stack of the process, which would give you some more information about what the process is doing (ptlrpc_set_wait() is just the generic "wait for RPC completion" function).
That said, what is more likely to be useful is to check the kernel console error messages (dmesg and/or /var/log/messages) to see what is going on. Lustre is definitely not shy about logging errors when there is a problem.
Very likely this will show that the client is waiting on a server to complete the RPC, so you'll also have to check the dmesg and/or /var/log/messages To see what the problem is on the server. There are several existing docs that go into detail about how to debug Lustre issues:
https://wiki.lustre.org/Diagnostic_and_Debugging_Tools
https://cug.org/5-publications/proceedings_attendee_lists/CUG11CD/pages/1-program/final_program/Wednesday/12A-Spitz-Paper.pdf
At that point, you are probably best off to check for existing Lustre bugs at https://jira.whamcloud,com/ to search for the first error messages that are reported, or maybe a stack trace. It is very likely (depending on what error is being hit), that there is already a fix available, and upgrading to the latest maintenance release (2.12.7 currently), or applying a patch (if the bug is recently fixed) will sole your problem.

MessageProducer.send() is too slow for a particular topic

I've narrowed down the area of the problem I'm facing and it turned out that MessageProducer.send() is too slow when it is created for a particular topic "replyfeserver":
auto producer = context.CreateProducerFromTopic("replyfeserver");
producer->send(textMessage); //it is slow
Here the call to send() blocks for up to 55-65 seconds occasionally — almost every after 4-5 calls, and up to 5-15 seconds in general.
However, if I use some other topic, say "feserver.action.status".
auto producer = context.CreateProducerFromTopic("feserver.action.status");
producer->send(textMessage); //it is fast!
Now the call to send() returns immediately, within a fraction of second. I've tried send() with several other topics and all of them work fast enough.
What could be the possible issues with this particular topic "replyfeserver"? What are the things I should look at in order to diagnose the issue with it? I was using this topic for last 2 months.
I'm using XMS C++ API and please assume that context object is an abstraction which creates session, destination, consumer, producer and so on.
I'd also like to know if there is any difference between these two approaches:
xms::Destination dest("topic://replyfeserver");
vs
xms::Destination dest = session.createTopic("replyfeserver");
I tried with both, it doesn't make any difference — at least I didn't notice it.
There shouldn't be any difference. Personally, I like to have my topics in a hierarchy. i.e. A.B.C
I would run an MQ trace then open a PMR with IBM and give them the trace and say please explain the delay.

What happens to running processes on a continuous Azure WebJob when website is redeployed?

I've read about graceful shutdowns here using the WEBJOBS_SHUTDOWN_FILE and here using Cancellation Tokens, so I understand the premise of graceful shutdowns, however I'm not sure how they will affect WebJobs that are in the middle of processing a queue message.
So here's the scenario:
I have a WebJob with functions listening to queues.
Message is added to Queue and job begins processing.
While processing, someone pushes to develop, triggering a redeploy.
Assuming I have my WebJobs hooked up to deploy on git pushes, this deploy will also trigger the WebJobs to be updated, which (as far as I understand) will kick off some sort of shutdown workflow in the jobs. So I have a few questions stemming from that.
Will jobs in the middle of processing a queue message finish processing the message before the job quits? Or is any shutdown notification essentially treated as "this bitch is about to shutdown. If you don't have anything to handle it, you're SOL."
If we are SOL, is our best option for handling shutdowns essentially to wrap anything you're doing in the equivalent of DB transactions and implement your shutdown handler in such a way that all changes are rolled back on shutdown?
If a queue message is in the middle of being processed and the WebJob shuts down, will that message be requeued? If not, does that mean that my shutdown handler needs to handle requeuing that message?
Is it possible for functions listening to queues to grab any more queue messages after the Job has been notified that it needs to shutdown?
Any guidance here is greatly appreciated! Also, if anyone has any other useful links on how to handle job shutdowns besides the ones I mentioned, it would be great if you could share those.
After no small amount of testing, I think I've found the answers to my questions and I hope someone else can gain some insight from my experience.
NOTE: All of these scenarios were tested using .NET Console Apps and Azure queues, so I'm not sure how blobs or table storage, or different types of Job file types, would handle these different scenarios.
After a Job has been marked to exit, the triggered functions that are running will have the configured amount of time (grace period) (5 seconds by default, but I think that is configurable by using a settings.job file) to finish before they are exited. If they do not finish in the grace period, the function quits. Main() (or whichever file you declared host.RunAndBlock() in), however, will finish running any code after host.RunAndBlock() for up to the amount of time remaining in the grace period (I'm not sure how that would work if you used an infinite loop instead of RunAndBlock). As far as handling the quit in your functions, you can essentially "listen" to the CancellationToken that you can pass in to your triggered functions for IsCancellationRequired and then handle it accordingly. Also, you are not SOL if you don't handle the quits yourself. Huzzah! See point #3.
While you are not SOL if you don't handle the quit (see point #3), I do think it is a good idea to wrap all of your jobs in transactions that you won't commit until you're absolutely sure the job has ran its course. This way if your function exits mid-process, you'll be less likely to have to worry about corrupted data. I can think of a couple scenarios where you might want to commit transactions as they pass (batch jobs, for instance), however you would need to structure your data or logic so that previously processed entities aren't reprocessed after the job restarts.
You are not in trouble if you don't handle job quits yourself. My understanding of what's going on under the covers is virtually non-existent, however I am quite sure of the results. If a function is in the middle of processing a queue message and is forced to quit before it can finish, HAVE NO FEAR! When the job grabs the message to process, it will essentially hide it on the queue for a certain amount of time. If your function quits while processing the message, that message will "become visible" again after x amount of time, and it will be re-grabbed and ran against the potentially updated code that was just deployed.
So I have about 90% confidence in my findings for #4. And I say that because to attempt to test it involved quick-switching between windows while not actually being totally sure what was going on with certain pieces. But here's what I found: on the off chance that a queue has a new message added to it in the grace period b4 a job quits, I THINK one of two things can happen: If the function doesn't poll that queue before the job quits, then the message will stay on the queue and it will be grabbed when the job restarts. However if the function DOES grab the message, it will be treated the same as any other message that was interrupted: it will "become visible" on the queue again and be reran upon the restart of the job.
That pretty much sums it up. I hope other people will find this useful. Let me know if you want any of this expounded on and I'll be happy to try. Or if I'm full of it and you have lots of corrections, those are probably more welcome!

C++: How to set a timeout (not reading input, not threaded)?

Got a large C++ function in Linux that calls a whole lot of other functions, making up an algorithm. At various points given certain bad inputs, the algorithm can get "stuck" and go on forever. Adding a timeout seems appropriate as all potential "stuck" points cannot be predicted. But despite scouring the Internet for timeout examples I've only found how to apply timeouts when either the thing your timing is a separate thread or it's reading inputs. My code is a single thread and does not modify file descriptors, so not coming up with any luck. Do I basically have no choice but to thread it?
I am not sure about the situation, actually server applications or embedded applications often run for years in background without stopping. I think one option is to let your program run in background and log to a file(or screen) timely, and, if you really want to stop the program after certain time, you can use timeout command or a script to kill your program after that time, say, timeout 15s your-prog.

Debugging a TCP-Server

I have a C++ application that accepts TCP connections from client applications.
After a seemingly random time of running fine (days), it stops receiving followup messages from the clients and only sees the first message on each TCP connection. After a re-start all is fine again.
The trouble is, this only happens on the production server where I have to restart is as soon as it gets stuck and I have been uanble to reproduce this on a lab machine. None of the socket operations seems to return an error, that I would see in my logfile and the application is huge so I can't just post the relevant part here.
First messages keep coming through all the time, only subsequent messages aren't received after a while. Even when my application stops receiving the followup-messages, I can see them comming in with Wireshark.
Any ideas how I might find out what is happening ? What should I be looking for ?
Any config settings used here? In the past I have put a condition on a server accept to ignore messages after 50,000 have been processed. This was to prevent run-away situations in development. This code went live on one occasion without changing the config setting to 'allow infinite messages'. The result was exactly what you describe, ok for 2-3 days, then messages sent ok, but just ignored with no errors anywhere.
This may not be the case here, but I mention it as an example of where you may have to look.