Failover strategies for stateful servers

Failover strategies for stateful servers - web-services

in our project, we have a stateful server. The server runs a rule engine (Drools) and exposes functionality using a rest service. It is monitoring system and it is very critical to have an uptime or more less 100%. Therefore we also need strategies to shut down a server for maintainance and to have strategies to be able to continue monitoring of an agent when one server is offline.
The first might be to put a message queue or service bus in front of the drools servers to keep messages that have not been processed and to have mechanisms to backup the state of the server to a database or another storage. This makes it possible to shut down the server for a few minutes to deploy a new version. But the question is, what to do when one server goes offline unexpectedly. Are there any failover strategies for stateful servers, what is your experience? And advice is welcome.

There's no 'correct' way that I can think of. It rather depends on things like:
sensitivity to changes over time windows.
how quickly your application needs to be brought back up.
impact if events are missed.
impact if the events it is monitoring are not up to the second.
how the application raises events back to the outside world.
Some ideas for enabling fail-over:
Start from a clean slate. Examine the most serious impact of this before spending time developing anything else.
Load a list of facts (today's transactions perhaps) from a database. Potentially replay in order. Possibly whilst using a pseudo clock. I'm aware of this being used for some pricing applications in the financial sector, although at the same time, I'm also aware that some of those systems can take a very long time to catch up due to the number of events that need to be replayed.
Persist the stateful session periodically. The interval to be determined based on how far behind the DR application is permitted to be, and how long it takes to persist a session. This way, the DR application can retrieve the same session from the database. However, there will be a gap in events received based on the interval between persists. Of course, if the reason for failure is corruption of the session, then this doesn't work so well.
Configure middleware to forward events to 2 queues, and subscribe primary and DR applications to those queues. This way, both monitors should be in sync and able to make decisions based on the last 1 minute of activity. Note that if one leg is taken out for a period then it will need to catch up, and your middleware needs capacity to store multiple hours (however long an outage might be) worth of events on a queue. Also, your rules need to work off the timestamp on the event itself when queued, rather than the current time. Otherwise, when bringing a leg back after an outage, it could well raise alerts based on events in a time window.
An additional point to consider when replaying events is that you probably don't want any alerts to be raised to the outside world until you have completed the replay. For instance you probably don't want 50 alert emails sent to say that ApplicationX is down, up, down, up, down, up, ...
I'll assume that a monitoring application might be pushing alerts to the outside world in some form. If you have a hot-hot configuration as in 4, you also need to control your alerts. I would be tempted to deal with this by configuring each to push alerts to its own queue. Then middleware could forward alerts from the secondary monitor to a dead letter queue. Failover would be to reconfigure middleware so that primary alerts go to the dead letter queue and secondary alerts go to the alert channel. This mechanism could also be used to discard events raised during a replay recovery.
Given the complexity and potential mess that can arise from replaying events, for a monitoring application I would probably prefer starting from a clean slate, or going with persisted sessions. However this may well depend on what you are monitoring.

Related

WSO2 DAS bad performance for incoming events?

We're using WSO2 DAS 3.1.0 to receive events from WSO2 API-Manager and send off to a database.
If we send maybe 70-100 events / second for 4-5 hours to DAS the performance slowly deteriorates and it starts "laging" behind. At first we suspected a problem pushing the resulting events to our database (we have an event-receiver, an execution-plan (that summarizes events / second) and a publisher to our database), but we've now concluded that this isn't an issue, the database has no problem keeping up with the load at all.
To isolate the problem we've for e.g. added an event publisher to file from the incoming event receiver (before we do any handling in our execution-plan) and we can see that when DAS performance deteriorates, for several seconds, there's no output for this publisher either; hence the problem is in handling incoming events (we've also added a queue between pushing events to our database to make sure there were no back-pressure propagating to the handling of incoming events).
The really strange part however is that when this behavior occurs (the performance handling incoming events in DAS deteriorates), there's no way to get out of it apart from restarting the entire server (then it starts working again without problem for several hours). Even if we stop sending events to the server for several days, when we start sending even 1-2 events to the server, it takes several seconds between handling all events (and thus straight away "lags" behind incoming events). It's as if the performance gets exponentially slower at handling incoming events until we restart DAS.
Would be very happy for any potential clues as to where to make changes for this behavior to not occur (purging internal events has no effect either).

After a lot of debugging we finally found the cause for this.
In our Siddhi-statements we use 'group by' with dynamically changing timestamps, which it turns out is handled extremely inefficient as described by this bug: https://github.com/wso2/siddhi/issues/431.
After patching the specified classes the problem disappeared (but currently there's still a bug where the product gets OOM since it doesn't release the dynamic 'group by' information).

Who is in charge of re-create persistant actor instance after JVM crash?

I am evaluating whether to use Akka and akka persistent as key toolkit of certain project, in which there is a complex background running process (might be triggered by Quartz at a fixed time per day).
The back-ground running process will communicate with many different external services via HTTP communication, will generates many encrypted files locally, and transfer them via SFTP.
From business perspective:
The service is mission critical, roughly it will charge N million users' money from their bank cards automatically and help them purchase some fund product.
From technical perspective:
Each of the external service might not be available with whatever reason, such as network issues, the external service might running out of their resources(i.e. jdbc connections).
Our service might be killed, restarted, re-deployed due to urgent reason or crashed with some unexpected errors.
Once the process was restarted with an incomplete job, then it needs to gracefully complete them with different strategies, such as redo, confirm external system business state, and resume from certain check point.
I was reading from official AkkaScala.PDF, and some youtube conference videos, all of them were mentioning, actor's state can be restored by replaying the events from journal after JVM crash.
But it must be a stupid question, since i did not find it was being discussed:
Imagine there were 1000 persistent actors living in the service, and the service's JVM crashed and restarted, who should be in charge of triggering re-create those 1000 persistent actors in the newly created actor system in both single process mode and clustered mode? And how? Or what articles should I read first?

You should read basics of Akka Persistence and Akka Persistence Query. But probably, first thing that comes to my mind is to use Akka Persistence Query AllPersistenceIdsQuery or CurrentPersistenceIdsQuery. It will give you all persistence id's which you can use to reignite your persistent actors. Persistent actors by specific persistent id will replay all events from event store journal. You can take snapshots to speed up recovery. Your event store will probably be some kind of database (e.g. Cassandra). Considering that your persistent actor has specific mutable state, it will be brought back to its last state after the recovery. Recovery might take some time.

Akka.net load balancing and span out processing

I am looking to build a system that is able to process a stream of requests that needs a long processing time say 5 min each. My goal is to speed up request processing with minimal resource footprint which at times can be a burst of messages.
I can use something like a service bus to queue the request and have multiple process (a.k.a Actors in akka) that can subscribe for a message and start processing. Also can have a watchdog that looks at the queue length in the service bus and create more actors/ actor systems or stop a few.
if I want to do the same in the Actor system like Akka.net how can this be done. Say something like this:
I may want to spin up/stop new Remote Actor systems based on my request queue length
Send the message to any one of the available actor who can start processing without having to check who has the bandwidth to process on the sender side.
Messages should not be lost, and if the actor fails, it should be passed to next available actor.
can this be done with the Akka.net or this is not a valid use case for the actor system. Can some one please share some thoughts or point me to resources where I can get more details.

I may want to spin up/stop new Remote Actor systems based on my request queue length
This is not supported out of the box by Akka.Cluster. You would have to build something custom for it.
However Akka .NET has pool routers which are able to resize automatically according to configurable parameters. You may be able to build something around them.
Send the message to any one of the available actor who can start processing without having to check who has the bandwidth to process on the sender side.
If you look at Akka .NET Routers, there are various strategies that can be used to assign work. SmallestMailbox is probably the closest to what you're after.
Messages should not be lost, and if the actor fails, it should be passed to next available actor.
Akka .NET supports At Least Once Delivery. Read more about it in the docs or at the Petabridge blog.

While you may achieve some of your goals with Akka cluster, I wouldn't advise that. From your requirements it clearly states that your concerns are oriented about:
Reliable message delivery (where service buses and message queues are better option). There are a lot of solutions here, depending on your needs i.e. MassTransit, NServiceBus or queues (RabbitMQ).
Scaling workers (which is infrastructure problem and it's not solved by actor frameworks themselves). From what you've said, you don't even even need a cluster.
You could use akka for building a message processing logic, like workers. But as I said, you don't need it if your goal is to replace existing service bus.

Windows Phone: Updating backend datastore (via web service) while keeping UI very responsive

I am developing a Windows Phone app where users can update a list. Each update, delete, add etc need to be stored in a database that sits behind a web service. As well as ensuring all the operations made on the phone end up in the cloud, I need to make sure the app is really responsive and the user doesn’t feel any lag time whatsoever.
What’s the best design to use here? Each check box change, each text box edit fires a new thread to contact the web service? Locally store a list of things that need to be updated then send to the server in batch every so often (what about the back button)? Am I missing another even easier implementation?
Thanks in advance,

Data updates to your web service are going to take some time to execute, so in terms of providing the very best responsiveness to the user your best approach would be to fire these off on a background thread.
If updates not taking place (until your app resumes) due to a back press is a concern for your app then you can increase the frequency of sending these updates off.
Storing data locally would be a good idea following each change to make sure nothing is lost since you don't know if your app will get interrupted such as by a phone call.
You are able to intercept the back button which would allow you to handle notifying the user of pending updates being processed or requesting confirmation to defer transmission (say in the case of poor performing network location). Perhaps a visual queue in your UI would be helpful to indicate pending requests in your storage queue.
You may want to give some thought to the overall frequency of data updates in a typical usage scenario for your application and how intensely this would utilise the network connection. Depending on this you may want to balance frequency of updates with potential power consumption.
This may guide you on whether to fire updates off of field level changes, a timer when the queue isn't empty, and/or manipulating a different row of data among other possibilities.
General efficiency guidance with mobile network communications is to have larger and less frequent transmissions rather than a "chatty" or frequent transmissions pattern, however this is up to you to decide what is most applicable for your application.

You might want to look into something similar to REST or SOAP.
Each update, delete, add would send a request to the web service. After the request is fulfilled, the web service sends a message back to the Phone application.
Since you want to keep this simple on the Phone application, you would send a URL to the web service, and the web service would respond with a simple message you can easily parse.
Something like this:
http://webservice?action=update&id=10345&data=...
With a reply of:
Update 10345 successful
The id number is just an incrementing sequence to identify the request / response pair.

There is the Microsoft Sync Framework recently released and discussed some weeks back on DotNetRocks. I must admit I didnt consider this till I read your comment.
I've not looked into the sync framework's dependencies and thus capability for running on the wp7 platform as yet, but it's probably worth checking out.
Here's a link to the framework.
And a link to Carl and Richard's show with Lev Novik, an architect on the project if you're interested in some background info. It was quite an interesting show.

why is the lift web framework scalable?

I want to know the technical reasons why the lift webframework has high performance and scalability? I know it uses scala, which has an actor library, but according to the install instructions it default configuration is with jetty. So does it use the actor library to scale?
Now is the scalability built right out of the box. Just add additional servers and nodes and it will automatically scale, is that how it works? Can it handle 500000+ concurrent connections with supporting servers.
I am trying to create a web services framework for the enterprise level, that can beat what is out there and is easy to scale, configurable, and maintainable. My definition of scaling is just adding more servers and you should be able to accommodate the extra load.
Thanks

Lift's approach to scalability is within a single machine. Scaling across machines is a larger, tougher topic. The short answer there is: Scala and Lift don't do anything to either help or hinder horizontal scaling.
As far as actors within a single machine, Lift achieves better scalability because a single instance can handle more concurrent requests than most other servers. To explain, I first have to point out the flaws in the classic thread-per-request handling model. Bear with me, this is going to require some explanation.
A typical framework uses a thread to service a page request. When the client connects, the framework assigns a thread out of a pool. That thread then does three things: it reads the request from a socket; it does some computation (potentially involving I/O to the database); and it sends a response out on the socket. At pretty much every step, the thread will end up blocking for some time. When reading the request, it can block while waiting for the network. When doing the computation, it can block on disk or network I/O. It can also block while waiting for the database. Finally, while sending the response, it can block if the client receives data slowly and TCP windows get filled up. Overall, the thread might spend 30 - 90% of it's time blocked. It spends 100% of its time, however, on that one request.
A JVM can only support so many threads before it really slows down. Thread scheduling, contention for shared-memory entities (like connection pools and monitors), and native OS limits all impose restrictions on how many threads a JVM can create.
Well, if the JVM is limited in its maximum number of threads, and the number of threads determines how many concurrent requests a server can handle, then the number of concurrent requests will be determined by the number of threads.
(There are other issues that can impose lower limits---GC thrashing, for example. Threads are a fundamental limiting factor, but not the only one!)
Lift decouples thread from requests. In Lift, a request does not tie up a thread. Rather, a thread does an action (like reading the request), then sends a message to an actor. Actors are an important part of the story, because they are scheduled via "lightweight" threads. A pool of threads gets used to process messages within actors. It's important to avoid blocking operations inside of actors, so these threads get returned to the pool rapidly. (Note that this pool isn't visible to the application, it's part of Scala's support for actors.) A request that's currently blocked on database or disk I/O, for example, doesn't keep a request-handling thread occupied. The request handling thread is available, almost immediately, to receive more connections.
This method for decoupling requests from threads allows a Lift server to have many more concurrent requests than a thread-per-request server. (I'd also like to point out that the Grizzly library supports a similar approach without actors.) More concurrent requests means that a single Lift server can support more users than a regular Java EE server.

at mtnyguard
"Scala and Lift don't do anything to either help or hinder horizontal scaling"
Ain't quite right. Lift is highly statefull framework. For example if a user requests a form, then he can only post the request to the same machine where the form came from, because the form processeing action is saved in the server state.
And this is actualy a thing which hinders scalability in a way, because this behaviour is inconistent to the shared nothing architecture.
No doubt that lift is highly performant but perfomance and scalability are two different things. So if you want to scale horizontaly with lift you have to define sticky sessions on the loadbalancer which will redirect a user during a session to the same machine.

Jetty maybe the point of entry, but the actor ends up servicing the request, I suggest having a look at the twitter-esque example, 'skitter' to see how you would be able to create a very scalable service. IIRC, this is one of the things that made the twitter people take notice.

I really like #dre's reply as he correctly states the statefulness of lift being a potential problem for horizontal scalability.
The problem -
Instead of me describing the whole thing again, check out the discussion (Not the content) on this post. http://javasmith.blogspot.com/2010/02/automagically-cluster-web-sessions-in.html
Solution would be as #dre said sticky session configuration on load balancer on the front and adding more instances. But since request handling in lift is done in thread + actor combination you can expect one instance handle more requests than normal frameworks. This would give an edge over having sticky sessions in other frameworks. i.e. Individual instance's capacity to process more may help you to scale
you have Akka lift integration which would be another advantage in this.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js