akka cluster fast handover

akka cluster fast handover - akka

I'm using an akka cluster singleton feature to have a single master node and multiple worker nodes. Worker nodes send results to the master and master persists them in a database.
The problem is that it takes akka about 10 seconds to hand the singleton over, which results in 10 seconds downtime every time I deploy. Is there a way to make the singleton handover faster? Or is there some other approach to a single master - multiple workers pattern?

Related

Distributing tasks over HTTP using SpringBoot non-blocking WebClient (performance question?)

I have a large number of tasks - N, needs to be distributed to multiple http worker nodes via load balancer. Though there exists multiple nodes - n, combining all nodes we have a max-concurrency setting - x.
Always
N > x > n
One node can run those tasks in multiple threads. Mean time consumption for each task is about 50 sec to 1 min. Using Web-Client to distribute tasks and Mono response from Workers.
There exists a distributor and designed the process as follows:
1. Remove a task from queue.
2. Send the task via POST request using Web-Client and subscribe immediately with a subscriber instance
3. Holt new subscription when max concurrency reached to x
4. When any one of the above distributed task completes it calls on-accept(T) method of the subscriber.
5. If task queue is not empty, remove and send the next task / (x+1) task.
6. Keep track of total number completed tasks.
7. If all tasks completed & queue empty set Completable Future object as complete
8. Exit
The above process works fine. Tested with N=5000, n=10 & x=25.
Now the confusion is in this design we always have x number of concurrent subscriptions. As soon as one ends we create another until all tasks are completed. What is the impact of this in large scale production environment? If number of concurrent subscription (the value of x > 10,000) increases via the HTTP(s) load balancer is that going to have serious impact on performance and network latency? Our expected production volume will be something like below:
N=200,000,000
n=100
x=10,000
I will be grateful if some one with knowledge of Reactor and Web-Client expertise comment of this approach. Our main concern is having too many concurrent subscriptions.

Apache STORM - Tuples Distribution among workers

I am working on a project using Apache STORM and the topology that I'm using consist of Spout, BoltA(2 Executors) and BoltB(1 Executor) in sequence.
Spout --> BoltA --> BoltB
My understanding is that Shuffle grouping divide the tuples equally among the bolt tasks but what I noticed is that it is true only when all the tasks are on the same worker. When there are more than 1 workers let's say 2 workers and each worker hosts one task instance of same bolt then load is not even in both the tasks.
With my topology - I have load on one task as 90% and the second task has 0%.
Why is that different for multiple workers.

If you have shuffle connections between the Spout and BoltA, the tuples should be evenly distributed.
As Stig Rohde Døssing mentioned, the behavior you mention matches the "local or shuffle grouping" (see Storm Concepts) which will preferentially send tuples to tasks on the local worker process.
So if the Spout has a parallelism of 1 and one of the BoltA tasks is on the same worker process, tuples from the spout will be preferentially routed to that local downstream task.

Creating a scalable and fault tolerant system using AWS ECS

We're designing C# scheduled task (runs every few hours) that will run on AWS ECS instances that will grab batched transaction data for thousands of customers from an endpoint, modify the data then send it on to another web service. We will be maintaining the state of the last successful batch in a separate database (using some like created date of the transactions). We need the system to be scalable so as more customers are added we add additional ECS containers to process the data.
There are the options we're considering:
Each container only processes a specific subset of the data. As more customers are added more contains are added. We would need to maintain a logical separation of what contains are processing what customers data.
All the containers process all of the customers. We use some kind of locking flags on the database to let other processes know that the customers data is being processed.
Some other approach.
I think that option 2 is probably the best, but it adds a lot of complexity regarding the locking and unlocking of customers. Are there specific design patterns I could be pointed towards if that if the correct solution?

In both scenarios an important thing to consider is retries in case processing for a specific customer fails. One potential way to distribute jobs across a vast number of container with retries would be to use AWS SQS.
A single container would run periodically every few hours and be the job generator. It would create one SQS queued item for each customer that needs to be processed. In response to items appearing in the queue a number of "worker" containers would be spun up by ECS to consume items from the queue. This can be made to autoscale relative to the number of items in the queue to quickly spin up many containers that can work in parallel.
Each container would use its own high performance concurrent poller similar to this (https://www.npmjs.com/package/squiss) to start grabbing items from the queue and processing them. If a worker failed or crashed due to a bug then SQS will automatically redeliver and dropped queued items that worker had been working on to a different worker after they time out.
This approach would give you a great deal of flexibility, and would let you horizontally scale out the number of workers, while letting any of the workers process any jobs from the queue that it grabs. It would also ensure that every queued item gets processed at least once, and that none get dropped forever in case something crashes or goes wrong.

Distributed Priority Queue, once and only once

TL;DR
I have producers, tasks and consumers. I need a scalable queuing system which can ensure that a task can be consumed once and only once, and which can sort the tasks according to their priority.
The context:
We have a prototype working, but it's not "scale ready", and today we need to scale...
Below is the prototype "process":
1°) Some customers upload dataset in the database (PostgreSQL)
2°) Each second, an application fetches for new dataset in the database and converts them into tasks.
One customer's dataset can generate thousand of tasks (~500K tasks/day, ~30K tasks/customer)
3°) An application "Dispatcher"
fetches sorted tasks from the database (tasks with the smallest dataset will be proceeded first even if they have been submitted later + some random value to shuffle)
performs some validations (check if the task has been canceled or not)
dispatch the task to the according worker.
Each worker can process only one kind of task, but it can process thousands of them concurrently.
4°) The workers receive the task, and push the result to the database
5°) A "Monitor" application checks the state of all tasks, and retries any task that needs to (worker crashed).
Today, the bottleneck is the SQL server, I can tune it but I would prefer to redesign it the right way. So, I was wondering if there are some best practices for that kind of process?
It seems I need a distributed queuing system (Kafka?), which can guarantee that a task will be processed once and only once, but which will also manage priority.

AWS SWF Simple Workflow - Best Way to Keep Activity Worker Scripts Running?

The maximum amount of time the pollForActivityTask method stays open polling for requests is 60 seconds. I am currently scheduling a cron job every minute to call my activity worker file so that my activity worker machine is constantly polling for jobs.
Is this the correct way to have continuous queue coverage?

The way that the Java Flow SDK does it and the way that you create an ActivityWorker, give it a tasklist, domain, activity implementations, and a few other settings. You set both the setPollThreadCount and setTaskExecutorSize. The polling threads long poll and then hand over work to the executor threads to avoid blocking further polling. You call start on the ActivityWorker to boot it up and when wanting to shutdown the workers, you can call one of the shutdown methods (usually best to call shutdownAndAwaitTermination).
Essentially your workers are long lived and need to deal with a few factors:
New versions of Activities
Various tasklists
Scaling independently on tasklist, activity implementations, workflow workers, host sizes, etc.
Handle error cases and deal with polling
Handle shutdowns (in case of deployments and new versions)

I ended using a solution where I had another script file that is called by a cron job every minute. This file checks whether an activity worker is already running in the background (if so, I assume a workflow execution is already being processed on the current server).
If no activity worker is there, then the previous long poll has completed and we launch the activity worker script again. If there is an activity worker already present, then the previous poll found a workflow execution and started processing so we refrain from launching another activity worker.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js