Apache STORM - Tuples Distribution among workers

Apache STORM - Tuples Distribution among workers - grouping

I am working on a project using Apache STORM and the topology that I'm using consist of Spout, BoltA(2 Executors) and BoltB(1 Executor) in sequence.
Spout --> BoltA --> BoltB
My understanding is that Shuffle grouping divide the tuples equally among the bolt tasks but what I noticed is that it is true only when all the tasks are on the same worker. When there are more than 1 workers let's say 2 workers and each worker hosts one task instance of same bolt then load is not even in both the tasks.
With my topology - I have load on one task as 90% and the second task has 0%.
Why is that different for multiple workers.

If you have shuffle connections between the Spout and BoltA, the tuples should be evenly distributed.
As Stig Rohde Døssing mentioned, the behavior you mention matches the "local or shuffle grouping" (see Storm Concepts) which will preferentially send tuples to tasks on the local worker process.
So if the Spout has a parallelism of 1 and one of the BoltA tasks is on the same worker process, tuples from the spout will be preferentially routed to that local downstream task.

Related

Distributing tasks over HTTP using SpringBoot non-blocking WebClient (performance question?)

I have a large number of tasks - N, needs to be distributed to multiple http worker nodes via load balancer. Though there exists multiple nodes - n, combining all nodes we have a max-concurrency setting - x.
Always
N > x > n
One node can run those tasks in multiple threads. Mean time consumption for each task is about 50 sec to 1 min. Using Web-Client to distribute tasks and Mono response from Workers.
There exists a distributor and designed the process as follows:
1. Remove a task from queue.
2. Send the task via POST request using Web-Client and subscribe immediately with a subscriber instance
3. Holt new subscription when max concurrency reached to x
4. When any one of the above distributed task completes it calls on-accept(T) method of the subscriber.
5. If task queue is not empty, remove and send the next task / (x+1) task.
6. Keep track of total number completed tasks.
7. If all tasks completed & queue empty set Completable Future object as complete
8. Exit
The above process works fine. Tested with N=5000, n=10 & x=25.
Now the confusion is in this design we always have x number of concurrent subscriptions. As soon as one ends we create another until all tasks are completed. What is the impact of this in large scale production environment? If number of concurrent subscription (the value of x > 10,000) increases via the HTTP(s) load balancer is that going to have serious impact on performance and network latency? Our expected production volume will be something like below:
N=200,000,000
n=100
x=10,000
I will be grateful if some one with knowledge of Reactor and Web-Client expertise comment of this approach. Our main concern is having too many concurrent subscriptions.

Celery worker best practices

I am using Celery to run background jobs for my Django app, hosted on Heroku, with Redis as broker. and I want to set up task prioritization.
I am currently using the Celery default queue and all the workers feed from. I was thinking about implementing prioritization within the only queue but it is described everywhere as a bad practice.
The consensus on the best approach to deal with the priority problem is to set different Celery queues for each level of priority. Let's say:
Queue1 for highest priority tasks, assigned to x workers
Queue2 the default queue, assigned to all the other workers
The first problem I see with this method is that if there is no high priority task at some time, I loose the productivity of x workers.
Also, let's say my infrastructure scales up and I have more workers available. Only the number of "default" workers will be expanded dynamically. Besides, this method prevents me from keeping identical dynos (containers on Heroku) which doesn't look optimized for scalability.
Is there an efficient way to deal with task prioritization and keep replicable workers at the same time?

I know this is a very old question, but this might be helpful for someone who is still looking for an answer.
In My current project, we have an implementation like this
1) High --> User actions (user is waiting for this task to get completed)
2) Low ---> Non-User action (There is no wait time)
3) Reporting --> Sending emails/reports to the user
4) Maintenance --> Cleaning up the historical data from DB
All these queues have different tasks based on priority (low, medium & high). We have to implement the priority for our celery tasks.
A) Let's assume this scenario in which we are more interested in processing the tasks based on priority.
1) We have 2 or more queues and we are pushing the tasks into the queue(s) by specifying the priority.
2) all the workers (let's say I have 4 workers) are listening to all the queues.
3) In this scenario, if you have 100 tasks in your queues, and within these 100 tasks 20 are high priority tasks, 30 are medium priority and 50 are low priority.
4) So, the celery framework first process the high priority tasks across the queues then the medium, and finally the low priority tasks.
B) Queue1 for highest priority tasks, assigned to x workers
Queue2 the default queue, assigned to all the other workers
1) This approach would be helpful **when you are more concerned about the performance of processing the tasks in a specific queue** like I have queue **HIGH** and tasks in this queue are very important for me irrespective of the priority of the task.
2) So, I should be having a dedicated worker which would be processing the tasks only from a particular queue. (As you have mentioned in the question, it has some limitations)
We have to choose these two options based on our requirements, I hope this would be helpful,
Suresh.

For the answer, W1 and W2 are workers consuming high and low priority tasks respectively.
You can scale W1 and W2 as separate containers. You can have three containers, essentially drawn from the same image. One for the app, two for the workers. If you have higher number of one kind of task, only that container would scale. Also, depending on the kind of dyno you are using, you can set concurrency for the workers to use resources in a better way.
For your reference, this is something that I did in one of my projects.

akka cluster fast handover

I'm using an akka cluster singleton feature to have a single master node and multiple worker nodes. Worker nodes send results to the master and master persists them in a database.
The problem is that it takes akka about 10 seconds to hand the singleton over, which results in 10 seconds downtime every time I deploy. Is there a way to make the singleton handover faster? Or is there some other approach to a single master - multiple workers pattern?

Distributed Priority Queue, once and only once

TL;DR
I have producers, tasks and consumers. I need a scalable queuing system which can ensure that a task can be consumed once and only once, and which can sort the tasks according to their priority.
The context:
We have a prototype working, but it's not "scale ready", and today we need to scale...
Below is the prototype "process":
1°) Some customers upload dataset in the database (PostgreSQL)
2°) Each second, an application fetches for new dataset in the database and converts them into tasks.
One customer's dataset can generate thousand of tasks (~500K tasks/day, ~30K tasks/customer)
3°) An application "Dispatcher"
fetches sorted tasks from the database (tasks with the smallest dataset will be proceeded first even if they have been submitted later + some random value to shuffle)
performs some validations (check if the task has been canceled or not)
dispatch the task to the according worker.
Each worker can process only one kind of task, but it can process thousands of them concurrently.
4°) The workers receive the task, and push the result to the database
5°) A "Monitor" application checks the state of all tasks, and retries any task that needs to (worker crashed).
Today, the bottleneck is the SQL server, I can tune it but I would prefer to redesign it the right way. So, I was wondering if there are some best practices for that kind of process?
It seems I need a distributed queuing system (Kafka?), which can guarantee that a task will be processed once and only once, but which will also manage priority.

Quartz Job throttling

Little bit wiered requirement:
I have a few quartz jobs that are acting as data collectors, collects data from different locations as and when available. Then I have another job [data load] which is being called/triggered from the collector jobs to update my DB.
My requirement is to some how throttle Load Job to have only two instances running in parallel and handle the work coming from the collector jobs
Collector Jobs 1,2,...N > Loader Job (two instances)
Job programs are deployed in clusted Tomcat.
Two Questions:
1) How can I make the Collector jobs to wait, when two instances of the Loader job already in process? Is there any way to use the quartz program to impelement FIFO logic to throttle the work to Loader job? I also do not want the collector to pick up another data, if one is already waiting to be processed.
2) Is there any way to run a job with two threads only? No more than two instances should be active at a time? I have limitation on my DB table to run only two instances in parallel.

It's 8 years later and the question shows up as the top result in google when searching for job throttling. And while the case from question clearly begs for using a queue, the actual answer was never given.
So... To throttle jobs in quartz one has to use TriggerListener and implement the throttling in vetoJobExecution. The job itself can also be annotated to prevent concurrent executions with #DisallowConcurrentExecution.

It seems you have a producer-consumer situation here.
The producer and the consumer are usually separated by a queue. Have your collectors put items into a (persistent?) queue and have your Loader reading from the queue and dispatching up to 2 handling threads.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js