TL;DR
I have producers, tasks and consumers. I need a scalable queuing system which can ensure that a task can be consumed once and only once, and which can sort the tasks according to their priority.
The context:
We have a prototype working, but it's not "scale ready", and today we need to scale...
Below is the prototype "process":
1°) Some customers upload dataset in the database (PostgreSQL)
2°) Each second, an application fetches for new dataset in the database and converts them into tasks.
One customer's dataset can generate thousand of tasks (~500K tasks/day, ~30K tasks/customer)
3°) An application "Dispatcher"
fetches sorted tasks from the database (tasks with the smallest dataset will be proceeded first even if they have been submitted later + some random value to shuffle)
performs some validations (check if the task has been canceled or not)
dispatch the task to the according worker.
Each worker can process only one kind of task, but it can process thousands of them concurrently.
4°) The workers receive the task, and push the result to the database
5°) A "Monitor" application checks the state of all tasks, and retries any task that needs to (worker crashed).
Today, the bottleneck is the SQL server, I can tune it but I would prefer to redesign it the right way. So, I was wondering if there are some best practices for that kind of process?
It seems I need a distributed queuing system (Kafka?), which can guarantee that a task will be processed once and only once, but which will also manage priority.
Related
We're designing C# scheduled task (runs every few hours) that will run on AWS ECS instances that will grab batched transaction data for thousands of customers from an endpoint, modify the data then send it on to another web service. We will be maintaining the state of the last successful batch in a separate database (using some like created date of the transactions). We need the system to be scalable so as more customers are added we add additional ECS containers to process the data.
There are the options we're considering:
Each container only processes a specific subset of the data. As more customers are added more contains are added. We would need to maintain a logical separation of what contains are processing what customers data.
All the containers process all of the customers. We use some kind of locking flags on the database to let other processes know that the customers data is being processed.
Some other approach.
I think that option 2 is probably the best, but it adds a lot of complexity regarding the locking and unlocking of customers. Are there specific design patterns I could be pointed towards if that if the correct solution?
In both scenarios an important thing to consider is retries in case processing for a specific customer fails. One potential way to distribute jobs across a vast number of container with retries would be to use AWS SQS.
A single container would run periodically every few hours and be the job generator. It would create one SQS queued item for each customer that needs to be processed. In response to items appearing in the queue a number of "worker" containers would be spun up by ECS to consume items from the queue. This can be made to autoscale relative to the number of items in the queue to quickly spin up many containers that can work in parallel.
Each container would use its own high performance concurrent poller similar to this (https://www.npmjs.com/package/squiss) to start grabbing items from the queue and processing them. If a worker failed or crashed due to a bug then SQS will automatically redeliver and dropped queued items that worker had been working on to a different worker after they time out.
This approach would give you a great deal of flexibility, and would let you horizontally scale out the number of workers, while letting any of the workers process any jobs from the queue that it grabs. It would also ensure that every queued item gets processed at least once, and that none get dropped forever in case something crashes or goes wrong.
I'm developing a Django app which relies heavily on Celery task scheduling, using Redis as backend. Tasks can be set to run at a large periods of time, as well as in a few seconds/minutes.
I've read about Redis visibility timeout and consequences of scheduling tasks with timedelta greater than visibility timeout (I'm also in the process of dealing with it in a previous project), so I'm interested if there's anything neater than my solution, which is to have another "helper" task run 5 minutes before the "main" one needs to be executed, scheduling the "main" task to run in required time, storing task id in DB, and then checking in "main" task if the stored task id is the one that is being run. The last part (with task id storing) is required as multiple runs of "helper" task could spawn a lot of "main" task instances, but with this approach each will have different task id.
I really hate how that approach sounds and how it works, as if the task is scheduled to be run a month from current time, "helper" and "main" tasks are executed up to a hundred times.
I also know that it's an open issue, so I'm interested in more a neat workaround than a solution itself.
Having tested available options, in my opinion only using RabbitMQ as broker solves the whole problem.
Although it's a viable option for me, lack of some of redis configuration parameters (e.g. pool size) makes it unusable for those who are using hosting services with some limit on opened broker connection.
I want to create a web application were a client calls a REST Webservice. This returns OK-Status for the client (with a link to the result) and creates a new message on an activeMQ Queue. On the listeners side of the activeMQ there should be worker who process the messages.
Iam stucking here with my concept, because i dont really know how to determine the number of workers i need. The workers only have to call web service interfaces, so no high computation power is needed for the worker itself. The most time the worker has to wait for returning results from the called webservice. But one worker can not handle all requests, so if a limit of requests in the queue is exceeded (i dont know the limit yet), another worker should treat the queue.
What is the best practise for doing this job? Should i create one worker per Request and destroying them if the work is done? How to dynamically create workers based on the queue size? Is it better to run these workers all the time or creating them when the queue requiere that?
I think a Topic/Suscriber architecture is not reasonable, because only one worker should care about one request. Lets imagine of 100 Requests per Minute average and 500 requests on high workload.
My intention is to get results fast, so no client have to wait for it answer just because not properly used ressources ...
Thank you
Why don't you figure out the max number of workers you'd realistically be able to support, and then make that number and leave them running forever? I'd use a prefetch of either 0 or 1, to avoid piling up a bunch of messages in one worker's prefetch buffer while the others sit idle. (Prefetch=0 will pull the next message when the current one is finished, whereas prefetch=1 will have a single message sitting "on deck" available to be processed without needing to get it from the network but it means that a consumer might be available to consume a message but can't because it's sitting in another consumer's prefetch buffer waiting for that consumer to be read for it). I'd use prefetch=0 as long as the time to download your messages from the broker isn't unreasonable, since it will spread the workload as evenly as possible.
Then whenever there are messages to be processed, either a worker available to process the next message (so no delay) or all the workers are processing messages (so of course you're going to have to wait because you're at capacity, but as soon as there's a worker available it will take the next message from the queue).
Also, you're right that you want queues (where a message will be consumed by only a single worker) not topics (where a message will be consumed by each worker).
The maximum amount of time the pollForActivityTask method stays open polling for requests is 60 seconds. I am currently scheduling a cron job every minute to call my activity worker file so that my activity worker machine is constantly polling for jobs.
Is this the correct way to have continuous queue coverage?
The way that the Java Flow SDK does it and the way that you create an ActivityWorker, give it a tasklist, domain, activity implementations, and a few other settings. You set both the setPollThreadCount and setTaskExecutorSize. The polling threads long poll and then hand over work to the executor threads to avoid blocking further polling. You call start on the ActivityWorker to boot it up and when wanting to shutdown the workers, you can call one of the shutdown methods (usually best to call shutdownAndAwaitTermination).
Essentially your workers are long lived and need to deal with a few factors:
New versions of Activities
Various tasklists
Scaling independently on tasklist, activity implementations, workflow workers, host sizes, etc.
Handle error cases and deal with polling
Handle shutdowns (in case of deployments and new versions)
I ended using a solution where I had another script file that is called by a cron job every minute. This file checks whether an activity worker is already running in the background (if so, I assume a workflow execution is already being processed on the current server).
If no activity worker is there, then the previous long poll has completed and we launch the activity worker script again. If there is an activity worker already present, then the previous poll found a workflow execution and started processing so we refrain from launching another activity worker.
Little bit wiered requirement:
I have a few quartz jobs that are acting as data collectors, collects data from different locations as and when available. Then I have another job [data load] which is being called/triggered from the collector jobs to update my DB.
My requirement is to some how throttle Load Job to have only two instances running in parallel and handle the work coming from the collector jobs
Collector Jobs 1,2,...N > Loader Job (two instances)
Job programs are deployed in clusted Tomcat.
Two Questions:
1) How can I make the Collector jobs to wait, when two instances of the Loader job already in process? Is there any way to use the quartz program to impelement FIFO logic to throttle the work to Loader job? I also do not want the collector to pick up another data, if one is already waiting to be processed.
2) Is there any way to run a job with two threads only? No more than two instances should be active at a time? I have limitation on my DB table to run only two instances in parallel.
It's 8 years later and the question shows up as the top result in google when searching for job throttling. And while the case from question clearly begs for using a queue, the actual answer was never given.
So... To throttle jobs in quartz one has to use TriggerListener and implement the throttling in vetoJobExecution. The job itself can also be annotated to prevent concurrent executions with #DisallowConcurrentExecution.
It seems you have a producer-consumer situation here.
The producer and the consumer are usually separated by a queue. Have your collectors put items into a (persistent?) queue and have your Loader reading from the queue and dispatching up to 2 handling threads.