Quartz Job throttling

Quartz Job throttling - concurrency

Little bit wiered requirement:
I have a few quartz jobs that are acting as data collectors, collects data from different locations as and when available. Then I have another job [data load] which is being called/triggered from the collector jobs to update my DB.
My requirement is to some how throttle Load Job to have only two instances running in parallel and handle the work coming from the collector jobs
Collector Jobs 1,2,...N > Loader Job (two instances)
Job programs are deployed in clusted Tomcat.
Two Questions:
1) How can I make the Collector jobs to wait, when two instances of the Loader job already in process? Is there any way to use the quartz program to impelement FIFO logic to throttle the work to Loader job? I also do not want the collector to pick up another data, if one is already waiting to be processed.
2) Is there any way to run a job with two threads only? No more than two instances should be active at a time? I have limitation on my DB table to run only two instances in parallel.

It's 8 years later and the question shows up as the top result in google when searching for job throttling. And while the case from question clearly begs for using a queue, the actual answer was never given.
So... To throttle jobs in quartz one has to use TriggerListener and implement the throttling in vetoJobExecution. The job itself can also be annotated to prevent concurrent executions with #DisallowConcurrentExecution.

It seems you have a producer-consumer situation here.
The producer and the consumer are usually separated by a queue. Have your collectors put items into a (persistent?) queue and have your Loader reading from the queue and dispatching up to 2 handling threads.

Related

How to limit number of concurrent workflows running?

The title is pretty much the question. Is there some way to limit the number of concurrent workflows running at any given time?
Some background:
I'm using eventarc to dispatch a workflow once a message has been sent to a pubsub topic. The workflow will be used to start some long-running operation (LRO) but for reasons I won't go into, I don't want more than 3 instances of this workflow running at a given time.
Is there some way to do this? - primarily from some type of configuration rather than using another compute resource.

There is no configuration to limit running processes that specifically targets sessions that are executed by a Workflow enabled for concurrent execution.
The existing process limit applies to all sessions without differentiating between those from non-concurrent or concurrent enabled Workflows.
Synchronization enables users to limit the parallel execution of certain workflows or templates within a workflow without having to restrict others.
Users can create multiple synchronization configurations in the ConfigMap that can be referred to from a workflow or template within a workflow. Alternatively, users can configure a mutex to prevent concurrent execution of templates or workflows using the same mutex.
Refer to this link for more information.

Summarizing your requirements:
Trigger workflow executions with Pub/Sub messages
Execute at most 3 workflow executions concurrently
Queue up waiting Pub/Sub messages
(Unspecified) Do you need messages processed in the order delivered?
There is no out-of-the box capability to achieve this. For fun, below is a solution that doesn't need secondary compute (and therefore is still fully managed).
The key to making this work is likely starting new executions for every message, but waiting in that execution if needed. Workflows does not provide a global concurrency construct, so you'll need to use some external storage, such as Firestore. An algorithm like this could work:
Create a callback
Push the callback into a FIFO queue
Atomically increment a counter (which returns the new value)
If the returned value is <= 3, pop the last callback and call it
Wait on the callback
-- MAIN WORKFLOW HERE --
Atomically decrement the counter
If the returned value is < 3, pop the last callback and call it
To keep things cleaner, you could put the above steps in a the triggered workflow and the main logic in a separate workflow that is called as needed.

Creating a scalable and fault tolerant system using AWS ECS

We're designing C# scheduled task (runs every few hours) that will run on AWS ECS instances that will grab batched transaction data for thousands of customers from an endpoint, modify the data then send it on to another web service. We will be maintaining the state of the last successful batch in a separate database (using some like created date of the transactions). We need the system to be scalable so as more customers are added we add additional ECS containers to process the data.
There are the options we're considering:
Each container only processes a specific subset of the data. As more customers are added more contains are added. We would need to maintain a logical separation of what contains are processing what customers data.
All the containers process all of the customers. We use some kind of locking flags on the database to let other processes know that the customers data is being processed.
Some other approach.
I think that option 2 is probably the best, but it adds a lot of complexity regarding the locking and unlocking of customers. Are there specific design patterns I could be pointed towards if that if the correct solution?

In both scenarios an important thing to consider is retries in case processing for a specific customer fails. One potential way to distribute jobs across a vast number of container with retries would be to use AWS SQS.
A single container would run periodically every few hours and be the job generator. It would create one SQS queued item for each customer that needs to be processed. In response to items appearing in the queue a number of "worker" containers would be spun up by ECS to consume items from the queue. This can be made to autoscale relative to the number of items in the queue to quickly spin up many containers that can work in parallel.
Each container would use its own high performance concurrent poller similar to this (https://www.npmjs.com/package/squiss) to start grabbing items from the queue and processing them. If a worker failed or crashed due to a bug then SQS will automatically redeliver and dropped queued items that worker had been working on to a different worker after they time out.
This approach would give you a great deal of flexibility, and would let you horizontally scale out the number of workers, while letting any of the workers process any jobs from the queue that it grabs. It would also ensure that every queued item gets processed at least once, and that none get dropped forever in case something crashes or goes wrong.

Distributed Priority Queue, once and only once

TL;DR
I have producers, tasks and consumers. I need a scalable queuing system which can ensure that a task can be consumed once and only once, and which can sort the tasks according to their priority.
The context:
We have a prototype working, but it's not "scale ready", and today we need to scale...
Below is the prototype "process":
1°) Some customers upload dataset in the database (PostgreSQL)
2°) Each second, an application fetches for new dataset in the database and converts them into tasks.
One customer's dataset can generate thousand of tasks (~500K tasks/day, ~30K tasks/customer)
3°) An application "Dispatcher"
fetches sorted tasks from the database (tasks with the smallest dataset will be proceeded first even if they have been submitted later + some random value to shuffle)
performs some validations (check if the task has been canceled or not)
dispatch the task to the according worker.
Each worker can process only one kind of task, but it can process thousands of them concurrently.
4°) The workers receive the task, and push the result to the database
5°) A "Monitor" application checks the state of all tasks, and retries any task that needs to (worker crashed).
Today, the bottleneck is the SQL server, I can tune it but I would prefer to redesign it the right way. So, I was wondering if there are some best practices for that kind of process?
It seems I need a distributed queuing system (Kafka?), which can guarantee that a task will be processed once and only once, but which will also manage priority.

AWS SWF Simple Workflow - Best Way to Keep Activity Worker Scripts Running?

The maximum amount of time the pollForActivityTask method stays open polling for requests is 60 seconds. I am currently scheduling a cron job every minute to call my activity worker file so that my activity worker machine is constantly polling for jobs.
Is this the correct way to have continuous queue coverage?

The way that the Java Flow SDK does it and the way that you create an ActivityWorker, give it a tasklist, domain, activity implementations, and a few other settings. You set both the setPollThreadCount and setTaskExecutorSize. The polling threads long poll and then hand over work to the executor threads to avoid blocking further polling. You call start on the ActivityWorker to boot it up and when wanting to shutdown the workers, you can call one of the shutdown methods (usually best to call shutdownAndAwaitTermination).
Essentially your workers are long lived and need to deal with a few factors:
New versions of Activities
Various tasklists
Scaling independently on tasklist, activity implementations, workflow workers, host sizes, etc.
Handle error cases and deal with polling
Handle shutdowns (in case of deployments and new versions)

I ended using a solution where I had another script file that is called by a cron job every minute. This file checks whether an activity worker is already running in the background (if so, I assume a workflow execution is already being processed on the current server).
If no activity worker is there, then the previous long poll has completed and we launch the activity worker script again. If there is an activity worker already present, then the previous poll found a workflow execution and started processing so we refrain from launching another activity worker.

SQS/task-queue job retry count strategy?

I'm implementing a task queue with Amazon SQS ( but i guess the question applies to any task-queue ) , where the workers are expected to take different action depending on how many times the job has been re-tried already ( move it to a different queue, increase visibility timeout, send an alert..etc )
What would be the best way to keep track of failed job count? I'd like to avoid having to keep a centralized db for job:retry-count records. Should i look at time spent in the queue instead in a monitoring process? IMO that would be ugly or un-clean at best, iterating over jobs until i find ancient ones..
thanks!
Andras

There is another simpler way. With your message you can request ApproximateReceiveCount information and base your retry logic on that. This way you won't have to keep it in the database and can calculate it from the message itself.
http://docs.aws.amazon.com/AWSSimpleQueueService/latest/APIReference/API_ReceiveMessage.html

I've had good success combining SQS with SimpleDB. It is "centralized", but only as much as SQS is.
Every job gets a record in simpleDB and a task in SQS. You can put any information you like in SimpleDB like the job creation time. When a worker pulls a job from the queue it can grab the corresponding record from simpleDB to determine it's history. You can see how old the job is, and you can see how many times it has been attempted. Once you're done, you can add worker data to the SimpleDB record (completion time, outcome, logs, errors, stack-trace, whatever) and acknowledge the message from SQS.
I prefer this method because it helps diagnose faults by providing lots of debug info for failed tasks. It also allows workers to handle the job differently depending on how long the job has been queued, how many failures it's had, etc.
It also gives you the ability to query SimpleDB directly and calculate things like average time per task, percent failure rate, etc.

Amazon just released Simple workflow serice (swf) which you can think of as a more sophisticated/flexible version of GAE Task queues.
It will let you monitor your tasks (with hearbeats), configure retry strategies and create complicated workflows. It looks pretty promising abstracting out task dependencies, scheduling and fault tolerance for tasks (esp. asynchronous ones)
Checkout http://docs.amazonwebservices.com/amazonswf/latest/developerguide/swf-dg-intro-to-swf.html for overview.

SQS stands for "Simple Queue Service" which, in concept is the incorrect name for that service. The first and foremost feature of a "Queue" is FIFO (First in, First out), and SQS lacks that. Just wanting to clarify.
Also, Azure Queue Services lacks that as well. For the best cloud Queue service, use Azure's Service Bus since it's a TRUE Queue concept.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js