The airflow I'm using, sometimes the pipelines wait for a long time to be scheduled. There have also been instances where a job was running for too long (presumably taking up resources of other jobs)
I'm trying to work out how to programatically identify the health of the scheduler and potentially monitor those in the future without any additional frameworks. I started to have a look at the metadata database tables. All I can think of now is to see start_date and end_date from dag_run, and duration of the tasks. What are the other metrics that I should be looking at? Many thanks for your help.
There is no need to go "deep" inside the database.
Airflow provide you with metrics that you can utilize for the very purpose: https://airflow.apache.org/docs/apache-airflow/stable/logging-monitoring/metrics.html
If you scroll down, you will see all the useful metrics and some of them are precisely what you are looking for (especially Timers).
This can be done with the usual metrics integration. Airflow publishes the metrics via statsd, and Airflow Official Helm Chart (https://airflow.apache.org/docs/helm-chart/stable/index.html) even exposes those metrics for Prometheus via statsd exporter.
Regarding the spark job - yeah - current implementation of spark submit hook/operator is implemented in "active poll" mode. The "worker" process of airflow polls the status of the job. But Airlfow can run multiple worker jobs in parallel. Also if you want, you can implement your own task which will behave differently.
In "classic" Airflow you'd need to implement a Submit Operator (to submit the job) and "poke_reschedule" sensor (to wait for the job to complete) and implement your DAG in the way that sensort task will be triggered after the operator. The "Poke reschedule" mode works in the way that the sensor is only taking the worker slot for the time of "polling" and then it frees the slot for some time (until it checks again).
As of Airflow 2.2 you can also write a Deferrable Operator (https://airflow.apache.org/docs/apache-airflow/stable/concepts/deferring.html?highlight=deferrable) where you could write single Operator - doing submision first, and then deferring the status check - all in one operator. Defferrable operators are efficiently handling (using async.io) potentially multiple thousands of waiting/deferred operators without taking slots or excessive resources.
Update: If you really cannot use statsd (helm is not needed, statsd is enough) you should never use DB to get information about the DAGS. Use Stable Airflow REST API instead: https://airflow.apache.org/docs/apache-airflow/stable/stable-rest-api-ref.html
Related
Preamble: I have a web app, the backend is based on the serverless architecture. It's basically an amplify app hosted on AWS with a dynamoDB database. I've learnt is possible to create a task scheduling system of sorts more here. A quick summary of the article is "Its possible to create a task scheduling table taking advantage of TTL and dynamoDB streams to execute lambda function at specific times. The TTL specifies a set time for an record to be deleted, we can capture this delete event in a dynamoDB stream and run some tasks based on information from the stream"
Problem:
The goal is to send a series of emails to users who sign up for our service. Each user that signs up gets a series of "Getting Started" emails. The first of the emails is sent 24 hours after a user signs up, the second 3 days later and the third exactly 7 days after sign up.
I see how a cron job would be suitable here, but it just seems a bit inefficient to me. I would basically have to search the users table for users whose sign up time falls between a specific 24 hour period and send the email to the users whereas with a Task scheduler table I could add a task to the table ( something like send first email to user300 with a TTL of when I want it to be sent ) and listen for delete events to run the task. No need to run a cron job daily, just a function that handles each task as it comes.
I think this is more like a performance vs storage problem. Having a task scheduler table would take up space, if we add all the emails to be sent to a user as tasks on the table (each email to be sent to a specific user is it's own task) each time a user signs up then I see the task scheduler table growing 3n records for every n user signed up. But this may not really be a problem as tasks are deleted after they are run. I do not know the performance cost of using a cron job for this particular task hence I'm here. I also may be wrong and the cost of running and updating this task scheduler table may be more than that of the cron job.
I initially thought of setting up a dummy user table and running both the cron and the task scheduler and documenting cost of running both, but you can imagine how much time and effort that would take.
So I guess my question is which is a more efficient solution in terms of performance and cost?
There is no perfect solution here. Keep in mind that Dynamodb TTL takes up to 48h to invoke, so it's probably unacceptable. CRON Jobs with Lambda are cheap, and it's easy to set. You coul also use SQS and populate it with daily CRON. Yan Cui wrote great article about this problem https://theburningmonk.com/2019/03/dynamodb-ttl-as-an-ad-hoc-scheduling-mechanism/
This may not exactly be an answer. Based on the medium article you linked the guy had a plausible reason why the TTL and dynamoDB streams would be better than a cron job which you reiterated. Setting up a cron job is easier and cheaper (free) and I doubt the performance will be that much worse unless the database is huge. I don't have any experience doing something like this so I wouldn't know how large the database would have to be for it to make sense to switch over. Alternatively, you can have as many cron jobs as you want so I don't see how you couldn't just set up a user specific cron job whenever someone signs up.
You can setup a CloudWatch Event to fire a Lambda function on a regular schedule. The Lambda function can search a database for an applicable result set and perform other actions - send an email, a text message, etc.
Here is an AWS tutorial that covers a very similar use case with step by step instructions. This tutorial is implemented by using the AWS Java API (but you can implement it using other supported programming languages).
https://github.com/awsdocs/aws-doc-sdk-examples/tree/master/javav2/usecases/creating_scheduled_events
From a Cost perspective - Lambda allows 1M free requests per month. Details are here - https://aws.amazon.com/lambda/pricing/
I wanted to implement scheduled jobs/actions in my DJANGO backend.
Actions are basically deducting monthly recurring payment from the customer. Sending payment link say before 10 days etc. etc. The dates will be based on when the user buys the subscription.
I have never implemented scheduled jobs before. I know there are some ways like cron tabs and celery.
I wanted to know what will be the best strategy/tool for scheduled payments.
So basically what I think i will do is that i will run the scheduled job every day at a particular time and will check the available candidates and will run the payment module.
Is this strategy correct to run jobs everyday. Are there any better methods available. Is there a way that jobs run automatically when say the customers new billing cycle arrives.
Yes, the strategy you are following is correct. You can use celery, redis and crontab to execute the payment system.
So, firstly you can specify the schedules using crontab. Also, .delay() function will help you in triggering the jobs whenever customer`s new billing cycle arrives.
So, the flow will be, tasks get triggered when a new billing cycle arrives using .delay()
Then the celery worker will register the task and the schedules. Then you may use celery beat to run the tasks periodically. Also you may use redis as a message queue.
Read about .delay() here
Read about celery configuration setting here
Read about setting a task scheduler using celery here
I have set up multiple jobs in Informatica cloud to sync data from Oracle with Informatica objects. The job is scheduled to run every 3 minutes as per the business requirements. Sometimes the job used to run long due to secure agent resource crunch and my team used to multiple emails as below
The Mapping task failed to run. Another instance of the task is currently running.
Is there any way to suppress these failure emails in the mapping?
This wont get set at the mapping level but on the session or integration service level see following https://network.informatica.com/thread/7312
This type of error comes when workflow/session is running and trying to re-run. use check if by script if already running then wait. If want to run multiple instance of same:
In Workflow Properties Enable 'Configure Concurrent Execution' by checking the check box.
once its enables you 2 options
Allow Concurrent Run with same instance name
Allow Concurrent run only with unique instance name
Notifications configured at the task level over ride those at the org level, so you could do this by configuring notifications at the task level and only sending out warnings to the broader list. That said, some people should still be receiving the error level warning because if it recurs multiple times within a short period of time there may be another issue.
Another thought is that batch processes that run every three minutes that take longer than three minutes is usually an opportunity to improve the design. Often a business requirement for short batch intervals is around a "near real time" desire. If you have also Cloud Application Integration service, you may want to set up an event to trigger the batch run. If there is still overlap based on events, you can use the Cloud Data Integration API to and create a dynamic version of the task each time. For really simple integrations you could perform the integration in CAI, which does allow multiple instances running at the same time.
HTH
I have written a web job which will do multiple tasks that run on different schedules like once a day, once in every hour and so and I achieved this by using Timer delegate. Now I am thinking of changing that approach and create a Scheduler job for each scenario. I was able to find some information regarding schedules from googling but was never able to join them to form a flow.
I learned that we can create job collection and each collection can have 'n' jobs based on the pricing tier we are using. After creating a job the program logic that the job must do how can we bind them to the corresponding job?
Also linking jobs to job collection how can I achieve that?
Thanks
A typical workflow is that you would write to a Azure Message Queue with a message, then you would have an Azure Cloud Service that reads from that and does the processing.
To tie specific jobs to specific program logic you can either embed information about the type into the message and have something that generically picks the messages up and turns them into specific operations/classes or you could have behavior specific queues and each job would write to its own queue and you would read from each queue by a different Cloud Service.
I think this will solve my problem either using API calls or queue processing
Solution
If I understand your question, you have a WebJob that has multiple methods, each of which needs to be called on a different schedule. Instead of going through the hassle of setting up a Scheduler and having yet another resource that you have to manage, mark each method you need called with a TimerTriggerAttribute.
We're trying to move to AWS and to use DynamoDB. It'd be nice to keep everything under DynamoDB so there aren't extraneous types of databases, but aside from half complete research projects I'm not really finding anything to use for a scheduler. There's going to be dynamically set schedules in the range of thousands+, possibly with many running at the same time. For languages, Java or at least JVM would be awesome.
Does anyone know a good Scheduler for DynamoDB or other AWS technology?
---Addendum
When I say scheduler I'm thinking of something all purpose like quartz. I want to set a cron and it runs at that time with the code I give it. This isn't doing some AWS task, this is a task internal to our product. SWF's cron runs inside the VM, so I'm worried what happens when the VM is down. Data Pipeline seems a bit too much. I've been looking into making a dynamodb job store for quartz, consistent read might get around the transaction and consistency issues, but I'm hesitant, might be biting off a lot with a lot of hard to notice problems.
Have you looked at AWS Simple Workflow? You would use the AWS Flow Framework to program against the service, and they have a well documented Java API with lots of samples. They support continuous workflows with timers which you can use to run periodic code (see code example here). I'm using SWF and the Flow Framework for Ruby to run async code that gets kicked off from my main app, and it's been working great.
Another new option for you is to look at AWS Lambda. You can attach your Lambda function code directly to a DynamoDB table update event, and Lambda will spin up and shut down the compute resources for you, without you having to manage a server to run your code. Also, recently, AWS launched the ability to call the Lambda function directly -- e.g. you could have an external timer or other code that triggers the function on a specific schedule.
Lastly, this SO thread may have other options for you to consider.
Another option is to use AWS Lambda Scheduled Functions (newly announced on October 8th 2015 at AWS re:Invent).
Here is a relevant snippet from the blog (source):
Scheduled Functions (Cron)
You can now invoke a Lambda function on a regular, scheduled basis. You can specify a fixed rate (number of minutes, hours, or days between invocations) or you can specify a Cron-like expression: