How to force fail a dag after x number of time? - airflow-scheduler

I want to force fail a dag after say 3 hours have passed.
I have a dag that is scheduled for 2am and one that is scheduled for 6am. I want the 2am dag to stop and give precedence to the one scheduled at 6am.
I have already tried using execution_timeout
I have tried using dagrun_timeout keeps running as there are no other dags to run within the first dag.
NOTE: This is like a cross-dag dependency where I want to give preference to a dag during certain hours

Programmatically stopping a dag like I wanted to do above is not possible. I solved my issue, giving priority to a dag by assigning a larger pool to it.

Related

Aws Glue Workflow triggering multiple times one job (incorrect behavior)

I have a big glue workflow (about 100 jobs / crawlers), and it was executing properly until last week. Since then, my first conditional trigger (ALL), is executing 20 time the same job.
I've configured the job it self, to just allow 1 parallel execution, but every time the workflow executes, it tries to launch 20 times (the same job).
Also configured the workflow, to allow a max concurrency of 1, but that doesn't fix the problem.
Since i started working with glue workflow, i've noticed that the tool it self is buggy, old and maybe deprecated ?
Any tips on how to fix this problem ?
I too have faced similar problems. Even one is fixed, some other issue would occur lately. So my suggestion is to try using step functions.

Why does AWS Glue say "Max concurrent runs exceeded", when there are no jobs running?

I have an AWS Glue job, with max concurrent runs set to 1. The job is currently not running. But when I try to run it, I keep getting the error: "Max concurrent runs exceeded".
Deleting and re-creating the job does not help. Also, other jobs in the same account run fine, so it cannot be a problem with account wide service quotas.
Why am I getting this error?
I raised this issue with AWS support, and they confirmed that it is a known bug:
I would like to inform you that this is a known bug, where an internal distributed counter that keeps track of job concurrency goes into a stale state due to an edge case, causing this error. Our internal Service team has to manually reset the counter to fix this issue. Service team has already added the bug fix in their product roadmap and will be working on it. Unfortunately I may not be able to comment on the ETA on the deployment, as we don’t have any visibility on product teams road map and fix release timeline.
The suggested workarounds are:
Increase the max concurrency to 2 or higher
Re-create the job with a different name
Glue container is start and its taking some time same when your job end container shutdown taking some time in between if you try to execute new Jon and default concurrency is 1 so you will get this error.
How to resolve:
Go to your Glue Job --> Under Job detail tab you can find "Maximum concurrency" default value is 1 change it to 3 or more as per your need.
I tried changing "Maximum concurrency" to 2 and then run it !
It worked but again running it cause the same issue, but I looked into my s3 ,it has dumped the data ,so it run for once!
I'm still looking for a stable solution but this may work!

Amazon EMR: Only start new scheduled job if previous job has finished

I have a AWS EMR cluster job which runs every 2 hours. I have set up a schedule using cloudWatch job to run every two hours.
But sometimes the next job (which runs after 2 hour from previous one) starts when previous one is not finished as it sometimes take more than 2 hours to completed depending on data to be processed.
I need some configuration by which I could prevent next job to be started if previous job is running.
I tried but couldn't found any set up. Can anyone knows how to do that please?
Add them as EMR steps. EMR steps run sequentially by default(Unless you change the concurrency setting)

Is there a metric for time taken by AWS-DMS for full load?

I'm using AWS-DMS to migrate existing data only from a Postgres db as source to aws-S3 as target. I have created a migration task for this, and I'm able to do the aforementioned.
However, I wanted to know how much time it took for a task to complete. I couldn't find a time completion metric in either the metrics corresponding to the task or the metrics corresponding to the replication-instance.
How do I find out the time taken for the full load?
Using the AWS CLI you can try using the describe-replication-tasks function.
This will provide you with both the Start and Stop times, as well as the time elapsed.

airflow - how to 'Filling up the DagBag' once only

My dag takes about 50seconds to parse, I only use external triggers to start dag runs, no schedules. I notice airflow wants to fill the dagbag a lot --> On every trigger_dag command AND in the background it keeps checking the dags folder AND creating .pyc files seemingly instantly once new .py deployed.
Is there anyway I can deploy my cluster and get dags filled once! Then for the next 2 weeks get dagruns starting instantly on any trigger_dag (right now takes 50 seconds just to fill the dagbag before starting). I have no need to update dag definitions within the 2 weeks.
50 seconds is an incredibly huge amount of time for DAG instantiation. Looks like you are using a big piece of code (or just long-working) in your DAG file. It is very bad practice:
Note: This means all top level code (ie. anything that isn't defining the DAG) in a DAG file will get run each scheduler heartbeat. Try to avoid top level code to your DAG file unless absolutely necessary.
Airflow works exactly as you described. It is why you should treat your Python files in your DAG folder mostly as configuration files (with some programmatical capabilities). You can't change it with any magic config keys or something like it. This behaviour is the core of Airflow.