Airflow backfill not completely filling up to the present - python-2.7

I'm struggling with a weird problem I can't seem to figure out. I have a basic DAG that does nothing fancy. It just makes use of the bash operator to start a Python script.
I have this DAG scheduled to run every Monday. When I switch on the dag in the webserver it starts backfilling up to the 25th of September. However, it does not backfill for the 2nd of October. When I change the schedule from weekly to daily, it works fine.
This is my DAG setup.
default_args = {
'owner': 'xxxxx',
'depends_on_past': False,
'email': ['xxxxxx'],
'start_date': datetime(2017, 9, 1, 0, 0),
'email_on_failure': False,
'email_on_retry': False,
'retries': 2,
'retry_delay': timedelta(minutes=10),
}
# Create DAG
dag = DAG(dag_id='IFS_weekly_forecast',
schedule_interval = '0 8 * * MON',
default_args=default_args)
As you can see from this picture, the backfilling is just fine up to the 25th. After that there are no new tasks queued.
What am I doing wrong here? I have other DAGs have been running fine for weeks. I also restarted the scheduler and the webserver, this did not help.
Edit:
The topic below seems to cover the same problem. However, this changes my question. How can I let airflow run weekly on a given date, instead of waiting for that entire period to finish?
Airflow does not backfill latest run

Related

Superset chart email schedules and celery beat schedule. Why report is sent with celery beat schedule instead of chart email schedules?

My configured email report is named "Raw player games" with crontab */20 * * * * (At every 20th minute I will expect a report in my email box). Look in the screenshot raw player games.
Another crontab is configured in main superset config - superset_config.py
# superset_config.py
CELERYBEAT_SCHEDULE = {
'email_reports.schedule_hourly': {
'task': 'email_reports.schedule_hourly',
'schedule': crontab(minute=1, hour='*'), # At minute 1, every hour
},
}
I receive emails, but only one per hour. I don't see any errors in logs, all jobs in celery flower are in success state.
apache-superset==0.37.2
celery==4.4.7
Why?
Why superset send me reports only once in a hour? How to reconfigure superset to correctly handle my crontabs, what I missed?
Note that your beat schedule is configured to run hourly. So on every minute one of each hour, beat is going to enqueue a new job that will verify if it's time to send a new report. It will not matter to have a thinner resolution configured on superset itself.
Yet by default email reports functionality has an hourly resolution:
https://github.com/apache/incubator-superset/blob/master/superset/tasks/schedules.py#L823
This default can be changed by configuring:
EMAIL_REPORTS_CRON_RESOLUTION

Airflow DAG is running for all the retries

I have a DAG running since few months and from last one week it's behaving abnormal. i am running a bash operator which is executing a shell script and in shell script we have a hive query.
no of retries set to 4 as below.
default_args = {
'owner': 'airflow',
'depends_on_past': False,
'email': ['airflow#example.com'],
'email_on_failure': False,
'email_on_retry': False,
'retries': 4,
'retry_delay': timedelta(minutes=5)
}
i can see in the log that it's triggering the hive query and loosing the heartbeats after some time(around 5 to 6 minutes) and going for the retry.
Yarn is showing that query is not yet finished but airflow triggered the next run. now in the yarn 2 queries are running (one for the first run and second for the retry) for the same task.similarly this dag is triggering 5 queries(as retry is 4) for the same task and showing the failed status in the last.
Interesting point is that the same dag was running fine from long time. also, this is the issue will all the dags related to hive in the production.
today i upgraded to latest version of airflow v 1.10.9.
I am using LocalExecuter in this case.
Did anyone have faced the similar issue?
Airflow UI doesn't initiate the retries on its own, irrespective of whether it's connected to backend DB or not. It seems like your task executors are going Zombie, in that case Scheduler's Zombie detection kicks in and call the task instances (TI's) handle_failure method. So in nutshell, you can override that method in your dag and add some logging to see at what's happening, in fact you should be able to Hadoop RM and check the status of your job and make a decision accordingly including canceling retries.
For example, see this code, which I wrote to handle Zombie failures only.

EMR Job Long Running Notifications

Consider we have around 30 EMR Jobs runs in 5:30 AM PST to 10:30 PST.
We have S3 Buckets and we use to receive flat files in S3 bucket and through lambda functions, received files will be copied to other target paths.
We have dynamo DB tables for data processing once data gets received in target path.
Now the problem area is since we have multiple dependencies & parallel execution, sometimes job gets failed due to memory issue as well as sometimes take more time to get completed.
Sometimes it will run for 4 or 5 hours, and finally it will get terminated with memory or any other issues like Subnet not available or EC2 issue. So we dont want to wait till that long time.
Eg: Job_A process some 1st to 4th files and Job_B processes from 5th to 10th files. Like that it goes.
Here Job_B has dependency with Job_A with 3rd file. So, Job_B will wait until Job_A gets completed. Like this dependency we have in our process.
I would like to get notification from EMR Jobs like below,
Eg: Average Running time for Job_A is 1 hour, but it is running for more than 1 hour and in this case I need to get notified by email or any other way.
How to achieve it? Please help or advise anyone.
Regards,
Karthik
Repeatedly call the list of steps by using lambda and aws sdk, e.g. boto3 and check the start date. When it is 1 hour behind, then you can trigger some notification like Amazon SES. See the documentation.
For example, you can call the list_steps for the running steps only.
response = client.list_steps(
ClusterId='string',
StepStates=['RUNNING']
)
Then it will give you below response.
{
'Steps': [
{
...
'Status': {
...
'Timeline': {
'CreationDateTime': datetime(2015, 1, 1),
'StartDateTime': datetime(2015, 1, 1),
'EndDateTime': datetime(2015, 1, 1)
}
}
},
],
...
}

Celery interval schedule time

I am using django celery beat to schedule my task .
Currently i am just creating a interval schedule of 2 days and creating a periodic task to run at that interval .
My main problem is , when i schedule a task to run at 2 days , at what time does it run ? and cant i change that time , because i need to run the interval task at certain time provided by the user .
The code written so far is
periodic_task=PeriodicTask.objects.update_or_create(
name='my-interval-task,
defaults={
'interval': schedule, #interval schedule object
'task': 'myapp.tasks.auto_refresh',
}
)
Have a look at the crontab class
Eg. schedule = crontab(hour=0, minute=0, day_of_month='2-30/3') fires every even numbered day at midnight

celerybeat set to crontab(day_of_month=1) sends task multiple times in a month

I have this task which is set to crontab(day_of_month=1). But then when it perform the tasks in continues to send task minutely which is supposed to perform once.
from my tasks.py
from celery.task.schedules import crontab
#periodic_task(run_every=crontab(day_of_month=1))
def Sample():
...
Am I missing something?
By default crontab will run every minute so you need to specify minutes and hours.
Change #periodic_task(run_every=crontab(day_of_month=1)) to #periodic_task(run_every=crontab(minute=0, hour=0, day_of_month=1))
This would run the task only at midnight on the first day of the month.