DAG schedule in Airflow 2.0 - airflow-scheduler

How to schedule a DAG in Airflow 2.0 so that it does not run on holidays?
Question 1 : Runs on every 5th working day of the month?
Question 2 : Runs on 5th working day of the month, if 5th day is holiday then it should run on next day which is not holiday?

For the moment this can't be done (at least not natively). Airflow DAGs accept either single cron expression or a timedelta. If you can't say the desired scheduling logic with one of them then you can not have this scheduling in Airflow. The good news is that Airflow has AIP-39 Richer scheduler_interval to address it and provide more scheduling capabilities in future versions.
That said, you can workaround this by setting you DAG to run with schedule_interval="#daily" and place BranchPythonOperator as the first task of the DAG. In the Python callable you can write the desired logic of your scheduling meaning that your function will return True if it's the 5th working day of the month otherwise will return False and you workflow will branch accordingly. For True it will continue to executed and for False it will end. This is not ideal but it will work. A possible template can be:
def check_fifth():
#write the logic of finding if today is the 5th working day of the month
if logic:
return "continue_task"
else:
return "stop_task"
with DAG(dag_id='stackoverflow',
default_args=default_args,
schedule_interval="#daily",
catchup=False
) as dag:
start_op = BranchPythonOperator(
task_id='choose',
python_callable=check_fifth
)
stop_op = DummyOperator(
task_id='stop_task'
)
#replace with the operator that you want to actually execute
continue_op = YourOperator(
task_id='continue_task'
)
start_op >> [stop_op, continue_op]

Related

Running Airflow DAG on Tuesday 00:00:00 every week

I'm trying to run a DAG on Airflow (GCP Cloud Composer to be exact) on weekly bases.
But the Dag is not ran on Tuesdays as I'm specifying on the Cron expression.
In all the examples I found the schedule_interval was sat as an interval (daily,weekly, and so on). I can't figure out what the error might be on my settings.
default_dag_args = {
'start_date': datetime.datetime.strptime('07/08/2020 00:00:00', '%d/%m/%Y %H:%M:%S'),
'depends_on_past':False,
'catchup' :...,
'retry_delay': ...,
'project_id': ...
}
with models.DAG(
'every_Tues_00_00',
schedule_interval= "0 0 * * 2",
default_args=default_dag_args) as dag:
.
.
.
Something to keep in mind is when Airflow triggers the task.
"For example, if you run a DAG on a schedule_interval of one day, the run stamped 2020-01-01 will be triggered soon after 2020-01-01T23:59. In other words, the job instance is started once the period it covers has ended. The execution_date available in the context will also be 2020-01-01." [1]
[1] https://airflow.apache.org/docs/stable/dag-run.html

How to run an Airflow DAG once after x minutes?

I need to run a DAG exactly once, but waiting 10 minutes before:
with models.DAG(
'bq_executor',
schedule_interval = '#once',
start_date= datetime().now() + timedelta(minutes=10) ,
catchup = False,
default_args=default_dag_args) as dag:
// DAG operator here
but I can't see the execution after 10 minutes. Something wrong with start_date?
If I use schedule_interval = '*/10 * * * *' and start_date= datetime(2019, 8, 1) (old date from now), I can see the excution every 10 minutes
Dont use datetime.now() as it will keep on changing whenever the DAG is loaded and now() + 10 minutes will always be a future timestamp resulting in DAG never getting scheduled.
Airflow runs the DAGS you have added always AFTER the start_date. So if you have start_date today, it will start after today 23:59.
Scheduling is tricky for this, so check the documentation and examples:
https://airflow.apache.org/scheduler.html
In you case, just switch the start_date to yesterday (or today -1) and it will start today with yesterday's ds (date stamp)

django-apscheduler schedule a job to run at a specific time of the day

There isn't much available in the https://github.com/jarekwg/django-apscheduler .
I want to set a job which will run exactly at 12:00 AM each day.
How to set that up in django-apscheduler?
What I have tried so far is this:
#register_job(scheduler, "interval", days=1)
def pending_block_activity_notification():
print("Job started")
How to specify it to run once a day at exactly 12:00 am?
My configuration will run at an interval of 1 day but the interval is counted from when the django server is being started.
Finally I found the solution.
The syntaxes are same as that of APScheduler.
#register_job(scheduler, "cron", hour=0)
def pending_block_activity_notification():
print("pending_block_activity_notification Job started")
Similarly we can run the job at 12:00 am, 6:00 am, 12:00 pm and 6:00 pm in the following way:-
#register_job(scheduler, "cron", hour='0,6,12,18')
def pending_block_activity_notification():
print("pending_block_activity_notification Job started")
We can find the valid expressions which we can use in the apscheduler docs
https://apscheduler.readthedocs.io/en/latest/modules/triggers/cron.html

Python make script run at specified time daily

I want to make this script to run automatically once or twice a day at a specified time, what would be the best way to approach this.
def get_data():
"""Reads the currency rates from cinkciarz.pl and prints out, stores the pln/usd
rate in a variable myRate"""
sock = urllib.urlopen("https://cinkciarz.pl/kantor/kursy-walut-cinkciarz-pl/usd")
htmlSource = sock.read()
sock.close()
currancyRate = re.findall(r'<td class="cur_down">(.*?)</td>',str(htmlSource))
for eachTd in currancyRate:
print(eachTd)
print currancyRate[0]
myRate = currancyRate[0]
print myRate
return myRate
You can use crontab to run any script at regular intervals. See https://stackoverflow.com/a/8727991/1517864
To run a script once a day (at 12:00) you will need an entry like this in your crontab
0 12 * * * python /path/to/script.py
You can add a bash function.
while true; do <your_command>; sleep <interval_in_seconds>; done

Xively read data in Python

I have written a python 2.7 script to retrieve all my historical data from Xively.
Originally I wrote it in C#, and it works perfectly.
I am limiting the request to 6 hour blocks, to retrieve all stored data.
My version in Python is as follows:
requestString = 'http://api.xively.com/v2/feeds/41189/datastreams/0001.csv?key=YcfzZVxtXxxxxxxxxxxORnVu_dMQ&start=' + requestDate + '&duration=6hours&interval=0&per_page=1000' response = urllib2.urlopen(requestString).read()
The request date is in the correct format, I compared the full c# requestString version and the python one.
Using the above request, I only get 101 lines of data, which equates to a few minutes of results.
My suspicion is that it is the .read() function, it returns about 34k of characters which is far less than the c# version. I tried adding 100000 as an argument to the ad function, but no change in result.
Left another solution wrote in Python 2.7 too.
In my case, got data each 30 minutes because many sensors sent values every minute and Xively API has limited half hour of data to this sent frequency.
It's general module:
for day in datespan(start_datetime, end_datetime, deltatime): # loop increasing deltatime to star_datetime until finish
while(True): # assurance correct retrieval data
try:
response = urllib2.urlopen('https://api.xively.com/v2/feeds/'+str(feed)+'.csv?key='+apikey_xively+'&start='+ day.strftime("%Y-%m-%dT%H:%M:%SZ")+'&interval='+str(interval)+'&duration='+duration) # get data
break
except:
time.sleep(0.3)
raise # try again
cr = csv.reader(response) # return data in columns
print '.'
for row in cr:
if row[0] in id: # choose desired data
f.write(row[0]+","+row[1]+","+row[2]+"\n") # write "id,timestamp,value"
The full script you can find it here: https://github.com/CarlosRufo/scripts/blob/master/python/retrievalDataXively.py
Hope you might help, delighted to answer any questions :)