can we parameterize the airflow schedule_interval dynamically reading from the variables instead of passing as the cron expression - airflow-scheduler

can we parameterize the airflow schedule_interval dynamically reading from the airflow variables instead of passing directly as the cron expression
I have passed the following way as per airflow documentation
args = {
'owner': 'pavan',
'depends_on_past': False,
'start_date': datetime(2020, 1, 15),
'email_on_failure': True,
'email_on_retry': False,
'retries': 0,
'on_failure_callback': notify_email
}
with DAG(dag_id=DAG_NAME, default_args=args, schedule_interval='* 1 * * *', catchup=False) as dag:

Yes
Technically you can do it, but it brings 2 problems
minor problem: reading a Variable means a SQL-query being fired to Airflow's SQLAlchemy backend meta-db. And doing it in your DAG-definition script means this will happen as the DAG is continuously parsed by Airflow in background. Read point 2 here
major problem: A Variable can be edited via UI. But altering an Airflow DAG's schedule_interval can have wierd behaviours and may require you to either rename the DAG or (anecdotal finding) restart the scheduler to fix that

Related

How can I run specific task/s from the Airflow dag

Current State of airflow dag:
ml_processors = [a, b, c, d, e]
abc_task >> ml_processors (all ml models from a to e run in parallel after abc task is successfully completed)
ml_processors >> xyz_task (once a to e all are successful xyz task runs)
Problem statement: There are instances when one of the machine learning models (task in airflow) get on new version with better accuracy and we want to reprocess our data. Now lets say c_processor get on new version and reprocessing is required to just reprocess the data for this processor. In that case I would like to run c_processor >> xyz_task only.
What I know/tried
I know that I can go back in successful dag runs and clear the task for certain period of time to run only specific task. But this way might not be very efficient when I have lets say c_processor, d_classifier to be rerun. And I would end up doing 2 steps here:
c_processor >> xyz_task
d_processor >> xyz_task which I would like to avoid
I read about "backfill in airflow" but looks like its more for whole dag instead of specific/ selected tasks from a dag
Environment/setup
Using google composer environment.
Dag is triggered on file upload in GCP storage.
I am interested to know if there are any other ways to rerun only specific tasks from airflow dag.
"clear"1 would also allow you to clear some specific tasks in a DAG with the --task-regex flag. In this case, you can run airflow tasks clear --task-regex "[c|d]_processor" --downstream -s 2021-03-22 -e 2021-03-23 <dag_id>, which clear the states for c and d processors with their downstreams.
One caveat though, this will also clean up the states for the original task runs.

Understanding Airflow execution_date when property 'catchup=false'

I am trying to see how Airflow sets execution_date for any DAG. I have made the property catchup=false in the DAG. Here is my
dag = DAG(
'child',
max_active_runs=1,
description='A sample pipeline run',
start_date=days_ago(0),
catchup=False,
schedule_interval=timedelta(minutes=5)
)
Now, since Catchup=false, it should skip the runs prior to current_time. It does the same, however a strange thing is it is not setting the execution_date right.
Here, the runs execution time:
Exectution time
We can see the runs are scheduled at freq of 5 min. But, why does it append seconds and milliseconds to time?
This is impacting my sensors later.
Note that the behaviour runs fine when catchup=True.
I did some permutations. Seems that the execution_time is correctly coming when I specify cron, instead of timedelta function.
So, my DAG now is
dag = DAG(
'child',
max_active_runs=1,
description='A sample pipeline run',
start_date=days_ago(0),
catchup=False,
schedule_interval='*/5 * * * *'
)
Hope it will help someone. I have also raised a bug for this:
Can be tracked at : https://github.com/apache/airflow/issues/11758
Regarding execution_date you should have a look on scheduler documentation. It is the begin of the period, but get's triggered at the end of the period (start_date).
The scheduler won’t trigger your tasks until the period it covers has ended e.g., A job with schedule_interval set as #daily runs after the day has ended. This technique makes sure that whatever data is required for that period is fully available before the dag is executed. In the UI, it appears as if Airflow is running your tasks a day late
Note
If you run a DAG on a schedule_interval of one day, the run with execution_date 2019-11-21 triggers soon after 2019-11-21T23:59.
Let’s Repeat That, the scheduler runs your job one schedule_interval AFTER the start date, at the END of the period.
Also the article Scheduling Tasks in Airflow might be worth a read.
You also should avoid setting the start_date to a relative value - this can lead to unexpected behaviour as this value is newly interpreted everytime the DAG file is parsed.
There is a long description within the Airflow FAQ:
We recommend against using dynamic values as start_date, especially datetime.now() as it can be quite confusing. The task is triggered once the period closes, and in theory an #hourly DAG would never get to an hour after now as now() moves along.

Google Cloud Spanner insert is very slow

I am new to Cloud Spanner. My Cloud Spanner instance is a single node instance (as I am doing POC). I am doing simple insert. However, it returns the result after 5 to 8 seconds, although it shows Time elapsed: 1.16 secs.
I am clueless as to why. Also, if I go with the time that Google shows, 1.16 seconds for simple insert is still too much. A benchmark that was performed by couple of people showcase the average time taken is around 16 ms to 80 ms.
Here's my insert statement:
Insert into SP_DATAFLOW_PER_INCR (PER_NATL_ID, DQ_PER_NATL_ID_FLG, PER_FRST_NM, PER_MID_NM, PER_LST_NM, PER_MTRNL_LST_NM, PER_BRTH_DT, DQ_PER_BRTH_DT_FLG, PER_SEX_CD, DQ_PER_SEX_CD_FLG, PER_CELL_NUM, DQ_PER_CELL_NUM_FLG, PER_DTH_IND, PER_EMAIL_ADR_TXT, DQ_PER_EMAIL_ADR_TXT_FLG, PER_ED_LVL_CD, REC_STRT_TS, REC_END_TS, CUR_REC_IND, loadingdate)
values('00000000000000666971', false, 'DE', 'DIOS', 'DIAZ', 'JUAN', '1906-08-11', false, 'M', false, 'No Data', false, false, '', true, '','2019-12-09 03:51:06.249454 UTC', null, true, '2019-12-09');
Am I doing anything wrong?
Can someone please help me understand this? Google documentation
does not mention anything on this behavior.
Running queries through the GCP console or gcloud command line is expected to be a little slow. For a simple insert, there are some overheads such as creating a session which can be expensive.
If you want to evaluate the performance of Cloud Spanner, you should use the client libraries (see tutorials).

Airflow: Dag scheduled twice a few seconds apart

I am trying to run a DAG only once a day at 00:15:00 (midnight 15 minutes), yet, it's being scheduled twice, a few seconds apart.
dag = DAG(
'my_dag',
default_args=default_args,
start_date=airflow.utils.dates.days_ago(1) - timedelta(minutes=10),
schedule_interval='15 0 * * * *',
concurrency=1,
max_active_runs=1,
retries=3,
catchup=False,
)
The main goal of that Dag is check for new emails then check for new files in a SFTP directory and then run a "merger" task to add those new files to a database.
All the jobs are Kubernetes pods:
email_check = KubernetesPodOperator(
namespace='default',
image="g.io/email-check:0d334adb",
name="email-check",
task_id="email-check",
get_logs=True,
dag=dag,
)
sftp_check = KubernetesPodOperator(
namespace='default',
image="g.io/sftp-check:0d334adb",
name="sftp-check",
task_id="sftp-check",
get_logs=True,
dag=dag,
)
my_runner = KubernetesPodOperator(
namespace='default',
image="g.io/my-runner:0d334adb",
name="my-runner",
task_id="my-runner",
get_logs=True,
dag=dag,
)
my_runner.set_upstream([sftp_check, email_check])
So, the issue is that there seems to be two runs of the DAG scheduled a few seconds apart. They do not run concurrently, but as soon as the first one is done, the second one kicks off.
The problem here is that the my_runner job is intended to only run once a day: it tries to create a file with the date as a suffix, and if the file already exists, it throws an exception, so that second run always throws an exception (because the file for the day has already been properly created by the first run)
Since an image (or two) are worth a thousand words, here it goes:
You'll see that there's a first run that is scheduled "22 seconds after 00:15" (that's fine... sometimes it varies a couple of seconds here and there) and then there's a second one that always seems to be scheduled "58 seconds after 00:15 UTC" (at least according to the name they get). So the first one runs fine, nothing else seems to be running... And as soon as it finishes the run, a second run (the one scheduled at 00:15:58) starts (and fails).
A "good" one:
A "bad" one:
Can you check the schedule interval parameter?
schedule_interval='15 0 * * * *'. The cron schedule takes only 5 parameters and I see an extra star.
Also, can you have fixed start_date?
start_date: datetime(2019, 11, 10)
It looks like setting the start_date to 2 days ago instead of 1 did the trick
dag = DAG(
'my_dag',
...
start_date=airflow.utils.dates.days_ago(2),
...
)
I don't know why.
I just have a theory. Maaaaaaybe (big maybe) the issue was that because.days_ago(...) sets a UTC datetime with hour/minute/second set to 0 and then subtracts whichever number of days indicated in the argument, just saying "one day ago" or even "one day and 10 minutes ago" didn't put the start_date over the next period (00:15) and that was somehow confusing Airflow?
Let’s Repeat That The scheduler runs your job one schedule_interval
AFTER the start date, at the END of the period.
https://airflow.readthedocs.io/en/stable/scheduler.html#scheduling-triggers
So, the end of the period would be 00:15... If my theory was correct, doing it airflow.utils.dates.days_ago(1) - timedelta(minutes=16) would probably also work.
This doesn't explain why if I set a date very far in the past, it just doesn't run, though. ¯\_(ツ)_/¯

EMR PySpark structured streaming takes too long to read from big s3 bucket

I have a two computer EMR cluster with PySpark installed reading data from s3. The code is a very simple filter and transform operation using sqlContext.readStream.text to fetch data from the bucket. The bucket is ~10TB large and has around 75k files organized by bucket/year/month/day/hour/* with * representing up to 20 files of 128MB in size. I started the streaming task by providing the bucket s3://bucket_name/dir/ and letting PySpark read all files in it. It's now being almost 2 hours, the job hasn't even started consuming data from s3 and the network traffic as reported by Ganglia is minimal.
I'm scratching my head about why is this process so slow and how can I increase its speed, since currently the machines I'm paying for are basically idle.
When I use .status and .lastProgress to track the status I get the following responses respectively:
{'isDataAvailable': False,
'isTriggerActive': True,
'message': 'Getting offsets from FileStreamSource[s3://bucket_name/dir]'}
and
{'durationMs': {'getOffset': 207343, 'triggerExecution': 207343},
'id': '******-****-****-****-*******',
'inputRowsPerSecond': 0.0,
'name': None,
'numInputRows': 0,
'processedRowsPerSecond': 0.0,
'runId': '******-****-****-****-*******',
'sink': {'description': 'FileSink[s3://dest_bucket_name/results/file_name.csv]'},
'sources': [{'description': 'FileStreamSource[s3://bucket_name/dir]',
'endOffset': None,
'inputRowsPerSecond': 0.0,
'numInputRows': 0,
'processedRowsPerSecond': 0.0,
'startOffset': None}],
'stateOperators': [],
'timestamp': '2018-02-19T22:31:13.385Z'}
Any ideas of what could be causing the data consumption to take so long? Is this normal behaviour? Am I doing something wrong? Any tips on how can this process be improved?
Any help is greatly appreciated. Thanks.
Spark checks for files in the source folder and tries to discover partitions by checking sub-folders' names to correspond pattern "column-name=column-value".
Since your data is partitioned by date then files should be structured in something like this: s3://bucket_name/dir/year=2018/month=02/day=19/hour=08/data-file.