airflow - how to get start date of current dag_run (not specific task)? - airflow-scheduler

Tasks 1, 2, 3, 4 in the same dag will insert to a db table.
I then want task 7 to update the db table only for rows with timestamp >= the time of the start of the dagrun (not the start time of task 7).
Is there some jinja/kwarg/context macro i can use?
I didn't see any example to get dagrun start_date (not exec date).

context variable contain a number of variables containing information about the task context, including dag_run.start_date
context['dag_run'].start_date

kwargs['dag_run'].start_date will provide the start date (as opposed to the execution date) of the task:
from datetime import datetime, timedelta
from airflow import DAG
from airflow.decorators import task
with DAG(
"demo_dag", # Dag id
start_date=datetime(2021, 1 ,1),
schedule_interval='* * * * *', # every minute
catchup=False
) as dag:
#task(task_id="task")
def demo(**kwargs):
print("kwargs['dag_run'].start_date:")
print(kwargs["dag_run"].start_date)
print("kwargs['dag_run'].execution_date:")
print(kwargs["dag_run"].execution_date)
task1 = demo()
This results in log entries similar to:
[2023-02-08, 09:43:01 NZDT] {logging_mixin.py:115} INFO - kwargs['dag_run'].start_date:
[2023-02-08, 09:43:01 NZDT] {logging_mixin.py:115} INFO - 2023-02-07 20:43:00.996729+00:00
[2023-02-08, 09:43:01 NZDT] {logging_mixin.py:115} INFO - kwargs['dag_run'].execution_date:
[2023-02-08, 09:43:01 NZDT] {logging_mixin.py:115} INFO - 2023-02-07 20:42:00+00:00
A discussion of the difference between start_date and execution_date can be found here https://infinitelambda.com/airflow-start-date-execution-date/

Related

How to create a BigQuery table with Airflow failure notification?

I have a Airflow DAG on GCP composer that runs every 5 minutes. I would like to create a BigQuery table that will have the time when DAG starts to run and a flag identifying whether it's a successful run or failed run. For example, if the DAG runs at 2020-03-23 02:30 and the run fails, the BigQuery table will have time column with 2020-03-23 02:30 and flag column with 1. If it's a successful run, then the table will have time column with 2020-03-23 02:30 and flag column with 0. The table will append new rows.
Thanks in advance
You can list_dag_runs CLI to list the DAG runs for a given dag_id. The information returned includes the state of each run.
Another option is retrieving the information via python code a few different ways. One such way that I've used in the past is the 'find' method in airflow.models.dagrun.DagRun.
dag_id = 'my_dag'
dag_runs = DagRun.find(dag_id=dag_id)
for dag_run in dag_runs:
print(dag_run.state)
Finally, use the BigQuery operator to write the DAg information into a BigQuery table. You can find an example of how to use the BigQueryOperator here.
Based on the solution by #Enrique, Here is my final solution.
def status_check(**kwargs):
dag_id = 'dag_id'
dag_runs = DagRun.find(dag_id=dag_id)
import pandas as pd
import pandas_gbq
from google.cloud import bigquery
arr = []
arr1 = []
for dag_run in dag_runs:
arr.append(dag_run.state)
arr1.append(dag_run.execution_date)
data1 = {'dag_status': arr, 'time': arr1}
df = pd.DataFrame(data1)
project_name = "project_name"
dataset = "Dataset"
outputBQtableName = '{}'.format(dataset)+'.dag_status_tab'
df.to_gbq(outputBQtableName, project_id=project_name, if_exists='replace', progress_bar=False, \
table_schema= \
[{'name': 'dag_status', 'type': 'STRING'}, \
{'name': 'time', 'type': 'TIMESTAMP'}])
return None
Dag_status = PythonOperator(
task_id='Dag_status',
python_callable=status_check,
)

Elasticsearch-Hadoop formatting multi resource writes issue

I am interfacing Elasticsearch with Spark, using the Elasticsearch-Hadoop plugin and I am having difficulty writing a dataframe with a timestamp type column to Elasticsearch.
The problem is when I try to write using dynamic/multi resource formatting to create a daily index.
From the relevant documentation I get the impression that this is possible, however, the python example below fails to run unless I change my dataframe type to date.
import pyspark
conf = pyspark.SparkConf()
conf.set('spark.jars', 'elasticsearch-spark-20_2.11-6.1.2.jar')
conf.set('es.nodes', '127.0.0.1:9200')
conf.set('es.read.metadata', 'true')
conf.set('es.nodes.wan.only', 'true')
from datetime import datetime, timedelta
now = datetime.now()
before = now - timedelta(days=1)
after = now + timedelta(days=1)
cols = ['idz', 'name', 'time']
vals = [(0,'maria', before), (1, 'lolis', after)]
time_df = spark.createDataFrame(vals, cols)
When I try to write, I use the following:
time_df.write.mode('append').format(
'org.elasticsearch.spark.sql'
).options(
**{'es.write.operation': 'index' }
).save('xxx-{time|yyyy.MM.dd}/1')
Unfortunatelly this renders an error:
.... Caused by: java.lang.IllegalArgumentException: Invalid format:
"2018-03-04 12:36:12.949897" is malformed at " 12:36:12.949897" at
org.joda.time.format.DateTimeFormatter.parseDateTime(DateTimeFormatter.java:945)
On the other hand this works perfectly fine if I use dates when I create my dataframe:
cols = ['idz', 'name', 'time']
vals = [(0,'maria', before.date()), (1, 'lolis', after.date())]
time_df = spark.createDataFrame(vals, cols)
Is it possible to format a dataframe timestamp to be written to daily indexes with this method, without also keeping a date column around? How about monthly indexes?
Pyspark version:
spark version 2.2.1
Using Scala version 2.11.8, OpenJDK 64-Bit Server VM, 1.8.0_151
ElasticSearch version
number "6.2.2" build_hash "10b1edd"
build_date "2018-02-16T19:01:30.685723Z" build_snapshot false
lucene_version "7.2.1" minimum_wire_compatibility_version "5.6.0"
minimum_index_compatibility_version "5.0.0"

Django + PostgreSQL: Fill missing dates in a range

I have a table with one of the columns as date. It can have multiple entries for each date.
date .....
----------- -----
2015-07-20 ..
2015-07-20 ..
2015-07-23 ..
2015-07-24 ..
I would like to get data in the following form using Django ORM with PostgreSQL as database backend:
date count(date)
----------- -----------
2015-07-20 2
2015-07-21 0 (missing after aggregation)
2015-07-22 0 (missing after aggregation)
2015-07-23 1
2015-07-24 1
Corresponding PostgreSQL Query:
WITH RECURSIVE date_view(start_date, end_date)
AS ( VALUES ('2015-07-20'::date, '2015-07-24'::date)
UNION ALL SELECT start_date::date + 1, end_date
FROM date_view
WHERE start_date < end_date )
SELECT start_date, count(date)
FROM date_view LEFT JOIN my_table ON date=start_date
GROUP BY date, start_date
ORDER BY start_date ASC;
I'm having trouble translating this raw query to Django ORM query.
It would be great if someone can give a sample ORM query with/without a workaround for Common Table Expressions using PostgreSQL as database backend.
The simple reason is quoted here:
My preference is to do as much data processing in the database, short of really involved presentation stuff. I don't envy doing this in application code, just as long as it's one trip to the database
As per this answer django doesn't support CTE's natively, but the answer seems quite outdated.
References:
MySQL: Select All Dates In a Range Even If No Records Present
WITH Queries (Common Table Expressions)
Thanks
I do not think you can do this with pure Django ORM, and I am not even sure if this can be done neatly with extra(). The Django ORM is incredibly good in handling the usual stuff, but for more complex SQL statements and requirements, more so with DBMS specific implementations, it is just not quite there yet. You might have to go lower and down to executing raw SQL directly, or offload that requirement to be done by the application layer.
You can always generate the missing dates using Python, but that will be incredibly slow if the range and number of elements are huge. If this is being requested by AJAX for other use (e.g. charting), then you can offload that to Javascript.
from datetime import date, timedelta
from django.db.models.functions import Trunc
from django.db.models.expressions import Value
from django.db.models import Count, DateField
# A is model
start_date = date(2022, 5, 1)
end_date = date(2022, 5, 10)
result = A.objects\
.annotate(date=Trunc('created', 'day', output_field=DateField())) \
.filter(date__gte=start_date, date__lte=end_date) \
.values('date')\
.annotate(count=Count('id'))\
.union(A.objects.extra(select={
'date': 'unnest(Array[%s]::date[])' %
','.join(map(lambda d: "'%s'::date" % d.strftime('%Y-%m-%d'),
set(start_date + timedelta(n) for n in range((end_date - start_date).days + 1)) -
set(A.objects.annotate(date=Trunc('created', 'day', output_field=DateField())) \
.values_list('date', flat=True))))})\
.annotate(count=Value(0))\
.values('date', 'count'))\
.order_by('date')
In stead of the recursive CTE you could use generate_series() to construct a calendar-table:
SELECT calendar, count(mt.zdate) as THE_COUNT
FROM generate_series('2015-07-20'::date
, '2015-07-24'::date
, '1 day'::interval) calendar
LEFT JOIN my_table mt ON mt.zdate = calendar
GROUP BY 1
ORDER BY 1 ASC;
BTW: I renamed date to zdate. DATE is a bad name for a column (it is the name for a data type)

Django - Oracle backend error

I have the following model in Django:
class Event(models.Model):
# some fields
start_date = models.DateField()
end_date = models.DateField()
I'm using Oracle 10g Database with Django 1.5 and cx_oracle 5.1.2. The issue here is when I try to create a new object in the admin interface (picking dates from the calendar), the following error is raised:
ORA-01843: not a valid month
syncdb has created a DATE field in oracle for start_date and end_date. Does this look like a backend bug or am I doing something wrong?
I do have other models with DateTimeField() and they work fine when I persist new objects, the issue looks related to DateField itself.
UPDATE: I have checked the backend implementation, and in backends/oracle/base.py lines 513 to 516:
cursor.execute(
"ALTER SESSION SET NLS_DATE_FORMAT = 'YYYY-MM-DD HH24:MI:SS'"
" NLS_TIMESTAMP_FORMAT = 'YYYY-MM-DD HH24:MI:SS.FF'"
+ (" TIME_ZONE = 'UTC'" if settings.USE_TZ else ''))
Executing this statement allows an insert statement to have literal values for DATE fields. I have checked the query generated by the backend and it is inserting '2013-03-20' in start_date and end_date. The date matches NLS_DATE_FORMAT, so this in theory should work!
UPDATE: I believe my case is related to cx_oracle.
UPDATE: Since I still don't have a definite answer (although I'm almost sure it's cx_oracle that's causing this issue), I changed my DateField into a DateTimeField which translates into oracle's TIMESTAMP and works perfectly fine.
Based on jtiai problem description, I made following workaround - before calling any problematic sql-s (e.g. oracle 10.5.0.2 and 11.2.0.1, cx_oracle 5.1.2), reset NLS_DATE_FORMAT/NLS_TIMESTAMP_FORMAT again - done in django/db/backends/oracle/base.py in method def execute(...):
--- base.py 2013-10-31 12:19:24.000000000 +0100
+++ base_new.py 2013-10-31 12:20:32.000000000 +0100
## -707,6 +707,18 ##
query = convert_unicode(query % tuple(args), self.charset)
self._guess_input_sizes([params])
try:
+ # BUG-WORKAROUND: ulr1-131031
+ # https://stackoverflow.com/a/17269719/565525
+ # It's actually a bug in the Oracle 10.5.0.2 and 11.2.0.1. Bug can be reproduced as following:
+ # - set NLS_TIMESTAMP_FORMAT in session.
+ # - Run any implicit or explicit TO_DATE conversion with unicode data.
+ # - **Next implicit or explicit TO_TIMESTAMP with unicode data will trigger internal reset of timestamp format.**
+ # - All consecutive TO_TIMESTAMP will fail and TO_CHAR of timestamp will produce invalid output.
+ self.cursor.execute(
+ "ALTER SESSION SET NLS_DATE_FORMAT = 'YYYY-MM-DD HH24:MI:SS'"
+ " NLS_TIMESTAMP_FORMAT = 'YYYY-MM-DD HH24:MI:SS.FF'"
+ + (" TIME_ZONE = 'UTC'" if settings.USE_TZ else ''))
+
return self.cursor.execute(query, self._param_generator(params))
except Database.IntegrityError as e:
six.reraise(utils.IntegrityError, utils.IntegrityError(*tuple(e.args)), sys.exc_info()[2])
The cause of the error is you entered a date, but the month portion of the date was not a valid month. Oracle give resolutions for this problem.
1 - Re-enter the date value using either a MONTH or MON format mask. The valid values for month are:
January
February
March
.......
//and soon
OR
Jan
Feb
Mar
.......
//and soon
2 - if the above resolution is failed, use to_date function instead.
to_date( string1, [ format_mask ], [ nls_language ] )

How to use AWS Glue / Spark to convert CSVs partitioned and split in S3 to partitioned and split Parquet

In AWS Glue's catalog, I have an external table defined with partitions that looks roughly like this in S3 and partitions for new dates are added daily:
s3://my-data-lake/test-table/
2017/01/01/
part-0000-blah.csv.gz
.
.
part-8000-blah.csv.gz
2017/01/02/
part-0000-blah.csv.gz
.
.
part-7666-blah.csv.gz
How could I use Glue/Spark to convert this to parquet that is also partitioned by date and split across n files per day?. The examples don't cover partitioning or splitting or provisioning (how many nodes and how big). Each day contains a couple hundred GBs.
Because the source CSVs are not necessarily in the right partitions (wrong date) and are inconsistent in size, I'm hoping to to write to partitioned parquet with the right partition and more consistent size.
Since the source CSV files are not necessarily in the right date, you could add to them additional information regarding collect date time (or use any date if already available):
{"collectDateTime": {
"timestamp": 1518091828,
"timestampMs": 1518091828116,
"day": 8,
"month": 2,
"year": 2018
}}
Then your job could use this information in the output DynamicFrame and ultimately use them as partitions. Some sample code of how to achieve this:
from awsglue.transforms import *
from pyspark.sql.types import *
from awsglue.context import GlueContext
from awsglue.utils import getResolvedOptions
import sys
import datetime
###
# CREATE THE NEW SIMPLIFIED LINE
##
def create_simplified_line(event_dict):
# collect date time
collect_date_time_dict = event_dict["collectDateTime"]
new_line = {
# TODO: COPY YOUR DATA HERE
"myData": event_dict["myData"],
"someOtherData": event_dict["someOtherData"],
"timestamp": collect_date_time_dict["timestamp"],
"timestampmilliseconds": long(collect_date_time_dict["timestamp"]) * 1000,
"year": collect_date_time_dict["year"],
"month": collect_date_time_dict["month"],
"day": collect_date_time_dict["day"]
}
return new_line
###
# MAIN FUNCTION
##
# context
glueContext = GlueContext(SparkContext.getOrCreate())
# fetch from previous day source bucket
previous_date = datetime.datetime.utcnow() - datetime.timedelta(days=1)
# build s3 paths
s3_path = "s3://source-bucket/path/year={}/month={}/day={}/".format(previous_date.year, previous_date.month, previous_date.day)
# create dynamic_frame
dynamic_frame = glueContext.create_dynamic_frame.from_options(connection_type="s3", connection_options={"paths": [s3_path]}, format="json", format_options={}, transformation_ctx="dynamic_frame")
# resolve choices (optional)
dynamic_frame_resolved = ResolveChoice.apply(frame=dynamic_frame,choice="project:double",transformation_ctx="dynamic_frame_resolved")
# transform the source dynamic frame into a simplified version
result_frame = Map.apply(frame=dynamic_frame_resolved, f=create_simplified_line)
# write to simple storage service in parquet format
glueContext.write_dynamic_frame.from_options(frame=result_frame, connection_type="s3", connection_options={"path":"s3://target-bucket/path/","partitionKeys":["year", "month", "day"]}, format="parquet")
Did not test it, but the script is just a sample of how to achieve this and is fairly straightforward.
UPDATE
1) As for having specific file sizes/numbers in output partitions,
Spark's coalesce and repartition features are not yet implemented in Glue's Python API (only in Scala).
You can convert your dynamic frame into a data frame and leverage Spark's partition capabilities.
Convert to a dataframe and partition based on "partition_col"
partitioned_dataframe = datasource0.toDF().repartition(1)
Convert back to a DynamicFrame for further processing.
partitioned_dynamicframe = DynamicFrame.fromDF(partitioned_dataframe,
glueContext, "partitioned_df")
The good news is that Glue has an interesting feature that if you have more than 50,000 input files per partition it'll automatically group them to you.
In case you want to specifically set this behavior regardless of input files number (your case), you may set the following connection_options while "creating a dynamic frame from options":
dynamic_frame = glueContext.create_dynamic_frame.from_options(connection_type="s3", connection_options={"paths": [s3_path], 'groupFiles': 'inPartition', 'groupSize': 1024 * 1024}, format="json", format_options={}, transformation_ctx="dynamic_frame")
In the previous example, it would attempt to group files into 1MB groups.
It is worth mentioning that this is not the same as coalesce, but it may help if your goal is to reduce the number of files per partition.
2) If files already exist in the destination, will it just safely add it (not overwrite or delete)
Glue's default SaveMode for write_dynamic_frame.from_options is to append.
When saving a DataFrame to a data source, if data/table already
exists, contents of the DataFrame are expected to be appended to
existing data.
3) Given each source partition may be 30-100GB, what's a guideline for # of DPUs
I'm afraid I won't be able to answer that. It depends on how fast it'll load your input files (size/number), your script's transformations, etc.
Import the datetime library
import datetime
Split the timestamp based on partition conditions
now=datetime.datetime.now()
year= str(now.year)
month= str(now.month)
day= str(now.day)
currdate= "s3:/Destination/"+year+"/"+month+"/"+day
Add the variable currdate in the path address in the writer class. The results will be patitioned parquet files.