How to restrict access to airflow.models? - airflow-scheduler

I have an airflow instance with many tenants that have DAGs. They want to extract metadata on their dagruns like DagRun.end_date. However I want to restrict each tenant so they can only access data related to their own dagruns and be unable to access data of other people's dagruns. How can this be done?
This is what I imagine the DAG to look like
# custom macro function
def get_last_dag_run(dag):
last_dag_run = dag.get_last_dagrun()
return last_dag_run.end_date
I found these resources which explain how to extract data but not how to restrict it.
Getting the date of the most recent successful DAG execution
Apache airflow macro to get last dag run execution time
How to get last two successful execution dates of Airflow job?
how to get latest execution time of a dag run in airflow
How to find the start date and end date of a particular task in dag in airflow?
How to get dag status like running or success or failure

NB: I am a contributor to Airflow.
This is not possible with the current Airflow architecture.
We are slowly working to make Airflow multi-tenant capable, but for now we are half-way through and it will be several major releases to get there I believe.
Currently the only way to isolate tenants is to give every tenant separate Airflow instance, which is not as bad as you might initially think. If you run them in separate namespaces on the same auto-scaling Kubernetes cluster and add KEDA autoscaling, and use same database server (but give each tenant a separate schema), this might be rather efficient (especially if you use Terraform to setup/teardown such Airflow instances for example).

Related

How to get dataflow job id from inside that dataflow job - JAVA

In my current architecture, multiple dataflow jobs are triggered at various stages, as part of ABC framework, I need to capture the job id of those jobs as audit metrics inside the dataflow pipeline and update it in BigQuery.
How do I get the run id of dataflow job from the pipeline using JAVA?
Is there any existing method that I can use for that or do I need to use google cloud's client library inside the pipeline for that?
If you are submitting to dataflow, I believe this might work:
DataflowPipelineJob result = (DataflowPipelineJob)pipeline.run()
result.getJobId()
But you cannot access that within the pipeline itself afaik (DoFns etc).
The best way to ensure you know your job id/name, is to set it yourself. You can do this by setting --jobName and this is accessible via options.getJobName(), dataflow will use this. Note it must be unique.

How airflow loads/updates DagBag from dags home folder on google cloud platform?

Please do not down vote my answer. If needed then I will update and correct my words. I have done my home-work research. I am little new so trying to understand this.
I would like to understand that how do airflow on Google cloud platform gets the changes from dags home folder to UI. Also Please help me with my dags setup script. I have read so many answers along with books. book link is here
I tried figuring out my answer from page 69 which says
3.11 Scheduling & Triggers The Airflow scheduler monitors all tasks and all DAGs, and triggers the task instances whose dependencies have
been met. Behind the scenes, it monitors and stays in sync with a
folder for all DAG objects it may contain, and periodically (every
minute or so) inspects active tasks to see whether they can be
triggered.
My understanding from this book is that scheduler regularly takes changes from dags home folder. (Is it correct?)
I also read multiple answers on stack overflow , I found this one useful Link
But still answer does not contain process that is doing this creation/updation of dagbag from script.py in dag home folder. How changes are sensed.
Please help me with my dags setup script.
We have created a generic python script that dynamically creates dags by reading/iterating over config files.
Below is directory structure
/dags/workflow/
/dags/workflow/config/dag_a.json
/dags/workflow/config/dag_b.json
/dags/workflow/task_a_with_single_operator.py
/dags/workflow/task_b_with_single_operator.py
/dags/dag_creater.py
Execution flow dag_creater.py is as following :-
1. Iterate in dags/workflow/config folder get the Config JSON file and
read variable dag_id.
2. create Parent_dag = DAG(dag_id=dag_id,
start_date=start_date, schedule_interval=schedule_interval,
default_args=default_args, catchup=False)
3. Read tasks with dependencies of that dag_id from config json file
(example :- [[a,[]],[b,[a]],[c,[b]]]) and code it as task_a >>
task_b >> task_c
This way dag is created. All works fine. Dags are also visible on UI and running fine.
But problem is, My dag creation script is running every time. Even in each task logs I see logs of all the dags. I expect this script to run once. just to fill entry in metadata. I am unable to understand like why is it running every time.
Please make me understand the process.
I know airflow initdb is run once we setup metadata first time. So that is not doing this update all time.
Is it scheduler heart beat updating all?
Is my setup correct?
Please Note: I can't type real code as that is the restriction from my
organization. However if asked, i will provide more information.
Airflow Scheduler is actually continuously running in Airflow runtime environment as a main contributor for monitoring changes in DAG folder and triggering the relevant DAG tasks residing in this folder. The main settings for Airflow Scheduler service can be found in airflow.cfg file, essentially the heart beat intervals which effectively impact the general DAG tasks maintenance.
However, the way how the particular task will be executed is defined as per the Executor's model in Airflow configuration.
To store DAGs being available for the Airflow runtime environment GCP Composer uses Cloud Storage, implementing the specific folder structure, synchronizing any object arriving to /dags folder with *.py extension be verified if it contains the DAG definition.
If you expect to run DAG spreading script within Airflow runtime, then in this particular use case I would advise you to look at PythonOperator, using it in the separate DAG to invoke and execute your custom generic Python code with guarantees scheduling it only once a time. You can check out this Stack thread with implementation details.

Way to trigger dataflow only after Big Query Job finished

actually the following steps to my data:
new objects in GCS bucket trigger a Google Cloud function that create a BigQuery Job to load this data to BigQuery.
I need low cost solution to know when this Big Query Job is finished and trigger a Dataflow Pipeline only after the job is completed.
Obs:
I know about BigQuery alpha trigger for Google Cloud Function but i
dont know if is a good idea,from what I saw this trigger uses the job
id, which from what I saw can not be fixed and whenever running a job
apparently would have to deploy the function again. And of course
it's an alpha solution.
I read about a Stackdriver Logging->Pub/Sub -> Google cloud function -> Dataflow solution, but i didn't find any log that
indicates that the job finished.
My files are large so isn't a good idea to use a Google Cloud Function to wait until the job finish.
Despite your mention about Stackdriver logging, you can use it with this filter
resource.type="bigquery_resource"
protoPayload.serviceData.jobCompletedEvent.job.jobStatus.state="DONE"
severity="INFO"
You can add dataset filter in addition if needed.
Then create a sink into Function on this advanced filter and run your dataflow job.
If this doesn't match your expectation, can you detail why?
You can look at Cloud Composer which is managed Apache Airflow for orchestrating jobs in a sequential fashion. Composer creates a DAG and executes each node of the DAG and also checks for dependencies to ensure that things either run in parallel or sequentially based on the conditions that you have defined.
You can take a look at the example mentioned here - https://github.com/GoogleCloudPlatform/professional-services/tree/master/examples/cloud-composer-examples/composer_dataflow_examples

Google Dataprep: Scheduling with updated data source

Is there way to trigger dataprep flow on GCS (Google Cloud Storage) file upload? Or, at least, is it possible to make dataprep run each day and take the newest file from certain directory in GCS?
It should be possible, because otherwise what is the point in scheduling? Running the same job over the same data source with the same output?
It seems this product is very immature at the moment, so no API endpoint exists to run a job in this service. It is only possible to run a job in the UI.
In general, this is a pattern that is typically used for running jobs on a schedule. Maybe at some point the service will allow you to publish into the "queue" that Run Job already uses.

Google App Engine Parse Logs in DataStore Save to Table

I am new to GAE and I am trying to quickly find a way to retrieve logs in DataStore, clean them to my specs, and then save them to a table to be called on later for a reports view in my app. I was thinking of using Google Data Flow and creating batch jobs (app is python/Django) but the documentation does not seem to fit my use case so maybe data flow is not the answer. I could create a python script with BigQuery and schedule through CRON but then I would have to contend with errors and it would seem that there is a faster way to solve this problem.
Any help/thoughts/suggestions is always greatly appreciated.
You can use Dataflow/Beam Python SDK to develop a pipeline that read entities from Datastore [1], transform data, and write a table to BigQuery [2]. To schedule this job to run regularly you'll have to use a third party mechanism such as a cron job. Note that Dataflow performs automatic scaling and perform retries to handle errors so you are not expected to manually address these complexities.
[1] https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/gcp/datastore/v1/datastoreio.py
[2] https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/gcp/bigquery.py