map reduce directly on Airflow - mapreduce

What is the downside for implementing map reduce directly on Airflow?
I can dynamically create operators for map and reduce when creating the DAG for Airflow.

Airflow >= 2.3.0:
Support for map reduce like workflows was added by implementing AIP (Airflow Improvement Proposal) to support it. See AIP-42 Dynamic Task Mapping
You can set mapped tasks to achieve that.
Airflow < 2.3.0:
Airflow does not support map-reduce.
You can still create tasks & DAGs dynamically but not in a map-reduce manner see docs.

Airflow now supports dynamic task mapping in its v2.3.0 release.
See release announcement for more info Apache Airflow 2.3.0 is here

Related

Specify specific worker configuration in GCP Dataflow

Is it possible to specify the configurations I want for a single dataflow worker? It seems like the default contains 4 cores and 15GB of memory, which is more than enough. How can I down size it or is this the smallest unit of a worker being offered?
Per the Workers section of the Dataflow > How-to Guides > Deploying a Pipeline page, you can specify a custom machine type (with different cores or memory) with the --worker_machine_type option.
You can also see the other Dataflow worker-related options in the docs/source code for the WorkerOptions class, which parses the various worker-related command-line options. Some of the other options listed here include: disk_size_gb, worker_disk_type.
Tangentially related: Additional GCP-related Dataflow options are handled by the GoogleCloudOptions class.

How to restrict access to airflow.models?

I have an airflow instance with many tenants that have DAGs. They want to extract metadata on their dagruns like DagRun.end_date. However I want to restrict each tenant so they can only access data related to their own dagruns and be unable to access data of other people's dagruns. How can this be done?
This is what I imagine the DAG to look like
# custom macro function
def get_last_dag_run(dag):
last_dag_run = dag.get_last_dagrun()
return last_dag_run.end_date
I found these resources which explain how to extract data but not how to restrict it.
Getting the date of the most recent successful DAG execution
Apache airflow macro to get last dag run execution time
How to get last two successful execution dates of Airflow job?
how to get latest execution time of a dag run in airflow
How to find the start date and end date of a particular task in dag in airflow?
How to get dag status like running or success or failure
NB: I am a contributor to Airflow.
This is not possible with the current Airflow architecture.
We are slowly working to make Airflow multi-tenant capable, but for now we are half-way through and it will be several major releases to get there I believe.
Currently the only way to isolate tenants is to give every tenant separate Airflow instance, which is not as bad as you might initially think. If you run them in separate namespaces on the same auto-scaling Kubernetes cluster and add KEDA autoscaling, and use same database server (but give each tenant a separate schema), this might be rather efficient (especially if you use Terraform to setup/teardown such Airflow instances for example).

How to write a file in a GCS bucket using Airflow

New to Airflow here.
I have a Python code that reads a BigQuery table, makes some transformations as a pandas DataFrame and save it as a file.
Using Airflow, I need a DAG that executes my code and save it as a file in a Google Cloud Storage bucket.
The Airflow is deployed on Composer.
How am I supposed to do that ?
If your transformation can be expressed in BigQuery QL you can use BQ to GCS operator:
https://airflow.apache.org/docs/apache-airflow-providers-google/stable/_api/airflow/providers/google/cloud/transfers/bigquery_to_gcs/index.html
Examples here:
https://github.com/apache/airflow/blob/main/airflow/providers/google/cloud/example_dags/example_bigquery_to_gcs.py
If you need to to do more complex transformation that you have no external service that you can orchestrate, create a custom operator that uses BigQuery Hook and GCS Hook and does what you want to do. It is easier than you think - just take a look at the BQToGCS operator and you will see that it's rather straightforward.
https://github.com/apache/airflow/blob/main/airflow/providers/google/cloud/transfers/bigquery_to_gcs.py
Airflow is all Python - so it does not really change much if you compose existing operators in a DAG, or write your own operators (and then compose them). It's all python code. Airflow implemented Hook abstractions, specifically for the reason to be able to hide the complexity of communication with the services, but allow you as the DAG/Operator's writer to write the operator's code using the hooks and doing some extra operations.

What is the proper way to use Google Pub/Sub with Flink Streaming using Dataproc?

I'm trying to figure out the proper way to run Apache Flink on Dataproc and use Google Pub/Sub as a source/sink. When I create a Dataproc cluster, after applying flink initialization action to the most recent image 1.4, Flink 1.6.4 will be installed.
The problem is that flink-connector-gcp-pubsub is only available starting from Flink version 1.9.0.
So my question is what is the proper way to use all of this together? Should I build my own gce image with the latest Flink? Is there one already existing?
As you already said flink-connector-gcp-pubusub is only available from Flink 1.9.0. So you have two options:
Either implement connector yourself
Build your own image based on the flink initialization actions
I would not recommend implementing connector as it is a complex task and requires an in-depth understanding of Flink while building your own image should be relatively easy given an example for Flink 1.6.4
I solved this problem by running Flink 1.9.0 in Kubernetes. This way I do not depend on anybody and can run whatever version I need.

Configuring Spark on EMR

When you pick a more performant node, say a r3.xlarge vs m3.xlarge, will Spark automatically utilize the additional resources? Or is this something you need to manually configure and tune?
As far as configurations go, which are the most configuration values to tune to get the most out of your cluster?
It will try..
AWS has a setting you can enable in your EMR cluster configuration that will attempt to do this. It is called spark.dynamicAllocation.enabled. In the past there were issues with this setting where it would give too many resources to Spark. In newer releases they have lowered the amount they are giving to spark. However, if you are using Pyspark they will not take python's resource requirements into account.
I typically disable dynamicAllocation and set the appropriate memory and cores settings dynamically from my own code based upon what instance type is selected.
This page discusses what defaults they will select for you:
http://docs.aws.amazon.com/ElasticMapReduce/latest/ReleaseGuide/emr-spark-configure.html
If you do it manually, at a minimum you will want to set:
spark.executor.memory
spark.executor.cores
Also, you may need to adjust the yarn container size limits with:
yarn.scheduler.maximum-allocation-mb
yarn.scheduler.minimum-allocation-mb
yarn.nodemanager.resource.memory-mb
Make sure you leave a core and some RAM for the OS, and RAM for python if you are using Pyspark.