what is the best Airflow architecture for AWS EMR clusters? - amazon-web-services

I have an AWS EMR cluster with 1 master node, 30 core nodes and some auto-scaled task nodes.
now, hundreds of Hive and mysql jobs are running by Oozie on the cluster.
I'm going to change some jobs from Oozie to Airflow.
I googled to apply Airflow to my cluster.
I found out that all dag should be located on every node and Airflow Worker must be installed on all nodes.
But, My dag will be updated frequently and new dags will be added frequently, but the number of nodes is about 100 and even auto-scaled nodes are used.
And, As you know, only master node has hive/mysql application on the cluster.
So I am very confused.
Who can tell me Airflow architecture to apply to my EMR cluster?

Airflow worker nodes are not the same as EMR nodes.
In a typical setup, a celery worker ("Airflow worker node") reads from a queue of jobs and executes them using the appropriate operator (In this case probably a SparkSubmitOperator or possibly an SSHOperator).
Celery workers would not run on your EMR nodes as those are dedicated to running Hadoop jobs.
Celery workers would likely run on EC2s outside of your EMR cluster.
One common solution to having the same DAGs on every celery worker, is to put the dags on network storage (like EFS) and mount the network drive to the celery worker EC2s.

Related

How to run some batch script on all MWAA Airflow worker nodes?

Is there a way to run a specific batch script on every worker node that MWAA spins up?
Is there any feature in airflow that can do this?

Spark on EMR | EKS |YARN

I am getting migrated from on-premise to the AWS stack. I have a doubt and often confused about how Apache spark works in AWS/similar.
I will just share my current understanding about the onpremise spark that run on yarn. When the application is submitted in the spark cluster,an application master will be created in any of the data node (as containers) and this will take care of the application by spawning executor tasks in the data nodes. This means the spark code will be deployed to the node where the data resides. This means less network transfer. More over this is logical and easy to visualise (at least to me.)
But, suppose there is a same spark application that runs on AWS. This fetches the data from S3 and run on top of eks. Here as I understand the spark drvier and the executor tasks will be spawn on a k8s pod.
-Then does this mean, data has to be transferred through network from S3 to EKS cluster to the node where the executor pod gets spawned ?
I have seen some of the videos that uses EMR on top of EKS. But I am a little confused here.
-Since EMR provides spark runtime, why do we use EKS here? Can't we run EMR alone for spark applications in actual production environment? (I know that EKS, can be a replacement to YARN in spark world)
-Can't we run spark on top of EKS without using EMR? (I am thinking emr as a cluster where in spark drivers and executors can run )
Edit - This is a query more on k8s with spark integration. Not specific to AWS.

Airflow cluster: Is it needed to deploy DAGs / Workflows in all the workers?

We are planning to update Airflow and switch from single Airflow server to Airflow cluster (AWS).
We've been cheking the this article and this one.
We are using SQS as queue service and despite the documentation said that we only need to deploy our DAGs py files in the masters we wonder if this is correct.
The comunications throught queues don't include the code
In our tests, our DAGs are not working in case we don't deploy them in all nodes, workers and masters.
So, what we should do?
Many thanks!
Your DAGS need to be synced across all workers for it to work because the airflow_scheduler will send the DAG to whichever worker is available. If the DAGS are not synced across all workers, an older copy of the DAG may be run.

Does AWS ECS internally maintain tasks queue

We are in the development phase. So we are using AWS ECS cluster consisting of 2 EC2 instances. We are submitting tasks to ECS cluster using ECSOperator of Airflow. We are looking to scale this process. So we are going to use Celeryexecutor of Airflow which is used to concurrently submit and schedule tasks on Airflow.
So the question is, should we care about number of task submitted to ECS or irrespective of number of tasks submitted to ECS, it will service all the tasks without failure by any internal queuing mechanism?

How do I get Spark installed on Amazon's EMR core/worker nodes/instances while creating the cluster?

I am trying to launch a EMR cluster with Spark (1.6.0) and Hadoop (Distribution: Amazon 2.7.1) applications. The release label is emr-4.4.0.
The cluster gets setup as needed but it doesn't run Spark master (in the master instances) as a daemon process and also I cannot find Spark being installed in the worker (core) instances (the Spark dir under /usr/lib/ has just lib and yarn directories).
I'd like to run the Spark master and worker nodes as soon as the cluster has been setup. (i.e., workers connect to the master automatically and become a part of the Spark cluster).
How do I achieve this? Or am I missing something?
Thanks in advance!
Spark on EMR is installed in YARN mode. This is the reason why you are not able to see standalone masters and slave daemons. http://spark.apache.org/docs/latest/running-on-yarn.html
Standalone Spark master and worker daemons are spawned only in spark-standalone mode. http://spark.apache.org/docs/latest/spark-standalone.html
Now, if you do want to run spark masters and workers on EMR, you can do so using
/usr/lib/spark/sbin/start-master.sh
/usr/lib/spark/sbin/start-slave.sh
and configuring accordingly.