EMR Serverless Airflow Operator not allowing EMR custom images - amazon-web-services

I want to launch a Spark job on EMR Serverless from Airflow. I want to use Spark 3.3.0 and Scala 2.13 but the 6.9.0 EMR Release ships with Scala 2.12. I created a FAT jar including all Spark dependencies and it won't work either. As an alternative, I am trying to use an EMR custom image by creating an application using --image-configuration with the Airflow operator but it won't just pass through all the arguments from the boto API.
create_app = EmrServerlessCreateApplicationOperator(
task_id="create_my_app",
job_type="SPARK",
release_label="emr-6.9.0",
config={"name": "data-ingestion",
"imageConfiguration": {
"imageUri": "xxxxxxx.dkr.ecr.eu-west-1.amazonaws.com/emr-custom-image:0.0.1"}})
Airflow gives the following error message:
Unknown parameter in input: "imageConfiguration", must be one of:
name, releaseLabel, type, clientToken, initialCapacity, maximumCapacity, tags, autoStartConfiguration, autoStopConfiguration, networkConfiguration
This other config won't work either:
config={"name": "data-ingestion",
"imageUri": "xxxxxxx.dkr.ecr.eu-west-1.amazonaws.com/emr-custom-image:0.0.1"})
Does anybody have any ideas other than downgrading my Scala version?

Airflow operator passes the argument to the boto3 client, and this client create the application.
The configuration imageConfiguration is added to boto3 client in 1.26.44 (PR), and the other configuration are added in different version (please check the changelog).
So you can try to upgrade the version of boto3 in you Airflow server, provided that it is compatible with the others dependencies, and if not, you may need to upgrade your Airflow version.

Related

AWS EMR pyspark java.lang.illegalArgumentException when using pandas_udf

Using pyspark 2.4.7 and pyarrow 6.0.1.
I know from documentation there is compatibility issue therefore I need to set ARROW_PRE_0_15_IPC_FORMAT = 1 inside spark-env.sh
This solves the problem on my local machine however still getting the same error in AWS Emr 5.33.1
I am usint boto3 and have configured spark-env by passing
[...{'Classification': 'spark-env', 'Configurations':[{'Classification': 'export', 'Properties':{'ARROW_PRE_0_15_IPC_FORMAT':'1'}}],
'Properties':{}
}
and EMR loads property and has its config can be see in EMR UI.
I've read that these config only used for master node, so worker nodes are still getting the same error?

Is it possible to use a custom hadoop version with EMR?

As of today (2022-06-28), AWS EMR latest version is 6.6.0, which uses Hadoop 3.2.1.
I need to use a different Hadoop version (3.2.2). I tried the following approach, but it doesn't work. You can either set ReleaseLabel or Hadoop version, but not both.
client = boto3.client("emr", region_name="us-west-1")
response = client.run_job_flow(
ReleaseLabel="emr-6.6.0",
Applications=[{"Name": "Hadoop", "Version": "3.2.2"}]
)
Another approach that seems to not be an option, is loading a specific hadoop jar with SparkSession.builder.getOrCreate(), like so:
spark = SparkSession \
.builder \
.config('spark.jars.packages', 'org.apache.hadoop:hadoop-aws:3.2.2') \
.config('spark.hadoop.fs.s3a.impl', 'org.apache.hadoop.fs.s3a.S3AFileSystem') \
.getOrCreate()
Is it even possible to run an EMR cluster with a different Hadoop version? If so, how does one go about doing that?
I'm afraid not. AWS don't want the support headache of allowing unsupported Hadoop versions, so they're always a little bit behind as they presumably take time to test each new release and its compatibility with other Hadoop tools. https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-660-release.html.
You'd have to build your own cluster from scratch in EC2.
You just need to add the script to the Bootstrap section when you spin up your cluster, this one -> spark-patch-s3a-fix_emr-6.6.0.sh =) Amazon provided this fix only for EMR 6.6.0.

AWS EMR Spark error with `Failed to load class of driverClassName com.mysql.jdbc.Driver`

I'm currently trying to add a process in EMR 6.1.0 that will use Spark to store aggregated data in mysql.
However, when I actually run Spark, I get the following error.
Exception in thread "main" java.lang.RuntimeException: Failed to load class of driverClassName com.mysql.jdbc.
This error did not occur in EMR 6.0.0.
In the process of updating from EMR 6.0.0 to 6.1.0, I changed the Spark version from 2.4.4 to 3.0.0.
The code itself has not changed significantly, and we know that it is not a network problem.
I've spent a lot of time looking through the AWS documentation and can't seem to find any hints.
Can anyone help me?
Place the MySQL connector jar under $SPARK_HOME/jars folder or pass the the MySQL connector jar path in spark-shell/spark-submit command using --jars flag.
Spark 3.x depends on HikariCP.
https://github.com/apache/spark/blob/v3.0.0/dev/deps/spark-deps-hadoop-3.2-hive-2.3#L1
Preloaded HikariCP can't load your application classes due to ClassLoader.
https://github.com/brettwooldridge/HikariCP/blob/HikariCP-2.5.1/src/main/java/com/zaxxer/hikari/HikariConfig.java#L318
this.getClass().getClassLoader().loadClass(driverClassName)
You should add shade settings if use sbt-assemlby plugin.
assembly / assemblyShadeRules := {
Seq("com.zaxxer.hikari").map { packageName =>
ShadeRule.rename(s"${packageName}.**" -> s"my_app_shade_package.${packageName}.#1").inAll
}
}

Accessing Airflow REST API in AWS Managed Workflows?

I have Airflow running in AWS MWAA, I would like to access REST API and there are 2 ways to do this but doesn't seem to work for me.
Overriding api.auth_backend. This used to work and now AWS MWAA won't allow you to add this, it is consider as 'blocklist' and not allow.
api.auth_backend = airflow.api.auth.backend.default
Using MWAA Cli(Python). This doesn't work if any of the DAGs uses packages that are in requirments.txt file.
a. as an example, I have "paramiko" in requirements.txt because I have a task that uses SSHOperator. The MWAA Cli fails with "no module paramiko"
b. Also noted here, https://docs.aws.amazon.com/mwaa/latest/userguide/access-airflow-ui.html
"Any command that parses a DAG (such as list_dags, backfill) will fail if the DAG uses plugins that depend on packages that are installed through requirements.txt."
We are using MWAA 2.0.2 and managed to use Airflow's Rest-API through MWAA CLI, basically following the instructions and sample codes of the Apache Airflow CLI command reference. You'll notice that not all Rest-API calls are supported, but many of them are (even when you have a requirements.txt in place).
Also have a look at AWS sample codes on GitHub.

GCP, Composer, Airflow, Operators

As Google Cloud Composer uses Cloud Storage to store Apache Airflow DAGs. However, where the operators are stored ? I am getting an error as below:
Broken DAG: [/home/airflow/gcs/dags/example_pubsub_flow.py] cannot import name PubSubSubscriptionCreateOperator.
This operator was added in Airflow 1.10.0 . As of today, Cloud Composer is still using Airflow 1.9.0, hence this operator is not available yet. You can add this as a plugin.
apparently, according to the following post in this message in the Composer Google Group list, to install as a plugin the contrib is not needed to add the Plugin boilerplate.
It is enough with registering the plugins via this command:
gcloud beta composer environments storage plugins import --environment dw --location us-central1 --source=custom_operators.py
See here for detail.
The drawback is that if your contrib operator uses others you will have to copy also those and modify the way they are imported in python, using:
from my_custom_operator import MyCustomOperator
instead of:
from airflow.contrib.operators.my_custom_operator import MyCustomOperator