Is it possible to use a custom hadoop version with EMR? - amazon-web-services

As of today (2022-06-28), AWS EMR latest version is 6.6.0, which uses Hadoop 3.2.1.
I need to use a different Hadoop version (3.2.2). I tried the following approach, but it doesn't work. You can either set ReleaseLabel or Hadoop version, but not both.
client = boto3.client("emr", region_name="us-west-1")
response = client.run_job_flow(
ReleaseLabel="emr-6.6.0",
Applications=[{"Name": "Hadoop", "Version": "3.2.2"}]
)
Another approach that seems to not be an option, is loading a specific hadoop jar with SparkSession.builder.getOrCreate(), like so:
spark = SparkSession \
.builder \
.config('spark.jars.packages', 'org.apache.hadoop:hadoop-aws:3.2.2') \
.config('spark.hadoop.fs.s3a.impl', 'org.apache.hadoop.fs.s3a.S3AFileSystem') \
.getOrCreate()
Is it even possible to run an EMR cluster with a different Hadoop version? If so, how does one go about doing that?

I'm afraid not. AWS don't want the support headache of allowing unsupported Hadoop versions, so they're always a little bit behind as they presumably take time to test each new release and its compatibility with other Hadoop tools. https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-660-release.html.
You'd have to build your own cluster from scratch in EC2.

You just need to add the script to the Bootstrap section when you spin up your cluster, this one -> spark-patch-s3a-fix_emr-6.6.0.sh =) Amazon provided this fix only for EMR 6.6.0.

Related

EMR Serverless Airflow Operator not allowing EMR custom images

I want to launch a Spark job on EMR Serverless from Airflow. I want to use Spark 3.3.0 and Scala 2.13 but the 6.9.0 EMR Release ships with Scala 2.12. I created a FAT jar including all Spark dependencies and it won't work either. As an alternative, I am trying to use an EMR custom image by creating an application using --image-configuration with the Airflow operator but it won't just pass through all the arguments from the boto API.
create_app = EmrServerlessCreateApplicationOperator(
task_id="create_my_app",
job_type="SPARK",
release_label="emr-6.9.0",
config={"name": "data-ingestion",
"imageConfiguration": {
"imageUri": "xxxxxxx.dkr.ecr.eu-west-1.amazonaws.com/emr-custom-image:0.0.1"}})
Airflow gives the following error message:
Unknown parameter in input: "imageConfiguration", must be one of:
name, releaseLabel, type, clientToken, initialCapacity, maximumCapacity, tags, autoStartConfiguration, autoStopConfiguration, networkConfiguration
This other config won't work either:
config={"name": "data-ingestion",
"imageUri": "xxxxxxx.dkr.ecr.eu-west-1.amazonaws.com/emr-custom-image:0.0.1"})
Does anybody have any ideas other than downgrading my Scala version?
Airflow operator passes the argument to the boto3 client, and this client create the application.
The configuration imageConfiguration is added to boto3 client in 1.26.44 (PR), and the other configuration are added in different version (please check the changelog).
So you can try to upgrade the version of boto3 in you Airflow server, provided that it is compatible with the others dependencies, and if not, you may need to upgrade your Airflow version.

How to read a csv file from s3 bucket using pyspark

I'm using Apache Spark 3.1.0 with Python 3.9.6. I'm trying to read csv file from AWS S3 bucket something like this:
spark = SparkSession.builder.getOrCreate()
file = "s3://bucket/file.csv"
c = spark.read\
.csv(file)\
.count()
print(c)
But I'm getting the following error:
py4j.protocol.Py4JJavaError: An error occurred while calling o26.csv.
: org.apache.hadoop.fs.UnsupportedFileSystemException: No FileSystem for scheme "s3"
I understand that I need add special libraries, but I didn't find any certain information which exactly and which versions. I've tried to add something like this to my code, but I'm still getting same error:
import os
os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages com.amazonaws:aws-java-sdk:1.7.4,org.apache.hadoop:hadoop-aws:2.7.3 pyspark-shell'
How can I fix this?
You need to use hadoop-aws version 3.2.0 for spark 3. In --packages specifying hadoop-aws library is enough to read files from S3.
--packages org.apache.hadoop:hadoop-aws:3.2.0
You need to set below configurations.
spark._jsc.hadoopConfiguration().set("fs.s3a.access.key", "<access_key>")
spark._jsc.hadoopConfiguration().set("fs.s3a.secret.key", "<secret_key>")
After that you can read CSV file.
spark.read.csv("s3a://bucket/file.csv")
Thanks Mohana for the pointer! After breaking my head for more than a day, I was able to finally figure out. Summarizing my learnings:
Make sure what version of Hadoop your spark comes with:
print(f'pyspark hadoop version:
{spark.sparkContext._jvm.org.apache.hadoop.util.VersionInfo.getVersion()}')
or look for
ls jars/hadoop*.jar
The issue I was having was I had older version of Spark that I had installed a while back that Hadoop 2.7 and was messing up everything.
This should give a brief idea of what binaries you need to download.
For me it was Spark 3.2.1 and Hadoop 3.3.1.
Hence I downloaded :
https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-aws/3.3.1
https://mvnrepository.com/artifact/com.amazonaws/aws-java-sdk-bundle/1.11.901 # added this just in case;
Placed these jar files in the spark installation dir:
spark/jars/
spark-submit runner.py --packages org.apache.hadoop:hadoop-aws:3.3.1
You have your code snippet that reads from AWS S3

AWS EMR Spark error with `Failed to load class of driverClassName com.mysql.jdbc.Driver`

I'm currently trying to add a process in EMR 6.1.0 that will use Spark to store aggregated data in mysql.
However, when I actually run Spark, I get the following error.
Exception in thread "main" java.lang.RuntimeException: Failed to load class of driverClassName com.mysql.jdbc.
This error did not occur in EMR 6.0.0.
In the process of updating from EMR 6.0.0 to 6.1.0, I changed the Spark version from 2.4.4 to 3.0.0.
The code itself has not changed significantly, and we know that it is not a network problem.
I've spent a lot of time looking through the AWS documentation and can't seem to find any hints.
Can anyone help me?
Place the MySQL connector jar under $SPARK_HOME/jars folder or pass the the MySQL connector jar path in spark-shell/spark-submit command using --jars flag.
Spark 3.x depends on HikariCP.
https://github.com/apache/spark/blob/v3.0.0/dev/deps/spark-deps-hadoop-3.2-hive-2.3#L1
Preloaded HikariCP can't load your application classes due to ClassLoader.
https://github.com/brettwooldridge/HikariCP/blob/HikariCP-2.5.1/src/main/java/com/zaxxer/hikari/HikariConfig.java#L318
this.getClass().getClassLoader().loadClass(driverClassName)
You should add shade settings if use sbt-assemlby plugin.
assembly / assemblyShadeRules := {
Seq("com.zaxxer.hikari").map { packageName =>
ShadeRule.rename(s"${packageName}.**" -> s"my_app_shade_package.${packageName}.#1").inAll
}
}

Accessing Airflow REST API in AWS Managed Workflows?

I have Airflow running in AWS MWAA, I would like to access REST API and there are 2 ways to do this but doesn't seem to work for me.
Overriding api.auth_backend. This used to work and now AWS MWAA won't allow you to add this, it is consider as 'blocklist' and not allow.
api.auth_backend = airflow.api.auth.backend.default
Using MWAA Cli(Python). This doesn't work if any of the DAGs uses packages that are in requirments.txt file.
a. as an example, I have "paramiko" in requirements.txt because I have a task that uses SSHOperator. The MWAA Cli fails with "no module paramiko"
b. Also noted here, https://docs.aws.amazon.com/mwaa/latest/userguide/access-airflow-ui.html
"Any command that parses a DAG (such as list_dags, backfill) will fail if the DAG uses plugins that depend on packages that are installed through requirements.txt."
We are using MWAA 2.0.2 and managed to use Airflow's Rest-API through MWAA CLI, basically following the instructions and sample codes of the Apache Airflow CLI command reference. You'll notice that not all Rest-API calls are supported, but many of them are (even when you have a requirements.txt in place).
Also have a look at AWS sample codes on GitHub.

amazon emr spark submission from S3 not working

I have a cluster up and running. I am trying to add a step to run my code. The code itself works fine on a single instance. Only thing is I can't get it to work off S3.
aws emr add-steps --cluster-id j-XXXXX --steps Type=spark,Name=SomeSparkApp,Args=[--deploy-mode,cluster,--executor-memory,0.5g,s3://<mybucketname>/mypythonfile.py]
This is exactly what examples show I should do. What am I doing wrong?
Error I get:
Exception in thread "main" java.lang.IllegalArgumentException: Unknown/unsupported param List(--executor-memory, 0.5g, --executor-cores, 2, --primary-py-file, s3://<mybucketname>/mypythonfile.py, --class, org.apache.spark.deploy.PythonRunner)
Usage: org.apache.spark.deploy.yarn.Client [options]
Options:
--jar JAR_PATH Path to your application's JAR file (required in yarn-cluster
mode)
.
.
.
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Command exiting with ret '1'
When I specify as this instead:
aws emr add-steps --cluster-id j-XXXXX --steps Type=spark,Name= SomeSparkApp,Args=[--executor-memory,0.5g,s3://<mybucketname>/mypythonfile.py]
I get this error instead:
Error: Only local python files are supported: Parsed arguments:
master yarn-client
deployMode client
executorMemory 0.5g
executorCores 2
EDIT: IT gets further along when I manually create the python file after SSH'ing into the cluster, and specifying as follows:
aws emr add-steps --cluster-id 'j-XXXXX' --steps Type=spark,Name= SomeSparkApp,Args=[--executor-memory,1g,/home/hadoop/mypythonfile.py]
But, not doing the job.
Any help appreciated. This is really frustrating as a well documented method on AWS's own blog here https://blogs.aws.amazon.com/bigdata/post/Tx578UTQUV7LRP/Submitting-User-Applications-with-spark-submit does not work.
I will ask, just in case, you used your correct buckets and cluster ID-s?
But anyways, I had similar problems, like I could not use --deploy-mode,cluster when reading from S3.
When I used --deploy-mode,client,--master,local[4] in the arguments, then I think it worked. But I think I still needed something different, can't remember exactly, but I resorted to a solution like this:
Firstly, I use a bootstrap action where a shell script runs the command:
aws s3 cp s3://<mybucket>/wordcount.py wordcount.py
and then I add a step to the cluster creation through the SDK in my Go application, but I can recollect this info and give you the CLI command like this:
aws emr add-steps --cluster-id j-XXXXX --steps Type=CUSTOM_JAR,Name="Spark Program",Jar="command-runner.jar",ActionOnFailure=CONTINUE,Args=["spark-submit",--master,local[4],/home/hadoop/wordcount.py,s3://<mybucket>/<inputfile.txt> s3://<mybucket>/<outputFolder>/]
I searched for days and finally discovered this thread which states
PySpark currently only supports local
files. This does not mean it only runs in local mode, however; you can
still run PySpark on any cluster manager (though only in client mode). All
this means is that your python files must be on your local file system.
Until this is supported, the straightforward workaround then is to just
copy the files to your local machine.