I am using Amazon EMR
my question log is following.
Image name '<account_id>.dkr.ecr.ap-northeast-2.amazonaws.com/pyspark-etl#sha256:3d3a07135.......https://forums.aws.amazon.com/....0.87563e51ef5c841841a5d1a6dde9c' doesn't match docker image name pattern
Release label:emr-6.2.0
Hadoop distribution:Amazon 3.2.1
Applications:Spark 3.0.1,
Hive 3.1.2,
JupyterHub 1.1.0,
Ganglia 3.7.2
Zeppelin 0.9.0
Livy 0.7.0 \
Hue 4.8.0
PrestoSQL 343
The reason I had to use sha256 digest is becuase I previously used TAG:latest pyspark image hardcoded in airflow job ALSO containerized in ECR image.
so, when my airflow container runs a EMROperator(SSHoperator precisely) as a CLI spark-submit. It pull :latest spark container which doesn't update because of some reason.
It is strange because when I ssh into core instance, I am able to pull sha256 name pattern from ECR, also update :latest TAG if something is changed(so digest changed).
I think this is something about spark configuration or spark source from AWS which prohibited digest name pattern, but I can't debug this because I do not have spark(Amazon) source on my own. I would appreciate your answer.
Many thanks,
I am editing on my own because I got an answer from someone.
The problem was YARN configuration installed on my EMR master node. YARN default setting for docker default image update is False.
https://github.com/apache/hadoop/blob/03cfc852791c14fad39db4e5b14104a276c08e59/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/linux/runtime/DockerLinuxContainerRuntime.java#L344
so, in order to fix default setting, you should go /etc/hadoop/conf/ and find a yarn-sites.xml and fix(or add) this
<property>
<name>yarn.nodemanager.runtime.linux.docker.image-update</name>
<value>false</value>
<description>
Optional. Default option to decide whether to pull the latest image
or not.
</description>
</property>
https://hadoop.apache.org/docs/stable/hadoop-yarn/hadoop-yarn-site/DockerContainers.html
Related
I want to launch a Spark job on EMR Serverless from Airflow. I want to use Spark 3.3.0 and Scala 2.13 but the 6.9.0 EMR Release ships with Scala 2.12. I created a FAT jar including all Spark dependencies and it won't work either. As an alternative, I am trying to use an EMR custom image by creating an application using --image-configuration with the Airflow operator but it won't just pass through all the arguments from the boto API.
create_app = EmrServerlessCreateApplicationOperator(
task_id="create_my_app",
job_type="SPARK",
release_label="emr-6.9.0",
config={"name": "data-ingestion",
"imageConfiguration": {
"imageUri": "xxxxxxx.dkr.ecr.eu-west-1.amazonaws.com/emr-custom-image:0.0.1"}})
Airflow gives the following error message:
Unknown parameter in input: "imageConfiguration", must be one of:
name, releaseLabel, type, clientToken, initialCapacity, maximumCapacity, tags, autoStartConfiguration, autoStopConfiguration, networkConfiguration
This other config won't work either:
config={"name": "data-ingestion",
"imageUri": "xxxxxxx.dkr.ecr.eu-west-1.amazonaws.com/emr-custom-image:0.0.1"})
Does anybody have any ideas other than downgrading my Scala version?
Airflow operator passes the argument to the boto3 client, and this client create the application.
The configuration imageConfiguration is added to boto3 client in 1.26.44 (PR), and the other configuration are added in different version (please check the changelog).
So you can try to upgrade the version of boto3 in you Airflow server, provided that it is compatible with the others dependencies, and if not, you may need to upgrade your Airflow version.
I am running DynamoDB locally using the instructions here. To remove potential docker networking issues I am using the "Download Locally" version of the instructions. Before running dynamo locally I run aws configure to set some fake values for AWS access, secret, and region, and here is the output:
$ aws configure
AWS Access Key ID [****************fake]:
AWS Secret Access Key [****************ake2]:
Default region name [local]:
Default output format [json]:
here is the output of running dynamo locally:
$ java -Djava.library.path=./DynamoDBLocal_lib -jar DynamoDBLocal.jar -sharedDb
Initializing DynamoDB Local with the following configuration:
Port: 8000
InMemory: false
DbPath: null
SharedDb: true
shouldDelayTransientStatuses: false
CorsParams: *
I can confirm that the DynamoDB is running locally successfully by listing tables using aws cli
$ aws dynamodb list-tables --endpoint-url http://localhost:8000
{
"TableNames": []
}
but when I visit http://localhost:8000/shell in my browser, this is the error I get and the page does not load.
I tried running curl on the shell to see if I can get a more useful error message:
$ curl http://localhost:8000/shell
{
"__type":"com.amazonaws.dynamodb.v20120810#MissingAuthenticationToken",
"Message":"Request must contain either a valid (registered) AWS access key ID or X.509 certificate."}%
I tried looking up the error above, but I don't have much choice in doing setup when running the shell merely in the browser. Any help is appreciated on how I can run the Dynamodb javascript web shell with this setting.
Software versions:
aws cli: aws-cli/2.4.7 Python/3.9.9 Darwin/20.6.0 source/x86_64 prompt/off
OS: MacOS Big Sur 11.6.2 (20G314)
DynamoDB Local Web Shell was deprecated with version 1.16.X and is not available any longer from 1.17.X to latest. There are no immediate plans for a new Web Shell to be introduced.
You can download an old version of DynamoDB Local < 1.17.X should you wish to use the Web Shell.
Available versions:
aws s3 ls s3://dynamodb-local-frankfurt/
Download most recent working version with Web Shell:
aws s3 ls s3://dynamodb-local-frankfurt/dynamodb_local_2021-04-27.tar.gz .
The next release of DynamoDB Local will have an updated README indicating its deprecation
As I answered in DynamoDB local http://localhost:8000/shell this appears to be a regression in new versions of DynamoDB Local, where the shell mysteriously stopped working, whereas in versions from a year ago it does work.
Somebody should report it to Amazon. If there is some flag that new versions require you to set to enable the shell, it isn't documented anywhere that I can find.
Update JAVA to the latest version and voila, it works!
I have a pyspark code stored both on the master node of an AWS EMR cluster and in an s3 bucket that fetches over 140M rows from a MySQL database and stores the sum of a column back in the log files on s3.
When I spark-submit the pyspark code on the master node, the job gets completed successfully and the output is stored in the log files on the S3 bucket.
However, when I spark-submit the pyspark code on the S3 bucket using these- (using the below commands on the terminal after SSH-ing to the master node)
spark-submit --master yarn --deploy-mode cluster --py-files s3://bucket_name/my_script.py
This returns a Error: Missing application resource. error.
spark_submit s3://bucket_name/my_script.py
This shows :
20/07/02 11:26:23 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Exception in thread "main" java.lang.RuntimeException: java.lang.ClassNotFoundException: Class com.amazon.ws.emr.hadoop.fs.EmrFileSystem not found
at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2369)
at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2840)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2857)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:99)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2896)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2878)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:392)
at org.apache.spark.util.Utils$.getHadoopFileSystem(Utils.scala:1911)
at org.apache.spark.util.Utils$.doFetchFile(Utils.scala:766)
at org.apache.spark.deploy.DependencyUtils$.downloadFile(DependencyUtils.scala:137)
at org.apache.spark.deploy.SparkSubmit$$anonfun$prepareSubmitEnvironment$7.apply(SparkSubmit.scala:356)
at org.apache.spark.deploy.SparkSubmit$$anonfun$prepareSubmitEnvironment$7.apply(SparkSubmit.scala:356)
at scala.Option.map(Option.scala:146)
at org.apache.spark.deploy.SparkSubmit.prepareSubmitEnvironment(SparkSubmit.scala:355)
at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:782)
at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:161)
at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:184)
at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86)
at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:928)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:937)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.lang.ClassNotFoundException: Class com.amazon.ws.emr.hadoop.fs.EmrFileSystem not found
at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:2273)
at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2367)
... 20 more
I read about having to add a Spark Step on the AWS EMR cluster to submit a pyspark code stored on the S3.
Am I correct in saying that I would need to create a step in order to submit my pyspark job stored on the S3?
In the 'Add Step' window that pops up on the AWS Console, in the 'Application location' field, it says that I'll have to type in the location to the JAR file. What JAR file are they referring to? Does my pyspark script have to be packaged into a JAR file and how do I do that or do I mention the path to my pyspark script?
In the 'Add Step' window that pops up on the AWS Console, in the Spark-submit options, how do I know what to write for the --class parameter? Can I leave this field empty? If no, why not?
I have gone through the AWS EMR documentation. I have so many questions because I dived nose-down into the problem and only researched when an error popped up.
Your spark submit should be this.
spark-submit --master yarn --deploy-mode cluster s3://bucket_name/my_script.py
--py-files is used if you want to pass the python dependency modules, not the application code.
When you are adding step in EMR to run spark job, jar location is your python file path. i.e. s3://bucket_name/my_script.py
No its not mandatory to use STEP to submit spark job.
You can also use spark-submit
To submit a pyspark script using STEP please refer aws doc and stackoverflow
For problem 1:
By default spark will use python2.
You need to add 2 config
Go to $SPARK_HOME/conf/spark-env.sh and add
export PYSPARK_PYTHON=/usr/bin/python3
export PYSPARK_DRIVER_PYTHON=/usr/bin/python3
Note: If you have any custom bundle add that using --py-files
For problem 2:
A hadoop-assembly jar exists on /usr/share/aws/emr/emrfs/lib/. That contains com.amazon.ws.emr.hadoop.fs.EmrFileSystem.
You need to add this to your classpath.
A better option to me is to create a symbolic link of hadoop-assembly jar to HADOOP_HOME (/usr/lib/hadoop) in your bootstrap action.
It appears that one must provide a new full task definition for each service update. Even though most of the time new deployments exclusively consists of updates to one of the underlying docker images
While this is understandable as a core architectural choice. It is quite cumbersome. Is there a command-line option that makes this easier as the full JSON spec for task definitions are quite complex?
Right now the developers needs to provide complex scripts and deployment orchestrations to achieve this relatively routine task in their CI/CD processes
I see attempts at this Here and Here. These solutions do not appear to work in all cases (for example, for Fargate launches)
I know that if the updated image uses the re-use the same tag this problem is made easier, however in dev cultures that values reproducibility and audibility that is simply not an reasonable option
Is there no other option than to leverage both the AWS API & JSON manipulation libraries?
EDIT It appears this project does a fairly good job https://github.com/fabfuel/ecs-deploy
I found a few approaches
As mentioned in my comment, use ecs-deploy script per the Github link
Create a task definition via the --generate-cli-skeleton option on awscli.
Fill out all details except for execution-rule-arn, task-role-arn, image
These cannot be filled out because they will change per commit or per environment you want to deploy to
Commit this skeleton to git, so it is part of your workspace on the CI
Then use a JSON traversing/parsing library or utility such as https://jqplay.org/ to replace at build time on the CI the roleArn and image name
Use https://github.com/fabfuel/ecs-deploy.
If you want to update only the tag of an existent task:
ecs deploy <CLUSTER NAME> <SERVICE NAME> --region <REGION NAME> --tag <NEW TAG>
e.g. ecs deploy default web-service --region us-east-1 --tag v2.0
In your ci/cd you use git hash:
Using git rev-parse HEAD, will return a hash like: d63c16cd4d0c9a30524c682fe4e7d417faae98c9
docker build -t image-name:$(git rev-parse HEAD) .
docker push image-name:$(git rev-parse HEAD)
And use the same tag on task:
ecs deploy default web-service --region us-east-1 --tag $(git rev-parse HEAD)
I'm running emr-5.12.0, with Amazon 2.8.3, Hive 2.3.2, Hue 4.1.0, Livy 0.4.0, Spark 2.2.1 and Zeppelin 0.7.3 on 1 m4.large as my master node and 1 m4.large as core node.
I am trying to execute a bootstrap action that configures some parts of the cluster. One of these includes the line:
sudo sed -i '/zeppelin.pyspark.python/c\ \"zepplin.pyspark.python\" : \"python3\",' /etc/alternatives/zeppelin-conf/interpreter.json
It makes sure that the Zeppelin uses python3.4 instead of python2.7. It works fine if I execute this in the terminal after SSH'ing to the master node, but it fails when I submit it as a Custom JAR step on the AWS Web interface. I get the following error:
ed: can't read /etc/alternatives/zeppelin-conf/interpreter.json
: No such file or directory
Command exiting with ret '2'
The same thing happens if I use
sudo sed -i '/zeppelin.pyspark.python/c\ \"zepplin.pyspark.python\" : \"python3\",' /etc/zeppelin-conf/interpreter.json
Obviously I could just change it from the Zeppelin UI, but I would like to include it in the bootstrap action.
Thanks!
It turns out that a bootstrap action submitted throug the AWS EMR web interface is submitted as a regular EMR step, so it's only run on the master node. This can be seen if you click the 'AWS CLI export' in the cluster web interface. The intended bootstrap action is listed as a regular step.
Using the command line to launch a cluster with a bootstrap action bypasses this problem, so I've just used that.
Edit: Looking back at the web interface, it's pretty clear that I was adding regular steps instead of bootstrap actions. My bad!