Setup Apache Sedona on EMR - amazon-web-services

I want to be able to use Apache Sedona for distributed GIS computing on AWS EMR. We need the right bootstrap script to have all dependencies.
I tried setting up Geospark using EMR 5.33 using the Jars listed here. It didn't work as some dependencies were still missing.
I then manually set Sedona up on local, found the difference of Jars between Spark 3 and the Sedona setup and came up with following bootstrap script
#!/bin/bash
sudo pip3 install numpy
sudo pip3 install boto3 pandas findspark shapely py4j attrs
sudo pip3 install geospark --no-dependencies
sudo pip3 install apache-sedona
sudo aws s3 cp s3://emr_setup/apache-sedona-1.0.1-incubating-bin/sedona-python-adapter-2.4_2.11-1.0.1-incubating.jar /usr/lib/spark/jars/
sudo aws s3 cp s3://emr_setup/apache-sedona-1.0.1-incubating-bin/sedona-viz-2.4_2.11-1.0.1-incubating.jar /usr/lib/spark/jars/
sudo aws s3 cp s3://emr_setup/geospark_bin/postgresql-42.2.23.jar /usr/lib/spark/jars/
sudo aws s3 cp s3://emr_setup/spark_2.4_2.11_sedona_all_jars/sedona-core-2.4_2.11-1.0.1-incubating.jar /usr/lib/spark/jars/
sudo aws s3 cp s3://emr_setup/spark_2.4_2.11_sedona_all_jars/stream-2.7.0.jar /usr/lib/spark/jars/
sudo aws s3 cp s3://emr_setup/spark_2.4_2.11_sedona_all_jars/orc-core-1.5.5-nohive.jar /usr/lib/spark/jars/
sudo aws s3 cp s3://emr_setup/spark_2.4_2.11_sedona_all_jars/jersey-media-jaxb-2.22.2.jar /usr/lib/spark/jars/
sudo aws s3 cp s3://emr_setup/spark_2.4_2.11_sedona_all_jars/hadoop-mapreduce-client-common-2.6.5.jar /usr/lib/spark/jars/
sudo aws s3 cp s3://emr_setup/spark_2.4_2.11_sedona_all_jars/hadoop-mapreduce-client-shuffle-2.6.5.jar /usr/lib/spark/jars/
sudo aws s3 cp s3://emr_setup/spark_2.4_2.11_sedona_all_jars/org.w3.xlink-24.0.jar /usr/lib/spark/jars/
sudo aws s3 cp s3:///emr_setup/spark_2.4_2.11_sedona_all_jars/minlog-1.3.0.jar /usr/lib/spark/jars/
sudo aws s3 cp s3://emr_setup/spark_2.4_2.11_sedona_all_jars/jersey-client-2.22.2.jar /usr/lib/spark/jars/
sudo aws s3 cp s3://emr_setup/spark_2.4_2.11_sedona_all_jars/xz-1.5.jar /usr/lib/spark/jars/
sudo aws s3 cp s3://emr_setup/spark_2.4_2.11_sedona_all_jars/pyrolite-4.13.jar /usr/lib/spark/jars/
sudo aws s3 cp s3://emr_setup/spark_2.4_2.11_sedona_all_jars/hadoop-yarn-common-2.6.5.jar /usr/lib/spark/jars/
sudo aws s3 cp s3://emr_setup/spark_2.4_2.11_sedona_all_jars/curator-recipes-2.6.0.jar /usr/lib/spark/jars/
sudo aws s3 cp s3://emr_setup/spark_2.4_2.11_sedona_all_jars/aopalliance-1.0.jar /usr/lib/spark/jars/
sudo aws s3 cp s3://emr_setup/spark_2.4_2.11_sedona_all_jars/commons-configuration-1.6.jar /usr/lib/spark/jars/
sudo aws s3 cp s3://emr_setup/spark_2.4_2.11_sedona_all_jars/commons-beanutils-1.7.0.jar /usr/lib/spark/jars/
sudo aws s3 cp s3://emr_setup/spark_2.4_2.11_sedona_all_jars/gt-metadata-24.0.jar /usr/lib/spark/jars/
sudo aws s3 cp s3://emr_setup/spark_2.4_2.11_sedona_all_jars/spark-unsafe_2.11-2.4.7.jar /usr/lib/spark/jars/
sudo aws s3 cp s3://emr_setup/spark_2.4_2.11_sedona_all_jars/objenesis-2.5.1.jar /usr/lib/spark/jars/
sudo aws s3 cp s3://emr_setup/spark_2.4_2.11_sedona_all_jars/commons-httpclient-3.1.jar /usr/lib/spark/jars/
sudo aws s3 cp s3://emr_setup/spark_2.4_2.11_sedona_all_jars/stax-api-1.0-2.jar /usr/lib/spark/jars/
sudo aws s3 cp s3://emr_setup/spark_2.4_2.11_sedona_all_jars/hk2-api-2.4.0-b34.jar /usr/lib/spark/jars/
sudo aws s3 cp s3://emr_setup/spark_2.4_2.11_sedona_all_jars/apacheds-i18n-2.0.0-M15.jar /usr/lib/spark/jars/
The EMR setup starts, but the attached notebooks to the script don't seem to be able to start. The master seems to fail for some reason.
Need help with preparing the right bootstrap script to install Apache Sedona on EMR 6.0.

Here is a complete tutorial of setting up Sedona on EMR EC2.
EMR version: 6.9.0.
Installed applications: Hadoop 3.3.3, JupyterEnterpriseGateway 2.6.0, Livy 0.7.1, Spark 3.3.0
I am using it together EMR Studio (notebooks).
In a S3 bucket, add a script that has the following content:
#!/bin/bash
# EMR clusters only have ephemeral local storage. It does not really matter where we store the jars.
sudo mkdir /jars
# Download Sedona jar
sudo curl -o /jars/sedona-python-adapter-3.0_2.12-1.3.1-incubating.jar "https://repo1.maven.org/maven2/org/apache/sedona/sedona-python-adapter-3.0_2.12/1.3.1-incubating/sedona-python-adapter-3.0_2.12-1.3.1-incubating.jar"
# Download GeoTools jar
sudo curl -o /jars/geotools-wrapper-1.3.0-27.2.jar "https://repo1.maven.org/maven2/org/datasyslab/geotools-wrapper/1.3.0-27.2/geotools-wrapper-1.3.0-27.2.jar"
# Install necessary python libraries
sudo python3 -m pip install pandas geopandas==0.10.2
sudo python3 -m pip install attrs matplotlib descartes apache-sedona==1.3.1
When you create a EMR cluster, in the bootstrap action, specify the location of this script.
When you create a EMR cluster, in the software configuration, add the following content:
[
{
"Classification":"spark-defaults",
"Properties":{
"spark.yarn.dist.jars": "/jars/sedona-python-adapter-3.0_2.12-1.3.1-incubating.jar,/jars/geotools-wrapper-1.3.0-27.2.jar",
"spark.serializer": "org.apache.spark.serializer.KryoSerializer",
"spark.kryo.registrator": "org.apache.sedona.core.serde.SedonaKryoRegistrator",
"spark.sql.extensions": "org.apache.sedona.viz.sql.SedonaVizExtensions,org.apache.sedona.sql.SedonaSqlExtensions"
}
}
]
The key point is to use Sedona 1.3.1-incubating which can search for jars specified in spark.yarn.dist.jars property. spark.jars property is ignored for EMR on EC2 since it uses Yarn to deploy jars. See SEDONA-183

Related

Can you multi stage build a docker image with both aws/gsutil cli?

I am wondering if there is a straightforward way in docker to build an image that has both the aws cli and gsutil cli installed on it for use. Unfortunately, an s3 name containing periods creates a Host ... returned an invalid certificate error https://github.com/GoogleCloudPlatform/gsutil/issues/267 and I cannot change the s3 bucket name unfortunately, which means I cannot do the following
gsutil -m cp -r "s3://path.with.periods/path/files" "gs://bucket_path/path"
so instead Ill have to do something like
aws s3 cp --recursive --quiet "s3://path.with.periods/path/files" ./
gsutil -m cp -r "./" "gs://bucket_path/path"
but I was wondering if there was a straightforward dockerfile that could run these commands?

Copy S3 File to Docker Image via Dockerfile

I have a Dockerfile that installs awscli and then tries to run aws s3 cp to get a file and put it on the docker image.
My dockerfile is:
FROM my-kie-server:latest
USER root
RUN echo "ip_resolve=4" >> /etc/yum.conf
ENV http_proxy host.docker.internal:9000
ENV https_proxy host.docker.internal:9000
ENV HTTP_PROXY host.docker.internal:9000
ENV HTTPS_PROXY host.docker.internal:9000
RUN yum install -y maven
RUN yum install -y awscli
USER jboss
ARG AWS_ACCESS_KEY_ID
ARG AWS_SECRET_ACCESS_KEY
RUN aws s3 cp s3://myBucket/myPath/myFile.jar x.jar
But when I build the image I get this error:
fatal error: [SSL: UNKNOWN_PROTOCOL] unknown protocol (_ssl.c:618)
The command '/bin/sh -c aws s3 cp s3://myBucket/myPath/myFile.jar x.jar' returned a non-zero code: 1
I have tried using --no-verify-ssl on the aws s3 cp command but get the same error.
I've found very little online that mentions this UNKNOWN_PROTOCOL error. Any advice appreciated, thanks.

How to deploy files from s3 to ec2 instance based on S3 event

Actually I am working on a pipeline. So I am having a scenario where I am pushing some artifacts into s3. Now I have wrote a shell script which download the folder and copy each file to its desired location in a wildfly server(Ec2 instance).
#!/bin/bash
mkdir /home/ec2-user/test-temp
cd /home/ec2-user/test-temp
aws s3 cp s3://deploy-artifacts/test-APP test-APP --recursive --region us-east-1
aws s3 cp s3://deploy-artifacts/test-COMMON test-COMMON --recursive --region us-east-1
cd /home/ec2-user/
sudo mkdir -p /opt/wildfly/modules/system/layers/base/psg/common
sudo cp -rf ./test-temp/test-COMMON/standalone/configuration/standalone.xml /opt/wildfly/standalone/configuration
sudo cp -rf ./test-temp/test-COMMON/modules/system/layers/base/com/microsoft/* /opt/wildfly/modules/system/layers/base/com/microsoft/
sudo cp -rf ./test-temp/test-COMMON/modules/system/layers/base/com/mysql /opt/wildfly/modules/system/layers/base/com/
sudo cp -rf ./test-temp/test-COMMON/modules/system/layers/base/psg/common/* /opt/wildfly/modules/system/layers/base/psg/common
sudo cp -rf ./test-temp/test-APP/standalone/deployments/HS.war /opt/wildfly/standalone/deployments
sudo cp -rf ./test-temp/test-APP/bin/resource /opt/wildfly/bin/resource
sudo cp -rf ./test-temp/test-APP/modules/system/layers/base/psg/* /opt/wildfly/modules/system/layers/base/psg/
sudo cp -rf ./test-temp/test-APP/standalone/deployments/* /opt/wildfly/standalone/deployments/
sudo chown -R wildfly:wildfly /opt/wildfly/
sudo service wildfly start
But every time I push new artifacts into s3. I have to go to the server and run this script manually. Is there a way to automate it? I was reading about lamda but after lamda knows the change in s3. where I am gonna define my shell script to run??
Any guidance will be help full.
To Trigger the lambda function on file uploading to s3 bucket, for this you have to set the event notification in s3 bucket.
Steps for setting up s3 event notification:-
1- you lambda and s3 bucket should be in the same region
2 - go to Properties tab of s3 bucket
3 - open up the Event and provide values for event types like put or copy
4 - Do specify the Lambda ARN in Send to option.
Now create one lambda function and add the s3 bucket as a trigger option. Just make sure your Lambda IAM policy is properly set.

AWS code deploy agent not able to install?

Hi i am trying to install code deploy agent in my ec2 agent but not able to succeed
I m following below steps
sudo apt-get update
sudo apt-get install awscli
sudo apt-get install ruby2.0
cd /home/ubuntu
sudo aws s3 cp s3://bucket-name/latest/install . --region region-name
sudo chmod +x ./install
sudo ./install auto
but ./install file is missing for me .
But I dont think its a problem with AMI as I used same steps with same AMI in different ec2 instance. Any one has any idea. please help me.
You need to fill in the bucket name and region name in sudo aws s3 cp s3://bucket-name/latest/install . --region region-name. If you are in us-east-1 you would use: aws-codedeploy-us-east-1 and us-east-1.
All the buckets follow that pattern so you can fill in another region if you are there instead.
See http://docs.aws.amazon.com/codedeploy/latest/userguide/how-to-set-up-new-instance.html for a complete list of buckets for each region.

User Data script not downloading files from S3 bucket?

i have s3cmd and EC2 api pre configured AMI. While creating new instance with user data for downloading files from S3 bucket, i face some following problems.
in user data i have code for
- creating new directory on new instance.
- downloading file from AWS S3 bucket.
Script is
#! /bin/bash
cd /home
mkdir pravin
s3cmd get s3://bucket/usr.sh >> temp.log
But in above script , mkdir pravin creates new directory with name pravin but s3cmd get s3://bucket/usr.sh not downloads file from AWS S3 bucket.
it also creates temp.log but it is empty.
So how i can solve this problem ?
An alternative solution would be to use an instance that has an IAM role assigned to it and the aws-cli, which would require that you have Python installed. All of this could be accomplished by inserting the following in the user-data field for your instance:
#!/bin/bash
apt-get update
apt-get -y install python-pip
apt-get -y install awscli
mkdir pravin
aws s3 cp s3://bucket/usr.sh temp.log --region {YOUR_BUCKET_REGION}
NOTE: The above is applicable for Ubuntu only.
And then for your instances IAM role you would attach a policy like so:
{
"Version":"2012-10-17",
"Statement":[{
"Effect":"Allow",
"Action":"s3:*",
"Resource":"arn:aws:s3:::YourBucketName/*"
}
]
}
I suspect that the user running the user data script lacks a .s3cfg file. You may need to find a way to indicate the location of the file when running this script.