How to import Spark packages in AWS Glue? - amazon-web-services

I would like to use the GrameFrames package, if I were to run pyspark locally I would use the command:
~/hadoop/spark-2.3.1-bin-hadoop2.7/bin/pyspark --packages graphframes:graphframes:0.6.0-spark2.3-s_2.11
But how would I run a AWS Glue script with this package? I found nothing in the documentation...

You can provide a path to extra libraries packaged into zip archives located in s3.
Please check out this doc for more details

It's possible to using graphframes as follows:
Download the graphframes python library package file e.g. from here. Unzip the .tar.gz and then re-archive to a .zip. Put somewhere in s3 that your glue job has access to
When setting up your glue job:
Make sure that your Python Library Path references the zip file
For job parameters, you need {"--conf": "spark.jars.packages=graphframes:graphframes:0.6.0-spark2.3-s_2.11"}

Every one looking for an answer please read this comment..
In order to use an external package in AWS Glue pySpark or Python-shell:
1)
Clone the repo from follwing url..
https://github.com/bhavintandel/py-packager/tree/master
git clone git#github.com:bhavintandel/py-packager.git
cd py-packager
2)
Add your required package under requirements.txt. For ex.,
pygeohash
Update the version and project name under setup.py. For ex.,
VERSION = "0.1.0"
PACKAGE_NAME = "dependencies"
3) Run the follwing "command1" to create .zip package for pyspark OR "command2" to create egg files for python-shell..
command1:
sudo make build_zip
Command2:
sudo make bdist_egg
Above commands will generate packae in dist folder.
4) Finally upload this pakcage from dist directory to S3 bucket. Then goto AWS Glue Job Console, edit job, find script libraries option, click on folder icon of "python library path" .. then select your s3 path.
finally use in your glue script:
import pygeohash as pgh
Done!

Also set --user-jars-firs: "true" parameter in glue job.

Related

Cant Import packages from layers in AWS Lambda

I know this question exists several places, but even by following different guides/answers I still cant get it to work. I have no idea what I do wrong. I have a lambda Python function on AWS where i need to do a "import requests". This is my approach so far.
Create .zip directory of packages. Locally I do:
pip3 install requests -t ./
zip -r okta_layer.zip .
Upload .zip directory to a lambda layer:
I go to the AWS console and go to lambda layers. I create a new layer based on this .zip file.
I go to my lambda python function and add the layer to the function directly form the console. I can now see the layer under "Layers" for the lambda function. Then i run the script it still complains about:
Unable to import module 'lambda_function': No module named 'requests'
I solved the problem. Apparently I needed to have a .zip folder, with a "python" folder inside, and inside that "python" folder should be all the packages.
I only had all the packages in the zip folder directly without a "python" folder ...

What is the correct way of installing a JDBC driver on EMR for Sqoop to use?

I am running Sqoop 1.4.7 on AWS EMR 5.21.1 and am trying to import data from a database. I have successfully been able to do this manually where I create an EMR instance with Sqoop installed via the EMR Console.
Here are the preliminary steps that I performed in order to run sqoop on EMR
Download the JDBC Driver
Move the JDBC driver to the /usr/lib/sqoop/lib directory
I was able to successfully run a sqoop import when I was sshd into an EMR cluster with these commands:
wget -O mssql-jdbc.jar https://repo1.maven.org/maven2/com/microsoft/sqlserver/mssql-jdbc/8.4.0.jre8/mssql-jdbc-8.4.0.jre8.jar
sudo mv mssql-jdbc.jar /usr/lib/sqoop/lib/
When I try to run these commands from an EMR bootstrap script however I get the error:
usr/lib/sqoop/lib/ No such file or directory
After doing some investigation I realized this is because "Bootstrap actions execute before core services, such as Hadoop or Spark, are installed", as found here
So the /usr/lib/sqoop/lib directory doesnt exist when I run my bootstrap steps.
Here are some solutions which work but they feel like work-arounds
Create the /usr/lib/sqoop/lib directory in my bootstrap script and then place the jar in it
Add the jar to this directory as an EMR step. (Turns out this this is the correct approach, look at below accepted answer)
What is the correct way of installing this JDBC driver on EMR?
The 2nd option is the correct way to do it. The documentation explains running bash scripts as an EMR step.
You can also use the jar command-runner.jar and the arguments to be
bash -c "wget -O mssql-jdbc.jar https://repo1.maven.org/maven2/com/microsoft/sqlserver/mssql-jdbc/8.4.0.jre8/mssql-jdbc-8.4.0.jre8.jar;sudo mv mssql-jdbc.jar /usr/lib/sqoop/lib/"

AWS Glue Python Shell Job Connect Timeout Error

Trying to run AWS Glue Python Shell Job but gives me Connect Timeout Error
Error Image : https://i.stack.imgur.com/MHpHg.png
Script : https://i.stack.imgur.com/KQxkj.png
It looks like you didn't added secretsmanager endpoint to your VPC. As the traffic will not leave AWS network there will not be internet access inside your Glue job's VPC. So if you want to connect to secretsmanager then you need to add it to your VPC.
Refer to this on how you can add this to your VPC and this to make sure you have properly configured security groups.
AWS Glue Git Issue
Hi,
We got AWS Glue Python Shell working with all dependency as follows. The Glue has awscli dependency as well along with boto3
AWS Glue Python Shell with Internet
Add awscli and boto3 whl files to Python library path during Glue Job execution. This option is slow as it has to download and install dependencies.
Download the following whl files
awscli-1.18.183-py2.py3-none-any.whl
boto3-1.16.23-py2.py3-none-any.whl
Upload the files to s3 bucket in your given python library path
Add the s3 whl file paths in the Python library path. Give the entire whl file s3 referenced path separated by comma
AWS Glue Python Shell without Internet connectivity
Reference: AWS Wrangler Glue dependency build
We followed the steps mentioned above for awscli and boto3 whl files
Below is the latest requirements.txt compiled for the newest versions
colorama==0.4.3
docutils==0.15.2
rsa==4.5.0
s3transfer==0.3.3
PyYAML==5.3.1
botocore==1.19.23
pyasn1==0.4.8
jmespath==0.10.0
urllib3==1.26.2
python_dateutil==2.8.1
six==1.15.0
Download the dependencies to libs folder
pip download -r requirements.txt -d libs
Move the original main whl files also to the lib directory
awscli-1.18.183-py2.py3-none-any.whl
boto3-1.16.23-py2.py3-none-any.whl
Package as a zip file
cd libs zip ../boto3-depends.zip *
Upload the boto3-depends.zip to s3 and add the path to Glue jobs Referenced files path
Note: It is Referenced files path and not Python library path
Placeholder code to install latest awcli and boto3 and load into AWS Python Glue Shell.
import os.path
import subprocess
import sys
# borrowed from https://stackoverflow.com/questions/48596627/how-to-import-referenced-files-in-etl-scripts
def get_referenced_filepath(file_name, matchFunc=os.path.isfile):
for dir_name in sys.path:
candidate = os.path.join(dir_name, file_name)
if matchFunc(candidate):
return candidate
raise Exception("Can't find file: ".format(file_name))
zip_file = get_referenced_filepath("awswrangler-depends.zip")
subprocess.run()
# Can't install --user, or without "-t ." because of permissions issues on the filesystem
subprocess.run(, shell=True)
#Additonal code as part of AWS Thread https://forums.aws.amazon.com/thread.jspa?messageID=954344
sys.path.insert(0, '/glue/lib/installation')
keys =
for k in keys:
if 'boto' in k:
del sys.modules[k]
import boto3
print('boto3 version')
print(boto3.__version__)
Check if the code is working with latest AWS CLI API
Thanks
Sarath

How do I download files within a Sagemaker notebook instance programatically?

We have a notebook instance within Sagemaker which contains many Jupyter Python scripts. I'd like to write a program which downloads these various scripts each day (i.e. so that I could back them up). Unfortunately I don't see any reference to this in the AWS CLI API.
Is this achievable?
It's not exactly that you want, but looks like VCS can fit your needs. You can use Github(if you already use it) or CodeCommit(free privat repos) Details and additional ways like sync target dir with S3 bucket - https://aws.amazon.com/blogs/machine-learning/how-to-use-common-workflows-on-amazon-sagemaker-notebook-instances/
Semi automatic way:
conda install -y -c conda-forge zip
!zip -r -X folder.zip folder-to-zip
Then download that zipfile.

Copying/using Python files from S3 to Amazon Elastic MapReduce at bootstrap time

I've figured out how to install python packages (numpy and such) at the bootstrapping step using boto, as well as copying files from S3 to my EC2 instances, still with boto.
What I haven't figured out is how to distribute python scripts (or any file) from S3 buckets to each EMR instance using boto. Any pointers?
If you are using boto, I recommend packaging all your Python files in an archive (.tar.gz format) and then using the cacheArchive directive in Hadoop/EMR to access it.
This is what I do:
Put all necessary Python files in a sub-directory, say, "required/" and test it locally.
Create an archive of this: cd required && tar czvf required.tgz *
Upload this archive to S3: s3cmd put required.tgz s3://yourBucket/required.tgz
Add this command-line option to your steps: -cacheArchive s3://yourBucket/required.tgz#required
The last step will ensure that your archive file containing Python code will be in the same directory format as in your local dev machine.
To actually do step #4 in boto, here is the code:
step = StreamingStep(name=jobName,
mapper='...',
reducer='...',
...
cache_archives=["s3://yourBucket/required.tgz#required"],
)
conn.add_jobflow_steps(jobID, [step])
And to allow for the imported code in Python to work properly in your mapper, make sure to reference it as you would a sub-directory:
sys.path.append('./required')
import myCustomPythonClass
# Mapper: do something!