AWS Glue Python Shell Job Connect Timeout Error - amazon-web-services

Trying to run AWS Glue Python Shell Job but gives me Connect Timeout Error
Error Image : https://i.stack.imgur.com/MHpHg.png
Script : https://i.stack.imgur.com/KQxkj.png

It looks like you didn't added secretsmanager endpoint to your VPC. As the traffic will not leave AWS network there will not be internet access inside your Glue job's VPC. So if you want to connect to secretsmanager then you need to add it to your VPC.
Refer to this on how you can add this to your VPC and this to make sure you have properly configured security groups.

AWS Glue Git Issue
Hi,
We got AWS Glue Python Shell working with all dependency as follows. The Glue has awscli dependency as well along with boto3
AWS Glue Python Shell with Internet
Add awscli and boto3 whl files to Python library path during Glue Job execution. This option is slow as it has to download and install dependencies.
Download the following whl files
awscli-1.18.183-py2.py3-none-any.whl
boto3-1.16.23-py2.py3-none-any.whl
Upload the files to s3 bucket in your given python library path
Add the s3 whl file paths in the Python library path. Give the entire whl file s3 referenced path separated by comma
AWS Glue Python Shell without Internet connectivity
Reference: AWS Wrangler Glue dependency build
We followed the steps mentioned above for awscli and boto3 whl files
Below is the latest requirements.txt compiled for the newest versions
colorama==0.4.3
docutils==0.15.2
rsa==4.5.0
s3transfer==0.3.3
PyYAML==5.3.1
botocore==1.19.23
pyasn1==0.4.8
jmespath==0.10.0
urllib3==1.26.2
python_dateutil==2.8.1
six==1.15.0
Download the dependencies to libs folder
pip download -r requirements.txt -d libs
Move the original main whl files also to the lib directory
awscli-1.18.183-py2.py3-none-any.whl
boto3-1.16.23-py2.py3-none-any.whl
Package as a zip file
cd libs zip ../boto3-depends.zip *
Upload the boto3-depends.zip to s3 and add the path to Glue jobs Referenced files path
Note: It is Referenced files path and not Python library path
Placeholder code to install latest awcli and boto3 and load into AWS Python Glue Shell.
import os.path
import subprocess
import sys
# borrowed from https://stackoverflow.com/questions/48596627/how-to-import-referenced-files-in-etl-scripts
def get_referenced_filepath(file_name, matchFunc=os.path.isfile):
for dir_name in sys.path:
candidate = os.path.join(dir_name, file_name)
if matchFunc(candidate):
return candidate
raise Exception("Can't find file: ".format(file_name))
zip_file = get_referenced_filepath("awswrangler-depends.zip")
subprocess.run()
# Can't install --user, or without "-t ." because of permissions issues on the filesystem
subprocess.run(, shell=True)
#Additonal code as part of AWS Thread https://forums.aws.amazon.com/thread.jspa?messageID=954344
sys.path.insert(0, '/glue/lib/installation')
keys =
for k in keys:
if 'boto' in k:
del sys.modules[k]
import boto3
print('boto3 version')
print(boto3.__version__)
Check if the code is working with latest AWS CLI API
Thanks
Sarath

Related

Where do I put a custom config (.cfg) file in AWS MWAA Airflow?

I have a config file, dev.cfg, that looks like this:
[S3]
bucket_name = my-bucket
I need this in my code to do S3 things. I do not have the access (nor will I be given it) to modify the environmental or config variables in the AWS console. My only method of putting files in S3 is the CLI (aws s3 cp ...).
This is the project directory structure in S3:
my-bucket/
dags/
dev.cfg
some_dag_file.py
plugins.zip
requirements.txt
In the plugins.zip file, there is a plugin that should set the path to dev.cfg to an env var (DEV_CONFIG_PATH) that my code uses:
import os
from airflow.plugins_manager import AirflowPlugin
os.environ["DEV_CONFIG_PATH"] = os.path.join(
os.getenv('AIRFLOW_HOME'), 'dags', 'dev.cfg')
class EnvVarPlugin(AirflowPlugin):
name = 'env_var_plugin'
However, I'm getting an import error in the Airflow UI:
configparser.NoSectionError: No section: 'S3'
Any help is appreciated.
I'm using MWAA with Airflow version 2.4.3 running python 3.10.
Try asking your infra team to update the config: core.lazy_load_plugins : False

AWS Lambda Console - Upgrade boto3 version

I am creating a DeepLens project to recognise people, when one of select group of people are scanned by the camera.
The project uses a lambda, which processes the images and triggers the 'rekognition' aws api.
When I trigger the API from my local machine - I get a good response
When I trigger the API from AWS console - I get failed response
Problem
After much digging, I found that the 'boto3' (AWS python library) is of version:
1.9.62 - on my local machine
1.8.9 - on AWS console
Question
Can I upgrade the 'boto3' library version on the AWS lambda console ?? If so, how ?
If you don't want to package a more recent boto3 version with you function, you can download boto3 with each invocation of the Lambda. Remember that /tmp/ is the directory that Lambda will allow you to download to, so you can use this to temporarily download boto3:
import sys
from pip._internal import main
main(['install', '-I', '-q', 'boto3', '--target', '/tmp/', '--no-cache-dir', '--disable-pip-version-check'])
sys.path.insert(0,'/tmp/')
import boto3
from botocore.exceptions import ClientError
def handler(event, context):
print(boto3.__version__)
You can achieve the same with either Python function with dependencies or with a Virtual Environment.
These are the available options other than that you also try to contact Amazon team if they can help you with up-gradation.
I know, you're asking for a solution through Console, but this is not possible (as of my knowledge).
To solve this you need to provide the boto3 version you require to your lambda (either with the solution from user1998671 or with what Shivang Agarwal is proposing). A third solution is to provide the required boto3 version as a layer for the lambda. The big advantage of the layer is that you can re-use it for all your lambdas.
This can be achieved by following the guide from AWS (the following is mainly copied from the linked guide from AWS):
IMPORTANT: Make sure to adjust boto3-mylayer with a for you suitable name.
Create a lib folder by running the following command:
LIB_DIR=boto3-mylayer/python
mkdir -p $LIB_DIR
Install the library to LIB_DIR by running the following command:
pip3 install boto3 -t $LIB_DIR
Zip all the dependencies to /tmp/boto3-mylayer.zip by running the following command:
cd boto3-mylayer
zip -r /tmp/boto3-mylayer.zip .
Publish the layer by running the following command:
aws lambda publish-layer-version --layer-name boto3-mylayer --zip-file fileb:///tmp/boto3-mylayer.zip
The command returns the new layer's Amazon Resource Name (ARN), similar to the following one:
arn:aws:lambda:region:$ACC_ID:layer:boto3-mylayer:1
To attach this layer to your lambda execute the following:
aws lambda update-function-configuration --function-name <name-of-your-lambda> --layers <layer ARN>
To verify the boto version in your lambda you can simply add the following two print commands in your lambda:
print(boto3.__version__)
print(botocore.__version__)

How to import Spark packages in AWS Glue?

I would like to use the GrameFrames package, if I were to run pyspark locally I would use the command:
~/hadoop/spark-2.3.1-bin-hadoop2.7/bin/pyspark --packages graphframes:graphframes:0.6.0-spark2.3-s_2.11
But how would I run a AWS Glue script with this package? I found nothing in the documentation...
You can provide a path to extra libraries packaged into zip archives located in s3.
Please check out this doc for more details
It's possible to using graphframes as follows:
Download the graphframes python library package file e.g. from here. Unzip the .tar.gz and then re-archive to a .zip. Put somewhere in s3 that your glue job has access to
When setting up your glue job:
Make sure that your Python Library Path references the zip file
For job parameters, you need {"--conf": "spark.jars.packages=graphframes:graphframes:0.6.0-spark2.3-s_2.11"}
Every one looking for an answer please read this comment..
In order to use an external package in AWS Glue pySpark or Python-shell:
1)
Clone the repo from follwing url..
https://github.com/bhavintandel/py-packager/tree/master
git clone git#github.com:bhavintandel/py-packager.git
cd py-packager
2)
Add your required package under requirements.txt. For ex.,
pygeohash
Update the version and project name under setup.py. For ex.,
VERSION = "0.1.0"
PACKAGE_NAME = "dependencies"
3) Run the follwing "command1" to create .zip package for pyspark OR "command2" to create egg files for python-shell..
command1:
sudo make build_zip
Command2:
sudo make bdist_egg
Above commands will generate packae in dist folder.
4) Finally upload this pakcage from dist directory to S3 bucket. Then goto AWS Glue Job Console, edit job, find script libraries option, click on folder icon of "python library path" .. then select your s3 path.
finally use in your glue script:
import pygeohash as pgh
Done!
Also set --user-jars-firs: "true" parameter in glue job.

AWS Lambda returns "Unable to import module"

I am running the AWS DevSecOps project present here:
In the StaticCodeAnalysis stage of the pipeline I am getting AWS Lambda function failed.
On checking the log the error is:
"Unable to import module cfn_validate_lambda: No module named cfn_validate_lambda".
I checked that the S3 bucket that has the python code Zip and also ensured that the zip file has Public in the permissions.
Please let me know how to resolve this.
Thanks.
You have to package and zip the dependencies carefully...
The problem lies in the packaging hierarchy. After you install the dependencies in a directory, zip the lambda function as follows (in the example below, lambda_function is the name of my function)
Try this:
pip install requests -t .
zip -r9 lambda_function.zip .
zip -g lambda_function.zip lambda_function.py

Copying/using Python files from S3 to Amazon Elastic MapReduce at bootstrap time

I've figured out how to install python packages (numpy and such) at the bootstrapping step using boto, as well as copying files from S3 to my EC2 instances, still with boto.
What I haven't figured out is how to distribute python scripts (or any file) from S3 buckets to each EMR instance using boto. Any pointers?
If you are using boto, I recommend packaging all your Python files in an archive (.tar.gz format) and then using the cacheArchive directive in Hadoop/EMR to access it.
This is what I do:
Put all necessary Python files in a sub-directory, say, "required/" and test it locally.
Create an archive of this: cd required && tar czvf required.tgz *
Upload this archive to S3: s3cmd put required.tgz s3://yourBucket/required.tgz
Add this command-line option to your steps: -cacheArchive s3://yourBucket/required.tgz#required
The last step will ensure that your archive file containing Python code will be in the same directory format as in your local dev machine.
To actually do step #4 in boto, here is the code:
step = StreamingStep(name=jobName,
mapper='...',
reducer='...',
...
cache_archives=["s3://yourBucket/required.tgz#required"],
)
conn.add_jobflow_steps(jobID, [step])
And to allow for the imported code in Python to work properly in your mapper, make sure to reference it as you would a sub-directory:
sys.path.append('./required')
import myCustomPythonClass
# Mapper: do something!