I know this question exists several places, but even by following different guides/answers I still cant get it to work. I have no idea what I do wrong. I have a lambda Python function on AWS where i need to do a "import requests". This is my approach so far.
Create .zip directory of packages. Locally I do:
pip3 install requests -t ./
zip -r okta_layer.zip .
Upload .zip directory to a lambda layer:
I go to the AWS console and go to lambda layers. I create a new layer based on this .zip file.
I go to my lambda python function and add the layer to the function directly form the console. I can now see the layer under "Layers" for the lambda function. Then i run the script it still complains about:
Unable to import module 'lambda_function': No module named 'requests'
I solved the problem. Apparently I needed to have a .zip folder, with a "python" folder inside, and inside that "python" folder should be all the packages.
I only had all the packages in the zip folder directly without a "python" folder ...
Related
I would like to use the GrameFrames package, if I were to run pyspark locally I would use the command:
~/hadoop/spark-2.3.1-bin-hadoop2.7/bin/pyspark --packages graphframes:graphframes:0.6.0-spark2.3-s_2.11
But how would I run a AWS Glue script with this package? I found nothing in the documentation...
You can provide a path to extra libraries packaged into zip archives located in s3.
Please check out this doc for more details
It's possible to using graphframes as follows:
Download the graphframes python library package file e.g. from here. Unzip the .tar.gz and then re-archive to a .zip. Put somewhere in s3 that your glue job has access to
When setting up your glue job:
Make sure that your Python Library Path references the zip file
For job parameters, you need {"--conf": "spark.jars.packages=graphframes:graphframes:0.6.0-spark2.3-s_2.11"}
Every one looking for an answer please read this comment..
In order to use an external package in AWS Glue pySpark or Python-shell:
1)
Clone the repo from follwing url..
https://github.com/bhavintandel/py-packager/tree/master
git clone git#github.com:bhavintandel/py-packager.git
cd py-packager
2)
Add your required package under requirements.txt. For ex.,
pygeohash
Update the version and project name under setup.py. For ex.,
VERSION = "0.1.0"
PACKAGE_NAME = "dependencies"
3) Run the follwing "command1" to create .zip package for pyspark OR "command2" to create egg files for python-shell..
command1:
sudo make build_zip
Command2:
sudo make bdist_egg
Above commands will generate packae in dist folder.
4) Finally upload this pakcage from dist directory to S3 bucket. Then goto AWS Glue Job Console, edit job, find script libraries option, click on folder icon of "python library path" .. then select your s3 path.
finally use in your glue script:
import pygeohash as pgh
Done!
Also set --user-jars-firs: "true" parameter in glue job.
I am running the AWS DevSecOps project present here:
In the StaticCodeAnalysis stage of the pipeline I am getting AWS Lambda function failed.
On checking the log the error is:
"Unable to import module cfn_validate_lambda: No module named cfn_validate_lambda".
I checked that the S3 bucket that has the python code Zip and also ensured that the zip file has Public in the permissions.
Please let me know how to resolve this.
Thanks.
You have to package and zip the dependencies carefully...
The problem lies in the packaging hierarchy. After you install the dependencies in a directory, zip the lambda function as follows (in the example below, lambda_function is the name of my function)
Try this:
pip install requests -t .
zip -r9 lambda_function.zip .
zip -g lambda_function.zip lambda_function.py
I am using the Python module xmlsec in my lambda function. The import looks like import dm.xmlsec.binding as xmlsec. The proper directory structure exists. at the root of the archive there is dm/xmlsec/binding/__init__.py and the rest of the module is there. However, when executing the function on lambda, I get the error "No module named dm.xmlsec.binding"
I have built many Python27 lambda functions in the same way as this one with no issues. I install all of the needed python modules to my build directory, with the lambda function at the root. I then zip the package recursively and update the existing function with the resulting archive using the AWS CLI. I've also tried manually uploading the archive in the console as well, with the same result.
I was honestly expecting some trouble with this module, but I did expect lambda to at least see it. What is going on?
I'm trying to download a postgres driver to each node of my cluster. I wrote the following bootstrap action, but it doesn't seem to have worked:
#!/bin/bash
aws s3 cp s3://path/to/driver/jars/postgresql-9.4.1210.jre7.jar .
I know this must be an easy thing to do, but I can't seem to find an obvious example.
The bootstrap action you have looks fine and is probably working. It's just that you are probably assuming that it will download the file to the same directory where you land when ssh'ing to the cluster, which is /home/hadoop, but that is not the case. The working directory of bootstrap actions is somewhere under /var/lib/bootstrap-actions, if I remember correctly.
It would be easier to find the file you've downloaded if you change "." to something like "/home/hadoop". You could also create some other new directory to which to download the file as part of this script (using "sudo mkdir" and "sudo chown" if necessary).
I've figured out how to install python packages (numpy and such) at the bootstrapping step using boto, as well as copying files from S3 to my EC2 instances, still with boto.
What I haven't figured out is how to distribute python scripts (or any file) from S3 buckets to each EMR instance using boto. Any pointers?
If you are using boto, I recommend packaging all your Python files in an archive (.tar.gz format) and then using the cacheArchive directive in Hadoop/EMR to access it.
This is what I do:
Put all necessary Python files in a sub-directory, say, "required/" and test it locally.
Create an archive of this: cd required && tar czvf required.tgz *
Upload this archive to S3: s3cmd put required.tgz s3://yourBucket/required.tgz
Add this command-line option to your steps: -cacheArchive s3://yourBucket/required.tgz#required
The last step will ensure that your archive file containing Python code will be in the same directory format as in your local dev machine.
To actually do step #4 in boto, here is the code:
step = StreamingStep(name=jobName,
mapper='...',
reducer='...',
...
cache_archives=["s3://yourBucket/required.tgz#required"],
)
conn.add_jobflow_steps(jobID, [step])
And to allow for the imported code in Python to work properly in your mapper, make sure to reference it as you would a sub-directory:
sys.path.append('./required')
import myCustomPythonClass
# Mapper: do something!