AWS CloudFormation Package Glue Extra Python Files - amazon-web-services

I am trying to use Cloudformation package to include the glue script and extra python files from the repo to be uploaded to s3 during the package step.
For the glue script it's straightforward where I can use
Properties:
Command:
Name: pythonshell #glueetl -spark # pythonshell -python shell...
PythonVersion: 3
ScriptLocation: "../glue/test.py"
But how would I be able to do the same for extra python files? The following does not work, it seems that I could upload the file using the Include Transform but not sure how to reference it back in extra-py-files?
DefaultArguments:
"--extra-py-files":
- "../glue/test2.py"

Sadly, you can't do this. package only supports for glue:
Command.ScriptLocation property for the AWS::Glue::Job resource
Packaging DefaultArguments arguments is not supported. This means that you have to do it "manually" (e.g. create bash script) outside of CloudFormation.

Related

google cloud functions command to package without deploying

I must be missing something because I cant find this option here: https://cloud.google.com/sdk/gcloud/reference/beta/functions/deploy
I want to package and upload my function to a bucket: --stage-bucket
But not actually deploy the function
I'm going to deploy multiple functions (different handlers) from the same package with a Deployment Manager template: type: 'gcp-types/cloudfunctions-v1:projects.locations.functions'
gcloud beta functions deploy insists on packaging AND deploying the function.
Where is the gcloud beta functions package command?
Here is an example of the DM template I plan to run:
resources:
- name: resource-name
type: 'gcp-types/cloudfunctions-v1:projects.locations.functions'
properties:
labels:
testlabel1: testlabel1value
testlabel2: testlabel2value
parent: projects/my-project/locations/us-central1
location: us-central1
function: function-name
sourceArchiveUrl: 'gs://my-bucket/some-zip-i-uploaded.zip'
environmentVariables:
test: '123'
entryPoint: handler
httpsTrigger: {}
timeout: 60s
availableMemoryMb: 256
runtime: nodejs8
EDIT: I realized I have another question. When I upload a zip does that zip need to include dependencies? Do I have to do npm install or pip install first and include those packages in the zip or does cloud functions read my requirements.txt and packages.json and do that for me?
The SDK CLI does not provide a command to package your function.
This link will provide you with detail on how to zip your files together. There are just two points to follow:
File type should be a zip file.
File size should not exceed 100MB limit.
Then you need to call an API, which returns a Signed URL to upload the package.
Once uploaded you can specify the URL minus the extra parameters as the location.
There is no gcloud functions command to "package" your deployment, presumably because this amounts to just creating a zip file and putting it into the right place, then referencing that place.
Probably the easiest way to do this is to generate a zip file and copy it into a GCS bucket, then set the sourceArchiveUrl on the template to the correct location.
There are 2 other methods:
You can point to source code in source repository (this would use the sourceRepository part of the template).
You can get a direct url (using this API) to upload a ZIP file to using a PUT request, upload the code there, and then pass this same URL to the signedUploadUrl on the template. This is the method discussed in #John's answer. It does not require you to do any signing yourself, and likewise does not require you to create your own bucket to store the code in (the "Signed URL" refers to a private cloud functions location).
At least with the two zip file methods you do not need to include the (publicly available) dependencies -- the package.json (or requirements.txt) file will be processed by cloud functions to install them. I don't know about the SourceRepository method but I would expect it would work similarly. There's documentation about how cloud functions installs dependencies during deployment of a function for node and python.

AWS Glue automatic job creation

I have pyspark script which I can run in AWS GLUE. But everytime I am creating job from UI and copying my code to the job .Is there anyway I can automatically create job from my file in s3 bucket. (I have all the library and glue context which will be used while running )
Another alternative is to use AWS CloudFormation. You can define all AWS resources you want to create (not only Glue jobs) in a template file and then update stack whenever you need from AWS Console or using cli.
Template for a Glue job would look like this:
MyJob:
Type: AWS::Glue::Job
Properties:
Command:
Name: glueetl
ScriptLocation: "s3://aws-glue-scripts//your-script-file.py"
DefaultArguments:
"--job-bookmark-option": "job-bookmark-enable"
ExecutionProperty:
MaxConcurrentRuns: 2
MaxRetries: 0
Name: cf-job1
Role: !Ref MyJobRole # reference to a Role resource which is not presented here
I created an open source library called datajob to deploy and orchestrate glue jobs. You can find it on github https://github.com/vincentclaes/datajob and on pypi
pip install datajob
npm install -g aws-cdk#1.87.1
you create a file datajob_stack.py that describes your glue jobs and how they are orchestrated:
from datajob.datajob_stack import DataJobStack
from datajob.glue.glue_job import GlueJob
from datajob.stepfunctions.stepfunctions_workflow import StepfunctionsWorkflow
with DataJobStack(stack_name="data-pipeline-simple") as datajob_stack:
# here we define 3 glue jobs with a relative path to the source code.
task1 = GlueJob(
datajob_stack=datajob_stack,
name="task1",
job_path="data_pipeline_simple/task1.py",
)
task2 = GlueJob(
datajob_stack=datajob_stack,
name="task2",
job_path="data_pipeline_simple/task2.py",
)
task3 = GlueJob(
datajob_stack=datajob_stack,
name="task3",
job_path="data_pipeline_simple/task3.py",
)
# we instantiate a step functions workflow and add the sources
# we want to orchestrate.
with StepfunctionsWorkflow(
datajob_stack=datajob_stack, name="data-pipeline-simple"
) as sfn:
[task1, task2] >> task3
To deploy your code to glue execute:
export AWS_PROFILE=my-profile
datajob deploy --config datajob_stack.py
any feedback is much appreciated!
Yes, it is possible. For instance, you can use boto3 framework for this purpose.
https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/glue.html#Glue.Client.create_job
https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-python-calling.html
I wrote script which does following:
We have (glue)_dependency.txt file, script gets path of all dependency files and create zip file.
It uploads glue file and zip file in S3 by using s3 sync
Optionally, if any change in job setting will re-deploy cloudformation template
You may write shell script to do it.

aws: .net Core: zip the built code and copy to s3 output bucket

I am a .net developer and using a .net core 2.x application to build and upload the release code to s3 bucket. later that code will be used to deploy to ec2 instance.
I am new to CI/CD using aws and in learning phase.
In order to create CI/CD for my sample project, I gone through some aws tutorials and was able to create the following buildspec.yml file. Using that file I am able to run the successful build.
The problem comes in the phase UPLOAD_ARTIFACTS. I am unable to understand how to create a zip file that will be used to upload to the s3 bucket specified in the build project.
My buildspec.yml files contains the following code, Please help me finding what is wrong or what I am missing.
version: 0.2
phases:
build:
commands:
- dotnet restore
- dotnet build
artifacts:
files:
- target/cicdrepo.zip
- .\bin\Debug\netcoreapp2.1\*
I think I have to add post_build and some commands that will generate the zip file. But don't know the commands.
Following is the output image from the build logs.
your file is good all what you need to do is to create a S3 bucket then
you need to configure your CodeBuild to generate zip (or not) your artifacts for you, and to store it to s3.
this is the step you need to configure:
Edit:
if you want all your files to be copied on the root of your Zip file you can use:
artifacts:
files:
- ...
discard-paths: yes

AWS Lambda Console - Upgrade boto3 version

I am creating a DeepLens project to recognise people, when one of select group of people are scanned by the camera.
The project uses a lambda, which processes the images and triggers the 'rekognition' aws api.
When I trigger the API from my local machine - I get a good response
When I trigger the API from AWS console - I get failed response
Problem
After much digging, I found that the 'boto3' (AWS python library) is of version:
1.9.62 - on my local machine
1.8.9 - on AWS console
Question
Can I upgrade the 'boto3' library version on the AWS lambda console ?? If so, how ?
If you don't want to package a more recent boto3 version with you function, you can download boto3 with each invocation of the Lambda. Remember that /tmp/ is the directory that Lambda will allow you to download to, so you can use this to temporarily download boto3:
import sys
from pip._internal import main
main(['install', '-I', '-q', 'boto3', '--target', '/tmp/', '--no-cache-dir', '--disable-pip-version-check'])
sys.path.insert(0,'/tmp/')
import boto3
from botocore.exceptions import ClientError
def handler(event, context):
print(boto3.__version__)
You can achieve the same with either Python function with dependencies or with a Virtual Environment.
These are the available options other than that you also try to contact Amazon team if they can help you with up-gradation.
I know, you're asking for a solution through Console, but this is not possible (as of my knowledge).
To solve this you need to provide the boto3 version you require to your lambda (either with the solution from user1998671 or with what Shivang Agarwal is proposing). A third solution is to provide the required boto3 version as a layer for the lambda. The big advantage of the layer is that you can re-use it for all your lambdas.
This can be achieved by following the guide from AWS (the following is mainly copied from the linked guide from AWS):
IMPORTANT: Make sure to adjust boto3-mylayer with a for you suitable name.
Create a lib folder by running the following command:
LIB_DIR=boto3-mylayer/python
mkdir -p $LIB_DIR
Install the library to LIB_DIR by running the following command:
pip3 install boto3 -t $LIB_DIR
Zip all the dependencies to /tmp/boto3-mylayer.zip by running the following command:
cd boto3-mylayer
zip -r /tmp/boto3-mylayer.zip .
Publish the layer by running the following command:
aws lambda publish-layer-version --layer-name boto3-mylayer --zip-file fileb:///tmp/boto3-mylayer.zip
The command returns the new layer's Amazon Resource Name (ARN), similar to the following one:
arn:aws:lambda:region:$ACC_ID:layer:boto3-mylayer:1
To attach this layer to your lambda execute the following:
aws lambda update-function-configuration --function-name <name-of-your-lambda> --layers <layer ARN>
To verify the boto version in your lambda you can simply add the following two print commands in your lambda:
print(boto3.__version__)
print(botocore.__version__)

Copying/using Python files from S3 to Amazon Elastic MapReduce at bootstrap time

I've figured out how to install python packages (numpy and such) at the bootstrapping step using boto, as well as copying files from S3 to my EC2 instances, still with boto.
What I haven't figured out is how to distribute python scripts (or any file) from S3 buckets to each EMR instance using boto. Any pointers?
If you are using boto, I recommend packaging all your Python files in an archive (.tar.gz format) and then using the cacheArchive directive in Hadoop/EMR to access it.
This is what I do:
Put all necessary Python files in a sub-directory, say, "required/" and test it locally.
Create an archive of this: cd required && tar czvf required.tgz *
Upload this archive to S3: s3cmd put required.tgz s3://yourBucket/required.tgz
Add this command-line option to your steps: -cacheArchive s3://yourBucket/required.tgz#required
The last step will ensure that your archive file containing Python code will be in the same directory format as in your local dev machine.
To actually do step #4 in boto, here is the code:
step = StreamingStep(name=jobName,
mapper='...',
reducer='...',
...
cache_archives=["s3://yourBucket/required.tgz#required"],
)
conn.add_jobflow_steps(jobID, [step])
And to allow for the imported code in Python to work properly in your mapper, make sure to reference it as you would a sub-directory:
sys.path.append('./required')
import myCustomPythonClass
# Mapper: do something!