AWS Glue automatic job creation - amazon-web-services

I have pyspark script which I can run in AWS GLUE. But everytime I am creating job from UI and copying my code to the job .Is there anyway I can automatically create job from my file in s3 bucket. (I have all the library and glue context which will be used while running )

Another alternative is to use AWS CloudFormation. You can define all AWS resources you want to create (not only Glue jobs) in a template file and then update stack whenever you need from AWS Console or using cli.
Template for a Glue job would look like this:
MyJob:
Type: AWS::Glue::Job
Properties:
Command:
Name: glueetl
ScriptLocation: "s3://aws-glue-scripts//your-script-file.py"
DefaultArguments:
"--job-bookmark-option": "job-bookmark-enable"
ExecutionProperty:
MaxConcurrentRuns: 2
MaxRetries: 0
Name: cf-job1
Role: !Ref MyJobRole # reference to a Role resource which is not presented here

I created an open source library called datajob to deploy and orchestrate glue jobs. You can find it on github https://github.com/vincentclaes/datajob and on pypi
pip install datajob
npm install -g aws-cdk#1.87.1
you create a file datajob_stack.py that describes your glue jobs and how they are orchestrated:
from datajob.datajob_stack import DataJobStack
from datajob.glue.glue_job import GlueJob
from datajob.stepfunctions.stepfunctions_workflow import StepfunctionsWorkflow
with DataJobStack(stack_name="data-pipeline-simple") as datajob_stack:
# here we define 3 glue jobs with a relative path to the source code.
task1 = GlueJob(
datajob_stack=datajob_stack,
name="task1",
job_path="data_pipeline_simple/task1.py",
)
task2 = GlueJob(
datajob_stack=datajob_stack,
name="task2",
job_path="data_pipeline_simple/task2.py",
)
task3 = GlueJob(
datajob_stack=datajob_stack,
name="task3",
job_path="data_pipeline_simple/task3.py",
)
# we instantiate a step functions workflow and add the sources
# we want to orchestrate.
with StepfunctionsWorkflow(
datajob_stack=datajob_stack, name="data-pipeline-simple"
) as sfn:
[task1, task2] >> task3
To deploy your code to glue execute:
export AWS_PROFILE=my-profile
datajob deploy --config datajob_stack.py
any feedback is much appreciated!

Yes, it is possible. For instance, you can use boto3 framework for this purpose.
https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/glue.html#Glue.Client.create_job
https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-python-calling.html

I wrote script which does following:
We have (glue)_dependency.txt file, script gets path of all dependency files and create zip file.
It uploads glue file and zip file in S3 by using s3 sync
Optionally, if any change in job setting will re-deploy cloudformation template
You may write shell script to do it.

Related

Find which files were updated in a commit in AWS CodePipeline

I am triggering AWS CodePipeline with a PR commit through Bitbucket.
Is there any way I can find which files were updated/created with the latest commit?
I looked into the Environment Variables and I looked into the pipeline's details and history CLI commands:
$ aws codepipeline list-pipeline-executions --pipeline-name <name> and
$ aws codepipeline get-pipeline-state --name <name> but they don't have the information I need.
Why do I need this?
The pipeline triggers one or more Airflow DAGs and listens to their status. We include the DAG name(s) in the branch name, we parse it using CODEBUILD_WEBHOOK_HEAD_REF and we trigger the DAG(s). This is error-prone, and since the DAG names match their filename we would like to extract them automatically.
Is it something doable? Thanks for your help!
You can get the commit hash from the pipeline execution.
when using cli:
aws codepipeline list-pipeline-executions --pipeline-name <pipeline-name>
you get a list of executions, the first execution is the latest one.
pipelineExecutionSummaries: [
{ ...
sourceRevisions: [
{
...
revisionId: "<commit-hash>"
}
]
}
...
]
then when you have the commit hash you can get the files that are modified in that commit
git diff-tree --stat <commit-hash>
further logic depending on your application you can handle when you got the files that are modified.

AWS CloudFormation Package Glue Extra Python Files

I am trying to use Cloudformation package to include the glue script and extra python files from the repo to be uploaded to s3 during the package step.
For the glue script it's straightforward where I can use
Properties:
Command:
Name: pythonshell #glueetl -spark # pythonshell -python shell...
PythonVersion: 3
ScriptLocation: "../glue/test.py"
But how would I be able to do the same for extra python files? The following does not work, it seems that I could upload the file using the Include Transform but not sure how to reference it back in extra-py-files?
DefaultArguments:
"--extra-py-files":
- "../glue/test2.py"
Sadly, you can't do this. package only supports for glue:
Command.ScriptLocation property for the AWS::Glue::Job resource
Packaging DefaultArguments arguments is not supported. This means that you have to do it "manually" (e.g. create bash script) outside of CloudFormation.

AWS Lambda for .net core using GitLab

I am new to gitlab and need to use it for pushing zipped lambda code written in .net core to s3 which will be later referenced by terraform.
Below is the step that works WITHOUT GITLAB
Created a dotnet Core Lambda Application for DynamoDB using command -
"dotnet new lambda.Dynamo" (standard template)
Ran command dotnet lambda package => generated a zip file in release
folder of the solution
Wrote a terraform script which reference this zip file and ran
terraform to create Lambda in AWS UI
Now I need to achieve all this USING GITLAB
Below are the pseudo steps
Create a new gitlab repo and upload .net core Lambda Application for
DynamoDB
add .gitlab-ci.yml =>
A. This script should be able to create a zip file for the .net core project
B. This script should be able to push the generated zip file to S3
Have a terraform that references this Zip file from S3.
Question 1 - Does the above mentioned Pseudo Stepsfor gitlab makes logical sense?
Question 2 - Can I get some guidance on build script which can help in achieving the pseudo steps. Thanks!

Unable to spark-submit a pyspark file on s3 bucket

I have a pyspark code stored both on the master node of an AWS EMR cluster and in an s3 bucket that fetches over 140M rows from a MySQL database and stores the sum of a column back in the log files on s3.
When I spark-submit the pyspark code on the master node, the job gets completed successfully and the output is stored in the log files on the S3 bucket.
However, when I spark-submit the pyspark code on the S3 bucket using these- (using the below commands on the terminal after SSH-ing to the master node)
spark-submit --master yarn --deploy-mode cluster --py-files s3://bucket_name/my_script.py
This returns a Error: Missing application resource. error.
spark_submit s3://bucket_name/my_script.py
This shows :
20/07/02 11:26:23 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Exception in thread "main" java.lang.RuntimeException: java.lang.ClassNotFoundException: Class com.amazon.ws.emr.hadoop.fs.EmrFileSystem not found
at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2369)
at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2840)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2857)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:99)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2896)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2878)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:392)
at org.apache.spark.util.Utils$.getHadoopFileSystem(Utils.scala:1911)
at org.apache.spark.util.Utils$.doFetchFile(Utils.scala:766)
at org.apache.spark.deploy.DependencyUtils$.downloadFile(DependencyUtils.scala:137)
at org.apache.spark.deploy.SparkSubmit$$anonfun$prepareSubmitEnvironment$7.apply(SparkSubmit.scala:356)
at org.apache.spark.deploy.SparkSubmit$$anonfun$prepareSubmitEnvironment$7.apply(SparkSubmit.scala:356)
at scala.Option.map(Option.scala:146)
at org.apache.spark.deploy.SparkSubmit.prepareSubmitEnvironment(SparkSubmit.scala:355)
at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:782)
at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:161)
at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:184)
at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86)
at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:928)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:937)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.lang.ClassNotFoundException: Class com.amazon.ws.emr.hadoop.fs.EmrFileSystem not found
at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:2273)
at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2367)
... 20 more
I read about having to add a Spark Step on the AWS EMR cluster to submit a pyspark code stored on the S3.
Am I correct in saying that I would need to create a step in order to submit my pyspark job stored on the S3?
In the 'Add Step' window that pops up on the AWS Console, in the 'Application location' field, it says that I'll have to type in the location to the JAR file. What JAR file are they referring to? Does my pyspark script have to be packaged into a JAR file and how do I do that or do I mention the path to my pyspark script?
In the 'Add Step' window that pops up on the AWS Console, in the Spark-submit options, how do I know what to write for the --class parameter? Can I leave this field empty? If no, why not?
I have gone through the AWS EMR documentation. I have so many questions because I dived nose-down into the problem and only researched when an error popped up.
Your spark submit should be this.
spark-submit --master yarn --deploy-mode cluster s3://bucket_name/my_script.py
--py-files is used if you want to pass the python dependency modules, not the application code.
When you are adding step in EMR to run spark job, jar location is your python file path. i.e. s3://bucket_name/my_script.py
No its not mandatory to use STEP to submit spark job.
You can also use spark-submit
To submit a pyspark script using STEP please refer aws doc and stackoverflow
For problem 1:
By default spark will use python2.
You need to add 2 config
Go to $SPARK_HOME/conf/spark-env.sh and add
export PYSPARK_PYTHON=/usr/bin/python3
export PYSPARK_DRIVER_PYTHON=/usr/bin/python3
Note: If you have any custom bundle add that using --py-files
For problem 2:
A hadoop-assembly jar exists on /usr/share/aws/emr/emrfs/lib/. That contains com.amazon.ws.emr.hadoop.fs.EmrFileSystem.
You need to add this to your classpath.
A better option to me is to create a symbolic link of hadoop-assembly jar to HADOOP_HOME (/usr/lib/hadoop) in your bootstrap action.

AWS Lambda Console - Upgrade boto3 version

I am creating a DeepLens project to recognise people, when one of select group of people are scanned by the camera.
The project uses a lambda, which processes the images and triggers the 'rekognition' aws api.
When I trigger the API from my local machine - I get a good response
When I trigger the API from AWS console - I get failed response
Problem
After much digging, I found that the 'boto3' (AWS python library) is of version:
1.9.62 - on my local machine
1.8.9 - on AWS console
Question
Can I upgrade the 'boto3' library version on the AWS lambda console ?? If so, how ?
If you don't want to package a more recent boto3 version with you function, you can download boto3 with each invocation of the Lambda. Remember that /tmp/ is the directory that Lambda will allow you to download to, so you can use this to temporarily download boto3:
import sys
from pip._internal import main
main(['install', '-I', '-q', 'boto3', '--target', '/tmp/', '--no-cache-dir', '--disable-pip-version-check'])
sys.path.insert(0,'/tmp/')
import boto3
from botocore.exceptions import ClientError
def handler(event, context):
print(boto3.__version__)
You can achieve the same with either Python function with dependencies or with a Virtual Environment.
These are the available options other than that you also try to contact Amazon team if they can help you with up-gradation.
I know, you're asking for a solution through Console, but this is not possible (as of my knowledge).
To solve this you need to provide the boto3 version you require to your lambda (either with the solution from user1998671 or with what Shivang Agarwal is proposing). A third solution is to provide the required boto3 version as a layer for the lambda. The big advantage of the layer is that you can re-use it for all your lambdas.
This can be achieved by following the guide from AWS (the following is mainly copied from the linked guide from AWS):
IMPORTANT: Make sure to adjust boto3-mylayer with a for you suitable name.
Create a lib folder by running the following command:
LIB_DIR=boto3-mylayer/python
mkdir -p $LIB_DIR
Install the library to LIB_DIR by running the following command:
pip3 install boto3 -t $LIB_DIR
Zip all the dependencies to /tmp/boto3-mylayer.zip by running the following command:
cd boto3-mylayer
zip -r /tmp/boto3-mylayer.zip .
Publish the layer by running the following command:
aws lambda publish-layer-version --layer-name boto3-mylayer --zip-file fileb:///tmp/boto3-mylayer.zip
The command returns the new layer's Amazon Resource Name (ARN), similar to the following one:
arn:aws:lambda:region:$ACC_ID:layer:boto3-mylayer:1
To attach this layer to your lambda execute the following:
aws lambda update-function-configuration --function-name <name-of-your-lambda> --layers <layer ARN>
To verify the boto version in your lambda you can simply add the following two print commands in your lambda:
print(boto3.__version__)
print(botocore.__version__)