AWS Glue: get job_id from within the script using pyspark

AWS Glue: get job_id from within the script using pyspark - amazon-web-services

I am trying to access the AWS ETL Glue job id from the script of that job. This is the RunID that you can see in the first column in the AWS Glue Console, something like jr_5fc6d4ecf0248150067f2. How do I get it programmatically with pyspark?

As it's documented in https://docs.aws.amazon.com/glue/latest/dg/aws-glue-api-crawler-pyspark-extensions-get-resolved-options.html, it's passed in as a command line argument to the Glue Job. You can access the JOB_RUN_ID and other default/reserved or custom job parameters using getResolvedOptions() function.
import sys
from awsglue.utils import getResolvedOptions
args = getResolvedOptions(sys.argv)
job_run_id = args['JOB_RUN_ID']
NOTE: JOB_RUN_ID is a default identity parameter, we don't need to include it as part of options (the second argument to getResolvedOptions()) for getting its value during runtime in a Glue Job.

You can use boto3 SDK for python to access the AWS services
import boto3
def lambda_handler(event, context):
client = boto3.client('glue')
client.start_crawler(Name='test_crawler')
glue = boto3.client(service_name='glue', region_name='us-east-2',
endpoint_url='https://glue.us-east-2.amazonaws.com')
myNewJobRun = client.start_job_run(JobName=myJob['Name'])
print myNewJobRun['JobRunId']

Related

AWS Step Function action GetJob (AWS Glue) does not return CodeGenConfigurationNodes despite the documentation saying it should, what am I missing?

According to https://docs.aws.amazon.com/glue/latest/dg/aws-glue-api-jobs-job.html#aws-glue-api-jobs-job-GetJob the Action GetJob (AWS Glue) should return CodeGenConfigurationNodes for a given JobName as input however it only returns the Job object.
Am I missing a request parameter that is undocumented? Is there another way to get the CodeGenConfigurationNodes for an AWS Glue Job in Step Functions?
[UPDATE]
OK I tried using the get_job() python function in a Lambda:
#!/usr/bin/python
# -*- coding: utf-8 -*-
import json
import boto3
glue = boto3.client(service_name='glue', region_name='eu-west-2',
endpoint_url='https://glue.eu-west-2.amazonaws.com')
def lambda_handler(event, context):
job_name = event.get('JobName')
job_details = glue.get_job(JobName = job_name)
return json.dumps(job_details, indent=4, sort_keys=True, default=str)
And that doesn't return a CodeGenConfigurationNodes either, so a bug in the python library?

CodeGenConfigurationNodes is only returned if the job is a glue studio job. Jobs that are not Glue Studio jobs do not have CodeGenConfigurationNodes as CodeGenConfigurationNodes represents the visual DAG.

How to check the cluster status of Redshift using boto3 module in Python?

I need to check if the Redshift cluster is paused or available using Python. I am aware of the module boto3 which further provides a method describe_clusters(). I am not sure how to proceed further to write a Python script for that.

You could try
import boto3
import pandas as pd
DWH_CLUSTER_IDENTIFIER = 'Your clusterId'
KEY='Your AWS Key'
SECRET='Your AWS Secret'
# Get client to connect redshift
redshift = boto3.client('redshift',
region_name="us-east-1",
aws_access_key_id=KEY,
aws_secret_access_key=SECRET
)
# Get cluster list as it is in AWS console
myClusters = redshift.describe_clusters()['Clusters']
if len(myClusters) > 0:
df = pd.DataFrame(myClusters)
myCluster=df[df.ClusterIdentifier==DWH_CLUSTER_IDENTIFIER.lower()]
print("My cluster status is: {}".format(myCluster['ClusterAvailabilityStatus'].item()))
else:
print('No clusters available')

Transfering files into S3 buckets based on Glue job status

I am new to **AWS Glue,** and my aim is to extract transform and load files uploaded in S3 bucket to RDS instance. Also I need to transfer the files into separate S3 buckets based on the Glue Job status (Success /Failure). There will be more than one file uploaded into the initial S3 bucket. How can I get the name of the files uploaded so that i can transfer those files to appropriate buckets.
Step 1: Upload files to S3 bucket1.
Step 2: Trigger lamda function to call Job1
Step 3: On success of job1 transfer file to S3 bucket2
Step 4: On failure transfer to another S3 bucket

Have a lambda event trigger listening to the folder you are uploading
the files to S3 In the lambda, use AWS Glue API to run the glue job
(essentially a python script in AWS Glue).
In Glue python script, use the appropriate library, such as pymysql, etc.
as an external library packaged with your python script.
Perform data load operations from S3 to your RDS tables. If you are
using Aurora Mysql, then AWS has provided a nice feature
"load from S3", so you can directly load the
file into the tables (you may need to do some configurations in the
PARAMETER GROUP / IAM Roles).
Lambda script to call glue job:
s3 = boto3.client('s3')
glue = boto3.client('glue')
def lambda_handler(event, context):
gluejobname="<YOUR GLUE JOB NAME>"
try:
runId = glue.start_job_run(JobName=gluejobname)
status = glue.get_job_run(JobName=gluejobname, RunId=runId['JobRunId'])
print("Job Status : ", status['JobRun']['JobRunState'])
except Exception as e:
print(e)
raise e
Glue Script:
import mysql.connector
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.context import DynamicFrame
from awsglue.transforms import *
from pyspark.sql.types import StringType
from pyspark.sql.types import DateType
from pyspark.sql.functions import unix_timestamp, from_unixtime
from pyspark.sql import SQLContext
# Create a Glue context
glueContext = GlueContext(SparkContext.getOrCreate())
url="<RDS URL>"
uname="<USER NAME>"
pwd="<PASSWORD>"
dbase="DBNAME"
def connect():
conn = mysql.connector.connect(host=url, user=uname, password=pwd, database=dbase)
cur = conn.cursor()
return cur, conn
def create_stg_table():
cur, conn = connect()
createStgTable1 = <CREATE STAGING TABLE IF REQUIRED>
loadQry = "LOAD DATA FROM S3 PREFIX 'S3://PATH FOR YOUR CSV' REPLACE INTO TABLE <DB.TABLENAME> FIELDS TERMINATED BY '|' LINES TERMINATED BY '\n' IGNORE 1 LINES (#var1, #var2, #var3, #var4, #var5, #var6, #var7, #var8) SET ......;"
cur.execute(createStgTable1)
cur.execute(loadQry)
conn.commit()
conn.close()
You can then create a cloudwatch alert wherein check for the glue job status, and depending upon the status, perform file copy operations between S3. We have similar setup in our production.
Regards
Yuva

arguments error while calling an AWS Glue Pythonshell job from boto3

Based on the previous post, I have an AWS Glue Pythonshell job that needs to retrieve some information from the arguments that are passed to it through a boto3 call.
My Glue job name is test_metrics
The Glue pythonshell code looks like below
import sys
from awsglue.utils import getResolvedOptions
args = getResolvedOptions(sys.argv,
['test_metrics',
's3_target_path_key',
's3_target_path_value'])
print ("Target path key is: ", args['s3_target_path_key'])
print ("Target Path value is: ", args['s3_target_path_value'])
The boto3 code that calls this job is below:
glue = boto3.client('glue')
response = glue.start_job_run(
JobName = 'test_metrics',
Arguments = {
'--s3_target_path_key': 's3://my_target',
'--s3_target_path_value': 's3://my_target_value'
}
)
print(response)
I see a 200 response after I run the boto3 code in my local machine, but Glue error log tells me:
test_metrics.py: error: the following arguments are required: --test_metrics
What am I missing?

Which job you are trying to launch? Spark Job or Python shell job?
If spark job, JOB_NAME is mandatory parameter. In Python shell job, it is not needed at all.
So in your python shell job, replace
args = getResolvedOptions(sys.argv,
['test_metrics',
's3_target_path_key',
's3_target_path_value'])
with
args = getResolvedOptions(sys.argv,
['s3_target_path_key',
's3_target_path_value'])

Seems like the documentation is kinda broken.
I had to update the boto3 code like below to make it work
glue = boto3.client('glue')
response = glue.start_job_run(
JobName = 'test_metrics',
Arguments = {
'--test_metrics': 'test_metrics',
'--s3_target_path_key': 's3://my_target',
'--s3_target_path_value': 's3://my_target_value'} )

We can get glue job name in python shell from sys.argv

returning JSON response from AWS Glue Pythonshell job to the boto3 caller

Is there a way to send a JSON response (of a dictionary of outputs) from A AWS Glue pythonshell job? Similar to returning a JSON response from AWS Lambda?
I am calling a Glue pythonshell job like below:
response = glue.start_job_run(
JobName = 'test_metrics',
Arguments = {
'--test_metrics': 'test_metrics',
'--s3_target_path_key': 's3://my_target',
'--s3_target_path_value': 's3://my_target_value'} )
print(response)
The response I get is a 200 stating the fact that the Glue start_job_run was a success. From the documentation, all I see is the result if a Glue job is either written in s3 or some other database.
I tried adding return {'result':'some_string'} at the end of my Glue pythonshell job to test if it works or not with below code.
import sys
from awsglue.utils import getResolvedOptions
args = getResolvedOptions(sys.argv,
['JOB_NAME',
's3_target_path_key',
's3_target_path_value'])
print ("Target path key is: ", args['s3_target_path_key'])
print ("Target Path value is: ", args['s3_target_path_value'])
return {'result':"some_string"}
But it throws error SyntaxError: 'return' outside function

Glue is not made to return response as it is expected to run long running operation inside it. Blocking for response for long running task is not right approach in itself. Instead of it, you may use launch job (service 1) -> execute job (service 2)-> get result (service 3) pattern. You can send json response to AWS service 3 which you want to launch from AWS Service 2 (execute job) e.g. if you launch lambda from glue job, you can send json response to it.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

AWS Glue: get job_id from within the script using pyspark - amazon-web-services

I am trying to access the AWS ETL Glue job id from the script of that job. This is the RunID that you can see in the first column in the AWS Glue Console, something like jr_5fc6d4ecf0248150067f2. How do I get it programmatically with pyspark?

Related

AWS Step Function action GetJob (AWS Glue) does not return CodeGenConfigurationNodes despite the documentation saying it should, what am I missing?

How to check the cluster status of Redshift using boto3 module in Python?

Transfering files into S3 buckets based on Glue job status

arguments error while calling an AWS Glue Pythonshell job from boto3

returning JSON response from AWS Glue Pythonshell job to the boto3 caller

Categories

Resources