How to access run-property of AWS Glue workflow in Glue job? - amazon-web-services

I have been working with AWS Glue workflow for orchestrating batch jobs.
we need to pass push-down-predicate in order to limit the processing for batch job.
When we run Glue jobs alone, we can pass push down predicates as a command line argument at run time (i.e. aws glue start-job-run --job-name foo.scala --arguments --arg1-text ${arg1}..). But when we use glue workflow to execute Glue jobs, it is bit unclear.
When we orchestrate Batch jobs using AWS Glue workflows, we can add run properties while creating workflow.
Can I use run properties to pass push down predicate for my Glue Job ?
If yes, then how can I define value for the run property (push down predicate) at run time. The reason I want to define value for push down predicate at run time, is because the predicate arbitrarily changes every day. (i.e. run glue-workflow for past 10 days, past 20 days, past 2 days etc.)
I tried:
aws glue start-workflow-run --name workflow-name | jq -r '.RunId '
aws glue put-workflow-run-properties --name workflow-name --run-id "ID"
--run-properties --pushdownpredicate="some value"
I am able to see the run property I have passed using put-workflow-run-property
aws glue put-workflow-run-properties --name workflow-name --run-id "ID"
But I am not able to detect "pushdownpredicate" in my Glue Job.
Any idea how to access workflow's run property in Glue Job?

If you are using python as programming language for your Glue job then you can issue get_workflow_run_properties API call to retrieve the property and use it inside your Glue job.
response = client.get_workflow_run_properties(
Name='string',
RunId='string'
)
This will give you below response which you can parse and use it:
{
'RunProperties': {
'string': 'string'
}
}
If you are using scala then you can use equivalent AWS SDK.

In first instance you need to be sure that the job is running from a workflow:
def get_worfklow_params(args: Dict[str, str]) -> Dict[str, str]:
"""
get_worfklow_params is delegated to retrieve the WORKFLOW parameters
"""
glue_client = boto3.client("glue")
if "WORKFLOW_NAME" in args and "WORKFLOW_RUN_ID" in args:
workflow_args = glue_client.get_workflow_run_properties(Name=args['WORKFLOW_NAME'], RunId=args['WORKFLOW_RUN_ID'])["RunProperties"]
print("Found the following workflow args: \n{}".format(workflow_args))
return workflow_args
print("Unable to find run properties for this workflow!")
return None
This method will return a map of the workflow input parameter.
Than you can use the following method in order to retrieve a given parameter:
def get_worfklow_param(args: Dict[str, str], arg) -> str:
"""
get_worfklow_param is delegated to verify if the given parameter is present in the job and return it. In case of no presence None will be returned
"""
if args is None:
return None
return args[arg] if arg in args else None
From reuse the code, in my opinion is better to create a python (whl) module and set the module in the script path of your job. By this way, you can retrieve the method with a simple import.
Without the whl module, you can move in the following way:
def MyTransform(glueContext, dfc) -> DynamicFrameCollection:
import boto3
import sys
from typing import Dict
def get_worfklow_params(args: Dict[str, str]) -> Dict[str, str]:
"""
get_worfklow_params is delegated to retrieve the WORKFLOW parameters
"""
glue_client = boto3.client("glue")
if "WORKFLOW_NAME" in args and "WORKFLOW_RUN_ID" in args:
workflow_args = glue_client.get_workflow_run_properties(
Name=args['WORKFLOW_NAME'], RunId=args['WORKFLOW_RUN_ID'])["RunProperties"]
print("Found the following workflow args: \n{}".format(workflow_args))
return workflow_args
print("Unable to find run properties for this workflow!")
return None
def get_worfklow_param(args: Dict[str, str], arg) -> str:
"""
get_worfklow_param is delegated to verify if the given parameter is present in the job and return it. In case of no presence None will be returned
"""
if args is None:
return None
return args[arg] if arg in args else None
_args = getResolvedOptions(sys.argv, ['JOB_NAME', 'WORKFLOW_NAME', 'WORKFLOW_RUN_ID'])
worfklow_params = get_worfklow_params(_args)
job_run_id = get_worfklow_param(_args, "WORKFLOW_RUN_ID")
my_parameter= get_worfklow_param(_args, "WORKFLOW_CUSTOM_PARAMETER")

If you run Glue Job using workflow then sys.argv (which is a list) will contain parameters --WORKFLOW_NAME and --WORKFLOW_RUN_ID in it. You can use this fact to check if a Glue Job is being run by Workflow or not and then retrieve the Workflow Runtime Properties
from awsglue.utils import getResolvedOptions
if '--WORKFLOW_NAME' in sys.argv and '--WORKFLOW_RUN_ID' in sys.argv:
glue_args = getResolvedOptions(
sys.argv, ['WORKFLOW_NAME', 'WORKFLOW_RUN_ID']
)
workflow_args = glue_client.get_workflow_run_properties(
Name=glue_args['WORKFLOW_NAME'], RunId=glue_args['WORKFLOW_RUN_ID']
)["RunProperties"]
return {**workflow_args}
else:
raise Exception("GlueJobNotStartedByWorkflow")

Related

Pass output from one workflow step to another in GCP

I am trying to orchestrate a GCP workflow to first run a query in Big Query to get some metadata (name & id) that would then be passed to another step in the workflow that starts a dataflow job given those parameters as input.
So step by step I want something like:
Result = Query("SELECT ID & name from biq query table")
Start dataflow job: Input(result)
Is this possible or is there a better solution?
I propose you 2 solutions and I hope it can help.
- Solution 1 :
If you have an orchestrator like Airflow in Cloud Composer :
Use task with a BigQueryInsertJobOperator in Airflow, this operator allows to execute a query to Bigquery
Pass the result to a second Operator via xcom
2 second operator is an operator that extends BeamRunPythonPipelineOperator
When you extend BeamRunPythonPipelineOperator, you override the execute method. In this method, you can recover the data from previous operator via xcom pull as Dict
Pass this Dict as pipeline options to your operator that extends BeamRunPythonPipelineOperator
The BeamRunPythonPipelineOperator will launch your Dataflow job
An example of operator with execute method :
class CustomBeamOperator(BeamRunPythonPipelineOperator):
def __init__(
self,
your_field
...
**kwargs) -> None:
super().__init__(**kwargs)
self.your_field = your_field
...
def execute(self, context):
task_instance = context['task_instance']
your_conf_from_bq = task_instance.xcom_pull('task_id_previous_operator')
operator = BeamRunPythonPipelineOperator(
runner='DataflowRunner',
py_file='your_dataflow_main_file.py',
task_id='launch_dataflow_job',
pipeline_options=your_conf_from_bq,
py_system_site_packages=False,
py_interpreter='python3',
dataflow_config=DataflowConfiguration(
location='your_region'
)
)
operator.execute(context)
- Solution 2 :
If you don't have an orchestrator like Airflow
You can use the same virtual env that launch your actual Dataflow job but add Python Bigquery client as package : https://cloud.google.com/bigquery/docs/reference/libraries
Create a main Python file that retrieves your conf from Bigquery table as Dict via Bigquery client
Generate with Python the command line to launch your Dataflow job with the previous conf retrieved from database, example with Python :
python -m folder.your_main_file \
--runner=DataflowRunner \
--conf1=conf1/ \
--conf2=conf2
....
--setup_file=./your_setup.py \
Launch the previous Python command with Python suprocess
You can also maybe try this api to launch Dataflow job : https://pypi.org/project/google-cloud-dataflow-client/
I didn't tried it.
I think the solution with Airflow is easier.

arguments error while calling an AWS Glue Pythonshell job from boto3

Based on the previous post, I have an AWS Glue Pythonshell job that needs to retrieve some information from the arguments that are passed to it through a boto3 call.
My Glue job name is test_metrics
The Glue pythonshell code looks like below
import sys
from awsglue.utils import getResolvedOptions
args = getResolvedOptions(sys.argv,
['test_metrics',
's3_target_path_key',
's3_target_path_value'])
print ("Target path key is: ", args['s3_target_path_key'])
print ("Target Path value is: ", args['s3_target_path_value'])
The boto3 code that calls this job is below:
glue = boto3.client('glue')
response = glue.start_job_run(
JobName = 'test_metrics',
Arguments = {
'--s3_target_path_key': 's3://my_target',
'--s3_target_path_value': 's3://my_target_value'
}
)
print(response)
I see a 200 response after I run the boto3 code in my local machine, but Glue error log tells me:
test_metrics.py: error: the following arguments are required: --test_metrics
What am I missing?
Which job you are trying to launch? Spark Job or Python shell job?
If spark job, JOB_NAME is mandatory parameter. In Python shell job, it is not needed at all.
So in your python shell job, replace
args = getResolvedOptions(sys.argv,
['test_metrics',
's3_target_path_key',
's3_target_path_value'])
with
args = getResolvedOptions(sys.argv,
['s3_target_path_key',
's3_target_path_value'])
Seems like the documentation is kinda broken.
I had to update the boto3 code like below to make it work
glue = boto3.client('glue')
response = glue.start_job_run(
JobName = 'test_metrics',
Arguments = {
'--test_metrics': 'test_metrics',
'--s3_target_path_key': 's3://my_target',
'--s3_target_path_value': 's3://my_target_value'} )
We can get glue job name in python shell from sys.argv

returning JSON response from AWS Glue Pythonshell job to the boto3 caller

Is there a way to send a JSON response (of a dictionary of outputs) from A AWS Glue pythonshell job? Similar to returning a JSON response from AWS Lambda?
I am calling a Glue pythonshell job like below:
response = glue.start_job_run(
JobName = 'test_metrics',
Arguments = {
'--test_metrics': 'test_metrics',
'--s3_target_path_key': 's3://my_target',
'--s3_target_path_value': 's3://my_target_value'} )
print(response)
The response I get is a 200 stating the fact that the Glue start_job_run was a success. From the documentation, all I see is the result if a Glue job is either written in s3 or some other database.
I tried adding return {'result':'some_string'} at the end of my Glue pythonshell job to test if it works or not with below code.
import sys
from awsglue.utils import getResolvedOptions
args = getResolvedOptions(sys.argv,
['JOB_NAME',
's3_target_path_key',
's3_target_path_value'])
print ("Target path key is: ", args['s3_target_path_key'])
print ("Target Path value is: ", args['s3_target_path_value'])
return {'result':"some_string"}
But it throws error SyntaxError: 'return' outside function
Glue is not made to return response as it is expected to run long running operation inside it. Blocking for response for long running task is not right approach in itself. Instead of it, you may use launch job (service 1) -> execute job (service 2)-> get result (service 3) pattern. You can send json response to AWS service 3 which you want to launch from AWS Service 2 (execute job) e.g. if you launch lambda from glue job, you can send json response to it.

AWS Glue: get job_id from within the script using pyspark

I am trying to access the AWS ETL Glue job id from the script of that job. This is the RunID that you can see in the first column in the AWS Glue Console, something like jr_5fc6d4ecf0248150067f2. How do I get it programmatically with pyspark?
As it's documented in https://docs.aws.amazon.com/glue/latest/dg/aws-glue-api-crawler-pyspark-extensions-get-resolved-options.html, it's passed in as a command line argument to the Glue Job. You can access the JOB_RUN_ID and other default/reserved or custom job parameters using getResolvedOptions() function.
import sys
from awsglue.utils import getResolvedOptions
args = getResolvedOptions(sys.argv)
job_run_id = args['JOB_RUN_ID']
NOTE: JOB_RUN_ID is a default identity parameter, we don't need to include it as part of options (the second argument to getResolvedOptions()) for getting its value during runtime in a Glue Job.
You can use boto3 SDK for python to access the AWS services
import boto3
def lambda_handler(event, context):
client = boto3.client('glue')
client.start_crawler(Name='test_crawler')
glue = boto3.client(service_name='glue', region_name='us-east-2',
endpoint_url='https://glue.us-east-2.amazonaws.com')
myNewJobRun = client.start_job_run(JobName=myJob['Name'])
print myNewJobRun['JobRunId']

Wait until a Jenkins build is complete

I am using Python 2.7 and Jenkins.
I am writing some code in Python that will perform a checkin and wait/poll for Jenkins job to be complete. I would like some thoughts on around how I achieve it.
Python function to create a check-in in Perforce-> This can be easily done as P4 has CLI
Python code to detect when a build got triggered -> I have the changelist and the job number. How do I poll the Jenkins API for the build log to check if it has the appropriate changelists? The output of this step is a build url which is carrying out the job
How do I wait till the Jenkins job is complete?
Can I use snippets from the Jenkins Rest API or from Python Jenkins module?
If you need to know if the job is finished, the buildNumber and buildTimestamp are not enough.
This is the gist of how I find out if a job is complete, I have it in ruby but not python so perhaps someone could update this into real code.
lastBuild = get jenkins/job/myJob/lastBuild/buildNumber
get jenkins/job/myJob/lastBuild/build?token=gogogo
currentBuild = get jenkins/job/myJob/lastBuild/buildNumber
while currentBuild == lastBuild
sleep 1
thisBuild = get jenkins/job/myJob/lastBuild/buildNumber
buildInfo = get jenkins/job/myJob/[thisBuild]/api/xml?depth=0
while buildInfo["freeStyleBuild/building"] == true
buildInfo = get jenkins/job/myJob/[thisBuild]/api/xml?depth=0
sleep 1
ie. I found I needed to A) wait until the build starts (new build number) and B) wait until the building finishes (building is false).
You can query the last build timestamp to determine if the build finished. Compare it to what it was just before you triggered the build, and see when it changes. To get the timestamp, add /lastBuild/buildTimestamp to your job URL
As a matter of fact, in your Jenkins, add /lastBuild/api/ to any Job, and you will see a lot of API information. It even has Python API, but I not familiar with that so can't help you further
However, if you were using XML, you can add lastBuild/api/xml?depth=0 and inside the XML, you can see the <changeSet> object with list of revisions/commit messages that triggered the build
Simple solution using invoke and block_until_complete methods (tested with Python 3.7)
import jenkinsapi
from jenkinsapi.jenkins import Jenkins
...
server = Jenkins(jenkinsUrl, username=jenkinsUser,
password=jenkinsToken, ssl_verify=sslVerifyFlag)
job = server.create_job(jobName, None)
queue = job.invoke()
queue.block_until_complete()
Inpsired by a test method in pycontribs
This snippet starts build job and wait until job is done.
It is easy to start the job but we need some kind of logic to know when job is done. First we need to wait for job ID to be applied and than we can query job for details:
from jenkinsapi import jenkins
server = jenkins.Jenkins(jenkinsurl, username=username, password='******')
job = server.get_job(j_name)
prev_id = job.get_last_buildnumber()
server.build_job(j_name)
while True:
print('Waiting for build to start...')
if prev_id != job.get_last_buildnumber():
break
time.sleep(3)
print('Running...')
last_build = job.get_last_build()
while last_build.is_running():
time.sleep(1)
print(str(last_build.get_status()))
Don't know if this was available at the time of the question, but jenkinsapi module's Job.invoke() and/or Jenkins.build_job() return a QueueItem object, which can block_until_building(), or block_until_complete()
jobq = server.build_job(job_name, job_params)
jobq.block_until_building()
print("Job %s (%s) is building." % (jobq.get_job_name(), jobq.get_build_number()))
jobq.block_until_complete(5) # check every 5s instead of the default 15
print("Job complete, %s" % jobq.get_build().get_status())
Was going through the same problem and this worked for me, using python3 and python-jenkins.
while "".join([d['color'] for d in j.get_jobs() if d['name'] == "job_name"]) == 'blue_anime':
print('Job is Running')
time.sleep(1)
print('Job Over!!')
Working Github Script: Link
This is working for me
#!/usr/bin/env python
import jenkins
import time
server = jenkins.Jenkins('https://jenkinsurl/', username='xxxxx', password='xxxxxx')
j_name = 'test'
server.build_job(j_name, {'testparam1': 'test', 'testparam2': 'test'})
while True:
print('Running....')
if server.get_job_info(j_name)['lastCompletedBuild']['number'] == server.get_job_info(j_name)['lastBuild']['number']:
print "Last ID %s, Current ID %s" % (server.get_job_info(j_name)['lastCompletedBuild']['number'], server.get_job_info(j_name)['lastBuild']['number'])
break
time.sleep(3)
print('Stop....')
console_output = server.get_build_console_output(j_name, server.get_job_info(j_name)['lastBuild']['number'])
print console_output
the issue main issue that the build_job doesn't return the number of the job, returns the number of a queue item (that only last 5 min). so the trick is
build_job
get the queue number,
with the queue number get the job_number
now we know the name of the job and the job number
get_job_info and loop the jobs till we find one with our job number
check the status
so i made a function for it with time_out
import time
from datetime import datetime, timedelta
import jenkins
def launch_job(jenkins_connection, job_name, parameters={}, wait=False, interval=30, time_out=7200):
"""
Create a jenkins job and waits for the job to finish
:param jenkins_connection: jenkins server jenkins object
:param job_name: the name of job we want to create and see if finish string
:param parameters: the parameters of the job to build directory
:param wait: if we want to wait for the job to finish or not bool
:param interval: how often we want to monitor seconds int
:param time_out: break the loop after certain X seconds int
:return: build job number int
"""
# we lunch the job and returns a queue_id
job_id = jenkins_connection.build_job(job_name, parameters)
# from the queue_id we get the job number that was created
queue_job = jenkins_connection.get_queue_item(job_id, depth=0)
build_number = queue_job["executable"]["number"]
print(f"job_name: {job_name} build_number: {build_number}")
if wait is True:
now = datetime.now()
later = now + timedelta(seconds=time_out)
while True:
# we check current time vs the timeout(later)
if datetime.now() > later:
raise ValueError(f"Job: {job_name}:{build_number} is running for more than {time_out} we"
f"stop monitoring the job, you can check it in Jenkins")
b = jenkins_connection.get_job_info(job_name, depth=1, fetch_all_builds=False)
for i in b["builds"]:
loop_id = i["id"]
if int(loop_id) == build_number:
result = (i["result"])
print(f"result: {result}") # in the json looks like null
if result is not None:
return i
# break
time.sleep(interval)
# return result
return build_number
after we ask jenkins to build the job>get queue#>get job#> loop the info and get the status till change from None to something else.
if works will return the directory with the information of that job. (hope the jenkins library could implement something like this.)