invoke glue job from another glue job - amazon-web-services

I have two glue jobs, created from aws console. I would like to invoke one glue job (python) from another glue(python) job with parameters.what would be best approach to do this. I appreciate your help.

You can use Glue workflows, and setup workflow parameters as mentioned by Bob Haffner. Trigger the glue jobs using the workflow. The advantage here is, if the second glue job fails due to any errors, you can resume / rerun only the second job after fixing the issues. The workflow parameter you can pass from one glue job to another as well. The sample code for read/write workflow parameters:
If first glue job:
args = getResolvedOptions(sys.argv, ['JOB_NAME', 'WORKFLOW_NAME', 'WORKFLOW_RUN_ID'])
workflow_name = args['WORKFLOW_NAME']
workflow_run_id = args['WORKFLOW_RUN_ID']
workflow_params = glue_client.get_workflow_run_properties(Name=workflow_name,RunId=workflow_run_id)["RunProperties"]
workflow_params['param1'] = param_value1
workflow_params['param2'] = param_value2
workflow_params['param3'] = param_value3
workflow_params['param4'] = param_value4
glue_client.put_workflow_run_properties(Name=workflow_name, RunId=workflow_run_id, RunProperties=workflow_params)
and in the second glue job:
args = getResolvedOptions(sys.argv, ['WORKFLOW_NAME', 'WORKFLOW_RUN_ID'])
workflow_name = args['WORKFLOW_NAME']
workflow_run_id = args['WORKFLOW_RUN_ID']
workflow_params = glue_client.get_workflow_run_properties(Name=workflow_name, RunId=workflow_run_id)["RunProperties"]
param_value1 = workflow_params['param1']
param_value2 = workflow_params['param2']
param_value3 = workflow_params['param3']
param_value4 = workflow_params['param4']
How to setup a glue workflow, refer here:
https://docs.aws.amazon.com/glue/latest/dg/creating_running_workflows.html
https://medium.com/#pioneer21st/orchestrating-etl-jobs-in-aws-glue-using-workflow-758ef10b8434

Related

aws_glue_trigger in terraform creates invalid expression schedule in aws

I am trying to create a AWS Glue job scheduler in terraform based on condition where Crawler triggered by Cron succeeded:
resource "aws_glue_trigger" "trigger" {
name = "trigger"
type = "CONDITIONAL"
actions {
job_name = aws_glue_job.job.name
}
predicate {
conditions {
crawler_name = aws_glue_crawler.crawler.name
crawl_state = "SUCCEEDED"
}
}
}
It applies cleanly but in the job schedules property I am getting job with
Invalid expression in Cron column while the status is Activated. Of course it won't trigger because of that. What I am missing here?
Not sure if I understood the question correctly, but this is my glue trigger configuration, which is to run at scheduled time. And this is triggered at the scheduled time.
resource "aws_glue_trigger" "tr_one" {
name = "tr_one"
schedule = var.wf_schedule_time
type = "SCHEDULED"
workflow_name = aws_glue_workflow.my_workflow.name
actions {
job_name = var.my_glue_job_1
}
}
// Specify schedule time in UTC format to run glue workfows
wf_schedule_time = "cron(56 09 * * ? *)"
Please note that the schedule should be in utc time.
I had the same problem. Unfortunately I did not find an easy way to solve the 'invalid expression' by just using the aws_glue_triggers. Although I figured out a nice workaround using glue workflows to achieve the same goal (to trigger a glue job after a crawler succeeded) I am not quite sure if this is the best way to do it.
First i created a glue workflow
resource "aws_glue_workflow" "my_workflow" {
name = "my-workflow"
}
Then I created a scheduled trigger for my crawler (and I removed the scheduler of the glue crawler I referenced)
resource "aws_glue_trigger" "crawler_scheduler" {
name = "crawler-trigger"
workflow_name = "my-workflow"
type = "SCHEDULED"
schedule = "cron(15 12 * * ? *)"
actions {
crawler_name = "my-crawler"
}
}
Lastly I created the final trigger for my glue job which shall run after the crawler succeeded. The important aspect here is that both triggers are linked to the same workflow; virtually linking crawler & job.
resource "aws_glue_trigger" "job_trigger" {
name = "${each.value.s3_bucket_id}-ndjson_to_parquet-trigger"
type = "CONDITIONAL"
workflow_name = "my-workflow"
predicate {
conditions {
crawler_name = "my-crawler"
crawl_state = "SUCCEEDED"
}
}
actions {
job_name = "my-job"
}
}
The glue job still shows the error message 'invalid expression' under the schedule label but this time you can successfully trigger the glue job by just running the scheduler. In addition to this you will even get a visualization in glue-workflows.

Moto to launch aws glue jobs

I would like to test a glue job with Moto ( https://docs.getmoto.org/en/latest/docs/services/glue.html ).
So I first start by creating the glue job:
#mock_glue
#mock_s3
class TestStringMethods(unittest.TestCase):
...
...
self.s3_client.upload_fileobj(open("etl.py", "rb"),self.s3_bucket_name, "etl.py")
self.glue_client = boto3.client('glue')
self.glue_client.create_job(
Name="Test Monitoring Job",
Role="test_role",
Command=dict(Name="glueetl", ScriptLocation=f"s3://{self.s3_bucket_name}/etl.py"),
GlueVersion='2.0',
NumberOfWorkers=1,
WorkerType='G.1X'
)
#Job is created with correct confs
assert (job["Name"] == job_name)
assert (job["GlueVersion"] == "2.0")
Then I proceed to launch it:
job_run_response = self.glue_client.start_job_run(
JobName=job_name,
Arguments={...} )
and get the job run:
response = self.glue_client.get_job_run(
JobName=job_name,
RunId=job_run_response['JobRunId']
)
However, at this point I find out that the configuration of the job I just launched is not the same of what I have defined in the create job.
Look for example at the Glue version that I asserted before.
print(response)
{'JobRun': {'Id': '01',... 'Arguments': {'runSpark': 'spark -f test_file.py'}, 'ErrorMessage': '', 'PredecessorRuns': [{'JobName': 'string', 'RunId': 'string'}] ... 'GlueVersion': '0.9' }
There's no evidence my code has actually run, it looks like to be some standard default, apart from the job name.
My questions basically are:
have you experience with moto supporting this functionality?
if yes, can you recognise if something's off with this code?

How to read and parse data from PubSub topic into a beam pipeline and print it

I have a program which creates a topic in pubSub and also publishes messages to the topic. I also have an automated dataflow job(using a template) which saves these messages into my BigQuery table. Now I intend to replace the template based job with a python pipeline where my requirement is to read data from PubSub, apply transformations and save the data into BigQuery/publish to another PubSub topic. I started writing the script in python and did a lot of trial and error to achieve it but to my dismay, I could not achieve it. The code looks like this:
import apache_beam as beam
from apache_beam.io import WriteToText
TOPIC_PATH = "projects/test-pipeline-253103/topics/test-pipeline-topic"
OUTPUT_PATH = "projects/test-pipeline-253103/topics/topic-repub"
def run():
o = beam.options.pipeline_options.PipelineOptions()
p = beam.Pipeline(options=o)
print("I reached here")
# # Read from PubSub into a PCollection.
data = (
p
| "Read From Pub/Sub" >> beam.io.ReadFromPubSub(topic=TOPIC_PATH)
)
data | beam.io.WriteToPubSub(topic=OUTPUT_PATH)
print("Lines: ", data)
run()
I will really appreciate if I can get some help at the earliest.
Note: I have my project set up on google cloud and I have my script running locally.
Here the working code.
import apache_beam as beam
TOPIC_PATH = "projects/test-pipeline-253103/topics/test-pipeline-topic"
OUTPUT_PATH = "projects/test-pipeline-253103/topics/topic-repub"
class PrintValue(beam.DoFn):
def process(self, element):
print(element)
return [element]
def run():
o = beam.options.pipeline_options.PipelineOptions()
# Replace this by --stream execution param
standard_options = o.view_as(beam.options.pipeline_options.StandardOptions)
standard_options.streaming = True
p = beam.Pipeline(options=o)
print("I reached here")
# # Read from PubSub into a PCollection.
data = p | beam.io.ReadFromPubSub(topic=TOPIC_PATH) | beam.ParDo(PrintValue()) | beam.io.WriteToPubSub(topic=OUTPUT_PATH)
# Don't forget to run the pipeline!
result = p.run()
result.wait_until_finish()
run()
In summary
You miss to run the pipeline. Indeed, Beam is a Graph programming model. So, in your previous code, you built your graph but you never run it. Here, at the end, run it (not blocking call) and wait the end (blocking call)
When you start your pipeline, Beam mention that PubSub work only in streaming mode. Thus, you can start your code with --streaming param, or do it programmatically as shown in my code
Be careful, streaming mode means to listen indefinitively on PubSub. If you run this on Dataflow, your pipeline will be always up, until you stop it. This can be cost expensive if you have few message. Be sure that is the target model
An alternative is to use your pipeline for a limited period of time (you use scheduler for starting it, and another one for stopping it). But, at this moment, you have to stack message. Here you use a Topic as entry of the pipeline. This option force Beam to create a temporary subscription and to listen message on this subscription. This means that the message publish before this subscription creation won't be received and processed.
The idea is to create a subscription, by this way the message will be stacked in it (up to 7 days, by default). Then, use the subscription name in entry of your pipeline beam.io.ReadFromPubSub(subscription=SUB_PATH). The messages will be unstacked and processed by Beam (Order not guaranteed!)
Based on the Beam programming guide, you simply have to add a Transform step in your pipeline. Here an example or transform:
class PrintValue(beam.DoFn):
def process(self, element):
print(element)
return [element]
Add it to your pipeline
data | beam.ParDo(PrintValue()) | beam.io.WriteToPubSub(topic=OUTPUT_PATH)
You can add the number of transforms that you want. You can test the value and set the elements in tagged PCollection (for having multiple output) for fan out, or use side input for fan in PCollection.

Giving input_file_path argument to glue from Glue console

I want to pass an s3 filename as an input_file_path which I want to execute job from Glue console.Is their any way to give input_file_path argument from AWS Glue console?
If you want to create job parameter via Glue Console (UI), you have to first define the argument in the job:
unfold the Script libraries and job parameters sections
enter your parameter:
Read it in your code (Scala example):
def main(sysArgs: Array[String]) {
val args = GlueArgParser.getResolvedOptions(sysArgs, Array("input_file_path"))
print(s"Input path: ${args("input_file_path")}")
}
Read it in your code (Python):
import sys
args = getResolvedOptions(sys.argv, ['input_file_path'])
print(args['input_file_path'])

How to delete / drop multiple tables in AWS athena?

I am trying to drop few tables from Athena and I cannot run multiple DROP queries at same time. Is there a way to do it?
Thanks!
You are correct. It is not possible to run multiple queries in the one request.
An alternative is to create the tables in a specific database. Dropping the database will then cause all the tables to be deleted.
For example:
CREATE DATABASE foo;
CREATE EXTERNAL TABLE bar1 ...;
CREATE EXTERNAL TABLE bar2 ...;
DROP DATABASE foo CASCADE;
The DROP DATABASE command will delete the bar1 and bar2 tables.
You can use aws-cli batch-delete-table to delete multiple table at once.
aws glue batch-delete-table \
--database-name <database-name> \
--tables-to-delete "<table1-name>" "<table2-name>" "<table3-name>" ...
You can use AWS Glue interface to do this now. The prerequisite being you must upgrade to AWS Glue Data Catalog.
If you Upgrade to the AWS Glue Data Catalog from Athena, the metadata for tables created in Athena is visible in Glue and you can use the AWS Glue UI to check multiple tables and delete them at once.
FAQ on Upgrading data catalog: https://docs.aws.amazon.com/athena/latest/ug/glue-faq.html
You could write a shell script to do this for you:
for table in products customers stores; do
aws athena start-query-execution --query-string "drop table $table" --result-configuration OutputLocation=s3://my-ouput-result-bucket
done
Use AWS Glue's Python shell and invoke this function:
def run_query(query, database, s3_output):
client = boto3.client('athena')
response = client.start_query_execution(
QueryString=query,
QueryExecutionContext={
'Database': database
},
ResultConfiguration={
'OutputLocation': s3_output,
}
)
print('Execution ID: ' + response['QueryExecutionId'])
return response
Athena configuration:
s3_input = 's3://athena-how-to/data'
s3_ouput = 's3://athena-how-to/results/'
database = 'your_database'
table = 'tableToDelete'
query_1 = "drop table %s.%s;" % (database, table)
queries = [ query_1]
#queries = [ create_database, create_table, query_1, query_2 ]
for q in queries:
print("Executing query: %s" % (q))
res = run_query(q, database, s3_ouput)
#Vidy
I would second what #Prateek said. Please provide an example of your code. Also, please tag your post with the language/shell that you're using to interact with AWS.
Currently, you cannot run multiple queries in one request. However, you can make multiple requests simultaneously. Currently, you can run 20 requests simultaneously (2018-06-15). You could do this through an API call or the console. In addition you could use the CLI or the SDK (if available for your language of choice).
For example, in Python you could use the multiprocess or threading modules to manage concurrent requests. Just remember to consider thread/multiprocess safety when creating resources/clients.
Service Limits:
Athena Service Limits
AWS Service Limits for which you can request a rate increase
I could not get Carl's method to work by executing DROP TABLE statements even though they did work in the console.
So I just thought it was worth posting my approach that worked for me, which uses a combination of the AWS Pandas SDK and the CLI
import awswrangler as wr
import boto3
import os
session = boto3.Session(
aws_access_key_id='XXXXXX',
aws_secret_access_key='XXXXXX',
aws_session_token='XXXXXX'
)
database_name = 'athena_db'
athena_s3_output = 's3://athena_s3_bucket/athena_queries/'
df = wr.athena.read_sql_query(
sql= "SELECT DISTINCT table_name FROM information_schema.tables WHERE
table_schema = '" + database_name + "'",
database= database_name,
s3_output = athena_s3_output,
boto3_session = session
)
print(df)
# ensure that your aws profile is valid for CLI commands
# i.e. your credentials are set in C:\Users\xxxxxxxx\.aws\credentials
for table in df['table_name']:
cli_string = 'aws glue delete-table --database-name ' + database_name + ' --name ' + table
print(cli_string)
os.system(cli_string)