I have two AWS Glue workflows, WorkflowA and WorkflowB, in which I want to run WorkflowA based on a schedule, but WorkflowB should run only after successful completion of WorkflowA.
I attempted to create a trigger called startWorkflowB, that is Event-based on the successful completion of WorkflowA's last task.
Now, when I try to use startWorkflowB trigger in WorkflowB as the first task, I get the following error -
My end result should be that WorkflowB should run only after successful completion of WorkflowA.
What am I doing wrong here ? Is it feasible to have linear dependency between two workflows within multiple Gluw workflows ?
Related
I have written a cloud storage trigger based cloud function. I have 10-15 files landing at 5 secs interval in cloud bucket which loads data into a bigquery table(truncate and load).
While there are 10 files in the bucket I want cloud function to process them in sequential manner i.e 1 file at a time as all the files accesses the same table for operation.
Currently cloud function is getting triggered for multiple files at a time and it fails in BIgquery operation as multiple files trying to access the same table.
Is there any way to configure this in cloud function??
Thanks in Advance!
You can achieve this by using pubsub, and the max instance param on Cloud Function.
Firstly, use the notification capability of Google Cloud Storage and sink the event into a PubSub topic.
Now you will receive a message every time that a event occur on the bucket. If you want to filter on file creation only (object finalize) you can apply a filter on the subscription. I wrote an article on this
Then, create an HTTP functions (http function is required if you want to apply a filter) with the max instance set to 1. Like this, only 1 function can be executed in the same time. So, no concurrency!
Finally, create a PubSub subscription on the topic, with a filter or not, to call your function in HTTP.
EDIT
Thanks to your code, I understood what happens. In fact, BigQuery is a declarative system. When you perform a request or a load job, a job is created and it works in background.
In python, you can explicitly wait the end on the job, but, with pandas, I didn't find how!!
I just found a Google Cloud page to explain how to migrate from pandas to BigQuery client library. As you can see, there is a line at the end
# Wait for the load job to complete.
job.result()
than wait the end of the job.
You did it well in the _insert_into_bigquery_dwh function but it's not the case in the staging _insert_into_bigquery_staging one. This can lead to 2 issues:
The dwh function work on the old data because the staging isn't yet finish when you trigger this job
If the staging take, let's say, 10 seconds and run in "background" (you don't wait the end explicitly in your code) and the dwh take 1 seconds, the next file is processed at the end of the dwh function, even if the staging one continue to run in background. And that leads to your issue.
The architecture you describe isn't the same as the one from the documentation you linked. Note that in the flow diagram and the code samples the storage events triggers the cloud function which will stream the data directly to the destination table. Since BigQuery allow for multiple streaming insert jobs several functions could be executed at the same time without problems. In your use case the intermediate table used to load with write-truncate for data cleaning makes a big difference because each execution needs the previous one to finish thus requiring a sequential processing approach.
I would like to point out that PubSub doesn't allow to configure the rate at which messages are sent, if 10 messages arrive to the topic they all will be sent to the subscriber, even if processed one at a time. Limiting the function to one instance may lead to overhead for the above reason and could increase latency as well. That said, since the expected workload is 15-30 files a day the above maybe isn't a big concern.
If you'd like to have parallel executions you may try creating a new table for each message and set a short expiration deadline for it using table.expires(exp_datetime) setter method so that multiple executions don't conflict with each other. Here is the related library reference. Otherwise the great answer from Guillaume would completely get the job done.
I have submitted 3 jobs in parallel in AWS Batch and I wanted to create a trigger when all these 3 jobs are completed.
Something like I should be able to specify 3 job ids and can update DB once all 3 jobs are done.
I can do this task easily by having long pooling but wanted to do something based on events.
I need your help with this.
The easiest option would be to create a fourth Batch job that specifies the other three jobs as dependencies. This job will sit in the PENDING state until the other three jobs have succeeded, and then it will run. Inside that job, you could update the DB or do whatever other actions you wanted.
One downside to this approach is that if one of the jobs fails, the pending job will automatically go into a FAILED state without running.
I have two distinct pipelines (A and B). When A has terminated I would like to kick off immediately the second one (B).
So far, to accomplish that I have added a ShellCommandActivity with the following command:
aws datapipeline activate-pipeline --pipeline-id <my pipeline id>
Are there other better ways to do that?
You can use a combination of indicator files (zero byte files) & Lambda to loosely couple the two data pipelines. You need to make the following changes -
Data Pipeline - Using a shell command touch a zero byte file as the last step in the data-pipeline in any of the given s3 path
Create a lambda function to watch for the indicator file and activate the Data Pipeline2
Note - This may not be very helpful if you are looking at a simple scenario of just executing two data-pipelines sequentially. However, it's helpful when you want to create an intricate dependency between pipelines viz. you have a set of Staging jobs (each corresponding to one pipeline) and you want to trigger your Data-mart Jobs or derived table jobs after all the staging jobs are completed.
we have 5 pipes in our data pipeline which execute on following basis:
pipe 1 - pipe 4 = daily basis
pipe 5 - end of the month.
we are considering an option to create separate pipeline for pipe 5 as it doesn't have any dependency on other pipes.
Is there any way possible I can execute all pipes except pipe 5 with likes of a decision variable as we have in OOZIE which can successfully ignore the execution of pipe 5 and complete the pipeline without any status of "error"/"Waiting on dependencies"?
You're probably better off creating multiple pipelines and setting them on different schedules. If you'd like you spice things up you can use Cloudwatch scheduling and AWS Lambda to schedule pipeline creation/deletion in a cron-like way. You could also use AWS Step functions to define the workflow of each component.
I know there are api to configure the notification when a job is failed or finished.
But what if, say, I run a hive query that count the number of rows in a table. If the returned result is zero I want to send out emails to the concerned parties. How can I do that?
Thanks.
You may want to look at Airflow and Qubole's operator for airflow. We use airflow to orchestrate all jobs being run using Qubole and in some cases non Qubole environments. We DataDog API to report success / failures of each task (Qubole / Non Qubole). DataDog in this case can be replaced by Airflow's email operator. Airflow also has some chat operator (like Slack)
There is no direct api for triggering notification based on results of a query.
However there is a way to do this using Qubole:
-Create a work flow in qubole with following steps:
1. Your query (any query) that writes output to a particular location on s3.
2. A shell script - This script reads result from your s3 and fails the job based on any criteria. For instance in your case, fail the job if result returns 0 rows.
-Schedule this work flow using "Scheduler" API to notify on failure.
You can also use "Sendmail" shell command to send mail based on results in step 2 above.