aws_glue_trigger in terraform creates invalid expression schedule in aws - amazon-web-services

I am trying to create a AWS Glue job scheduler in terraform based on condition where Crawler triggered by Cron succeeded:
resource "aws_glue_trigger" "trigger" {
name = "trigger"
type = "CONDITIONAL"
actions {
job_name = aws_glue_job.job.name
}
predicate {
conditions {
crawler_name = aws_glue_crawler.crawler.name
crawl_state = "SUCCEEDED"
}
}
}
It applies cleanly but in the job schedules property I am getting job with
Invalid expression in Cron column while the status is Activated. Of course it won't trigger because of that. What I am missing here?

Not sure if I understood the question correctly, but this is my glue trigger configuration, which is to run at scheduled time. And this is triggered at the scheduled time.
resource "aws_glue_trigger" "tr_one" {
name = "tr_one"
schedule = var.wf_schedule_time
type = "SCHEDULED"
workflow_name = aws_glue_workflow.my_workflow.name
actions {
job_name = var.my_glue_job_1
}
}
// Specify schedule time in UTC format to run glue workfows
wf_schedule_time = "cron(56 09 * * ? *)"
Please note that the schedule should be in utc time.

I had the same problem. Unfortunately I did not find an easy way to solve the 'invalid expression' by just using the aws_glue_triggers. Although I figured out a nice workaround using glue workflows to achieve the same goal (to trigger a glue job after a crawler succeeded) I am not quite sure if this is the best way to do it.
First i created a glue workflow
resource "aws_glue_workflow" "my_workflow" {
name = "my-workflow"
}
Then I created a scheduled trigger for my crawler (and I removed the scheduler of the glue crawler I referenced)
resource "aws_glue_trigger" "crawler_scheduler" {
name = "crawler-trigger"
workflow_name = "my-workflow"
type = "SCHEDULED"
schedule = "cron(15 12 * * ? *)"
actions {
crawler_name = "my-crawler"
}
}
Lastly I created the final trigger for my glue job which shall run after the crawler succeeded. The important aspect here is that both triggers are linked to the same workflow; virtually linking crawler & job.
resource "aws_glue_trigger" "job_trigger" {
name = "${each.value.s3_bucket_id}-ndjson_to_parquet-trigger"
type = "CONDITIONAL"
workflow_name = "my-workflow"
predicate {
conditions {
crawler_name = "my-crawler"
crawl_state = "SUCCEEDED"
}
}
actions {
job_name = "my-job"
}
}
The glue job still shows the error message 'invalid expression' under the schedule label but this time you can successfully trigger the glue job by just running the scheduler. In addition to this you will even get a visualization in glue-workflows.

Related

invoke glue job from another glue job

I have two glue jobs, created from aws console. I would like to invoke one glue job (python) from another glue(python) job with parameters.what would be best approach to do this. I appreciate your help.
You can use Glue workflows, and setup workflow parameters as mentioned by Bob Haffner. Trigger the glue jobs using the workflow. The advantage here is, if the second glue job fails due to any errors, you can resume / rerun only the second job after fixing the issues. The workflow parameter you can pass from one glue job to another as well. The sample code for read/write workflow parameters:
If first glue job:
args = getResolvedOptions(sys.argv, ['JOB_NAME', 'WORKFLOW_NAME', 'WORKFLOW_RUN_ID'])
workflow_name = args['WORKFLOW_NAME']
workflow_run_id = args['WORKFLOW_RUN_ID']
workflow_params = glue_client.get_workflow_run_properties(Name=workflow_name,RunId=workflow_run_id)["RunProperties"]
workflow_params['param1'] = param_value1
workflow_params['param2'] = param_value2
workflow_params['param3'] = param_value3
workflow_params['param4'] = param_value4
glue_client.put_workflow_run_properties(Name=workflow_name, RunId=workflow_run_id, RunProperties=workflow_params)
and in the second glue job:
args = getResolvedOptions(sys.argv, ['WORKFLOW_NAME', 'WORKFLOW_RUN_ID'])
workflow_name = args['WORKFLOW_NAME']
workflow_run_id = args['WORKFLOW_RUN_ID']
workflow_params = glue_client.get_workflow_run_properties(Name=workflow_name, RunId=workflow_run_id)["RunProperties"]
param_value1 = workflow_params['param1']
param_value2 = workflow_params['param2']
param_value3 = workflow_params['param3']
param_value4 = workflow_params['param4']
How to setup a glue workflow, refer here:
https://docs.aws.amazon.com/glue/latest/dg/creating_running_workflows.html
https://medium.com/#pioneer21st/orchestrating-etl-jobs-in-aws-glue-using-workflow-758ef10b8434

AWS CodeBuild webhook trigers when it shoudn't start

I have the following setup of codebuild's webhook:
resource "aws_codebuild_webhook" "apply" {
project_name = aws_codebuild_project.codebuild-apply.name
build_type = "BUILD"
filter_group {
filter {
type = "EVENT"
pattern = "PUSH"
}
filter {
type = "FILE_PATH"
pattern = "environments/test/*"
}
filter {
type = "HEAD_REF"
pattern = "master"
}
}
}
Purpose is to run it only when changes on master branch are done.
Currently this webhook starts buildspec when changes are done in environments/test/ on every branch not only master branch.
What is wrong and how to setup it correctly?
according to https://docs.aws.amazon.com/codebuild/latest/userguide/github-webhook.html the right format for the pattern of your filter of type HEAD_REF is ^refs/heads/master$.
I only now realized, that you use terraform. Can you try with
filter {
type = "HEAD_REF"
pattern = "refs/heads/master"
}

How to make a PubSub triggered Cloud Function with message ordering using terraform?

I am trying to create a Cloud Function that is triggered from a PubSub subscription, but I need to have the message ordering enabled. I know to use the event_trigger block in the google_cloudfunctions_function block, when creating a function linked to a subscription. However this does not like the enable_message_ordering as described under PubSub. When using the subscription push config, I don't know how I can get link the endpoint to the function.
So is there a way I can link the function to a subscription with message ordering enabled?
Can I just use the internal URL to the function as the push config URL?
You can't use background functions triggered by PubSub and message ordering (or filtering).
You have do deploy a HTTP functions (take care, the signature of the fonction change, and the the format of the PubSub message also change slightly).
Then create a PubSub PUSH subscriptions, use the Cloud Functions URL. The best is also to add a Service Account on PubSub to allow only it to call your Functions.
For completeness I wanted to add the terraform that I used to do this. In case others are looking.
# This is the HTTP function that processes the events from PubSub, note it is set as a HTTP trigger
resource "google_cloudfunctions_function" "processEvent" {
name = "processEvent"
runtime = var.RUNTIME
environment_variables = {
GCP_PROJECT_ID = var.GCP_PROJECT_ID
LOG_LEVEL = var.LOG_LEVEL
}
available_memory_mb = var.AVAILABLE_MEMORY
timeout = var.TIMEOUT
source_archive_bucket = var.SOURCE_ARCHIVE_BUCKET
source_archive_object = google_storage_bucket_object.processor-archive.name
trigger_http = true
entry_point = "processEvent"
}
# define the topic
resource "google_pubsub_topic" "event-topic" {
name = "event-topic"
}
# We need to create the subscription specifically as we need to enable message ordering
resource "google_pubsub_subscription" "processEvent_subscription" {
name = "processEvent_subscription"
topic = google_pubsub_topic.event-topic.name
ack_deadline_seconds = 20
push_config {
push_endpoint = "https://${var.REGION}-${var.GCP_PROJECT_ID}.cloudfunctions.net/${google_cloudfunctions_function.processEvent.name}"
oidc_token {
# a new IAM service account is need to allow the subscription to trigger the function
service_account_email = "cloudfunctioninvoker#${var.GCP_PROJECT_ID}.iam.gserviceaccount.com"
}
}
enable_message_ordering = true
}

GCP Cloud Tasks: shorten period for creating a previously created named task

We are developing a GCP Cloud Task based queue process that sends a status email whenever a particular Firestore doc write-trigger fires. The reason we use Cloud Tasks is so a delay can be created (using scheduledTime property 2-min in the future) before the email is sent, and to control dedup (by using a task-name formatted as: [firestore-collection-name]-[doc-id]) since the 'write' trigger on the Firestore doc can be fired several times as the document is being created and then quickly updated by backend cloud functions.
Once the task's delay period has been reached, the cloud-task runs, and the email is sent with updated Firestore document info included. After which the task is deleted from the queue and all is good.
Except:
If the user updates the Firestore doc (say 20 or 30 min later) we want to resend the status email but are unable to create the task using the same task-name. We get the following error:
409 The task cannot be created because a task with this name existed too recently. For more information about task de-duplication see https://cloud.google.com/tasks/docs/reference/rest/v2/projects.locations.queues.tasks/create#body.request_body.FIELDS.task.
This was unexpected as the queue is empty at this point as the last task completed succesfully. The documentation referenced in the error message says:
If the task's queue was created using Cloud Tasks, then another task
with the same name can't be created for ~1hour after the original task
was deleted or executed.
Question: is there some way in which this restriction can be by-passed by lowering the amount of time, or even removing the restriction all together?
The short answer is No. As you've already pointed, the docs are very clear regarding this behavior and you should wait 1 hour to create a task with same name as one that was previously created. The API or Client Libraries does not allow to decrease this time.
Having said that, I would suggest that instead of using the same Task ID, use different ones for the task and add an identifier in the body of the request. For example, using Python:
from google.cloud import tasks_v2
from google.protobuf import timestamp_pb2
import datetime
def create_task(project, queue, location, payload=None, in_seconds=None):
client = tasks_v2.CloudTasksClient()
parent = client.queue_path(project, location, queue)
task = {
'app_engine_http_request': {
'http_method': 'POST',
'relative_uri': '/task/'+queue
}
}
if payload is not None:
converted_payload = payload.encode()
task['app_engine_http_request']['body'] = converted_payload
if in_seconds is not None:
d = datetime.datetime.utcnow() + datetime.timedelta(seconds=in_seconds)
timestamp = timestamp_pb2.Timestamp()
timestamp.FromDatetime(d)
task['schedule_time'] = timestamp
response = client.create_task(parent, task)
print('Created task {}'.format(response.name))
print(response)
#You can change DOCUMENT_ID with USER_ID or something to identify the task
create_task(PROJECT_ID, QUEUE, REGION, DOCUMENT_ID)
Facing a similar problem of requiring to debounce multiple instances of Firestore write-trigger functions, we worked around the default Cloud Tasks task-name based dedup mechanism (still a constraint in Nov 2022) by building a small debounce "helper" using Firestore transactions.
We're using a helper collection _syncHelper_ to implement a delayed throttle for side effects of write-trigger fires - in the OP's case, send 1 email for all writes within 2 minutes.
In our case we are using Firebease Functions task queue utils and not directly interacting with Cloud Tasks but thats immaterial to the solution. The key is to determine the task's execution time in advance and use that as the "dedup key":
async function enqueueTask(shopId) {
const queueName = 'doSomething';
const now = new Date();
const next = new Date(now.getTime() + 2 * 60 * 1000);
try {
const shouldEnqueue = await getFirestore().runTransaction(async t=>{
const syncRef = getFirestore().collection('_syncHelper_').doc(<collection_id-doc_id>);
const doc = await t.get(syncRef);
let data = doc.data();
if (data?.timestamp.toDate()> now) {
return false;
}
await t.set(syncRef, { timestamp: Timestamp.fromDate(next) });
return true;
});
if (shouldEnqueue) {
let queue = getFunctions().taskQueue(queueName);
await queue.enqueue({
timestamp: next.toISOString(),
},
{ scheduleTime: next }); }
} catch {
...
}
}
This will ensure a new task is enqueued only if the "next execution" time has passed.
The execution operation (also a cloud function in our case) will remove the sync data entry if it hasn't been changed since it was executed:
exports.doSomething = functions.tasks.taskQueue({
retryConfig: {
maxAttempts: 2,
minBackoffSeconds: 60,
},
rateLimits: {
maxConcurrentDispatches: 2,
}
}).onDispatch(async data => {
let { timestamp } = data;
await sendYourEmailHere();
await getFirestore().runTransaction(async t => {
const syncRef = getFirestore().collection('_syncHelper_').doc(<collection_id-doc_id>);
const doc = await t.get(syncRef);
let data = doc.data();
if (data?.timestamp.toDate() <= new Date(timestamp)) {
await t.delete(syncRef);
}
});
});
This isn't a bullet proof solution (if the doSomething() execution function has high latency for example) but good enough for 99% of our use cases.

AWS CodeBuild Branch filter option removed

We are using AWS CodeBuild Branch filter option to trigger a build only when a PUSH to Master is made. However, The 'Branch filter' option has been apparently removed recently and 'Webhook event filter group' are added. They should provide more functionality I expect, but I cannot see how to make the 'Branch filter'.
Can someone help?
I couldn't see this change flagged anywhere, but it worked for me setting Event Type as PUSH and HEAD_REF to be
refs/heads/branch-name
as per
https://docs.aws.amazon.com/codebuild/latest/userguide/sample-github-pull-request.html
You need to use filter groups, instead of branch_filters.
Example in terraform (0.12+);
For feature branches ;
resource "aws_codebuild_webhook" "feature" {
project_name = aws_codebuild_project.feature.name
filter_group {
filter {
type = "EVENT"
pattern = "PULL_REQUEST_CREATED, PULL_REQUEST_UPDATED, PULL_REQUEST_REOPENED"
}
filter {
type = "HEAD_REF"
pattern = "^(?!^/refs/heads/master$).*"
exclude_matched_pattern = false
}
}
}
For master branch.
resource "aws_codebuild_webhook" "master" {
project_name = aws_codebuild_project.master.name
filter_group {
filter {
type = "EVENT"
pattern = "PUSH"
}
filter {
type = "HEAD_REF"
pattern = "^refs/heads/master$"
exclude_matched_pattern = false
}
}
}
So they both requires an aws_codebuild_project per each. Thus you will have 2 CodeBuild projects per repository.
branch_filter does not work in CodeBuild, although it is still configurable via UI or API. filter_groups are the one that has the required logic.