How to keep Lambda from triggering multiple times? - amazon-web-services

TechStack: salesforce data ->Aws Appflow->s3 ->databricks job
Hello! I have an appflow flow that is grabbing salesforce data and uploading it to s3 in a folder with multiple parquet files. I have an lambda that is listening to the prefix where this folder is being dropped. This lambda then triggers a databricks job which is an ingestion process I have created.
My main issue is that when these files are being uploaded to s3 it is triggering my lambda 1 time per file that is uploaded, and was curious as to how I can have the lambda run just once.

Amazon AppFlow publishes a Flow notification - Amazon AppFlow when a Flow is complete:
Amazon AppFlow is integrated with Amazon CloudWatch Events to publish events related to the status of a flow. The following flow events are published to your default event bus.
AppFlow End Flow Run Report: This event is published when a flow run is complete.
You could trigger the Lambda function when this Event is published. That way, it is only triggered when the Flow is complete.

I hope I've understood your issue correctly but it sounds like your Lambda is working correctly if you have it setup to run every time a file is dropped into the S3 bucket as the S3 trigger will call the Lambda upon every upload.
If you want to reduce the amount of time your Lambda runs is setup an Event Bridge trigger to check the bucket for new files you could run this off an Event Bridge CRON to ping the Lambda on a defined schedule. You could then send all the files to your data bricks block in bulk rather than individually.

Related

aws transcribe callback function

I want to call AWS transcribe function from an AWS Lambda.
In that lambda handler, I want to start the transcription job but not wait for it to finish in a while loop since it will not be cost-efficient. I don't see any way for the transcription job finish to call another Lambda, or something like that, to store the transcription information in an s3 bucket for example.
Any idea how to solve this?
See Using Amazon EventBridge with Amazon Transcribe.
With Amazon EventBridge, you can respond to state changes in your Amazon Transcribe jobs by initiating events in other AWS services. When a transcription job changes state, EventBridge automatically sends an event to an event stream. You create rules that define the events that you want to monitor in the event stream and the action that EventBridge should take when those events occur. For example, routing the event to another service (or target), which can then take an action. You could, for example, configure a rule to route an event to an AWS Lambda function when a transcription job has completed successfully.
Another alternative is:
when you call StartTranscriptionJob, you supply an S3 bucket name and S3 object key that will receive the transcribed results
you can use the Amazon S3 Event Notifications feature to notify you or to automatically trigger a Lambda function

How do I trigger a AWS lambda function only if bulk upload finished on S3?

We have a simple ETL setup below
Vendor upload crawled parquet data to our S3 bucket.
S3 event trigger a lambda function, which will trigger a glue crawler to update the existing table partition in glue.
This works fine most of the times, but in some cases our vendor might upload files consecutively in a short time period, for example when refreshing history data. This will cause an issue since glue crawler cannot run concurrently and the job will fail.
I'm wondering if there is anything we can do to avoid the potential error. I've looked into SQS but not exactly sure if this can help me, below is what I would like to achieve:
Vendor upload file to S3.
S3 send event to SQS.
SQS hold the event, wait until there has been no other following event for a given time period, say 5 minutes.
After no further event in 5 minutes, SQS trigger the lambda function to run the glue crawler.
Is this doable with S3 and SQS?
SQS hold the event,
Yes, you can do this, as you can setup SQS delay to up to 15 minues.
wait until there has been no other following event for a given time period, say 5 minutes.
No, there is not automated way for that. You have to develop your own custom solution. The most trivial way would be to not bundle SQS with lambda, and instead have lambda running on schedule (e.g. every 5 minutes). Lambda would have to have logic to determine if there are no new files uploaded after some time, and then trigger your Glue Job. Probably this would involve DynamoDB to keep track of last uploaded files between lambda executions.

Invoking third party API(IICS) for each events on S3 through eventbridge

I have been trying to run an IICS job every time an event occurs on my S3 bucket, so I turned on the eventbridge events in s3 created a rule, connection and API destination and in rule set the destination which I created.
Now I upload a file on S3. I can see two graphs invocations and triggered rules are created see the graph image here
the graph looks like that API is triggered, but IICS job has not run.
thankyou!

looking for a better way to visualize data lake pipeline on AWS

I am building a data lake pipeline on aws which includes many AWS services like s3, cloudwatch, lambda, glue crawler, glue job etc. The pipeline flow works like:
- cloudwatch schedule a cron job to trigger a lambda to fetch external data and save them in s3 bucket.
- a lambda will be triggered whenever a file is uploaded to the s3 bucket who trigger a glue crawler
- cloudwatch listen on glue crawler state change and trigger a lambda which calls a glue job to do data ETL
It works fine but I feel it is hard to monitor the the whole process. The only thing I can get is the log saved in cloudwatch and some notification / alert. Is there a better way to monitor this pipeline? Like viewing it as in a workflow diagram to see each time of execution.
You can try AWS X-Ray. AWS X-Ray helps developers analyze and debug production, distributed applications, such as those built using a microservices architecture. It traces user requests as they travel through your entire application. It aggregates the data generated by the individual services and resources that make up your application, providing you an end-to-end view of how your application is performing. Check here for more details here .

Running AWS Glue jobs in parallel

I have 30 Glue jobs that I want to run in parallel. If one job fails, others must continue. I started with step function, creating state machine that executes runner lambda function which on other hand triggers glue job depending on parameter(name of glue job). For one job there is decent amount of step function logic implemented(retry, error handling etc.)
Is there any way to execute state machine from other state machine? In that way I can have 30 parallel tasks that executes other state machines. If you have any suggestions please feel free to share.
AWS recommends using SNS for a fan out architecture to run parallel jobs from a single S3 event, as you get an overlap error if two lambdas try to use the same S3 event.
You basically send the S3 event to SNS and subscribe your 30 lambdas so they all trigger from the SNS notification (containing details of the S3 event) when it's published.
Create the Topic
Update the Topic Policy to allow Event Notifications from an S3 Bucket
Configure the S3 Bucket to send Event Notifications to the SNS Topic
Create the parallel Lambda functions, one for each job
Modify the Lambda functions to process SNS messages of S3 event notifications instead of the S3 event itself
https://aws.amazon.com/blogs/compute/fanout-s3-event-notifications-to-multiple-endpoints/
There is also another nice example with CloudFormation template https://aws.amazon.com/blogs/compute/messaging-fanout-pattern-for-serverless-architectures-using-amazon-sns/