AWS Appflow | Trigger lambda - amazon-web-services

I am pulling data from Salesforce using AWS Appflow and it is dumping data into S3. It is creating many files in S3. How can I run lambda only once per Appflow run. In my scenario, lambda is triggering n number of times, where n is the number of files created in S3 per run.
Note : 1. I can't create 1 file per run.
2. I am open for any other approach

As you have made it clear that your using S3 event for the trigger.
That addresses the issue of why your lambda function is being triggered every time there is a PUT event on S3.
Unfortunately there is no such trigger that would help you in this use case.
If you still want to use Lambda you can have a cloud watch event as a trigger for lambda function like a corn job.
You can schedule it for any interval which you feel would be appropriate for the function.
Hope this was helpfull

Appflow has the option of aggregating the files to a single file before loading it to S3. You can see that option in additional settings just before the scheduling.

I wouldn't mind my Lambda function get invoked multiple times, as long as it is idempotent.
Normally one AppFlow output file in the S3 bucket should only trigger your Lambda function once if the Lambda function can finish the execution within the timeout and return no error. Failing to do so may trigger a retry mechanism.
As #krr7931 suggested, you can try the aggregation option when setting up your AppFlow flow. AppFlow also supports filters, which can help reduce the amount of data you need to handle.
PS: here's my reflection on using AppFlow with Salesforce https://www.capturedlabs.com/amazon-app-flow-with-salesforce-in-action

Related

Is it possible to achieve parallel processing in AWS Lambda

I am having a python code in AWS Lambda which is triggered based on sqs event generated.
The criteria for generating sqs is if a new file comes into a particular S3 location, then sqs will be created and which in turn calls lambda.
Right now, lambda is processing the files one after the other in a serial mode. But I would like to know if we can process multiple files at the same time.
Example: If 5 files comes to s3 location, all the 5 files should be processed parallely at the same time.
I think you might miss observed the behavior of your system. If you using the Native SQS Standar Queue with Lambda integration, the Lambdas will consume the queus in batches, you can see a detailed explanation here:
https://aws.amazon.com/blogs/compute/understanding-how-aws-lambda-scales-when-subscribed-to-amazon-sqs-queues/
No need to add the SQS.
Enable triggering from the S3 PutObject action to the Lambda. With this, you can ensure invocations per object and also parallelism.
Also, check your Reserve concurrency value
Doc:https://docs.aws.amazon.com/lambda/latest/dg/configuration-concurrency.html
Check the concurrency dashboard in the monitoring sub section, it is already running in parallel mode.
File may look
Hope this helps!!!

AWS lambda - best practice when reading from long list/s3

I have a scheduled error handling lambda, I would like to use Serverless technology here as opposed to a spring boot service or something.
The lambda will read from an s3 bucket and process accordingly. The problem is at times the s3 bucket may have high volume of data to be processed. long running operations aren't suited to lambdas.
One solution I can think of is have the lambda read and process one item from the bucket and on success trigger another instance of the same lambda unless the bucket is empty/fully-processed. The thing i don't like is that this is synchronous and quite slow. I also need to be conscious of running too many lambdas at the same time as we are hitting a REST endpoint as part of the error flow and don't want to overload it with too many requests.
I am thinking it would be nice to have maybe 3 instances of the lambdas running at the same time until the bucket is empty but not really sure, I am wondering if anyone has any nice patterns that could be used here or suggestions on best practices?
Thanks
Create a S3 bucket for processing your files.
Enable a trigger S3 -> Lambda, on every new file in the bucket lambda will be invoked to process the file, every file is processed separately. https://docs.aws.amazon.com/AmazonS3/latest/user-guide/enable-event-notifications.html
Once the file is processed you could either delete or move file to other place.
About concurrency please have a look at provisioned concurrency https://docs.aws.amazon.com/lambda/latest/dg/configuration-concurrency.html
Update:
As you still plan to use a scheduler lambda and S3
Lambda reads/lists only the filenames and puts messages into SQS to process the file.
A new Lambda to consume SQS messages and process the file.
Note: I would recommend using SQS initially if the files/messages are not so big, it has built it recovery mechanics, DLQ , delays, visibility etc which you could benefit more than the simple S3 storage, second way is just create message with file reference and still use SQS.
I'd separate the lambda that is called by the scheduler from the lambda that is doing the actual processing. When the scheduler calls the first lambda, it can look at the contents of the bucket, then spawn the worker lambdas to process the objects. This way you have control over how many objects you want per worker.
Given your requirements, I would recommend:
Configure an Amazon S3 Event so that a message is pushed to an Amazon SQS queue when the objects are created in the S3 bucket
Schedule an AWS Lambda function at regular intervals that will:
Check that the external service is working
Invoke a Lambda function to process one message from the queue, and keep looping
The hard part would be throttling the second Lambda function so that it doesn't try to send all request at once (which might impact that external service).
You could probably do this by using a Step Function to trigger Lambda and then, if it was successful, trigger another Lambda function. This could even be done in parallel, such as allowing up to three parallel Lambda executions. The benefit of using Step Functions is that there is no cost for "waiting" for each Lambda to finish executing.
So, the Step Function flow would be something like:
Invoke a "check external service" Lambda function
If it fails, then quit the flow
Invoke the "processing" Lambda function
Get one message
Process the message
If successful, remove the message from the queue
Return success/fail
If it was successful, keep looping until the queue is empty

Is there a way to add delay to trigger a lambda from S3 upload?

I have a Lambda function which is triggered after put/post event of S3 bucket. This works fine if there is only one file uploaded to S3 bucket.
However, at times there could be multiple files uploaded which can take upto 7 minutes to complete the upload process. This triggers my lambda function multiple times which adds overhead of handling this from the code.
Is there any way to either trigger the lambda only once for the complete upload or add delay in the function and avoid multiple execution of Lambda function?
There is no specific interval when the files will be uploaded to S3 hence could not use scheduler.
Delay feature was added for Lambda that has Kinesis or DynamoDB Event Sources recently. But it's not supported for S3 events.
You can send events from S3 to SQS. Then your Lambda will consume SQS events. It consumes them in batch by default.
It seems Multi Part Upload is being used here from the client.
Maybe a duplicate of this? - AWS Lambda and Multipart Upload to/from S3
An alternative might be to have your Lambda function check for existence of all required files before moving on to the action you need to take. The Lambda function would still fire each time, but would exit quickly if not all files have been received yet.

Periodic Read from AWS S3 and publish to SQS

I have an S3 bucket with different files. I need to read those files and publish SQS msg for each row in the file.
I cannot use S3 events as the files need to be processed with a delay - put to SQS after a month.
I can write a scheduler to do this task, read and publish. But can I was AWS for this purpose?
AWS Batch or AWS data pipeline or Lambda.?
I need to pass the date(filename) of the data to be read and published.
Edit : The data volume to be dealt is huge
I can think of two ways to do this entirely using AWS serverless offerings without even having to write a scheduler.
You could use S3 events to start a Step Function that waits for a month before reading the S3 file and sending messages through SQS.
With a little more work, you could use S3 events to trigger a Lambda function which writes the messages to DynamoDB with a TTL of one month in the future. When the TTL expires, you can have another Lambda that listens to the DynamoDB streams, and when there’s a delete event, it publishes the message to SQS. (A good introduction to this general strategy can be found here.)
While the second strategy might require more effort, you might find it less expensive than using Step Functions depending on the overall message throughput and whether or not the S3 uploads occur in bursts or in a smooth distribution.
At the core, you need to do two things:
Enumerate all of the object in a bucket in S3, and perform some action on any object uploaded more than a month ago.
Can you use Lambda or Batch to do this? Sure. A Lambda could be set to trigger once a day, enumerate the files, and post the results to SQS.
Should you? No clue. A lot depends on your scale, and what you plan to do if it takes a long time to perform this work. If your S3 bucket has hundreds of objects, it won't be a problem. If it has billions, your Lambda will need to be able to handle being interrupted, and continuing paging through files from a previous run.
Alternatively, you could use S3 events to trigger a simple Lambda that adds a row to a database. Then, again, some Lambda could run on a cron job that asks the database for old rows, and publishes that set to SQS for others to consume. That's slightly cleaner, maybe, and can handle scaling up to pretty big bucket sizes.
Or, you could do the paging through files, deciding what to do, and processing old files all on a t2.micro if you just need to do some simple work to a few dozen files every day.
It all depends on your workload and needs.

Amazon S3 daily data processing

There is a data being stored on a s3 bucket in a daily basis, we are trying to automate parsing and processing that daily data being sent to s3 bucket, we already have the script that will parse the data, we just need to have the approach on the AWS how to automate this,the approach/use-case we thought was AWS batch that is scheduled to do the script on a daily basis or will get the latest data on that day before EOD, but seems like batch is incapable of doing it.
any ideas and approach? we've seen some approach like using Lambda and SQS/SNS
just to summarize:
data (Daily) > stored in S3 > data will be process by our team > stored to elastic search.
Thanks your ideas.
AWS Lambda is exactly what you want in this case. You can trigger lambda executing on S3 file showing up, that will process the file, and can then send it to ElasticSearch or wherever you want it to end up.
Here's an official explanation from AWS: https://docs.aws.amazon.com/lambda/latest/dg/with-s3.html
You can use Lambda + cloud watch events to execute your code on a regular schedule. You can specify a fixed rate ( or you can specify a Cron expression ), for example, in your case, you can execute your lambda every 24 hours, this way, your logic for data processing will run once daily.
Take a look at this article from AWS : Schedule AWS Lambda Functions Using CloudWatch Events