A batch job needs to process over 4000 files stored in S3 bucket, these files are stored across 36 different prefix and each of these prefix will have internally have 4 sub prefix each.
Basically 36 root folders and each root folders having 4 subfolders which will be recreated in S3 bucket using prefixes. Each of these files location will be
stored in database which will be ready by lambda.
I am planning to use lambda here and want to process 6 root folders( S3 prefix) in 6 concurrent lambda, meaning I want to run same lambda concurrently.
I want to create a initializer lambda which will read all configuration of these folder structure from RDS and it will push 6 different SQS messages,
with each sqs message containing names of 6 root folders(S3 prefix) , i will create another lambda called processing lambda which will read SQS messages and starts processing 6 folders sequentially
within in the lambda.
Below are my questions:-
With 6 SQS message will the processing lambda be executed in concurrent mode?
Assuming average size of files is around 50kb , lambda will have read each of these files in memory will this be an issue for processing in lambda
The SLA for job is around 10 min and the Batch job gets triggered every 20 min from 8am to 8pm, is Lambda right option or should this done using ECS
You might have a look into distributed map offered my stepfunctions. I think it perfectly fits your need.
https://aws.amazon.com/blogs/aws/step-functions-distributed-map-a-serverless-solution-for-large-scale-parallel-data-processing/
Related
I have setup a Glue Job which runs concurrently to process input files and writes it down to S3. The Glue job runs periodically (not a one time job).
The output in S3 is in a form of csv file. The requirement is to copy all those records into Aws SQS. Assuming there might be 100s of files, each containing upto million records.
Initially i was planning to have a lambda event to send the records row by row. however, from the doc i see a time limit for lambda as 15 mins- https://aws.amazon.com/about-aws/whats-new/2018/10/aws-lambda-supports-functions-that-can-run-up-to-15-minutes/#:~:text=You%20can%20now%20configure%20your,Lambda%20function%20was%205%20minutes.
Will it be better to use AWS Batch for copying the records from S3 to SQS ? I believe, AWS Batch has the capability to scale the process when needed and also perform the task in parallel.
I want to know if AWS Batch is a right pick or am i trying to more complicate the design ?
I've got a bucket that will receive a random amount of files within an unknown timeframe.
This could be anything from 1 file in 5 hours to 1000 files within 1 minute...
I want to invoke a lambda function when the bucket has new files but I don't really care about the content of the S3 event the lambda gets passed. Is there something that will allow me to call the lambda a single time if there a new files within the last 10 minutes without setting up something cron-like that runs every 10 minutes and checks for new files? I really only want to execute this a single time and only if there are new files.
You can create a CloudWatch Alarm that monitors the Amazon S3 request metrics and fires whenever the number of HTTP PUT requests made for objects in the bucket is greater than zero within a period of ten minutes.
I have S3 bucket website which has multiple HTML files, now I have created a lambda function and connected it to S3 trigger. Now I have created a lambda function which I want the first indexAll the current HTML file of S3 bucket to my Elastic search domain indices which I have after that if I upload or delete any HTML file in S3 bucket then this lambda function should index it to ES domain.
The issue is that I am not able to index my all current HTML also when I upload a new HTML file I am not able to index them to ES domain indices.
Lambda Function to indexall then index one by one . Also I wan to create a test in AWS lambda test for S3 and use event indexAll to index all file first .
Error is Time out in 3 sec:
Your Lambda function timeout is too low (its default is 3 seconds).
Also, it's not a good idea to try to analyze/index all S3 objects within a single invocation of a Lambda function unless you can constrain the number of objects. Lambda has a maximum timeout of 15 minutes.
One option to deal with existing files, as an alternative to EC2, would be to create a list of existing objects in the bucket (you could just list the bucket if it's reasonably sized, like 10k items or fewer, or you could use an S3 Inventory Report if it's a very large bucket). Either way, get a list of objects and then send them to an SQS queue, one by one. Have SQS trigger your Lambda function one object per invocation or a batch of 10 objects per invocation.
I have a use case, I have a list of pdf files stored in S3 Bucket, I have listed them and push them to SQS for Text Extraction, Created one Lambda for processing those files by providing bucket information and TextExtraxtion Information of AWS.
The issue is, Lambda is getting Timeout, as SQS trigger multiple lambda instance of all files and all of them waiting for Text Extract Service.
Lambda to trigger one by one, for all SQS message(FileName) so that Timeout does not occur, As we do have a limit for accessing AWS TextExtract
Processing 100+ files is a time consuming task, I would suggest taking no more than 10 files per Lambda execution.
Use SQS with Lambda as an event source.
https://dzone.com/articles/amazon-sqs-as-an-event-source-to-aws-lambda-a-deep
I have one Lambda it is executed on s3 put item trigger.
Now in s3 any objects uploaded lambda is triggering..
Let say Some one uploaded 5 files in s3 so each time it will execute the lambda for 5 files...
Is there any way that lambda can trigger only one time for all those 5 files...
Can I trace after complete of 5 time triggers/lambda execution...How many minutes lambda is not executing as no files uploaded..
Any help will really helpful for me
If you have the bucket notification configured for object create (s3:ObjectCreated) and if you haven't specified a filter or filter satisfies the uploaded objects your lambda will get triggered each time for per uploaded object.
To see the number of invocations, you can look at the Invocations metric for your lambda function in Cloudwatch metrics
You may want to add a queue that will handle the requests to process new files on S3.
A relevant one could be Kinesis data stream / SQS. If batching is important to you, Kinesis will be probably better.
The requests can be sent by a lambda triggered by S3 as you described, but it will only send the request to the queue, and another lambda will then process it. A simpler way will be to send the request in the same code that puts the object in S3 (if possible).
This way you can have statistics of how many requests were sent, processed, waiting, etc.