While putting files in S3 triggering them takes latency

While putting files in S3 triggering them takes latency - amazon-web-services

I got this flow over AWS:
Put file on S3 -> trigger -> lambda function that inserts item to
DynamoDB -> see that I actually got new item ove DynamoDB
While I'm uploading few files (about 5-10) to S3, which triggering the lambda call, it takes time to see the expected results inside my DynamoDB.
It seems like there is a queue which being handeled behind the scenes of the S3 trigger. Becuase when i'm uploading few more files, those which didn't seen before are now presented as an item in DynamoDB.
My expected result is to see new Item in DynamoDB by each file(s) upload to S3 in the second it was made.
Is there a way to handle this issue using any configuration ?

I think the above scenario is related to "Concurrent Execution" in Lambda as you are trying to upload 5-10 files.
Every Lambda function is allocated with a fixed amount of specific
resources regardless of the memory allocation, and each function is
allocated with a fixed amount of code storage per function and per
account.
AWS Lambda Account Limits Per Region = 100 Default Limit
Limits
Concurrent Executions - Refer Event Based Sources (e.g. S3)
You can use the following formula to estimate your concurrent Lambda function invocations:
events (or requests) per second * function duration
For example, consider a Lambda function that processes Amazon S3 events. Suppose that the Lambda function takes on average three seconds and Amazon S3 publishes 10 events per second. Then, you will have 30 concurrent executions of your Lambda function.
To increase the limit:-
Refer "To request a limit increase for concurrent executions" section in above link.
AWS may automatically raise the concurrent execution limit on your
behalf to enable your function to match the incoming event rate, as in
the case of triggering the function from an Amazon S3 bucket.

Related

batch processing s3 objects using lambda

The use case is that 1000s of very small-sized files are uploaded to s3 every minute and all the incoming objects are to be processed and stored in a separate bucket using lambda.
But using s3-object-create as a trigger will make many lambda invocations and concurrency needs to be taken care of. I am trying to batch process the newly created objects for every 5-10 minutes. S3 provides batch operations but it reports are generated everyday/week. Is there a service available that can help me?

According to AWS documentation, S3 can publish "New object created events" to following destinations:
Amazon SNS
Amazon SQS
AWS Lambda
In your case I would:
Create SQS.
Configure S3 Bucket to publish S3 new object events to SQS.
Reconfigure your existing Lambda to subscribe to SQS.
Configure batching for input SQS events.
Currently, the maximum batch size for SQS-Lambda subscription is 1000 events. But since your Lambda needs around 2 seconds to process single event, then you should start with something smaller, otherwise Lambda will timeout, because it won't be able to process all of the events.
Thanks to this, uploading X items to S3 will produce X / Y events, where Y is maximum batch size of SQS. For 1000 S3 items and batch size of 100, it will only invoke around 10 concurrent Lambda executions.
The AWS document mentioned above explains, how to publish S3 events to SQS. I won't explain it here, as it's more about implementation details.
Execution time
However you might run into a problem, where the processing is too slow, because Lambda will be processing probably events one-by-one in a loop.
The workaround would be to use asynchronous processing and implementation depends what runtime you use for Lambda, for Node.js it would be very easy to achieve.
Also if you want to speed up the processing in other ways, simply reduce maximum batch size and increase Lambda memory configuration, so single execution will be processing smaller number of events and will have access to more CPU units.

How to save data from a Lambda function into a S3 when we have too much incoming per millisecond?

I have a process that publishes data into a IoT-Core and that triggers a Lambda function that inserts the payload into an Amazon S3 bucket.
I have a process that send around 1.2 million records in some seconds, and when I check in the bucket I see I have lost around 10% of the data. If I set a sleep in the Lambda function it goes beyond 15 minutes.
What is the solution for this scenario?

It appears that your requirement is to capture the events coming into IoT-Core and save them to Amazon S3.
It also sounds like your Lambda functions are being throttled due to hitting concurrency limits and data is being lost. By default, there is a limit of 10,000 concurrent AWS Lambda functions. This could potentially be fixed by requesting an increase in the maximum number of concurrent functions.
Here is a diagram from How AWS IoT works:
As shown in the digram, the Rules engine can actually be used to send data to Amazon S3 without requiring Lambda. However, this creates a separate object in Amazon S3 for every message.
If you wish to combine messages together, you can Write to Kinesis Data Firehose Using AWS IoT. Firehose will buffer the data by time or size, and then output multiple messages to an Amazon S3 object. This could be a good way to handle large volumes of data, and it also makes it easier to work with the resulting objects in S3 because there are less objects created. This makes them faster to query and process later (eg with Amazon Athena).

Going from IoT-Core rule direct to a Lambda can be fragile.
You can use Kinesis to buffer the data or Firehose to stream it directly to S3. These are standard patterns that AWS recommend for IoT in the AWS Well-Architected framework (https://d1.awsstatic.com/whitepapers/architecture/AWS-IoT-Lens.pdf).

Copy data from S3 and post process

There is a service that generates data in S3 bucket that is used for warehouse querying. Data is inserted into S3 as daily mechanism.
I am interested in copying that data from S3 to my service account to further classify the data. The classification needs to happen in my AWS service account as it is based on information present in my service account. Classification needs to happens in my service account as it is specific to my team/service. The service generating the data in S3 is neither concerned about the classification nor has the data to make classification decision.
Each S3 file consists of json objects (record) in it. For every record, I need to look into a dynamodb table. Based on whether data exists in Dynamo table, I need to include an additional attribute to the json object and store the list into another S3 bucket in my account.
The way I am considering doing this:
Trigger a scheduled CW event periodically to invoke a Lambda that will copy the files from Source S3 bucket into a bucket (lets say Bucket A) in my account.
Then, use another scheduled CW event to invoke a Lambda to read the records in the json and compare with dynamodb table to determine classification and write to updated record to another bucket (lets say Bucket B).
I have few questions regarding this:
Are there better alternatives for achieving this?
Would using aws s3 sync in the first Lambda be a good way to achieve this? My concerns revolve around lambdas getting timed out due large amount of data, especially for the second lambda that needs to compare against DDB for every record.

Rather than setting up scheduled events, you can trigger the AWS Lambda functions in real-time.
Use Amazon S3 Events to trigger the Lambda function as soon as a file is created in the source bucket. The Lambda function can call CopyObject() to copy the object to Bucket-A for processing.
Similarly, an Event on Bucket-A could then trigger another Lambda function to process the file. Some things to note:
Lambda functions run for a maximum of 15 minutes
You can increase the memory assigned to a Lambda function, which will also increase the amount of CPU assigned. So, this might speed-up the function if it is taking longer than 15 minutes.
There is a maximum of 512MB of storage space made available for a Lambda function.
If the data is too big, or takes too long to process, then you will need to find a way to do it outside of AWS Lambda. For example, using Amazon EC2 instances.
If you can export the data from DynamoDB (perhaps on a regular basis), you might be able to use Amazon Athena to do all the processing, but that depends on what you're trying to do. If it is simple SELECT/JOIN queries, it might be suitable.

Does AWS guarantee my lambda function will be triggered 100%?

I set up my AWS workflow so that my lambda function will be triggered when a text file is added to my S3 bucket, and generally, it worked fine - when I upload a bunch of text files to the S3 bucket, a bunch of lambda will be running at the same time and process each text file.
But my issue is that occasionally, 1 or 2 files (out of 20k or so in total) did not trigger the lambda function as expected. I have no idea why - when I checked the logs, it's NOT that the file is processed by the lambda but failed. The log showed that the lambda was not trigger by that 1 or 2 files at all. I don't believe it's reaching the 1000 concurrent lambda limitation as well since my function runs faster and the peak is around 200 lambdas.
My question is: is this because AWS lambda does not guarantee it will be triggered 100%? Like the S3, there is always a (albeit tiny) possibility of failure? If not, how can I debug and fix this issue?

You don't mention how long the Lambdas take to execute. The default limit of concurrent executions is 1000. If you are uploading files faster than they can be processed with 1000 Lambdas then you'll want to reach out to AWS support and get your limit increased.
Also from the docs:
Amazon S3 event notifications typically deliver events in seconds but can sometimes take a minute or longer. On very rare occasions, events might be lost.
If your application requires particular semantics (for example, ensuring that no events are missed, or that operations run only once), we recommend that you account for missed and duplicate events when designing your application. You can audit for missed events by using the LIST Objects API or Amazon S3 Inventory reports. The LIST Objects API and Amazon S3 inventory reports are subject to eventual consistency and might not reflect recently added or deleted objects.

How to determine how many times my lambda executed for a certain of time

I have one Lambda it is executed on s3 put item trigger.
Now in s3 any objects uploaded lambda is triggering..
Let say Some one uploaded 5 files in s3 so each time it will execute the lambda for 5 files...
Is there any way that lambda can trigger only one time for all those 5 files...
Can I trace after complete of 5 time triggers/lambda execution...How many minutes lambda is not executing as no files uploaded..
Any help will really helpful for me

If you have the bucket notification configured for object create (s3:ObjectCreated) and if you haven't specified a filter or filter satisfies the uploaded objects your lambda will get triggered each time for per uploaded object.
To see the number of invocations, you can look at the Invocations metric for your lambda function in Cloudwatch metrics

You may want to add a queue that will handle the requests to process new files on S3.
A relevant one could be Kinesis data stream / SQS. If batching is important to you, Kinesis will be probably better.
The requests can be sent by a lambda triggered by S3 as you described, but it will only send the request to the queue, and another lambda will then process it. A simpler way will be to send the request in the same code that puts the object in S3 (if possible).
This way you can have statistics of how many requests were sent, processed, waiting, etc.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js