Using S3 instead of SQS for integration purposes - amazon-web-services

folks.
There's a question I've recently faced, which brought some concerns and hesitations. I'm creating an "almost serverless" micro-service using AWS. Here is its workflow: Workflows options
The thing is the input message may be large, but AWS SQS limits message size to 256 Kb. So, I decided to use S3 and S3 notifications to handle inputs: client PUTs an object; its creation triggers Lambda functions and so on. In that way, 256 Kb limits are not relevant, but on the other hand, I'm using storage service as an integration one. One of the concerns is a dead letter queue handling f.e.
Maybe someone has faced similar problems. One of the things is to keep "serverless". Are there any good solutions/improvements/advice?
Thanks in advance.

I would recommend combining the two approaches:
Write the data to Amazon S3
Create a message in the Amazon SQS queue that includes a reference to the data in S3
This way, you have the benefits of using a queue, with additional storage.
If all the data you require is already in the file, then you can configure an Amazon S3 Event to create the SQS message directly in the queue. The message will include the name of the bucket and the key of the object. Thus, putting the file in S3 will create the SQS message and trigger the AWS Lambda function. This is more scalable than directly triggering the Lambda function from S3.

Have you considered using Kinesis Stream, then attaching your lambda to the stream with a size of 1? You could also handle your dead-letter with shard time stamps etc.
Or if you are able to manipulate the original message place your timestamp inside the message then you can easily utilize kinesis firehose and bulk load messages. Limitation on Kinesis is 2mb so will give you almost 10x the message size on sqs which you could compress further.

Related

How to save data from a Lambda function into a S3 when we have too much incoming per millisecond?

I have a process that publishes data into a IoT-Core and that triggers a Lambda function that inserts the payload into an Amazon S3 bucket.
I have a process that send around 1.2 million records in some seconds, and when I check in the bucket I see I have lost around 10% of the data. If I set a sleep in the Lambda function it goes beyond 15 minutes.
What is the solution for this scenario?
It appears that your requirement is to capture the events coming into IoT-Core and save them to Amazon S3.
It also sounds like your Lambda functions are being throttled due to hitting concurrency limits and data is being lost. By default, there is a limit of 10,000 concurrent AWS Lambda functions. This could potentially be fixed by requesting an increase in the maximum number of concurrent functions.
Here is a diagram from How AWS IoT works:
As shown in the digram, the Rules engine can actually be used to send data to Amazon S3 without requiring Lambda. However, this creates a separate object in Amazon S3 for every message.
If you wish to combine messages together, you can Write to Kinesis Data Firehose Using AWS IoT. Firehose will buffer the data by time or size, and then output multiple messages to an Amazon S3 object. This could be a good way to handle large volumes of data, and it also makes it easier to work with the resulting objects in S3 because there are less objects created. This makes them faster to query and process later (eg with Amazon Athena).
Going from IoT-Core rule direct to a Lambda can be fragile.
You can use Kinesis to buffer the data or Firehose to stream it directly to S3. These are standard patterns that AWS recommend for IoT in the AWS Well-Architected framework (https://d1.awsstatic.com/whitepapers/architecture/AWS-IoT-Lens.pdf).

Periodic Read from AWS S3 and publish to SQS

I have an S3 bucket with different files. I need to read those files and publish SQS msg for each row in the file.
I cannot use S3 events as the files need to be processed with a delay - put to SQS after a month.
I can write a scheduler to do this task, read and publish. But can I was AWS for this purpose?
AWS Batch or AWS data pipeline or Lambda.?
I need to pass the date(filename) of the data to be read and published.
Edit : The data volume to be dealt is huge
I can think of two ways to do this entirely using AWS serverless offerings without even having to write a scheduler.
You could use S3 events to start a Step Function that waits for a month before reading the S3 file and sending messages through SQS.
With a little more work, you could use S3 events to trigger a Lambda function which writes the messages to DynamoDB with a TTL of one month in the future. When the TTL expires, you can have another Lambda that listens to the DynamoDB streams, and when there’s a delete event, it publishes the message to SQS. (A good introduction to this general strategy can be found here.)
While the second strategy might require more effort, you might find it less expensive than using Step Functions depending on the overall message throughput and whether or not the S3 uploads occur in bursts or in a smooth distribution.
At the core, you need to do two things:
Enumerate all of the object in a bucket in S3, and perform some action on any object uploaded more than a month ago.
Can you use Lambda or Batch to do this? Sure. A Lambda could be set to trigger once a day, enumerate the files, and post the results to SQS.
Should you? No clue. A lot depends on your scale, and what you plan to do if it takes a long time to perform this work. If your S3 bucket has hundreds of objects, it won't be a problem. If it has billions, your Lambda will need to be able to handle being interrupted, and continuing paging through files from a previous run.
Alternatively, you could use S3 events to trigger a simple Lambda that adds a row to a database. Then, again, some Lambda could run on a cron job that asks the database for old rows, and publishes that set to SQS for others to consume. That's slightly cleaner, maybe, and can handle scaling up to pretty big bucket sizes.
Or, you could do the paging through files, deciding what to do, and processing old files all on a t2.micro if you just need to do some simple work to a few dozen files every day.
It all depends on your workload and needs.

Can I limit concurrent invocations of an AWS Lambda?

I have a Lambda function that’s triggered by a PUT to an S3 bucket.
I want to limit this Lambda function so that it’s only running one instance at a time – I don’t want two instances running concurrently.
I’ve had a look through the Lambda configuration and docs, but I can’t see anything obvious. I can about writing my own locking system, but it would be nice if this was already a solved problem.
How can I limit the number of concurrent invocations of a Lambda?
AWS Lambda now supports concurrency limits on individual functions:
https://aws.amazon.com/about-aws/whats-new/2017/11/set-concurrency-limits-on-individual-aws-lambda-functions/
I would suggest you to use Kinesis Streams (or alternatively DynamoDB + DynamoDB Streams, which essentially have the same behavior).
You can see Kinesis Streams as as queue. The good part is that you can use a Kinesis Stream as a Trigger to you Lambda function. So anything that gets inserted into this queue will automatically be passed over to your function, in order. So you will be able to process those S3 events one by one, one Lambda execution after the other (one instance at a time).
In order to do that, you'll need to create a Lambda function with the simple purpose of getting S3 Events and putting them into a Kinesis Stream. Then you'll configure that Kinesis Stream as your Lambda Trigger.
When you configure the Kinesis Stream as your Lambda Trigger I suggest you to use the following configuration:
Batch size: 1
This means that your Lambda will be called with only one event from Kinesis. You can select a higher number and you'll get a list of events of that size (for example, if you want to process the last 10 events in one Lambda execution instead of 10 consecutive Lambda executions).
Starting position: Trim horizon
This means it'll behave as a queue (FIFO)
A bit more info on AWS May Webinar Series - Streaming Data Processing with Amazon Kinesis and AWS Lambda.
I hope this helps anyone with a similar problem.
P.S. Bear in mind that Kinesis Streams have their own pricing. Using DynamoDB + DynamoDB Streams might be cheaper (or even free due to the non-expiring Free Tier of DynamoDB).
No, this is one of the things I'd really like to see Lambda support, but currently it does not. One of the problems is that if there were a lot of S3 PUT operations happening AWS would have to queue up all the Lambda invocations somehow, and there is currently no support for that.
If you built a locking mechanism into your Lambda function, what would you do with the requests you don't process due to a lock? Would you just throw those S3 notifications away?
The solution most people recommend is to have S3 send the notifications to an SQS queue, and then have your Lambda function scheduled to run periodically, like once a minute, and check if there is an item in the queue that needs to be processed.
Alternatively, have S3 send the notifications to SQS and just have a t2.nano EC2 instance with a single-threaded service polling the queue.
I know this is an old thread, but I ran across it trying to figure out how to make sure my time sequenced SQS messages were processed in order coming out of a FIFO queue and not getting processed simultaneously/out-of-order via multiple Lambda threads running.
Per the documentation:
For FIFO queues, Lambda sends messages to your function in the order
that it receives them. When you send a message to a FIFO queue, you
specify a message group ID. Amazon SQS ensures that messages in the
same group are delivered to Lambda in order. Lambda sorts the messages
into groups and sends only one batch at a time for a group. If your
function returns an error, the function attempts all retries on the
affected messages before Lambda receives additional messages from the
same group.
Your function can scale in concurrency to the number of active message
groups.
Link: https://docs.aws.amazon.com/lambda/latest/dg/with-sqs.html
So essentially, as long as you use a FIFO queue and submit your messages that need to stay in sequence with the same MessageGroupID, SQS/Lambda automatically handles the sequencing without any additional settings necessary.
Have the S3 "Put events" cause a message to be placed on the queue (instead of involving a lambda function). The message should contain a reference to the S3 object. Then SCHEDULE a lambda to "SHORT POLL the entire queue".
PS: S3 events can not trigger a Kinesis Stream... only SQS, SMS, Lambda (see http://docs.aws.amazon.com/AmazonS3/latest/dev/NotificationHowTo.html#supported-notification-destinations). Kinesis Stream are expensive and used for real-time event handling.

Poll periodically for new files in AWS S3 buckets having a lot of file?

I have situation where I need to poll AWS S3 bucket for new files.
Also, its not just one bucket. There are ~1000+ buckets and these buckets could have a lot of files.
What are the usual strategies / design for such use case. I need to consumer new files on each poll. I cannot delete files from the bucket.
Instead of polling, you should subscribe to S3 event notifications: http://docs.aws.amazon.com/AmazonS3/latest/dev/NotificationHowTo.html
These can be delivered to an SNS topic, an SQS queue, or trigger a Lambda function.
Well, in order to best answer that question, we would need to know what kind of application / architecture is doing the polling and consuming, however the 'AWS' way to do that is to have S3 send out S3 notifications upon creation of each file. The S3 notification contains a reference to the S3 file and can go out to SNS or SQS or even better Lambda which will then trigger the application to spin up, consume the files and then shut down.
Now, if you're going to have a LOT of files, all of those SNS/SQS notifications could get costly and some might then start looking at continuously polling S3 with the S3 SDK/CLI, however you need to keep in mind there are costs associated with the polling as well and you should look at ways to decrease the number of files. For example, if you're using Kinesis Firehose to dump into S3, look at batching. Or you can batch the SQS. Try your best to stick with the event notifications, it's much more resilient.

Large message Amazon SQS

I have an application that uses JMS to send files that are about a few megabytes large. Would it be possible to use Amazon SQS as the JMS provider for this application, as described here?
The problem here is that the max size of an SQS message is 256K. One way around this is to break up each file into multiple messages of 256K. But, if I do that, would having multiple producers send files at the same time break the architecture, as the messages from different producers become mixed up?
In this scenario you cannot use the original message with SQS, you will have to use a new message with a reference to the original message. The reference can be to a S3 object or a custom location on-prem or with-in AWS. S3 option probably involves least amount of work and has best cost efficiency (building and running).
If you consider S3 option, AWS Lambdas can be used to drop the message in SQS.
On a side note, the original message considered here seems to be self contained. May be it's a good idea to revisit the contents of the message, you may find ways to trim it and send only locations around which will result a smaller payload.
If everything is in the same region - the latency and data transfer cost is very minimal. Putting an item in S3 and having the object sent in the SQS should just turn your solution handle any sized data and take off your effort on scalability of the items and size of the each item.
While I said the data transfer costs are minimal, you might still incur data storage costs in S3; which you can use S3 life cycle rules to delete them.
#D.Luffy mentioned an important and excite solution with lambda - with that you can keep adding the items in S3 - enable S3 notifications, get that to the Queue and process the queue item (transfer it to another ec2 instance etc.) - making the solution fire and forget kind of stuff.
Please do not hesitate to leverage S3 alongside with SQS