Does lambdas execute operations in sequence.? - amazon-web-services

We are contemplating using Amazon web services for our project. Wherein the upstream flow will push the messages into the kinesis and later those messages will be fed into the lambdas, those messages before processing are going to be in order. As per my understanding, the AWS lambdas will scale out horizontally based on the volume of the messages. We have a volume of 400 messages per second, which means AWS lambda will respond to message volume and will instantiate new processes on separate containers to leverage parallelism and in order to achieve parallelism, ordering has to be compromised. So in case of 10 messages, which were ordered, hit the lambda functions and one function takes more time than another, a new function will be provisioned in some container by the AWS to serve the request.
Is the final output going to be in order after all of this processes?
Any help is appreciated.
Thanks.

If you are using Amazon Kinesis, then you can use a Data Transformation to trigger an AWS Lambda function on each incoming record.
This allows the record to be transformed or deleted, before continuing through the Firehose. Thus, records can be processed by Lambda while remaining in the same order. The final data can be delivered to Amazon S3, Amazon Redshift, Amazon Elasticsearch Service or Splunk.
If your application is consuming records from Amazon Kinesis directly (instead of via Firehose), then records will be consumed in order by your application.

Related

How to save data from a Lambda function into a S3 when we have too much incoming per millisecond?

I have a process that publishes data into a IoT-Core and that triggers a Lambda function that inserts the payload into an Amazon S3 bucket.
I have a process that send around 1.2 million records in some seconds, and when I check in the bucket I see I have lost around 10% of the data. If I set a sleep in the Lambda function it goes beyond 15 minutes.
What is the solution for this scenario?
It appears that your requirement is to capture the events coming into IoT-Core and save them to Amazon S3.
It also sounds like your Lambda functions are being throttled due to hitting concurrency limits and data is being lost. By default, there is a limit of 10,000 concurrent AWS Lambda functions. This could potentially be fixed by requesting an increase in the maximum number of concurrent functions.
Here is a diagram from How AWS IoT works:
As shown in the digram, the Rules engine can actually be used to send data to Amazon S3 without requiring Lambda. However, this creates a separate object in Amazon S3 for every message.
If you wish to combine messages together, you can Write to Kinesis Data Firehose Using AWS IoT. Firehose will buffer the data by time or size, and then output multiple messages to an Amazon S3 object. This could be a good way to handle large volumes of data, and it also makes it easier to work with the resulting objects in S3 because there are less objects created. This makes them faster to query and process later (eg with Amazon Athena).
Going from IoT-Core rule direct to a Lambda can be fragile.
You can use Kinesis to buffer the data or Firehose to stream it directly to S3. These are standard patterns that AWS recommend for IoT in the AWS Well-Architected framework (https://d1.awsstatic.com/whitepapers/architecture/AWS-IoT-Lens.pdf).

Queue is needed when I use aws api gateway/lambda as web server?

I am learning Apache Kafka as Queue.
I can understand queue is needed when I run web server not to drop burst traffic.
Queue can help not to drop data for rush hours.
Unless using Queue, the only thing I can do is to put more server as much as rush hour traffic.
Is it right?
If it is right,
Assume that, I use aws api gateway + lambda for web server.
aws lambda can be auto scale. So my lambda web server never drop burst traffic. It means Queue such as Kafka is not needed in this case ?
Surely if I need any pub/sub architecture, Kafka is needed.
Is it right what I think?
API Gateway is typically used for cases where you care about the result of the API call and want to do something with the response. In this case, you need to wait for the Lambda function to finish and return the result so it can be passed back to the client. You don't need a queue because Lambda will scale out and add processes for each request. The limit would be the 10,000 requests per second of API Gateway, or the capacity of any downstream systems like a database.
Kafka is designed for real-time data streaming cases; things where you want to process data immediately, such as transcribing video. It is different than pub/sub. Consumers request data from Kafka. If the process requires merging data from multiple input sources on an on-going basis, then Kafka is a good fit. To say this another way, if the size of the input has no upper bound, stream processing is a good choice. A similar service that is available on AWS is Amazon Kinesis.
Pub/sub (such as Amazon SNS, which can easily trigger Lambda functions) is better for use cases where the size of the input, or the size of a useful batch, can be easily defined, but where data should still be processed near real-time. In a pub/sub system, events are published to subscribers rather than subscribers requesting them.
Another option is a queue like Amazon SQS, which can be useful if there is a bottleneck somewhere else in the system, such as database write capacity, or a Lambda concurrency limit. In this architecture, consumers request items from the queue when they are ready to process them, so it is better for use-cases where results are not immediately required.

SQS or Kinesis which one is good for queuing?

I have a server which can only process 20 request at a time. When lots of request coming, I want to store the request data, in some queues. and read a set of request (i.e 20) and process them by batch. What would be the ideal way to that ? Using SQS, or kinesis. I'm totally confused.
SQS = Simple Queue Service is for queuing messages in a 1:1 (once the message is consumed, it is removed from the queue)
Kinesis = low latency, high volumetry data streaming ... typically for 1:N (many consumers of messages)
As Kinesis is also storing the data for a period of time, both are often confused, but their architectural patterns are totally different.
Queue => SQS.
Data Streams => Kinesis.
Taken from https://aws.amazon.com/kinesis/data-streams/faqs/ :
Q: How does Amazon Kinesis Data Streams differ from Amazon SQS?
Amazon Kinesis Data Streams enables real-time processing of streaming
big data. It provides ordering of records, as well as the ability to
read and/or replay records in the same order to multiple Amazon
Kinesis Applications. The Amazon Kinesis Client Library (KCL) delivers
all records for a given partition key to the same record processor,
making it easier to build multiple applications reading from the same
Amazon Kinesis data stream (for example, to perform counting,
aggregation, and filtering).
Amazon Simple Queue Service (Amazon SQS) offers a reliable, highly
scalable hosted queue for storing messages as they travel between
computers. Amazon SQS lets you easily move data between distributed
application components and helps you build applications in which
messages are processed independently (with message-level ack/fail
semantics), such as automated workflows.
Q: When should I use Amazon Kinesis Data Streams, and when should I
use Amazon SQS?
We recommend Amazon Kinesis Data Streams for use cases with
requirements that are similar to the following:
Routing related records to the same record processor (as in streaming MapReduce). For example, counting and aggregation are
simpler when all records for a given key are routed to the same record
processor.
Ordering of records. For example, you want to transfer log data from the application host to the processing/archival host while maintaining
the order of log statements.
Ability for multiple applications to consume the same stream concurrently. For example, you have one application that updates a
real-time dashboard and another that archives data to Amazon Redshift.
You want both applications to consume data from the same stream
concurrently and independently.
Ability to consume records in the same order a few hours later. For example, you have a billing application and an audit application that
runs a few hours behind the billing application. Because Amazon
Kinesis Data Streams stores data for up to 7 days, you can run the
audit application up to 7 days behind the billing application.
We recommend Amazon SQS for use cases with requirements that are
similar to the following:
Messaging semantics (such as message-level ack/fail) and visibility timeout. For example, you have a queue of work items and want to track
the successful completion of each item independently. Amazon SQS
tracks the ack/fail, so the application does not have to maintain a
persistent checkpoint/cursor. Amazon SQS will delete acked messages
and redeliver failed messages after a configured visibility timeout.
Individual message delay. For example, you have a job queue and need to schedule individual jobs with a delay. With Amazon SQS, you can
configure individual messages to have a delay of up to 15 minutes.
Dynamically increasing concurrency/throughput at read time. For example, you have a work queue and want to add more readers until the
backlog is cleared. With Amazon Kinesis Data Streams, you can scale up
to a sufficient number of shards (note, however, that you'll need to
provision enough shards ahead of time).
Leveraging Amazon SQS’s ability to scale transparently. For example, you buffer requests and the load changes as a result of occasional
load spikes or the natural growth of your business. Because each
buffered request can be processed independently, Amazon SQS can scale
transparently to handle the load without any provisioning instructions
from you.

AWS Reduce webhooks ec2 impact with queue

I have a PHP web application that is running on an ec2 server. The app is integrated with another service which involves subscribing to a number of webhooks.
The number of requests the server is receiving from these webhooks has become unmanageable, and I'm looking for a more efficient way to deal with data coming from these webhooks.
My initial thought was to use API gateway and put these requests into an SQS queue and read from this queue in batches.
However, I would like these batches to be read by the ec2 instance because the code used to process the webhooks is code reused throughout my application.
Is this possible or am I forced to use a lambda function with SQS? Is there a better way?
The approach you suggested (API Gateway + SQS) will work just fine. There is no need to use AWS Lambda. You'll want to use the AWS SDK for PHP when writing the application code that receives messages from your SQS queue.
I've used this pattern before and it's a great solution.
. . . am I forced to use a lamda function with SQS?
SQS plus Lambda is basically free. At this time, you get 1M (million) lambda calls and 1M (million) SQS requests per month. Remember that those SQS Requests may contain up to 10 messages and that's a potential 10M messages, all inside the free tier. Your EC2 instance is likely always on. Your lambda function is not. Even if you only use Lambda to push the SQS data to a data store like RDBMS for your EC2 to periodically poll, the operation would be bullet-proof and very inexpensive. With the introduction of SQS you could transition the common EC2 code to Lambda function(s). These now have a run time of 15 minutes.
To cite my sources:
SQS pricing for reference: https://aws.amazon.com/sqs/pricing/
Lambda pricing for reference: https://aws.amazon.com/lambda/pricing/

How to handle AWS IOT streaming data in relational database

Generic information :-i am designing solution for one of IOT problem approach in which data is continuously streaming from plc(programmable logic controller),plc have different tags these tags are representation of telemetry data and data will be continuously streaming from these tags, each of devices will have alarm tags which will be 0 or 1 , 1 means there is an equipment failure
problem statement:- i have to read the alarm tag and raise a ticket if any of alarm tag value is 1 and i have to stream these alerts to dashboard and also i have to maintain the ticket history too,so the operator can update the ticket status too
My solution:- i am using aws IOT , i am getting data in dynamo db then i am using dynamo db stream to check if any new item is added in alarm table and if it will trigger lambda function (which i have implemented in java) lambda function opens a new ticket in relational database using hibernate.
problem with my approach:-the aws iot data is continuously streaming in alarm table at a very fast rate and this is opening a lot of connection before it can be closed that's taking my relational database down
please let me know if other good design approach can i adopt?
USE Amazon Kinesis Analytics to process streaming data. Dynamodb isn't suitable for this.
Read more here
Below image will give you an idea for same
Just a proposal....
From lambda, do not contact RDS,
Rather push all alarms in AWS SQS
then you can have one another lambda scheduled for every minute using AWS CloudWatch Rules that will pick all items from AWS SQS and then insert them in RDS at once.
I agree with raevilman's design of not letting Lambda contact RDS directly.
Since creating a new ticket is not the only task you Lambda function is doing, you are also streaming these alerts to a dashboard. Depending on the streaming rate and the RDS limitations, you may want to split these tasks in multiple queues.
Generic solution: I'd suggest you can push the alarm to a fanout exchange and this exchange will in turn push the alarm to one or more queues as required. You can then batch the alarms and perform multiple writes together without performing connect/disconnect cycle multiple times.
AWS specific Solution: I haven't used SQS so can't really comment on it's architecture. Alternatively, you can create an SNS Topic and publish these alarms to this topic. You can then have SQS queues as subscribers to this topic which in turn will be used for Ticketing and Dashboard purpose independent of each other.
Here again, from Ticketing queue, you can poll messages using Lambda or your own scheduler in batch and process tickets(frequency depending on how time critical alarms are).
You may want to read this tutorial to get some pointers.
You can control number of lambda function concurrency. And this will reduce the number of lambdas that get spinned up based on the dynamo events. Thereby reducing the connections to RDS.
https://aws.amazon.com/blogs/compute/managing-aws-lambda-function-concurrency/
Ofcourse , this will throttle the dynamo events.