My project requires me to communicate with many devices outside the cloud. If successful, this means millions of devices. These devices will not be running Android or iOS, and will be running behind routers & firewalls (I cannot assume they have an external IP).
I am looking to use SQS to send messages to my users outside the cloud. To allow the servers to message individual users, I am designing the system to have one queue per client. This can potentially mean millions (billions?) of queues. While it states that SQS can support unlimited queues, I would like to make sure that I am not abusing the system. If successful, the probability of millions of users is very high.
I am aware that SQS can be expensive, but I am using it at this stage
for ease of administration.
As far as I can tell SNS requires either an IOS/Android client, or an
HTTP server running on the consumer. This is why I ruled out SNS, and I
am using SQS.
I am going to build a distributed cloud front-end over SQS to handle
client connections. This front-end will just be a wrapper, that will
authenticate clients, and relay them to the SQS queues.
Am I abusing the SQS "unlimited queues" policy (will SQS performance drop)? Is there a simpler design for per device messaging?
Let me break the answer by the parts of your question:
About your questions:
Am I abusing the SQS "unlimited queues" policy?
AWS services are designed to prevent abuse and you will pay exactly for what you use, so if you believe this is the right approach, go for it. To remove the uncertainty, i'd advise for a preliminary "proof of concept" implementation.
Is there a simpler design for per device messaging?
Probably yes, re-consider SNS and other messaging systems.
About your statements:
I am aware that SQS can be expensive, but I am using it at this stage
for ease of administration.
"Expensive" is a very context-depend classification, considering that a SQS message can cost $0.00000005.
As far as I can tell SNS requires either an IOS/Android client, or an
HTTP server running on the consumer. This is why I ruled out SNS, and
I am using SQS.
SNS is a push based messaging system (SQS is pull based) that can handle 5 types of subscriptions: smtp, sms, http, mobile push and SQS, so they are not mutually exclusive.
I am going to build a distributed cloud front-end over SQS to handle
client connections. This front-end will just be a wrapper, that will
authenticate clients, and relay them to the SQS queues.
Managing millions of queues can be a overwhelming task for your "distributed cloud front-end over SQS". Unless your project is exactly about queue management, this is probably undifferentiated heavy lifting.
This is about all i can say without knowing your case, but consider that you can use SNS/SQS together with each other and with other messaging software, such as Apache Camel and others, that may help you build your solution or proof of concept.
I think SQS (or SNS if you can eventually use them) are still your best bet, esp if you need "quick response" or "near real time"; however, for the sake of having "alternatives" just so you can compare...
You can consider a giant dynamoDB, with each device/client having it's own "device-id" and perhaps "message-id" as key. This way, your devices can query it's own keys for messages. DynamoDB is meant to handle billions of rows, so this won't stress it much. The querying part, you should be careful though, as you could use up provisioned queries, although at aggregate level, your devices may not all respond/query at the same time, so you may still be ok.
You can also consider having a giant S3 bucket, each folder key'ed to the device id and further keyed into message-id folders. This is a poor man's SQS but it's guaranteed to scale, both in message quantity and number of accesses to it.
In both #1 and #2, if your devices are registered with Cognito for credentials, there's a straightforward way to do policies, so the devices can only access their "own" stuff. However, both alternatives #1 and #2 is likely slower than SQS, esp if you use SQS long-poll -- in long poll, you get responses, as soon as SQS detects a message have been dropped off... These alternatives will require you to wait for next cycle-poll.
Related
We do have a system that is using Redis pub/sub features to communicate between different parts of the system. To keep it simple we used the pub/sub channel to implement different things. On both ends (producer and consumer), we do have Servers containing code that I see no way to convert into Lambda Functions.
We are migrating to AWS and among other changes, we are trying to replace the use of Redis with a managed pub/sub solution. The required solution is fairly simple: a managed broker that allows to publish a message from one node and to subscribe for its reception from 0 or more other nodes.
It seems impossible to achieve this with any of the available solutions:
Kinesis - It is a streaming solution for data ingestion (similar to Apache Pulsar)
SNS - From the documentation, it looks like exactly what we need until we realize that there is no solution to connect a server (not a Lambda) unless with a custom HTTP endpoint.
EventBridge - Same issue as with SNS
SQS - It is a queue, not a pub/sub.
Amazon MQ / Rabbit MQ - It is a queue, not a pub/sub. But also is not a SaaS solution but rather an installation to an owned node.
I see no reason to remove a feature such as subscribing from a server, this is why I was sure it will be present in one or more of the available solutions. But we went through the documentation and attempted to consume fro SNS and EventBridge without success. Are we missing something? How to achieve what we need?
Example
Assume we have an API server layer, deployed on ECS with a load balancer in front. The API has 2 endpoints, a PUT to update a document, and an SSE to listen for updates on documents.
Assuming a simple round-robin load balancer, an update for document1 may occur on node1 where a client may have an ongoing SSE request for the same document on node2. This can be done with a Redis backbone; node1 publishes on document1 topic and node2 is subscribed to the same topic. This solution is fast and efficient (in this case at-most-once delivery is perfectly acceptable).
Being this an example we will not consider WebSocket pub/sub API or other ready-made solutions for this specific use case.
Lambda
Subscriber side can not be a Lambda. Being two distinct contexts involved (the SSE HTTP Request one and the SNS event one) this will cause two distinct lambdas to fire and no way to 'stitch' them together.
SNS + SQS
We hesitate with SQS in conjunction with SNS being a solution that will add a lot of unneeded complexity:
Number of nodes is not known in advance and they scale, requiring an automated system to increase/reduce the number of SQS queues.
Persistence is not required
Additional latency is introduced
Additional infrastructure cost
HTTP Endpoint
This is the closest thing to a programmatic subscription but suffers from similar issues to the SNS-SQS solution:
Number of nodes is unknown requiring endpoint subscriptions to be automatically added.
Eiter we expose one endpoint for each node or have a particular configuration on the Load Balancer to route the message to the appropriate node.
Additional API endpoints must be exposed, maintained, and secured.
AWS SQS provides long polling and short polling, but why it doesn't provide push mechanism, like RabbitMQ?
The application could establish a long-lived connection, and consume the messages, which is pushed from the SQS queue.
This is a design choice and makes sense when considering use cases for which SQS is designed.
Serverless Computing: SQS is a core service when it comes to designing serverless architectures. In such systems there is no concept of "persistent servers" and hence no need for long lived connections. This is also why the pricing model of SQS is primarily on the API calls.
REST API Access vs Connections: To me, this speaks everything. When in an serverless environment having REST APIs to the microservices is needed. This is because I cannot program around when the compute node would be provisioned and deprovisioned, eg in lambda there are no hooks for these actions. This means I will either have to introduce a new layer - Connection Pools or live with having dangling connections. If not I will end up opening and closing connections for every single operation (or lambda invocation) which will not give me any benefits of the "connection" in the first place. Here having REST API makes sense.
This is also why DynamoDB (database) is made accessible via a REST API and Aurora now has a serverless alternative, which as you guessed, has REST API.
Overhead of long-lived connections: The overhead of long lived connections, on either end is expensive enough and would require a completely different architecture. This again tied down to the above point of not having servers to keep the connections open in the first place.
Disclaimer: This answer comes from my experience of building architecture on AWS.
I am learning Apache Kafka as Queue.
I can understand queue is needed when I run web server not to drop burst traffic.
Queue can help not to drop data for rush hours.
Unless using Queue, the only thing I can do is to put more server as much as rush hour traffic.
Is it right?
If it is right,
Assume that, I use aws api gateway + lambda for web server.
aws lambda can be auto scale. So my lambda web server never drop burst traffic. It means Queue such as Kafka is not needed in this case ?
Surely if I need any pub/sub architecture, Kafka is needed.
Is it right what I think?
API Gateway is typically used for cases where you care about the result of the API call and want to do something with the response. In this case, you need to wait for the Lambda function to finish and return the result so it can be passed back to the client. You don't need a queue because Lambda will scale out and add processes for each request. The limit would be the 10,000 requests per second of API Gateway, or the capacity of any downstream systems like a database.
Kafka is designed for real-time data streaming cases; things where you want to process data immediately, such as transcribing video. It is different than pub/sub. Consumers request data from Kafka. If the process requires merging data from multiple input sources on an on-going basis, then Kafka is a good fit. To say this another way, if the size of the input has no upper bound, stream processing is a good choice. A similar service that is available on AWS is Amazon Kinesis.
Pub/sub (such as Amazon SNS, which can easily trigger Lambda functions) is better for use cases where the size of the input, or the size of a useful batch, can be easily defined, but where data should still be processed near real-time. In a pub/sub system, events are published to subscribers rather than subscribers requesting them.
Another option is a queue like Amazon SQS, which can be useful if there is a bottleneck somewhere else in the system, such as database write capacity, or a Lambda concurrency limit. In this architecture, consumers request items from the queue when they are ready to process them, so it is better for use-cases where results are not immediately required.
If i understood the whole concept correctly, the "serverless" architecture assumes that instead of using own servers or containers, one should use bunch of aws services. Usually such architecture includes Amazon API Gateway, bunch of Lambda functions and DynamoDB (or alternative) for storing data and state, as Lambda can't keep state. And such services as EC2 is not participating in all this, well, because this is a virtual server and it diminish all the benefits of serverless architecture.
All this looks really cool, but i feel like i'm missing something important, because right now this seems to be not applicable for such cases as real time applications.
Say, i have 2 users online. One of them performs an action in an app, which triggers changes in database, which in turn, should trigger changes in the second user app.
The conventional way to send some data or command from server to client is websocket connection. But with serverless architecture there seem to be no way to establish and maintain websocket connection. So... where did i misunderstood the concept? Or, if i understood everything correctly, then how do i implement the interactions between 2 users as described above?
where did i misunderstood the concept?
Your observation is correct. It doesn't work out of the box using API Gateway and Lambda.
Applicable solution as described here is to use AWS IoT - yes, another AWS Service.
Serverless isn't just a matter of Lambda, API Gateway and DynamoDB, it's much bigger than that. One of the big advantages to Serverless is the operational burden that it takes off your plate. No more patching, no more capacity planning, no more config management. Those may seem trivial but doing those things well and across a significant fleet of instances is complex, expensive and time consuming. Another benefit is the economics. Public cloud leverages utility billing, meaning you pay for what you run whether or not you actually use it. With AWS most of the billing per service is by hour but with Lambda it's per 100ms. The cheapest EC2 instance running for a full month is about $10/m (double that for redundancy). $20 in Lambda pricing gets you millions of invocations so for most cases serverless is significantly cheaper.
Serverless isn't for everything though, it has it's limitations, for example it's not meant for running binaries. You can't run nginx in Lambda (for example), it's only meant to be a runtime environment for the programming languages that it supports. It's also specifically meant for event based workloads, which is perfect for microservice based architectures. Small independent discrete pieces of compute doing work that when done they send an event to another(s) to do something else and if needed return a response.
To address your concerns about realtime processing, depending on what your code is doing your Lambda function could complete in less than 100ms all the way up to 5 minutes. There are strategies to optimize it's duration time but in general it's for short lived work which is conducive of realtime scenarios.
In your example about the 2 users interacting with the web app and the db, that could very easily be built using serverless technologies with one or 2 functions and a DynamoDB table. The total roundtrip time could be as low as milliseconds if not seconds, it really all depends on your code and what it's doing. These would all be HTTP calls so no websockets needed. Think of a number of APIs calling each other and your Lambda code is the orchestrator.
You might want to look at SNS (simple notification service). In your example, if app user 2 is a a subscriber to an SNS topic, then when app user 1 makes a change that triggers an SNS message, it will be pushed to the subscriber (app user 2). The message can be pushed over several supported protocols (Amazon, Apple, Google, MS, Baidu) in addition to SMTP or SMS. The SNS message can be triggered by a lambda function or directly from a DynamoDB stream after an update (a database trigger). It's up to the app developer to select a message protocol and format. The app only has to receive messages through its native channels. This may not exactly be millisecond-latency 'real-time', but it's fast enough for all but the most latency-sensitive applications.
I've been working on an AWS serverless application for several months now, and am amazed at the variety of services available. The rate of improvement and new features being added is enough to leave you out-of-breath.
We need to sync data between different web servers. The idea is very basic: when one entity is created on one server, it should be sent to all the other servers. What's the right way to do it? We are currently evaluating 2 approaches: amazon's sqs and sns services and custom implementation with some key-value database (like memcached and memqueue). What are the common pitfalls of custom implementations? Any feedback will be highly appreciated.
SQS would work OK if you create a new queue for each server and write the data to each queue. The biggest downside is that you will need each server to poll for new messages.
SNS would work more efficiently because it allows you to broadcast a message to multiple locations. However, it's a one-shot try; if a machine can't receive its notification when SNS sends it SNS will not try again.
You don't specify how many messages you are sending or what your performance requirements are, but any SQS/SNS system will likely be much, much slower (mostly due to latencies between sending the message and the servers receiving it) then a local memcache/key-value server solution.
A mixed solution would be to use a persistant store (like SimpleDB) and use SNS to alert the servers that new data is available.