Event Driven MessageBus architecture with AWS SNS: one or many message buses/ lambda action functions - amazon-web-services

I am implementing a process in my AWS based hosting business with an event driven architecture on AWS SNS. This is largely a learning experience with a new architecture, programming and hosting paradigm for me.
I have considered AWS Step functions, but have decided to implement a Message Bus with AWS SNS topic(s), because I want to understand the underlying event driven programming model.
Nearly all actions are performed by lambda functions and steps are coupled via SNS and/or SQS.
I am undecided if to implement the process with one or many SNS topics and if I should subscribe the core logic to the message bus(es) with one or many lambda functions.
One or many message buses
My core process currently consist of 9 events which of which 2 sets of 2 can be parallel, the remaining 4 are sequential. Subscribing these all to the same message bus is easier to set up, but requires each lambda function to check if the message is relevant to it, which seems like a waste of resources.
On the other hand I could have 6 message buses and be sure that a notified resource has something to do with the message.
One or many lambda functions
If all lambda functions are subscribed to the same message bus, it may be easier to package them all up with a dispatcher function in a single lambda function. It would also reduce the amount of code to upload to lambda, albeit I don't have to pay for that.
On the other hand I would loose the ability to control the timeout for the lambda function and any changes to the order of events is now dependent on the dispatcher code.
I would still have the ability to scale each process part, as any parts that contain repeating elements are seperated by SQS queues.

You should always emit each type of message to it's own topic, as this allows other services to consume these events without tightly coupling the two services.
Likewise, each worker that wants to consume messages should have it's own queue with it's own subscription to the topic.
Doing the following allows you to add new message consumers for a given event without having to modify the upstream service. Furthermore, responsibility over each component is clear - the service producing messages to a topic owns that topic (and the message format), whereas the consumer owns its queue and event handling semantics.
Your consumer can specify a message filter when subscribing to a topic, so it can only receive messages it cares about (documentation).
For example, a process that sends a customer survey after the customer has received their order would subscribe its queue to the Order Status Changed event with the filter set to only receive events where the new_status field is equal to shipment-received).
The above reflects principles of Service-Oriented architecture - and there's plenty of good material out there elaborating the points above.

Related

Google Cloud Pub/Sub with different message types

Within the same application I send different message types that have a completely different format and that are totally unrelated. What is the best practice to tackle this problem?
I see two different approaches here :
Filter at application level, which means I receive all messages on the same puller (same subscription)
Create a new subscription, this means the application will have two pullers running (one for each message type)
You answered your question with 2. point. If the messages have completely different formats and are totally unrelated, that means they should be separated. There's no advantage of filtering them at the application layer. Topics/subscriptions model is made exactly for this purpose.
The difference between topic and a subscription might be confusing. So let me describe that as well.
First the concepts of Pub Sub:
Topic: A named resource to which messages are sent by publishers. In a pub/sub model, any message published to a topic is immediately received by all of the subscribers to the topic.
Subscription: A named resource representing the stream of messages from a single, specific topic, to be delivered to the subscribing application.
Message: The combination of data and (optional) attributes that a publisher sends to a topic and is eventually delivered to subscribers.
Message attribute: A key-value pair that a publisher can define for a message.
This diagram demonstrates Pub/Sub model
The Publish Subscribe model allows messages to be broadcast to different parts of a system asynchronously. A sibling to a message queue, a message topic provides a mechanism to broadcast asynchronous event notifications, and endpoints that allow software components to connect to the topic in order to send and receive those messages. To broadcast a message, a component called a publisher simply pushes a message to the topic. Now the difference between topic and subscription is a topic can have multiple subscriptions, but a given subscription belongs to a single topic.
To sum up:
Use a Topic when you would like to publish messages.
Use a Subscription when you would like to consume messages.
It depends!! As always, but here it depends how the messages are consumed.
If they are consumed by the same application, use the same subscription.
If the message are consumed by different application (because the message are unrelated and with a different structure) use 2 subscriptions.
Use the message attribute to differentiate the message type. Thanks to this attribute, you can create subscription that accept only these type of message. Like this, you can keep the same topic, and you customize the dispatch afterward. I wrote an article on this
There are three ways you can approach this problem:
Publish the messages of different types to different topics, then create a subscription for each topic, and consume the messages from each subscription.
Publish the messages of different types to the same topic, create a single subscription, and consume all of the messages from the single subscription.
Publish the messages of different types to the same topic, create a two subscriptions, and filter messages by type on a subscriber for each subscription.
There are tradeoffs with these three options. If you have control over the publisher and can create entirely separate topics for the different message types, this can be a good approach as it keeps different types of messages on completely independent channels. Think of it like having a data structure with a more specific type specified. For example, in Java one would generally prefer a List<String> and List<Integer> over a List<Object> that contains both.
However, this approach may not be feasible if the publisher is owned by someone else. It may also not be feasible if the subscriber has no way of knowing all of the topics that it could be necessary to consume from. Imagine you add another type of message and create a new topic. Processing it would require creating another subscriber. If the number of types of messages could grow very large, you could find yourself with many subscriber clients in a single task.
If choosing between the second and third option, the decision depends on your consumption patterns. Is it the same application that needs to process messages of both types or would it make sense to split this into separate applications? If it could make sense to have separate applications, then separate subscriptions is a good way to go. If the published messages have a way to distinguish their type in the attributes, then you could potentially use Pub/Sub filtering to ensure that subscribers for each subscription only receive the relevant messages.
If all messages are always going to be consumed by the same application, then a single subscription probably makes the most sense. The biggest reason for this is cost: if you have two subscriptions and two subscribers, that means all messages are going to be delivered and paid for twice. With a single subscription and distinguishing between the messages done at the application level, messages are only delivered once (modulo Cloud Pub/Sub's at-least-once delivery guarantee). This last option is particularly useful if the set of message types is unknown to the subscriber and could grow over time.
So if you have control over the publisher and the set of messages can be known in advance, separate topics for each message type is the best option. If that is not the case and processing of the messages could be done by different applications, then different subscriptions using filters is the best option. If processing of all message types will always be done by the same application or the number of types could grow, a single subscription is the best option.

SNS > AWS Lambda asyncronous invocation queue vs. SNS > SQS > Lambda

Background
This archhitecture relies solely on Lambda's asyncronous invocation mechanism as described here:
https://docs.aws.amazon.com/lambda/latest/dg/invocation-async.html
I have a collector function that is invoked once a minute and fetches a batch of data in that can vary drastically in size (tens of of KB to potentially 1-3MB). The data contains a JSON array containing one-to-many records. The collector function segregates these records and publishes them individually to an SNS topic.
A parser function is subribed the SNS topic and has a concurrency limit of 3. SNS asynchronously invokes the parser function per record, meaning that the built-in AWS managed Lambda asyncronous queue begins to fill up as the instances of the parser maxes out at 3. The Lambda queueing mechanism initiates retries at incremental backups when throttling occurs, until the invocation request can be processed by the parser function.
It is imperitive that a record does not get lost during this process as they can not be resurrected. I will be using dead letter queues where needed to ensure they ultimately end up somewhere in case of error.
Testing this method out resulted in no lost invocation. Everything worked as expected. Lambda reported hundreds of throttle responses but I'm relying on this to initiate the Lambda retry behaviour for async invocations. My understanding is that this behaivour is effectively the same as that which I'd have to develop and initiate myself if I wanted to retry consuming a message coming from SQS.
Questions
1. Is the built-in AWS managed Lambda asyncronous queue reliable?
The parser could be subject to a consistent load of 200+ invocations per minute for prelonged periods so I want to understand whether the Lambda queue can handle this as sensibly as an SQS service. The main part that concerns me is this statement:
Even if your function doesn't return an error, it's possible for it to receive the same event from Lambda multiple times because the queue itself is eventually consistent. If the function can't keep up with incoming events, events might also be deleted from the queue without being sent to the function. Ensure that your function code gracefully handles duplicate events, and that you have enough concurrency available to handle all invocations.
This implies that an incoming invocation may just be deleted out of thin air. Also in my implementation I'm relying on the retry behaviour when a function throttles.
2. When a message is in the queue, what happens when the message timeout is exceeded?
I can't find a difinitive answer but I'm hoping the message would end up in the configured dead letter queue.
3. Why would I use SQS over the Lambda queue when SQS presents other problems?
See the articles below for arguments against SQS. Overpulling (described in the second link) is of particular concern:
https://lumigo.io/blog/sqs-and-lambda-the-missing-guide-on-failure-modes/
https://medium.com/#zaccharles/lambda-concurrency-limits-and-sqs-triggers-dont-mix-well-sometimes-eb23d90122e0
I can't find any articles or discussions of how the Lambda queue performs.
Thanks for reading!
Quite an interesting question. There's a presentation that covered queues in detail. I can't find it at the moment. The premise is the same as this queues are leaky buckets
So what if I add more Leaky Buckets. We'll you've delayed the leaking, however it's now leaking into another bucket. Have you solved the problem or delayed it?
What if I vibrate the buckets at different frequencies?
Further reading:
operate lambda
message expiry
message timeout
DDIA / DDIA Online
SQS Performance
sqs failure modes
mvce is missing from this question so I cannot address the the precise problem you are having.
As for an opinion on which to choose for SQS and Lambda queue I'll point to the Meta on this
sqs faq mentions Kinesis streams
sqs sns kinesis comparison
TL;DR;
It depends
I think the biggest advantage of using your own queue is the fact that you as a user have visibility into the state of the your backpressure.
Using the Lambda async invoke method, you have the potential to get throttled exceptions with the 'guarantee' that lambda will retry over an interval. If using a SQS source queue instead, you have complete visibility into the state of your message processing at all times with no ambiguity.
Secondly regarding overpulling. In theory this is a concern but in practice its never happened to me. I've run applications requiring thousands of transactions per second and never once had problems with SQS -> Lambda. Obviously set your retry policy appropriately and use a DLQ as transient/unpredictable errors CAN occur.

How to stream events with GCP platform?

I am looking into building a simple solution where producer services push events to a message queue and then have a streaming service make those available through gRPC streaming API.
Cloud Pub/Sub seems well suited for the job however scaling the streaming service means that each copy of that service would need to create its own subscription and delete it before scaling down and that seems unnecessarily complicated and not what the platform was intended for.
On the other hand Kafka seems to work well for something like this but I'd like to avoid having to manage the underlying platform itself and instead leverage the cloud infrastructure.
I should also mention that the reason for having a streaming API is to allow for streaming towards a frontend (who may not have access to the underlying infrastructure)
Is there a better way to go about doing something like this with the GCP platform without going the route of deploying and managing my own infrastructure?
If you essentially want ephemeral subscriptions, then there are a few things you can set on the Subscription object when you create a subscription:
Set the expiration_policy to a smaller duration. When a subscriber is not receiving messages for that time period, the subscription will be deleted. The tradeoff is that if your subscriber is down due to a transient issue that lasts longer than this period, then the subscription will be deleted. By default, the expiration is 31 days. You can set this as low as 1 day. For pull subscribers, the subscribers simply need to stop issuing requests to Cloud Pub/Sub for the timer on their expiration to start. For push subscriptions, the timer starts based on when no messages are successfully delivered to the endpoint. Therefore, if no messages are published or if the endpoint is returning an error for all pushed messages, the timer is in effect.
Reduce the value of message_retention_duration. This is the time period for which messages are kept in the event a subscriber is not receiving messages and acking them. By default, this is 7 days. You can set it as low as 10 minutes. The tradeoff is that if your subscriber disconnects or gets behind in processing messages by more than this duration, messages older than that will be deleted and the subscriber will not see them.
Subscribers that cleanly shut down could probably just call DeleteSubscription themselves so that the subscription goes away immediately, but for ones that shut down unexpectedly, setting these two properties will minimize the time for which the subscription continues to exist and the number of messages (that will never get delivered) that will be retained.
Keep in mind that Cloud Pub/Sub quotas limit one to 10,000 subscriptions per topic and per project. Therefore, if a lot of subscriptions are created and either active or not cleaned up (manually, or automatically after expiration_policy's ttl has passed), then new subscriptions may not be able to be created.
I think your original idea was better than ephemeral subscriptions tbh. I mean it works, but it feels totally unnatural. Depending on what your requirements are. For example, do clients only need to receive messages while they're connected or do they all need to get all messages?
Only While Connected
Your original idea was better imo. What I probably would have done is to create a gRPC stream service that clients could connect to. The implementation is essentially an observer pattern. The consumer will receive a message and then iterate through the subscribers to do a "Send" to all of them. From there, any time a client connects to the service, it just registers itself with that observer collection and unregisters when it disconnects. Horizontal scaling is passive since clients are sticky to whatever instance they've connected to.
Everyone always get the message, if eventually
The concept is similar to the above but the client doesn't implicitly un-register from the observer on disconnect. Instead, it would register and un-register explicitly (through a method/command designed to do so). Modify the 'on disconnected' logic to tell the observer list that the client has gone offline. Then the consumer's broadcast logic is slightly different. Now it iterates through the list and says "if online, then send, else queue", and send the message to a ephemeral queue (that belongs to the client). Then your 'on connect' logic will send all messages that are in queue to the client before informing the consumer that it's back online. Basically an inbox. Setting up ephemeral, self-deleting queues is really easy in most products like RabbitMQ. I think you'll have to do a bit of managing whether or not it's ok to delete a queue though. For example, never delete the queue unless the client explicitly unsubscribes or has been inactive for so long. Fail to do that, and the whole inbox idea falls apart.
The selected answer above is most similar to what I'm subscribing here in that the subscription is the queue. If I did this, then I'd probably implement it as an internal bus instead of an observer (since it would be unnecessary) - You create a consumer on demand for a connecting client that literally just forwards the message. The message consumer subscribes and unsubscribes based on whether or not the client is connected. As Kamal noted, you'll run into problems if your scale exceeds the maximum number of subscriptions allowed by pubsub. If you find yourself in that position, then you can unshackle that constraint by implementing the pattern above. It's basically the same pattern but you shift the responsibility over to your infra where the only constraint is your own resources.
gRPC makes this mechanism pretty easy. Alternatively, for web, if you're on a Microsoft stack, then SignalR makes this pretty easy too. Clients connect to the hub, and you can publish to all connected clients. The consumer pattern here remains mostly the same, but you don't have to implement the observer pattern by hand.
(note: arrows in diagram are in the direction of dependency, not data flow)

Using Amazon SQS for multiple consumers receiving the same message

I have one primary application sending messages to SQS Queue and want 4 consumer applications to consume the same message and process it however they want to
I am not sure what Queuing architecture to use for this purpose.
I see the option of Standard SQS, SQS FIFO, (SQS + SNSTopic) & Kenesis
For the functionality that I want it seems like either (SQS + SNS Topic) or Kenesis would be the way to go.
But I also have a question regarding Standard SQS & SQS FIFO - Is it not possible for all of the consumers to get the same message if I use SQS FIFO or Standard SQS?
I think I am confused between all the options and overwhelmed by all the information available on the Queues but still confused about which architecture to choose
Primary source of information is Amazon docs and https://www.schibsted.pl/blog/choosing-best-aws-messaging-service/
Some of the questions I went through on stackoverflow:
Link_1 This post answers the question of using multiple consumers with the Queue but not sure if it addressing the issue of same messages consumed by multiple consumers
Link_2
This one answers why Kenesis can be used for my scenario
Helpful_Info I used this article just to understand the differences
I would really appreciate some help on this. I am trying to read as much as possible but would definitely appreciate if someone can help me make the right decision
This looks like a perfect use case for SNS-SQS fanout notifications - the messages are sent to an SNS "topic", and SNS will deliver it to multiple SQS queues that are "subscribed" to that topic.
Some notes:
Each consumer application (that is attached to a queue) will consume at its own rate - this means that it's possible for one or more to "fall behind". In general, that should be ok as long as the consumers are independent - the queue acts as the buffer so no information is lost.
If you need them to be in sync, then that won't work - you should just use a single queue, and a process to synchronously poll the queue and deliver the message to each application.
You can perform similar logic with Kinesis (it's built to have multiple consumers), but the extra development complexity and cost is typically not worthwhile unless you are dealing with very large message volumes
Kinesis bills by data volume (megabytes), while SQS bills by message count - do the math for your use case.
Don't worry about SQS FIFO unless you need the guarantees it provides around ordering. Plain SQS is already roughly ordered, and will suffice for most use cases.
According to your use case SNS seems to be a a great choice however if you want to persist the messages you can use SQS with SNS.

Using Amazon SQS with multiple consumers

I have a service-based application that uses Amazon SQS with multiple queues and multiple consumers. I am doing this so that I can implement an event-based architecture and decouple all the services, where the different services react to changes in state of other systems. For example:
Registration Service:
Emits event 'registration-new' when a new user registers.
User Service:
Emits event 'user-updated' when user is updated.
Search Service:
Reads from queue 'registration-new' and indexes user in search.
Reads from queue 'user-updated' and updates user in search.
Metrics Service:
Reads from 'registration-new' queue and sends to Mixpanel.
Reads from queue 'user-updated' and sends to Mixpanel.
I'm having a number of issues:
A message can be received multiple times when doing polling. I can design a lot of the systems to be idempotent, but for some services (such as the metrics service) that would be much more difficult.
A message needs to be manually deleted from the queue in SQS. I have thought of implementing a "message-handling-service" that handles the deletion of messages when all the services have received them (each service would emit a 'message-acknowledged' event after handling a message).
I guess my question is this: what patterns should I use to ensure that I can have multiple consumers for a single queue in SQS, while ensuring that the messages also get delivered and deleted reliably. Thank you for your help.
I think you are doing it wrong.
It looks to me like you are using the same queue to do multiple different things. You are better of using a single queue for a single purpose.
Instead of putting an event into the 'registration-new' queue and then having two different services poll that queue, and BOTH needing to read that message and both doing something different with it (and then needing a 3rd process that is supposed to delete that message after the other 2 have processed it).
One queue should be used for one purpose.
Create a 'index-user-search' queue and a 'send to mixpanels' queue,
so the search service reads from the search queues, indexes the user
and immediately deletes the message.
The mixpanel-service reads from the mix-panels queue, processes the
message and deletes the message.
The registration service, instead of emiting a 'registration-new' to a single queue, now emits it to two queues.
To take it one step better, add SNS into the mix here and have the registration service emit an SNS message to the 'registration-new' topic (not queue), and then subscribe both of the queues I mentioned above, to that topic in a 'fan-out' pattern.
https://aws.amazon.com/blogs/aws/queues-and-notifications-now-best-friends/
Both queues will receive the message, but you only load it into SNS once - if down the road a 3rd unrelated service needs to also process 'registration-new' events, you create another queue and subscribe it to the topic as well - it can run with no dependencies or knowledge of what the other services are doing - that is the goal.
The primary use-case for multiple consumers of a queue is scaling-out.
The mechanism that allows for multiple consumers is the Visibility Timeout, which gives a consumer time to process and delete a message without it being consumed concurrently by another consumer.
To address the "At-Least-Once Delivery" property of Standard Queues,
the consuming service should be idempotent.
If that isn't possible, one possible solution is to use FIFO queues, but this mode has a limited message delivery rate and is not compatible with SNS subscription.
They even have a tutorial on how to create a fanout scenario using the combo SNS+SQS.
https://aws.amazon.com/getting-started/tutorials/send-fanout-event-notifications/
Too bad it does not support FIFO queues so you have to be careful to handle out of order messages.
It would be nice if they had a consistent hashing solution to have multiple competing consumers while respecting the message order.