AWS Event-Sourcing implementation

AWS Event-Sourcing implementation - amazon-web-services

I'm quite a newbe in microservices and Event-Sourcing and I was trying to figure out a way to deploy a whole system on AWS.
As far as I know there are two ways to implement an Event-Driven architecture:
Using AWS Kinesis Data Stream
Using AWS SNS + SQS
So my base strategy is that every command is converted to an event which is stored in DynamoDB and exploit DynamoDB Streams to notify other microservices about a new event. But how? Which of the previous two solutions should I use?
The first one has the advanteges of:
Message ordering
At least one delivery
But the disadvantages are quite problematic:
No built-in autoscaling (you can achieve it using triggers)
No message visibility functionality (apparently, asking to confirm that)
No topic subscription
Very strict read transactions: you can improve it using multiple shards from what I read here you must have a not well defined number of lamdas with different invocation priorities and a not well defined strategy to avoid duplicate processing across multiple instances of the same microservice.
The second one has the advanteges of:
Is completely managed
Very high TPS
Topic subscriptions
Message visibility functionality
Drawbacks:
SQS messages are best-effort ordering, still no idea of what they means.
It says "A standard queue makes a best effort to preserve the order of messages, but more than one copy of a message might be delivered out of order".
Does it means that giving n copies of a message the first copy is delivered in order while the others are delivered unordered compared to the other messages' copies? Or "more that one" could be "all"?
A very big thanks for every kind of advice!

I'm quite a newbe in microservices and Event-Sourcing
Review Greg Young's talk Polygot Data for more insight into what follows.
Sharing events across service boundaries has two basic approaches - a push model and a pull model. For subscribers that care about the ordering of events, a pull model is "simpler" to maintain.
The basic idea being that each subscriber tracks its own high water mark for how many events in a stream it has processed, and queries an ordered representation of the event list to get updates.
In AWS, you would normally get this representation by querying the authoritative service for the updated event list (the implementation of which could include paging). The service might provide the list of events by querying dynamodb directly, or by getting the most recent key from DynamoDB, and then looking up cached representations of the events in S3.
In this approach, the "events" that are being pushed out of the system are really just notifications, allowing the subscribers to reduce the latency between the write into Dynamo and their own read.
I would normally reach for SNS (fan-out) for broadcasting notifications. Consumers that need bookkeeping support for which notifications they have handled would use SQS. But the primary channel for communicating the ordered events is pull.
I myself haven't looked hard at Kinesis - there's some general discussion in earlier questions -- but I think Kevin Sookocheff is onto something when he writes
...if you dig a little deeper you will find that Kinesis is well suited for a very particular use case, and if your application doesn’t fit this use case, Kinesis may be a lot more trouble than it’s worth.
Kinesis’ primary use case is collecting, storing and processing real-time continuous data streams. Data streams are data that are generated continuously by thousands of data sources, which typically send in the data records simultaneously, and in small sizes (order of Kilobytes).
Another thing: the fact that I'm accessing data from another
microservice stream is an anti-pattern, isn't it?
Well, part of the point of dividing a system into microservices is to reduce the coupling between the capabilities of the system. Accessing data across the microservice boundaries increases the coupling. So there's some tension there.
But basically if I'm using a pull model I need to read
data from other microservices' stream. Is it avoidable?
If you query the service you need for the information, rather than digging it out of the stream yourself, you reduce the coupling -- much like asking a service for data rather than reaching into an RDBMS and querying the tables yourself.
If you can avoid sharing the information between services at all, then you get even less coupling.
(Naive example: order fulfillment needs to know when an order has been paid for; so it needs a correlation id when the payment is made, but it doesn't need any of the other billing details.)

Related

Cloud Storage notifications delivery guarantees

I am using cloud storage notifications with pub/sub in my streaming pipeline
I read documentation about delivery semantic of cloud notifications and it says that it supports at least once delivery semantic and it doesn't guarantee delivery events in the same order as objects was uploaded (as I understand it means that I can get several events with the same generations. Am I right?).
Notifications are not guaranteed to be published in the order Pub/Sub receives them.
Pub/Sub also offers at-least-once delivery to the recipient, which means that you could receive multiple messages, with multiple IDs, that represent the same Cloud Storage event.
I wrote stateful DoFn in Apache Beam with keeping state of the latest largest processed generation to be able to find out of order received generations or duplicated. I tested it via uploading objects to cloud storage one at three seconds, but I din't catch any duplicated events or out of order generations.
My question is which data volume or data velocity should be to be able to catch duplicated events or out of order generations?

Personally I would not try the exercise you are asking for.
Reason is that you may never catch such events during your tests, btw those events may happen in production. And, the other way around.. you may see them in test and they may never occur in prod.
That's how it's designed, those duplicates may be very rare, depending on pub/sub running status, usage, network traffic etc.
You just need to accomodate that behavior, by making your event handler's logic idempotent.
Also, have a look to pub/sub release news.. they have recently introduced "exactly-one-delivery" feature (maybe still in beta).

Is SQS better than DynamoDB for peak loads?

A service runs on ECS and writes the requested URL to a DynamoDB. Dynamic scaling was activated to keep the costs for DynamoDB from becoming too high. DynamoDB scales slower than requests are coming in at any given time, so some calls are not logged. My question now is whether writing to an SQS would be the better way here, because the documentation says:
Standard queues support a nearly unlimited number of API calls per second, per API action (SendMessage, ReceiveMessage, or DeleteMessage).
Of course, the messages would then have to be written back to DynamoDB, but another service can then do that.
Is the throughput of messages per second to SQS really unlimited, so it's definitely cheaper to send messages to SQS instead of increasing DynamoDB's writes per second?

I don't know if this qualifies for a good answer. But remembering a discussion with my architect at the time, we concluded that to have a queue for precisely this problem seems good practice, regardless of load. It keeps requests even if services go down, so there is an added benefit.

SQS and Dynamo fit two very different use cases. Its not so much which is better, its which is right for what you need.
Dynamodb is a NoSQL Document based Database. This is best for when you have known access patterns to data that needs to persist over time, that you need to access quickly, but probably are not making many changes too (or at least the changes do not have to be absolutely immediately, sub 5 ms accessible). Each document in a dynamodb is similar (but also very different) to a row in a standard SQL table, in that it will have attributes (columns) keys (Partition and Sort Key) and be retrievable through a query (though dynamic on the fly queries are NOT good for Dynamo)
SQS is a Queue system. It has no persistence. Payloads of JSON objects are dropped into the Queue and then processed by some end point - either a Lambda, or put into a dynamo, or something else entirely depending on your products use case. It is perfect for when you often receive bursts of data but your system needs some time to handle each individual payload - such as it is waiting on other systems to finish before it can handle the next one - so instead of scaling horizontally (by just handling all the payloads in parallel) you have to scale vertically (be able to handle more payloads at once through a single or only a few threads). You cannot access the data coming in while it is waiting in the queue, no queries on said data, only wait until that data pops/pushes off the queue and into processing by whatever system you have set up to receive it.
The answer to your question is entirely dependent on your use case and your system - something we here at SO will never really understand or know simply because we will always be hearing about it through you and never really experiencing it. As such, to answer it, you need to understand the capabilities of both Dynamo and SQS, the pros and cons for each, and then determine which is best for your product.

Is an SQS needed with a Lambda in this use case?

I'm trying to build a flow that allows a user to enter data and it's being stored in RDS. My question is do I need to go from USER -> SQS -> Lambda -> RDS to or is it better to go straight from USER -> Lambda -> RDS which skips the queue entirely. Are there going to be scalability issues with the latter?
I do like that the SQS can retry a large number of times to guarantee the data, but is there a similar way to retry with a lambda alone? It's important that all of the data is stored and done so in a timely manner. I'm struggling to see the tradeoffs of the two scenarios.
If anyone has any input on the situation, that would be amazing.

Are there going to be scalability issues with the latter?
It depends on multiple metrics including traffic, spikes, size of the database, rpm etc.
Putting SQS before lambda provides you to manage number of database queries in t time according to your needs. It is a "queue" and you are consuming that queue. In some business cases it may not be useful(banking transactions etc) but in some cases(analytic calculations) it may be helpful. Instead of making a single insert whenever lambda is invoked, you can set batch size and insert in batch(10 records at once) which reduces the number of queries.
Also you can define dead letter queue to push your problematic data(couldn't make it to database). It will be another queue that you to check later to identify problematic inputs. The document can be found here

AWS Lambda - Store state of a queue

I'm currently tasked with building a serverless architecture for communication between government agencies and citizens, and a main component is some form of queue that contains some form of object/pointer to each citizens request, sorted by priority. The government workers can then process an element when available. As Lambda is stateless, I need to save the queue outside in some manner.
For saving state I've gathered that you can use DynamoDB or S3 Buckets and use event triggers to invoke related Lambda methods. Some also suggest using Parameter Store to save some state variables. Storing things globally has also come up, though as you can't guarantee that the Lambda doesn't terminate, it doesn't seem like a good idea.
Finally, I've also read a bit about SQS, though I have no idea if it is at all applicable to this case.
What is the best-practice / suggested approach when working with Lambda in this way? I'm leaning towards S3 Buckets, due to event triggering, and not using DynamoDB as our DB.

Storing things globally has also come up, though as you can't guarantee that the Lambda doesn't terminate, it doesn't seem like a good idea.
Correct -- this is not viable at all. Note that what you are actually referring to when you say "the Lambda" is the process inside the container... and any time your Lambda function is handling more than one invocation concurrently, you are guaranteed that they will not be running in the same container -- so "global" variables are only useful for optimization, not state. Any two concurrent invocations of the same function have two entirely different global environments.
Forgetting all about Lambda for a moment -- I am not saying don't use Lambda; I'm saying that whether or not you use Lambda isn't relevant to the rest of what is written, below -- I would suggest that parallel/concurrent actions in general are perhaps one of the most important factors that many developers tend to overlook when trying to design something like you are describing.
How you will assign work from this work "queue" is extremely important to consider. You can't just "find the next item" and display it to a worker.
You must have a way to do all of these things:
finding the next item that appears to be available
verify that it is indeed available
assign it to a specific worker
mark it as unavailable for assignment
Not only that, but you have to be able to do all of these things atomically -- as a single logical action -- and without collisions.
A naïve implementation runs the risk of assigning the same work item to two or more people, with the first assignment being blindly and silently overwritten by subsequent assignments that happen at almost the same time.
DynamoDB allows conditional updates -- update a record if and only if a certain condition is true. This is a critical piece of functionality that your solution needs to accommodate -- for example, assign work item x to user y if and only if item x is currently unassigned. A conditional update will fail, and changes nothing, if the condition is not true at the instant the update happens and therein lies the power of the feature.
S3 does not support conditional updates, because unlike DynamoDB, S3 operates only on an eventual-consistency model in most cases. After an object in S3 is updated or deleted, there is no guarantee that the next request to S3 will return the most recent version or that S3 will not return an item that has recently been deleted. This is not a defect in S3 -- it's an optimization -- but it makes S3 unsuited to the "work queue" aspect.
Skip this consideration and you will have a system that appears to work, and works correctly much of the time... but at other times, it "mysteriously" behaves wrongly.
Of course, if your work items have accompanying documents (scanned images, PDF, etc.), it's quite correct to store them in S3... but S3 is the wrong tool for storing "state." SSM Parameter Store is the wrong tool, for the same reason -- there is no way for two actions to work cooperatively when they both need to modify the "state" at the same time.
"Event triggers" are useful, of course, but from your description, the most notable "event" is not from the data, or the creation of the work item, but rather it is when the worker says "I'm ready for my next work item." It is at that point -- triggered by the web site/application code -- when the steps above are executed to select an item and assign it to a worker. (In practice, this could be browser → API Gateway → Lambda). From your description, there may be no need for the creation of a new work item to trigger an "event," or if there is, it is not the most significant among the events.
You will need a proper database for this. DynamoDB is a candidate, as is RDS.
The queues provided by SQS are designed to decouple two parts of your application -- when two processes run at different speeds, SQS is used as a buffer, allowing X to safely store the work needing to be done and then continue with something else until Y is able to do the work. SQS queues are opaque -- you can't introspect what's in the queue, you just take the next message and are responsible for handling it. On its face, that seems to partially describe what you need, but it is not a clean match for this use case. Queues are limited in how long messages can be retained, and once a message is successfully processed, it is completely gone.
Note also that SQS is only a match to your use case with the FIFO queue feature enabled, which guarantees perfect in-order delivery and exactly-once delivery -- standard SQS queues, for performance optimization reasons, do not guarantee perfect in-order delivery and may under certain conditions deliver the same message more than once, to the same consumer or a different consumer. But the SQS FIFO queue feature does not coexist with event triggers, which require standard queues.
So SQS may have a role, but you need an authoritative database to store the work and the results of the business process.

If you need to store the message, then SQS is not the best tool here, because your Lambda function would then need to process the message and finally store it somewhere, making SQS nothing but a broker.
The S3 approach gives what you need out of the box, considering you can store the files (messages) in an S3 bucket and then have one Lambda consume its event. Your Lambda would then process this event and the file would remain safe and sound on S3.
If you eventually need multiple consumers for this message, then you can send the S3 event to SNS instead and finally you could subscribe N Lambda Functions to a given SNS topic.

You appear to be worrying too much about the infrastructure at this stage and not enough on the application design. The fact that it will be serverless does not change the basic functionality of the application — it will still present a UI to users, they will still choose options that must trigger some business logic and information will still be stored in a database.
The queue you describe is merely a datastore of messages that are in a particular state. The application will have some form of business logic for determining the next message to handle, which could be based on creation timestamp, priority, location, category, user (eg VIP users who get faster response), specialization of the staff member asking for the next message, etc. This is not a "queue" but rather a calculation to be performed against all 'unresolved' messages to determine the next message to assign.
If you wish to go serverless, then the back-end will certainly be using Lambda and a database (eg DynamoDB or Amazon RDS). The application should store everything in the database so that data is available for the application's business logic. There is no need to use SQS since there really isn't a "queue", and Parameter Store is merely a way of sharing parameters amongst application components — it is not meant for core data storage.
Determine the application functionality first, then determine the appropriate architecture to make it happen.

Microservice for notifications

Most of us are familiar with the notifications that pop-up on top right on FB page. Then we ack them, we can still see them with latest first ordering etc.
I have a question on designing such a thing(trying to do some reading also on the side on the arch for this). So at a high level let us say there are microservices running into the system
S1
S2
S3
So let us say I have services like that. I am thinking of creating a new Service say S4 that can receive messages from these services. I am less worried about how these services will talk to S4. There are ways like SQS, kafka, etc.
My main Q is how can S4 do the following
Maintain these notifications in like what? PSQL?
Have a column with timestamp?
Do I need a timeseries DB for it?
How can I store them based on severity? Fatal, Info, critical?
How can someone ack a notification?

Below are some thoughts, but I don't think there is a one size fit all approach. It really depends on the rest of your architecture, domain requirements, scalability requirements, etc.
Maintain these notifications in like what? PSQL?
Do you really need to store the notifications additionally to your event queues? It may depend on if your queue design does allow for a just-in-time access based on topics and message content. Many queue systems would allow you to store the message indefinitely/until processed (check out Apache Pulsar). In this case there is a backlog of unacknowledged messages that can be accessed and processed whenever ready (e.g. client confirms message was read). After acknowledgement you can archive or delete the events.
If you decide that doesn't work and you need an additional database, the usual suspects come into play: Key-Value-Stores, such as MongoDB, or Relational DB, such as CockroachDB.
Have a column with timestamp?
Most queue systems would have a time stamp recorded by default. Sometimes the queue order could be enough. Depending on your query requirements a schema-less format (e.g. json or blob) for the message content and some explicit identifiers for finding the messages (by user etc) would be required as a minimum in most cases.
Do I need a timeseries DB for it?
Possibly useful if you need to track a large number of values changing over time, e.g. IOT related sensor data such as a temperature measurement. If you just have a handful of facebook-style notifications per user that seems not like a good fit.
How can I store them based on severity? Fatal, Info, critical?
Here you could utilize separation by topics in event queues or when using a separate DB it depends again if you use a schema or schema-less DB.
How can someone ack a notification?
As previously mentioned, modern event queue systems have that built in, or if using a separate DB you'll have to add that to the schema (or use a schema-less DB). In either case you need to define or implement the behavior for message retention after acknowledgment, e.g. whether to delete or archive.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js