I would like to write a command line tool that, among other things, monitors CloudFormation stack events for a particular stack and pretty-prints them to the console.
The out-of-the box solution is to repeatedly poll describe-stack-events. This is not ideal — you get the latest 200 events every time, and you're limited by the polling rate.
CloudFormation can also log events to an SNS topic. However, as far as I know, SNS is designed to send events to long-lasting destinations like lambdas and email addresses; it's not made for a script to temporarily listen to a topic.
I don't know much about AWS's various event and queueing services. I'm looking for a solution that is:
Cheap/easy.
Low-latency. (I'd like to see events in real time as a deployment is happening.)
Amenable to short-term subscriptions from a Node app or somesuch on my local machine.
I do not need:
Any sort of storage or view into past events.
Very high reliability. (This is a convenience tool for devs.)
Suggestions?
Related
Problem Statement
Informal State
We have some scenarios where the integration layer (a combination of AWS SNS/SQS components and etc.) is also responsible for the data distribution to target systems. Those are mostly async flows. In this case, we send a confirmation to a caller that we have received the data and will take a responsibility for the data delivery. Here, although the data is not originated from the integration layer we are still holding it and need to make sure that the data is not lost, for example, if the consumers are down or if messages, on-error, are sent to the DLQs and hence being automatically deleted after the retention period.
Solution Design
Currently my idea was to proceed with a back-up of the SQS/DLQ queues based upon CloudWatch configured alerts using ApproximateAgeOfOldestMessage metric with some applied threshold (something like the below):
Msg Expiration Event if ApproximateAgeOfOldestMessage / Message retention > Threshold
Now, more I go forward with this idea and more I doubt that this might be actually the right approach…
In particular, I would like to build something unobtrusive that can be "attached" to our SQS queues and dump the messages that are about to expire in some repository, like for example the AWS S3. Then have a procedure to recover the messages from S3 to the same original queue.
The above procedure contains many challenges like: message identification and consumption (receive message is not design to "query" for specific messages), message dump in the repository with a reference to the source queue, etc. which would suggest to me that the above approach might be a complex over-kill.
That being said, I'm aware of other "alternatives" (such as this) but I would appreciate if you could answer to the specific technical details described above, without trying to challenge the "need" instead.
Similar to Mark B's suggestion, you can use the SQS extended client (https://github.com/awslabs/amazon-sqs-java-extended-client-lib) to send all your messages through S3 (which is a configuration knob: https://github.com/awslabs/amazon-sqs-java-extended-client-lib/blob/master/src/main/java/com/amazon/sqs/javamessaging/ExtendedClientConfiguration.java#L189).
The extended client is a drop-in replacement for the AmazonSQS interface so it minimizes the intrusion on business logic - usually it's a matter of just changing your dependency injection.
I am looking into building a simple solution where producer services push events to a message queue and then have a streaming service make those available through gRPC streaming API.
Cloud Pub/Sub seems well suited for the job however scaling the streaming service means that each copy of that service would need to create its own subscription and delete it before scaling down and that seems unnecessarily complicated and not what the platform was intended for.
On the other hand Kafka seems to work well for something like this but I'd like to avoid having to manage the underlying platform itself and instead leverage the cloud infrastructure.
I should also mention that the reason for having a streaming API is to allow for streaming towards a frontend (who may not have access to the underlying infrastructure)
Is there a better way to go about doing something like this with the GCP platform without going the route of deploying and managing my own infrastructure?
If you essentially want ephemeral subscriptions, then there are a few things you can set on the Subscription object when you create a subscription:
Set the expiration_policy to a smaller duration. When a subscriber is not receiving messages for that time period, the subscription will be deleted. The tradeoff is that if your subscriber is down due to a transient issue that lasts longer than this period, then the subscription will be deleted. By default, the expiration is 31 days. You can set this as low as 1 day. For pull subscribers, the subscribers simply need to stop issuing requests to Cloud Pub/Sub for the timer on their expiration to start. For push subscriptions, the timer starts based on when no messages are successfully delivered to the endpoint. Therefore, if no messages are published or if the endpoint is returning an error for all pushed messages, the timer is in effect.
Reduce the value of message_retention_duration. This is the time period for which messages are kept in the event a subscriber is not receiving messages and acking them. By default, this is 7 days. You can set it as low as 10 minutes. The tradeoff is that if your subscriber disconnects or gets behind in processing messages by more than this duration, messages older than that will be deleted and the subscriber will not see them.
Subscribers that cleanly shut down could probably just call DeleteSubscription themselves so that the subscription goes away immediately, but for ones that shut down unexpectedly, setting these two properties will minimize the time for which the subscription continues to exist and the number of messages (that will never get delivered) that will be retained.
Keep in mind that Cloud Pub/Sub quotas limit one to 10,000 subscriptions per topic and per project. Therefore, if a lot of subscriptions are created and either active or not cleaned up (manually, or automatically after expiration_policy's ttl has passed), then new subscriptions may not be able to be created.
I think your original idea was better than ephemeral subscriptions tbh. I mean it works, but it feels totally unnatural. Depending on what your requirements are. For example, do clients only need to receive messages while they're connected or do they all need to get all messages?
Only While Connected
Your original idea was better imo. What I probably would have done is to create a gRPC stream service that clients could connect to. The implementation is essentially an observer pattern. The consumer will receive a message and then iterate through the subscribers to do a "Send" to all of them. From there, any time a client connects to the service, it just registers itself with that observer collection and unregisters when it disconnects. Horizontal scaling is passive since clients are sticky to whatever instance they've connected to.
Everyone always get the message, if eventually
The concept is similar to the above but the client doesn't implicitly un-register from the observer on disconnect. Instead, it would register and un-register explicitly (through a method/command designed to do so). Modify the 'on disconnected' logic to tell the observer list that the client has gone offline. Then the consumer's broadcast logic is slightly different. Now it iterates through the list and says "if online, then send, else queue", and send the message to a ephemeral queue (that belongs to the client). Then your 'on connect' logic will send all messages that are in queue to the client before informing the consumer that it's back online. Basically an inbox. Setting up ephemeral, self-deleting queues is really easy in most products like RabbitMQ. I think you'll have to do a bit of managing whether or not it's ok to delete a queue though. For example, never delete the queue unless the client explicitly unsubscribes or has been inactive for so long. Fail to do that, and the whole inbox idea falls apart.
The selected answer above is most similar to what I'm subscribing here in that the subscription is the queue. If I did this, then I'd probably implement it as an internal bus instead of an observer (since it would be unnecessary) - You create a consumer on demand for a connecting client that literally just forwards the message. The message consumer subscribes and unsubscribes based on whether or not the client is connected. As Kamal noted, you'll run into problems if your scale exceeds the maximum number of subscriptions allowed by pubsub. If you find yourself in that position, then you can unshackle that constraint by implementing the pattern above. It's basically the same pattern but you shift the responsibility over to your infra where the only constraint is your own resources.
gRPC makes this mechanism pretty easy. Alternatively, for web, if you're on a Microsoft stack, then SignalR makes this pretty easy too. Clients connect to the hub, and you can publish to all connected clients. The consumer pattern here remains mostly the same, but you don't have to implement the observer pattern by hand.
(note: arrows in diagram are in the direction of dependency, not data flow)
I am implementing a process in my AWS based hosting business with an event driven architecture on AWS SNS. This is largely a learning experience with a new architecture, programming and hosting paradigm for me.
I have considered AWS Step functions, but have decided to implement a Message Bus with AWS SNS topic(s), because I want to understand the underlying event driven programming model.
Nearly all actions are performed by lambda functions and steps are coupled via SNS and/or SQS.
I am undecided if to implement the process with one or many SNS topics and if I should subscribe the core logic to the message bus(es) with one or many lambda functions.
One or many message buses
My core process currently consist of 9 events which of which 2 sets of 2 can be parallel, the remaining 4 are sequential. Subscribing these all to the same message bus is easier to set up, but requires each lambda function to check if the message is relevant to it, which seems like a waste of resources.
On the other hand I could have 6 message buses and be sure that a notified resource has something to do with the message.
One or many lambda functions
If all lambda functions are subscribed to the same message bus, it may be easier to package them all up with a dispatcher function in a single lambda function. It would also reduce the amount of code to upload to lambda, albeit I don't have to pay for that.
On the other hand I would loose the ability to control the timeout for the lambda function and any changes to the order of events is now dependent on the dispatcher code.
I would still have the ability to scale each process part, as any parts that contain repeating elements are seperated by SQS queues.
You should always emit each type of message to it's own topic, as this allows other services to consume these events without tightly coupling the two services.
Likewise, each worker that wants to consume messages should have it's own queue with it's own subscription to the topic.
Doing the following allows you to add new message consumers for a given event without having to modify the upstream service. Furthermore, responsibility over each component is clear - the service producing messages to a topic owns that topic (and the message format), whereas the consumer owns its queue and event handling semantics.
Your consumer can specify a message filter when subscribing to a topic, so it can only receive messages it cares about (documentation).
For example, a process that sends a customer survey after the customer has received their order would subscribe its queue to the Order Status Changed event with the filter set to only receive events where the new_status field is equal to shipment-received).
The above reflects principles of Service-Oriented architecture - and there's plenty of good material out there elaborating the points above.
So I have been trying to get my hands on Amazon's AWS since my company's whole infrastructure is based of it.
One component I have never been able to understand properly is the Queue Service, I have searched Google quite a bit but I haven't been able to get a satisfactory answer. I think a Cron job and Queue Service are quite similar somewhat, correct me if I am wrong.
So what exactly SQS does? As far as I understand, it stores simple messages to be used by other components in AWS to do tasks & you can send messages to do that.
In this question, Can someone explain to me what Amazon Web Services components are used in a normal web service?; the answer mentioned they used SQS to queue tasks they want performed asynchronously. Why not just give a message back to the user & do the processing later on? Why wait for SQS to do its stuff?
Also, let's just say I have a web app which allows user to schedule some daily tasks, how would SQS would fit in that?
No, cron and SQS are not similar. One (cron) schedules jobs while the other (SQS) stores messages. Queues are used to decouple message producers from message consumers. This is one way to architect for scale and reliability.
Let's say you've built a mobile voting app for a popular TV show and 5 to 25 million viewers are all voting at the same time (at the end of each performance). How are you going to handle that many votes in such a short space of time (say, 15 seconds)? You could build a significant web server tier and database back-end that could handle millions of messages per second but that would be expensive, you'd have to pre-provision for maximum expected workload, and it would not be resilient (for example to database failure or throttling). If few people voted then you're overpaying for infrastructure; if voting went crazy then votes could be lost.
A better solution would use some queuing mechanism that decoupled the voting apps from your service where the vote queue was highly scalable so it could happily absorb 10 messages/sec or 10 million messages/sec. Then you would have an application tier pulling messages from that queue as fast as possible to tally the votes.
One thing I would add to #jarmod's excellent and succinct answer is that the size of the messages does matter. For example in AWS, the maximum size is just 256 KB unless you use the Extended Client Library, which increases the max to 2 GB. But note that it uses S3 as a temporary storage.
In RabbitMQ the practical limit is around 100 KB. There is no hard-coded limit in RabbitMQ, but the system simply stalls more or less often. From personal experience, RabbitMQ can handle a steady stream of around 1 MB messages for about 1 - 2 hours non-stop, but then it will start to behave erratically, often becoming a zombie and you'll need to restart the process.
SQS is a great way to decouple services, especially when there is a lot of heavy-duty, batch-oriented processing required.
For example, let's say you have a service where people upload photos from their mobile devices. Once the photos are uploaded your service needs to do a bunch of processing of the photos, e.g. scaling them to different sizes, applying different filters, extracting metadata, etc.
One way to accomplish this would be to post a message to an SQS queue (or perhaps multiple messages to multiple queues, depending on how you architect it). The message(s) describe work that needs to be performed on the newly uploaded image file. Once the message has been written to SQS, your application can return a success to the user because you know that you have the image file and you have scheduled the processing.
In the background, you can have servers reading messages from SQS and performing the work specified in the messages. If one of those servers dies another one will pick up the message and perform the work. SQS guarantees that a message will be delivered eventually so you can be confident that the work will eventually get done.
I'm implementing a task queue with Amazon SQS ( but i guess the question applies to any task-queue ) , where the workers are expected to take different action depending on how many times the job has been re-tried already ( move it to a different queue, increase visibility timeout, send an alert..etc )
What would be the best way to keep track of failed job count? I'd like to avoid having to keep a centralized db for job:retry-count records. Should i look at time spent in the queue instead in a monitoring process? IMO that would be ugly or un-clean at best, iterating over jobs until i find ancient ones..
thanks!
Andras
There is another simpler way. With your message you can request ApproximateReceiveCount information and base your retry logic on that. This way you won't have to keep it in the database and can calculate it from the message itself.
http://docs.aws.amazon.com/AWSSimpleQueueService/latest/APIReference/API_ReceiveMessage.html
I've had good success combining SQS with SimpleDB. It is "centralized", but only as much as SQS is.
Every job gets a record in simpleDB and a task in SQS. You can put any information you like in SimpleDB like the job creation time. When a worker pulls a job from the queue it can grab the corresponding record from simpleDB to determine it's history. You can see how old the job is, and you can see how many times it has been attempted. Once you're done, you can add worker data to the SimpleDB record (completion time, outcome, logs, errors, stack-trace, whatever) and acknowledge the message from SQS.
I prefer this method because it helps diagnose faults by providing lots of debug info for failed tasks. It also allows workers to handle the job differently depending on how long the job has been queued, how many failures it's had, etc.
It also gives you the ability to query SimpleDB directly and calculate things like average time per task, percent failure rate, etc.
Amazon just released Simple workflow serice (swf) which you can think of as a more sophisticated/flexible version of GAE Task queues.
It will let you monitor your tasks (with hearbeats), configure retry strategies and create complicated workflows. It looks pretty promising abstracting out task dependencies, scheduling and fault tolerance for tasks (esp. asynchronous ones)
Checkout http://docs.amazonwebservices.com/amazonswf/latest/developerguide/swf-dg-intro-to-swf.html for overview.
SQS stands for "Simple Queue Service" which, in concept is the incorrect name for that service. The first and foremost feature of a "Queue" is FIFO (First in, First out), and SQS lacks that. Just wanting to clarify.
Also, Azure Queue Services lacks that as well. For the best cloud Queue service, use Azure's Service Bus since it's a TRUE Queue concept.