I am trying to solve the problem of generating distributed timing events for my application on the amazon cloud:
A server gets a message. As a result the system has to do something within X minutes. My problem is that the system needs to potentially handle millions of messages per second during peak time. Also, during that time interval the server that got the message might crash. So I am looking for a distributed solution that can receive a message, and then fire another message with a guarantee several minutes later.
I could design a sharded system by myself, but I was hoping that some distributed streaming framework can do this easily. But what I found so far are ones that complete transactions immediately.

Your question is somewhat vague, but if you are looking for a fault-tolerant, geographically distributed way of processing messages, the you should take a look at AWS SQS or SQS in combination with SNS.


Managing SQS Queue Manually vs Lambda Trigger

I'm not sure if this would be better served on ServerFault or Software Engineering, willing to move this post if appropriate.
We have somewhat recently started to move some of our data processing pipeline to use queues to manage individual bits of data, whereas previously we had timed lambdas that would pull all data since last change.
While making this change, we noticed that queues didn't work quite as we had anticipated first of all - we thought lambda would just pull items off the queue as the lambdas had availability. Instead, it seems the aws managed lambda trigger grabs a chunk of messages (up to ten) and throws it at the lambda service. If lambda doesn't have availability, the message gets throttled, then replayed after a backoff time, up til our configured replay "error" limit (five). After that, it's thrown into our dead letter queue.
We see a handful of message per day end up in the dead letter queue as a result of throttling. We then throw these back into the main queue (we have a process to do so every handful of hours). However, we weren't 100% sure throttling was the reason for things being pushed over since nothing indicates why the messages are moved over - we just assumed as much because we weren't getting any error logs for those messages. We contacted Amazon support to ask about this, and they were able to actually confirm the messages were in fact "errored" as a result of throttling.
We asked further into their recommendations for this - this must be a common problem right? They first off suggested upping our replay limit, which seemed an obvious no go. Replays occur for any failure, so that would just hammer our lambdas with bad requests when they came through. Asked also if there's any way to differentiate the errors because we don't care for throttling, we'd happily let those retry a dozen times if needed - but no. The other suggestion they had was to manage the queue ourselves from our lambdas. Build our own code within our lambdas to pull messages and then delete them after processing. This seems really counter-intuitive, though - why would every AWS consumer build their own infrastructure?
So I guess my question is, is this what others are doing? Are you using the built in lambda triggers? Are you creating your own code for managing queue consumption? Do you see these sorts of throttling, or is there anything we could do differently? Are there any difference with other services to manage this?
Best practice is to handle errors in your code and manually delete messages that have succeeded. That allows you to handle poison messages without reprocessing the good messages again. Throttles shouldn't be ending up in a DLQ that often. This video from re:Invent 2020 has a good explaination of how this works. Scalable serverless event-driven architectures with SNS, SQS & Lambda. Start at about the 20 minutes mark to get into SQS error handling.

What are the possible use cases for Amazon SQS or any Queue Service?

So I have been trying to get my hands on Amazon's AWS since my company's whole infrastructure is based of it.
One component I have never been able to understand properly is the Queue Service, I have searched Google quite a bit but I haven't been able to get a satisfactory answer. I think a Cron job and Queue Service are quite similar somewhat, correct me if I am wrong.
So what exactly SQS does? As far as I understand, it stores simple messages to be used by other components in AWS to do tasks & you can send messages to do that.
In this question, Can someone explain to me what Amazon Web Services components are used in a normal web service?; the answer mentioned they used SQS to queue tasks they want performed asynchronously. Why not just give a message back to the user & do the processing later on? Why wait for SQS to do its stuff?
Also, let's just say I have a web app which allows user to schedule some daily tasks, how would SQS would fit in that?
No, cron and SQS are not similar. One (cron) schedules jobs while the other (SQS) stores messages. Queues are used to decouple message producers from message consumers. This is one way to architect for scale and reliability.
Let's say you've built a mobile voting app for a popular TV show and 5 to 25 million viewers are all voting at the same time (at the end of each performance). How are you going to handle that many votes in such a short space of time (say, 15 seconds)? You could build a significant web server tier and database back-end that could handle millions of messages per second but that would be expensive, you'd have to pre-provision for maximum expected workload, and it would not be resilient (for example to database failure or throttling). If few people voted then you're overpaying for infrastructure; if voting went crazy then votes could be lost.
A better solution would use some queuing mechanism that decoupled the voting apps from your service where the vote queue was highly scalable so it could happily absorb 10 messages/sec or 10 million messages/sec. Then you would have an application tier pulling messages from that queue as fast as possible to tally the votes.
One thing I would add to #jarmod's excellent and succinct answer is that the size of the messages does matter. For example in AWS, the maximum size is just 256 KB unless you use the Extended Client Library, which increases the max to 2 GB. But note that it uses S3 as a temporary storage.
In RabbitMQ the practical limit is around 100 KB. There is no hard-coded limit in RabbitMQ, but the system simply stalls more or less often. From personal experience, RabbitMQ can handle a steady stream of around 1 MB messages for about 1 - 2 hours non-stop, but then it will start to behave erratically, often becoming a zombie and you'll need to restart the process.
SQS is a great way to decouple services, especially when there is a lot of heavy-duty, batch-oriented processing required.
For example, let's say you have a service where people upload photos from their mobile devices. Once the photos are uploaded your service needs to do a bunch of processing of the photos, e.g. scaling them to different sizes, applying different filters, extracting metadata, etc.
One way to accomplish this would be to post a message to an SQS queue (or perhaps multiple messages to multiple queues, depending on how you architect it). The message(s) describe work that needs to be performed on the newly uploaded image file. Once the message has been written to SQS, your application can return a success to the user because you know that you have the image file and you have scheduled the processing.
In the background, you can have servers reading messages from SQS and performing the work specified in the messages. If one of those servers dies another one will pick up the message and perform the work. SQS guarantees that a message will be delivered eventually so you can be confident that the work will eventually get done.

When to use delay queue feature of Amazon SQS?

I understand the concept of delay queue of Amazon SQS, but I wonder why it is useful.
What's the usage of SQS delay queue?
One use case which i can think of is usage in distributed applications which have eventual consistency semantics. The system consuming the message may have an dependency like a co-relation identifier to be available and hence may need to wait for certain guaranteed duration of time before seeing the co-relation data. In this case, it makes sense for the message to be delayed for certain duration of time.
Like you I was confused as to a use-case for delay queues, until I stumbled across one in my own work. My application needs to have an internal queue with each item waiting at least one minute between each check for completion.
So instead of having to manage a "last-checked-time" on every object, I just shove the object's ID into an SQS queue messagewith a delay time of 60 seconds, and my main loop then becomes a simple long-poll against the queue.
A few off the top of my head:
Emails - Let's say you have a service that sends reminder emails triggered from queue messages. You'd have to delay enqueueing the message in that case.
Race conditions - Delivery delays can be used to overcome race conditions in distributed systems. For example, a service could insert a row into a table, and sends a message about its availability to other services. They can't use the new entry just yet, so you have to delay publishing the SQS message.
Handling retries - Sometimes if a message fails you want to retry with exponential backoffs. This requires re-enqueuing the message with longer delays.
One use-case can be:
Think of a time critical expression like a scheduled equity trade order.
If one of your system is fetching all the order scheduled in next 60 minutes and putting them in queue (which will be fetched by another sub system).
If you send these order directly, then they will be visible immediately to process in queue and will be processed depending upon their order.
But most likely, they will not execute in exact time (Hour:Minute:Seconds) in which Customer wanted and this will impact the outcome.
So to solve this, what first sub system will do, it will add delay seconds (difference between current and execution time) so message will only be visible after that much delay or at exact time when user wanted.

does MSMQ have "lock until expire" functionality similar to Amazon SQS?

I've been using AWS SQS, which has a nice feature that when a message is claimed from the queue it locks for a period of time. During this lock if it is processed successfully the message is marked as completed. If the processing fails (and no response is received from the message processor), after a period of time the lock expires and the message is available for another processor to pick up.
Now I have a requirement to use queues outside of SQS (mostly for latency reasons, but potentially for cost reasons too). I'm really looking for a queue provider that has the same characteristic. MSMQ would be the obvious choice for me, since it's already installed and we use it elsewhere, but I can't find any functionality that handles failed messages in the same way.
Does MSMQ allow for this, or is there an easy way to replicate it?
Alternatively, is there another lightweight, open-source messaging service that does?
MSMQ does this already. If you read a message within a transaction and the transaction aborts then the message will reappear in the queue.

SQS/task-queue job retry count strategy?

I'm implementing a task queue with Amazon SQS ( but i guess the question applies to any task-queue ) , where the workers are expected to take different action depending on how many times the job has been re-tried already ( move it to a different queue, increase visibility timeout, send an alert..etc )
What would be the best way to keep track of failed job count? I'd like to avoid having to keep a centralized db for job:retry-count records. Should i look at time spent in the queue instead in a monitoring process? IMO that would be ugly or un-clean at best, iterating over jobs until i find ancient ones..
There is another simpler way. With your message you can request ApproximateReceiveCount information and base your retry logic on that. This way you won't have to keep it in the database and can calculate it from the message itself.
I've had good success combining SQS with SimpleDB. It is "centralized", but only as much as SQS is.
Every job gets a record in simpleDB and a task in SQS. You can put any information you like in SimpleDB like the job creation time. When a worker pulls a job from the queue it can grab the corresponding record from simpleDB to determine it's history. You can see how old the job is, and you can see how many times it has been attempted. Once you're done, you can add worker data to the SimpleDB record (completion time, outcome, logs, errors, stack-trace, whatever) and acknowledge the message from SQS.
I prefer this method because it helps diagnose faults by providing lots of debug info for failed tasks. It also allows workers to handle the job differently depending on how long the job has been queued, how many failures it's had, etc.
It also gives you the ability to query SimpleDB directly and calculate things like average time per task, percent failure rate, etc.
Amazon just released Simple workflow serice (swf) which you can think of as a more sophisticated/flexible version of GAE Task queues.
It will let you monitor your tasks (with hearbeats), configure retry strategies and create complicated workflows. It looks pretty promising abstracting out task dependencies, scheduling and fault tolerance for tasks (esp. asynchronous ones)
