I am trying to perform some parallelization on work which is computationally expensive to process on AWS via lambda functions. Specifically, the current architecture consists of a coordinator lambda which invokes several copies of a worker lambda via SNS with metadata specific to each invocation. These workers take the event from SNS to decide which partition of the data to work on, but I need something a bit more dynamic.
I need each worker to be ready to ingest new messages which affect the state of the worker. The one constraint of these messages is that these messages are indexed a key. Now initially does not matter which worker ingests a message with a particular key. What is important is that once that worker accepts this message (maybe through an acknowledgment?), it can only accept future messages with that specific key.
The number of possible keys is far, far smaller than the number of workers, but the keys themselves are not known in advanced. Usually they are determined after fanning out. The number of messages for a given key is around 2-8, each interspersed by some time. If it is important to the question, maybe we should distinguish case 1 - when only exactly one worker can commit to a key - to case 2 -when multiple workers can commit to the same key.
Example of Case #1:
Desired example of case #1:
Coordinator creates W1, W2, W3.
Coordinator calculates key K1. Begins sending messages indexed by K1.
W1 receives M1:K1. W1 acknowledges message M1:K1 and commits to only seeing key K1.
W2 receives M1:K2 and acknowledges. Acknowledgement fails since W1 already committed to K1.
W1 processes M2:K1.
Coordinator begins sending messages indexed by K2.
W1 doesn't see (or ignores?) M4:K2.
I'm not really sure how to go about designing this change. For example in case #1, it would ideal to have a dedicated SQS queue for the lambda that received and acknowledged the first message of a given key. The problem is that the coordinator will need to create the resource on the fly, get the lambda to read from it, etc which seems very expensive. Maybe I'm misunderstanding SQS but it doesn't seem to support routing messages of different keys within a queue. SNS probably won't do since no data is persisted. I'm not sure about EventBridge. Another concern is that there will be a lot of herding where lambdas that haven't committed to a particular message key send acknowledgements to the coordinator which eventually fail. They will fail on basically all keys since there are so many workers compared to keys.
What I'm not looking for
A system which is long-lasting such as EKS. There are usually only 2-3 messages for any given key and processing each message is fairly cheap.
Preferably, once a worker has committed to a key, it does not need to see messages for different keys. This maybe isn't a problem now, but will probably be one if the # of messages is far greater than 2-10.
I am curious for feedback. Thanks.
Related
I have a system where I publish updates to a shared topic meant for specific consumers.
I noticed messages getting stuck in the queue due to a lack of selective listening in SQS consumers, so messages are being hijacked.
Example:
Given: Message{destination: A, payload: 1234}
Given: ConsumerA, & ConsumerB
I expect Message to be processed by ConsumerA. However, it gets hijacked by Consumer B continuously. It receives the message, then refuses to process it since the destination field doesn't match, leading to the visibility timeout to expire, and the message put back on the queue.. but due to the nature of SQS, ConsumerB has an equal chance of picking the message again.
My question is, what patterns are used to solve this type of issue?
I'm considering creating a queue per consumer but it has drawbacks specific to the system im working on.
If I could only listen for messages with matching attributes, problem solved, but that's seemingly not the case.
Is there any other way?
Sharing a single Amazon SQS queue is not an appropriate architecture for your use-case.
If you want your consumers to be able to 'request' a message from a particular subset, you should either use separate SQS queues or use a database. You could even store objects in Amazon S3 as a form of noSQL database.
Having consumers grab messages and then 'send them back' to the queue is not compatible with the design of the Amazon SQS service.
I have a function doWork(id) that I'm offloading to some worker servers using AWS SQS. This function can get called very frequently but I'd like to throttle the function so that for a given id, the work is don't no more than once per second.
Is it possible with AWS / are there any services that feature this functionality?
EDIT: Some clarification.
doWork(id) does some expensive work on a record in a database. This work needs to continuously update whenever the user interacts with the record. Thus, I call doWork(id) whenever the user called a method that edits the record. However, the user may edit the record many times very quickly (I'm building a text editor so every character is an edit). Rather than doWork(id) a unnecessary amount of times, I'd like to throttle that work so it happens at most once per second.
Because this work is expensive, I enqueue a message in SQS and have a set of "worker" servers that dequeue tasks and run them.
My goal here is to somehow maintain the stateless horizontal scalability of my servers while throttling doWork(id). To make matters a little more complicated, I don't want to throttle the doWork function itself -- I want to throttle the work for each individual record identified by the id passed to doWork.
You could use a Redis instance on ElastiCache and configure your workers to use a distributed rate limiter for keys based on id. There are also many packages for different languages based on this kind of idea that might be ready to run on your workers.
That's interesting. You want to delay the work in case they hit another key within a given time period. If they don't hit another key in that time period, you then want to do the work. You might also want to do it after x seconds even if they continue typing (Auto Save).
The problem is that each keypress sends a message to the queue. When a worker receives the message, they have no idea whether another key has been pressed since the message was sent, and there's no way to look in the queue for other matching messages.
Amazon SQS does have the ability to delay a message, which means it will not be available for receiving for a given period, but this alone can't solve the problem because the worker doesn't know what else has happened.
Bottom line: A traditional queue is not a suitable mechanism for this use-case. You need something akin to a database/cache that can update a "last modified" timestamp each time that a key is pressed. Once that timestamp is more than x seconds old, you should queue the worker.
In the context of writing a Messenger chat bot in a cloud environment, I'm facing some concurrency issues.
Specifically, I would like to ensure that incoming messages from the same conversation are processed one after the other.
As a constraint, I'm processing the messages with workers in a Cloud environment (i.e the worker pool is of variable size and worker instances are potentially short-lived and may crash). Also, low latency is important.
So abstracting a little, my requirements are:
I have a stream of incoming messages
each of these messages has a 'topic key' (the conversation id)
the set of topics is not known ahead-of-time and is virtually infinite
I want to ensure that messages of the same topic are processed serially
on a cluster of potentially ephemeral workers
if possible, I would like reliability guarantees e.g making sure that each message is processed exactly once.
My questions are:
Is there a name for this concurrency scenario?.
Are there technologies (message brokers, coordination services, etc.) which implement this out of the box?
If not, what algorithms can I use to implement this on top of lower-level concurrency tools? (distributed locks, actors, queues, etc.)
I don't know of a widely-accepted name for the scenario, but a common strategy to solve that type of problem is to route your messages so that all messages with the same topic key end up at the same destination. A couple of technologies that will do this for you:
With Apache ActiveMQ, HornetQ, or Apache ActiveMQ Artemis, you could use your topic key as the JMSXGroupId to ensure all messages with the same topic key are processed in-order by the same consumer, with failover
With Apache Kafka, you could use your topic key as the partition key, which will also ensure all messages with the same topic key are processed in-order by the same consumer
Some message broker vendors refer to this requirement as Message Grouping, Sticky Sessions, or Sticky Message Load Balancing.
Another common strategy on messaging systems with weaker delivery/ordering guarantees (like Amazon SQS) is to simply include a sequence number in the message and leave it up to the destination to resequence and request redelivery of missing messages as needed.
I think you can fix this by using a queue and a set. What I can think of is sending every message object in queue and processing it as first in first out. But while adding it in queue add topic name in set and while taking it out for processing remove topic name from set.
So now if you have any topic in set then don't add another message object of same topic in queue.
I hope this will help you. All the best :)
I've set up an S3 bucket to emit an event on PUT object to SQS, and I'm handling the SQS queue in an EB worker tier.
The schema for the message that SQS sends is here: http://docs.aws.amazon.com/AmazonS3/latest/dev/notification-content-structure.html
Records is an array, implying that there can be multiple records sent in one POST to my worker's endpoint. Does this actually happen? Or will my worker only ever receive one record per message?
The worker can only return one response, either 200 (message handled successfully) or non-200 (message not handled successfully, which puts it back into the queue), regardless of how many records in the message it receives.
So if my worker receives multiple records in a message, and it handles some successfully (say by doing something with side effects such as inserting into a database) but fails on one or more, how should I handle that? If I return 200, then the ones that failed will not be retried. But if I return non-200, then the ones that were handled successfully will be retried unnecessarily, and possibly re-inserted. So I'd have to make my worker smart enough to retry only the failed ones -- which is logic I'd prefer not having to write.
This would be much easier if only one record was ever sent per message. So if that's the case in practice, despite records being an array, I'd really like to know!
To be clear, it's not the records that "SQS sends." It's the records that S3 sends to SQS (or to SNS, or to Lambda).
Currently, all S3 event notifications have a single event per notification message. We might include multiple records as we add new event types in the future. This is also a message format that is shared across other AWS services, and other services can include multiple records.
— https://forums.aws.amazon.com/thread.jspa?messageID=592264򐦈
So, for the moment, it appears there's only one record per message.
But... you are making a mistake if you assume your application need not be prepared to handle repeated or duplicate messages. In any massive and distributed system like SQS it is extremely difficult to absolutely guarantee that this can never happen, however unlikely:
Q: How many times will I receive each message?
Amazon SQS is engineered to provide “at least once” delivery of all messages in its queues. Although most of the time each message will be delivered to your application exactly once, you should design your system so that processing a message more than once does not create any errors or inconsistencies.
— http://aws.amazon.com/sqs/faqs/
Incidentally, in my platform, more than one entry in the records array is considered an error, causing the message to be abandoned and sent to the dead letter queue for review.
I understand the concept of delay queue of Amazon SQS, but I wonder why it is useful.
What's the usage of SQS delay queue?
Thanks
One use case which i can think of is usage in distributed applications which have eventual consistency semantics. The system consuming the message may have an dependency like a co-relation identifier to be available and hence may need to wait for certain guaranteed duration of time before seeing the co-relation data. In this case, it makes sense for the message to be delayed for certain duration of time.
Like you I was confused as to a use-case for delay queues, until I stumbled across one in my own work. My application needs to have an internal queue with each item waiting at least one minute between each check for completion.
So instead of having to manage a "last-checked-time" on every object, I just shove the object's ID into an SQS queue messagewith a delay time of 60 seconds, and my main loop then becomes a simple long-poll against the queue.
A few off the top of my head:
Emails - Let's say you have a service that sends reminder emails triggered from queue messages. You'd have to delay enqueueing the message in that case.
Race conditions - Delivery delays can be used to overcome race conditions in distributed systems. For example, a service could insert a row into a table, and sends a message about its availability to other services. They can't use the new entry just yet, so you have to delay publishing the SQS message.
Handling retries - Sometimes if a message fails you want to retry with exponential backoffs. This requires re-enqueuing the message with longer delays.
I've built a suite of API's to make queue message scheduling easy. You can call our API's to schedule queue messages, cancel, edit, and check on the status of such messages. Think of it like a scheduler microservice.
www.schedulerapi.com
If you are looking for a solution, let me know. I've built these schedulers before at work for delivering emails at high scale, so I have experience with similar use cases.
One use-case can be:
Think of a time critical expression like a scheduled equity trade order.
If one of your system is fetching all the order scheduled in next 60 minutes and putting them in queue (which will be fetched by another sub system).
If you send these order directly, then they will be visible immediately to process in queue and will be processed depending upon their order.
But most likely, they will not execute in exact time (Hour:Minute:Seconds) in which Customer wanted and this will impact the outcome.
So to solve this, what first sub system will do, it will add delay seconds (difference between current and execution time) so message will only be visible after that much delay or at exact time when user wanted.