sending lots of AWS SQS messages---FAST - amazon-web-services

I have an application that may need to send hundreds of thousands of messages each run of my program using SQS. The program takes 1-2 hours/run and I run it 5-10 times/day. So that's roughly 1 million messages/day.
I want to do it fast. Is my best approach to:
Send each with its own send-message, but send them in another thread so my main thread doesn't pause?
Use send-message-batch, which lets me send 10 messages at a time?
OMG. Why am I sending so many messages? Why not write them all into a big object, save the object in S3, and then send a pointer to the object with SQS?
My messages are the stdout and stderr of programs that are running in a distributed system. So the problem with #3 above is that I won't get the output of the program until the batching happens. I suppose that I could batch up every 60 seconds.
I'm sure that this has come up for other people. Is there a clever way to do this in the AWS SQS API that I am missing?
Kinesis is not an option in my environment.
We are currently sending the messages from python programs running on Apache Spark workers---about 2000 cores/cluster---and other monitoring systems and about 5-20 clusters. The messages will go to a lambda server. The problem is that some of the nodes send a few thousand messages within the course of 10-20 seconds
We tried using Spark itself to collect this information, storing it in an RDD, saving that RDD in S3, and so on. The problem with that approach was that we didn't get real-time monitoring, and we added several hours to processing time. (We're not entirely sure why it added so much time, but it's possible that Spark ended up re-computing some RDDs because some stuff would no longer fit in RAM or on the spill disks.)

We solved this problem three ways:
We created work queue with a consumer running in a separate queue. The consumer received messages from the worker and sent them off in batches of 10. If no message was received within a few seconds, the queue was flushed.
The full code is here: https://github.com/uscensusbureau/DAS_2020_Redistricting_Production_Code/blob/5e619a4b719284ad6af91e85e0548077ce3bfed7/source/programs/dashboard.py
The relevant class is below.
#
# We use a worker running in another thread to collect SQS messages
# and send them asychronously to the SQS in batches of 10 (or when WATCH_TIME expires.)
#
def sqs_queue():
return boto3.resource('sqs',
config = botocore.config.Config(
proxies={'https':bcc_https_proxy().replace("https://","")}
)).Queue(das_sqs_url())
SQS_MAX_MESSAGES=10 # SQS allows sending up to 10 messages at a time
WATCHER_TIME=5 # how long to collect SQS messages before sending them
EXIT='exit' # token message to send when program exits
# SQS Worker. Collects
class SQS_Client(metaclass=Singleton):
"""SQS_Client class is a singleton.
This uses a python queue to batch up messages that are send to the AWS Quwue.
We batch up to 10 messages a time, but send every message within 5 seconds.
"""
def __init__(self):
"""Set up the singleton by:
- getting a handle to the SQS queue through the BCC proxy.
- Creating the python queue for batching the requests to the SQS queue.
- Creating a background process to flush the queue every 5 seconds.
"""
# Set the default
if TRY_SQS_SECOND not in os.environ:
os.environ[TRY_SQS_SECOND]=YES
self.sqs_queue = sqs_queue() # queue to send this to SQS
self.pyqueue = queue.Queue() # producer/consumer queue used by dashboard.py
self.worker = threading.Thread(target=self.watcher, daemon=True)
self.worker.start()
atexit.register(self.terminate)
def flush(self, timeout=0.0):
"""Flush the pyqueue. Can be called from the main thread or the watcher thread.
While there are messages in the queue, grab up to 10, then send them to the sqs_queue.
Returns last message processed, which may be EXIT.
The watcher repeatedly calls flush() until it receives an Exit.
"""
entries = []
msg = None
t0 = time.time()
while True:
try:
msg = self.pyqueue.get(timeout=timeout, block=True)
except queue.Empty as e:
break
if msg==EXIT:
break
msg['Id'] = str( len( entries ))
entries.append(msg)
if len(entries)==SQS_MAX_MESSAGES:
break
if time.time() - t0 > timeout:
break
if entries:
# Send the 1-10 messages.
# If this fails, just save them in S3.
try:
if os.getenv(TRY_SQS_SECOND)==YES:
self.sqs_queue.send_messages(Entries=entries)
entries = []
except botocore.exceptions.ClientError as err:
logging.warning("Cannot send by SQS; sending by S3")
os.environ[TRY_SQS_SECOND]=NO
if entries:
assert os.getenv(TRY_SQS_SECOND)==NO # should have only gotten here if we failed above
for entry in entries:
send_message_s3(entry['MessageBody'])
return msg
def watcher(self):
"""Repeatedly call flush().
If the flush gets exit, it returns EXIT and we EXIT.
"""
while True:
if self.flush(timeout=WATCHER_TIME)==EXIT:
return
def queue_message(self, *, MessageBody, **kwargs):
self.pyqueue.put({'MessageBody':MessageBody})
def terminate(self):
"""Tell the watcher to exit"""
self.flush()
self.pyqueue.put(EXIT)
However, we were still unsatisfied with this, as emptying the SQS queue was also slow, and there is poor visibility into the queues.
We developed a system that used S3 as a message queue. Create objects with a given bucket and prefix and then a random string, and then remove them in the consumer. Different consumers used different prefixes of the random string.
We implemented a traditional system with HTTP REST and with the python server running under mod_wsgi. This was the most performant.

Related

Pub/Sub testing: message received by client even when ack_deadline is passed

I'm testing Cloud Pub/Sub. According to google documentation, ack_deadline of a pull substription can be set between 10s-600s ie. msg will be redelivered by Pubsub if ack_deadline is passed.
I'm processing the pubsub message in subscriber client before ack-ing the msg. This processing time can take ~ 700s which exceeds the max limit of 600s.
reproduction:
create a topic and subscription (by default Acknowledgement deadline is set to 10s)
run subscriber code (which ack the messages) see below
publish some msg on the topic from Web UI
subscriber code:
import time
import datetime
from concurrent.futures import TimeoutError
from google.cloud import pubsub_v1
project_id = "my-project"
subscription_id = "test-sub"
def sub():
subscriber = pubsub_v1.SubscriberClient()
subscription_path = subscriber.subscription_path(project_id, subscription_id)
def callback(message: pubsub_v1.subscriber.message.Message) -> None:
# My processing code, which takes 700s
time.sleep(700) # sleep function to demonstrate processing
print(f"Received {message}."+ str(datetime.datetime.now()) )
message.ack()
print("msg acked")
streaming_pull_future = subscriber.subscribe(subscription_path, callback=callback)
print(f"Listening for messages on {subscription_path}..\n")
try:
streaming_pull_future.result()
except:
streaming_pull_future.cancel() # Trigger the shutdown.
streaming_pull_future.result() # Block until the shutdown is complete.
subscriber.close()
if __name__ == "__main__":
sub()
Even if the ack_deadline is reached, the message is getting acked which is weird. According to my understanding, pubsub should redeliver the message again and eventually go this code will go into an infinite loop.
am I missing something here?
The reason that the message is getting acked and not getting redelivered even after the ack deadline specified in the subscription is reached is that the Pub/Sub client libraries internally extend ack deadlines up to a time specified when instantiating the subscriber client. By default, the time is 1 hour. You can change this amount of time by changing the max_lease_duration parameter in the FlowControl object (search for "FlowControl" in the Types page) passed into the subscribe method.
That's correct. There are several solutions with their tradeoff
Ack immediately the message and process it. The problem is: if you have an outage on your system, you lost the message
Save the message ID state in a database (firestore for instance).
If the message ID is new, start the processing; at the end of the processing update the message ID status in the database
If the message ID already exists, sleep a while (about 90s), check the status of the message ID in the database. If DONE, ack the message. If not, sleep again (max 6 time. Then NACK and start again that process. To break the loop, repeat the process until the message timestamp is above 1h)
Save the message in database, ack the message, and start the processing. In case of outage, at the start, check the not yet done messages and restart the process for each of them. At the end of the process, mark them as DONE.
You can also imagine other pattern. nothing is real perfect, depends on your needs.

Rate-limiting a Worker for a Queue (e.g.: SQS)

Every day, I will have a CRON task run which populates an SQS queue with a number of tasks which needs to be achieved. So (for example) at 9AM every morning, and empty queue will receive ~100 messages that will need to be processed.
I would like a new worker to be spun up every second until the queue is empty. If any task fails, it's put at the back of the queue to be re-run.
For example, if each task takes up to 1.5 seconds to complete:
after 1 second, 1 worker will have started message A
after 2 seconds, 1 worker may still be running message A and 1 worker will have started running message B
after 100 seconds, 1 worker may still be running message XX and 1 worker will pick up message B because it failed previous
after 101 seconds, no more workers are propagated until the next morning
Is there any way to have this type of infrastructure configured within AWS lambda?
One way, though I'm not convinced it's optimal:
A lambda that's triggered by an CloudWatch Event (say every second, or every 10 seconds, depending on your rate limit). Which polls SQS to receive (at most) N messages, it then "fans-out" to another Lambda function with each message.
Some pseudo code:
# Lambda 1 (schedule by CloudWatch Event / e.g. CRON)
def handle_cron(event, context):
# in order to get more messages, we might have to receive several times (loop)
for message in queue.receive_messages(MaxNumberOfMessages=10):
# Note: the Event InvocationType so we don't want to wait for the response!
lambda_client.invoke(FunctionName="foo", Payload=message.body, InvocationType='Event')
and
# Lambda 2 (triggered only by the invoke in Lambda 1)
def handle_message(event, context):
# handle message
pass
Seems to me you would be better of publishing you messages to SNS, instead of SQS and then have your lambda functions subscribe to the SNS topic.
Let Lambda worry about how many 'instances' it needs to spinup in response to the load.
Here is one blog post on this method, but google may help you find one that is closer to your actual use case.
https://aws.amazon.com/blogs/mobile/invoking-aws-lambda-functions-via-amazon-sns/
Why not just have a Lambda function that starts polling sqs at 9am, getting one message at a time and sleeping for a second between each message? Dead letter queues can handle retries. Stop execution after not receiving a message from SQS after x seconds.
It is a unique case where you don't actually want parallel processing.

Django celery task duplication: can't lock DB?

My django app allows users to send messages to each other, and I pool some of the recent messages together and send them in an email using celery and redis.
Every time a user sends a message, I add a Message to the db and then trigger an async task to pool that user's messages from the last 60 seconds and send them as an email.
tasks.pushMessagePool.apply_async(args = (fromUser,), countdown = 60)
If the user sends 5 messages in the next 60 seconds, then my assumption is that 5 tasks should be created, but only the first task sends the email, and the other 4 tasks do nothing. I implemented a simple locking mechanism to make sure that messages were only considered a single time and to ensure db locking.
#shared_task
def pushMessagePool(fromUser, ignore_result=True):
lockCode = randint(0,10**9)
data.models.Messages.objects.filter(fromUser = fromUser, locked=False).update(locked=True, lockCode = lockCode)
M = data.models.Messages.objects.filter(fromUser = fromUser, lockCode = lockCode)
sendEmail(M,lockCode)
With this setup, I still get occasional (~10%) duplicates. The duplicates will fire within 10ms of each other, and they have different lockCodes.
Why doesn't this locking mechanism work? Does celery refer to an old DB snapshot? That wouldn't make any sense.
Djangojack,here is a similar issue? But for SQS. I'm not sure if it applies to Redis too?
When creating your SQS queue you need to set the Default Visibility
timeout to some time that's greater than the max time you expect a
task to run. This is the time SQS will make a message invisible to all
other consumers after delivering to one consumer. I believe the
default is 30 seconds. So, if a task takes more than 30 seconds, SQS
will deliver the same message to another consumer because it assumes
the first consumer died and did not complete the task.
From a comment by #gustavo-ambrozio on this answer.

Subscribing to AWS SQS Messages

I have large number of messages in AWS SQS Queue. These messages will be pushed to it constantly by other source. There are no proper dynamic on how often those messages will be pushed to queue. Currently, I keep polling SQS every second and checking if there are any messages available in there. Is there any better way of handling this, like receiving notification from SQS or SNS that some messages are available so that I only request SQS when I needed instead of constant polling?
The way to do what you want is to use long polling - rather than constantly poll every second, you open a request that stays open until it either times out or a message comes into the queue. Take a look at the documentation for ReceiveMessageRequest
ReceiveMessageRequest req = new ReceiveMessageRequest()
.withWaitTimeSeconds(Integer.valueOf(20)); // set long poll timeout to 20 sec
// set other properties on the request as well
ReceiveMessageResult result = amazonSQS.receiveMessage(req);
A common usage pattern for this is to have a background thread running the long poll and pushing the results into an internal queue (such as LinkedBlockingQueue or an ExecutorService) for a worker thread to read from.
PS. Don't forget to call deleteMessage once you're done processing the result so you don't end up receiving it again.
You can also use the worker functionality in AWS Elastic Beanstalk. It allows you to build a worker to process each message, and when you use Elastic Beanstalk to deploy it to an EC2 instance, you can define it as subscribed to a specific queue. Then each message will be POST to the worker, without your need to call receive-message on it from the queue.
It makes your system wiring much easier, as you can also have auto scaling rules that will allow you to spawn multiple workers to handle more messages in time of peak load, and scale down back to a single worker, when the load is low. It will also delete the message automatically, if you respond with OK from your worker.
See more information about it here: http://docs.aws.amazon.com/elasticbeanstalk/latest/dg/using-features-managing-env-tiers.html
You could also have a look at Shoryuken and the property delay:
delay: 25 # The delay in seconds to pause a queue when it's empty
But being honest we use delay: 0 here, the cost of SQS is inexpensive:
First 1 million Amazon SQS Requests per month are free
$0.50 per 1 million Amazon SQS Requests per month thereafter ($0.00000050 per SQS Request)
A single request can have from 1 to 10 messages, up to a maximum total payload of 256KB.
Each 64KB ‘chunk’ of payload is billed as 1 request. For example, a single API call with a 256KB payload will be billed as four requests.
You will probably spend less than 10 dollars monthly polling messages every second 24x7 in a single host.
One of the advantages of Shoryuken is that it fetches in batch, so it saves some money compared with a fetch per message solutions.

SQS Messages never gets removed/deleted after script runs

I'm having issues where my SQS Messages are never deleted from the SQS Queue. They are only removed when the lifetime ends, which is 4 days.
So to summarize the app:
Send URL to SQS Queue to wait to be crawled
Send message to Elastic Beanstalk App that crawls the data and store it in database
The script seems to be working in the meaning that it does receive the message, and it does crawl it successfully and store the data successfully in the database. The only issue is that the messages remain in the queue, stuck at "Message Available".
So if I for example load the queue with 800 messages, it will be stuck at ~800 messages for 4 days and then they will all be deleted instantly because of the lifetime value. It seems like a few messages get deleted because the number changes slightly, but a large majority is never removed from the queue.
So question:
Isn't SQS supposed to remove the message as soon as it has been send and received by the script?
Is there a manual way for me to in the script itself, delete the current message? From what I know the message is only sent 1 way. From SQS -> App. So from what I know, I can not do SQS <-> App.
Any ideas?
A web application in a worker environment tier should only listen on
the local host. When the web application in the worker environment
tier returns a 200 OK response to acknowledge that it has received and
successfully processed the request, the daemon sends a DeleteMessage
call to the SQS queue so that the message will be deleted from the
queue. (SQS automatically deletes messages that have been in a queue
for longer than the configured RetentionPeriod.) If the application
returns any response other than 200 OK or there is no response within
the configured InactivityTimeout period, SQS once again makes the
message visible in the queue and available for another attempt at
processing.
http://docs.aws.amazon.com/elasticbeanstalk/latest/dg/using-features-managing-env-tiers.html
So I guess that answers my question. Some messages do not return HTTP 200 and then they are stuck in an infinite loop.
No the messages won't get deleted when you read a Queue Item; it is only hidden for a specific amount of time it is called as Visibility Timeout. The idea behind visibility timeout is to ensure that if there are multiple consumers for a single queue, no two consumer pick the same item and start processing.
The is the change you need to do your app to get the expected behavior
Send URL to SQS Queue to wait to be crawled
Send message to Elastic Beanstalk App that crawl the data and store it in database
On the event of successful crawled status, use the receipt-handle(not the message id) and delete the Queue Item from the Queue.
AWS Documentation - DeleteMessage