Read millions of SQS message and persist in RDS - amazon-web-services

I have million of SQS message coming on daily basis. Currently we are reading same from various poller machines and writing same in RDBMS (Aurora PostgreSQL). Architecture has two flaw:
It is taking more than 10 hours to process all SQS messages. We are targeting 2-3 hours for same.
SQS messages are coming from a job. It is not continuous activity. Maintaining poller machines 24 hours is costing us.
We have already configured SQS NumberOfPollers to 20 and MessageFetchSize to 10.
My questions are:
Apart from NumberOfPollers and MessageFetchSize, is there any other SQS configuration parameter we can use to speed the process?
How to calculate correct value of NumberOfPollers and MessageFetchSize? We are just doing try and error in this.
Can we utilize EMR-Spark to do allocate machine on demand and run poller and terminate it after execution, so that we need not to maintain machine 24x7?
Any other suggestion/ways to achieve same

Related

Estimate SQS processing time and load

I am going to use AWS SQS(regular queue, not FIFO) to process different client side metrics.
I’m expect to have ~400 messages per second (worst case).My SQS message will contain S3 location of the file.
I created an application, which will listen to my SQS Queue, and process messages from it.
By process I mean:
read SQS message ->
take S3 location from that SQS message ->
call S3 client ->
Read that file ->
Add a few additional fields —>
Publish data from this file to AWS Kinesis Firehose.
Similar process will be for each SQS message in the Queue. The size of S3 file is small, less than 0,5 KB.
How can calculate if I will be able to process those 400 messages per second? How can I estimate that my solution would handle x5 increase in data?
How can calculate if I will be able to process those 400 messages per second? How can I estimate that my solution would handle x5 increase in data?
Test it! Start with a small scale, and do the math to extrapolate from there. Make your test environment as close to what it will be in production as feasible.
On a single host and single thread, the math is simple:
1000 / AvgTotalTimeMillis = AvgMessagesPerSecond, or
1000 / AvgMessagesPerSecond = AvgTotalTimeMillis
How to approach testing this:
Start with a single thread and host, and generate some timing metrics for each step that you outlined, along with a total time.
Figure out your average/max/min time, and how many messages per second that translates to
400 messages per second on a single thread & host would be under 3ms per message. Hopefully this makes it obvious you need multiple threads/hosts.
Scale up!
Now that you know how much a single thread can handle, figure out how many threads a single host can effectively handle (you'll need to experiment). Consider batching messages where possible - SQS provides batch operations.
Use math to calculate how many hosts you need
If you need 5X that number, go up from there
While you're doing this math, consider any limits of the systems you're using:
Review the throttling limits of SQS / S3 / Firehose / etc. If you plan to use Lambda to do the work instead of EC2, it has limits too. Make sure you're within those limits, and consider contacting AWS support if you are close to exceeding them.
A few other suggestions based on my experience:
Based on your workflow outline & details, using EC2 you can probably handle a decent number of threads per host
M5.large should be more than enough - you can probably go smaller, as the performance bottleneck will likely be networking I/O to fetch and send messages.
Consider using autoscaling to handle message spikes for when you need to increase throughput, though keep in mind autoscaling can take several minutes to kick in.
The only way to determine this is to create a test environment that mirrors your scenario.
If your solution is designed to handle messages in parallel, it should be possible to scale-up your system to handle virtually any workload.
A good architecture would be to use AWS Lambda functions to process the messages. Lambda defaults to 1000 concurrent functions. So, if a function takes 3 seconds to run, it would support 333 messages per second consistently. You can request for the Lambda concurrency to be increased to handle higher workloads.
If you are using Amazon EC2 instead of Lambda functions, then it would just be a matter of scaling-out and adding more EC2 instances with more workers to handle whatever workload you desired.

I need help clarifying a high level use-case of Amazon SQS

So I need a second pair of eyes to correct or confirm my understand standing of Amazon SQS. From my understanding, you can add an unlimited amount of messages to one queue. A message can be 256 KB in size, and if it needs to be larger than that, you can use amazon s3 to store 2 GB. Reading around online, it appears there are many use cases for this queuing service. For example one use case of SQS can act as a database buffer.
But here's what I'm looking to do.. I'm looking to make a real time messaging system. My current functionality acts like more of a message board, so the implementation just inserts into the database then reads the data and packages it into JSON to be inserted on SQLITE mobile phone. That works great, but I'm getting a lot of requests from people to make it real-time.
So what I'm wondering is can I utilize amazon SQS to write and read messages for a chat application? So in my theoretical use case of SQS would have a message queue to write to, and pull from the that queue every second to check for messages on mobile. But here's where I'm confused. Since you cannot "Query" a particular message from the queue, would it make sense to have a queue per user then a generic queue for the app server to read from? Or am I just talking crazy and should spend cognitive resources thinking about implementing an open connection on an Ec2 instance?
Any help would be great,
Thanks!
Have you thought about using Amazon SNS to push the chat messages to your mobile devices? Each user publishes to a topic and the readers subscribe to that topic. You just have to be ok with missing messages if the app isn't running.
If you only have a few (or maybe, less than 100) users, you could have thought of having one SQS queue per user. If that is not so, the solution won't be operationally feasible.
If you were to have one generic queue, SQS won't help because it doesn't allow querying for a given field in all available messages.
I can think of following options for your use case:
Setup one Redis cluster, possibly on Amazon ElastiCache. Have one message List per user.
One Messages table in MySQL, possibly on AWS RDS. This will provide an easy way to query messages for a given user.
You can also use DynamoDB in #2.

What are the possible use cases for Amazon SQS or any Queue Service?

So I have been trying to get my hands on Amazon's AWS since my company's whole infrastructure is based of it.
One component I have never been able to understand properly is the Queue Service, I have searched Google quite a bit but I haven't been able to get a satisfactory answer. I think a Cron job and Queue Service are quite similar somewhat, correct me if I am wrong.
So what exactly SQS does? As far as I understand, it stores simple messages to be used by other components in AWS to do tasks & you can send messages to do that.
In this question, Can someone explain to me what Amazon Web Services components are used in a normal web service?; the answer mentioned they used SQS to queue tasks they want performed asynchronously. Why not just give a message back to the user & do the processing later on? Why wait for SQS to do its stuff?
Also, let's just say I have a web app which allows user to schedule some daily tasks, how would SQS would fit in that?
No, cron and SQS are not similar. One (cron) schedules jobs while the other (SQS) stores messages. Queues are used to decouple message producers from message consumers. This is one way to architect for scale and reliability.
Let's say you've built a mobile voting app for a popular TV show and 5 to 25 million viewers are all voting at the same time (at the end of each performance). How are you going to handle that many votes in such a short space of time (say, 15 seconds)? You could build a significant web server tier and database back-end that could handle millions of messages per second but that would be expensive, you'd have to pre-provision for maximum expected workload, and it would not be resilient (for example to database failure or throttling). If few people voted then you're overpaying for infrastructure; if voting went crazy then votes could be lost.
A better solution would use some queuing mechanism that decoupled the voting apps from your service where the vote queue was highly scalable so it could happily absorb 10 messages/sec or 10 million messages/sec. Then you would have an application tier pulling messages from that queue as fast as possible to tally the votes.
One thing I would add to #jarmod's excellent and succinct answer is that the size of the messages does matter. For example in AWS, the maximum size is just 256 KB unless you use the Extended Client Library, which increases the max to 2 GB. But note that it uses S3 as a temporary storage.
In RabbitMQ the practical limit is around 100 KB. There is no hard-coded limit in RabbitMQ, but the system simply stalls more or less often. From personal experience, RabbitMQ can handle a steady stream of around 1 MB messages for about 1 - 2 hours non-stop, but then it will start to behave erratically, often becoming a zombie and you'll need to restart the process.
SQS is a great way to decouple services, especially when there is a lot of heavy-duty, batch-oriented processing required.
For example, let's say you have a service where people upload photos from their mobile devices. Once the photos are uploaded your service needs to do a bunch of processing of the photos, e.g. scaling them to different sizes, applying different filters, extracting metadata, etc.
One way to accomplish this would be to post a message to an SQS queue (or perhaps multiple messages to multiple queues, depending on how you architect it). The message(s) describe work that needs to be performed on the newly uploaded image file. Once the message has been written to SQS, your application can return a success to the user because you know that you have the image file and you have scheduled the processing.
In the background, you can have servers reading messages from SQS and performing the work specified in the messages. If one of those servers dies another one will pick up the message and perform the work. SQS guarantees that a message will be delivered eventually so you can be confident that the work will eventually get done.

Generating timer events in the cloud

I am trying to solve the problem of generating distributed timing events for my application on the amazon cloud:
A server gets a message. As a result the system has to do something within X minutes. My problem is that the system needs to potentially handle millions of messages per second during peak time. Also, during that time interval the server that got the message might crash. So I am looking for a distributed solution that can receive a message, and then fire another message with a guarantee several minutes later.
I could design a sharded system by myself, but I was hoping that some distributed streaming framework can do this easily. But what I found so far are ones that complete transactions immediately.
Your question is somewhat vague, but if you are looking for a fault-tolerant, geographically distributed way of processing messages, the you should take a look at AWS SQS or SQS in combination with SNS.
http://aws.amazon.com/sqs/
http://aws.amazon.com/sns/

SQS/task-queue job retry count strategy?

I'm implementing a task queue with Amazon SQS ( but i guess the question applies to any task-queue ) , where the workers are expected to take different action depending on how many times the job has been re-tried already ( move it to a different queue, increase visibility timeout, send an alert..etc )
What would be the best way to keep track of failed job count? I'd like to avoid having to keep a centralized db for job:retry-count records. Should i look at time spent in the queue instead in a monitoring process? IMO that would be ugly or un-clean at best, iterating over jobs until i find ancient ones..
thanks!
Andras
There is another simpler way. With your message you can request ApproximateReceiveCount information and base your retry logic on that. This way you won't have to keep it in the database and can calculate it from the message itself.
http://docs.aws.amazon.com/AWSSimpleQueueService/latest/APIReference/API_ReceiveMessage.html
I've had good success combining SQS with SimpleDB. It is "centralized", but only as much as SQS is.
Every job gets a record in simpleDB and a task in SQS. You can put any information you like in SimpleDB like the job creation time. When a worker pulls a job from the queue it can grab the corresponding record from simpleDB to determine it's history. You can see how old the job is, and you can see how many times it has been attempted. Once you're done, you can add worker data to the SimpleDB record (completion time, outcome, logs, errors, stack-trace, whatever) and acknowledge the message from SQS.
I prefer this method because it helps diagnose faults by providing lots of debug info for failed tasks. It also allows workers to handle the job differently depending on how long the job has been queued, how many failures it's had, etc.
It also gives you the ability to query SimpleDB directly and calculate things like average time per task, percent failure rate, etc.
Amazon just released Simple workflow serice (swf) which you can think of as a more sophisticated/flexible version of GAE Task queues.
It will let you monitor your tasks (with hearbeats), configure retry strategies and create complicated workflows. It looks pretty promising abstracting out task dependencies, scheduling and fault tolerance for tasks (esp. asynchronous ones)
Checkout http://docs.amazonwebservices.com/amazonswf/latest/developerguide/swf-dg-intro-to-swf.html for overview.
SQS stands for "Simple Queue Service" which, in concept is the incorrect name for that service. The first and foremost feature of a "Queue" is FIFO (First in, First out), and SQS lacks that. Just wanting to clarify.
Also, Azure Queue Services lacks that as well. For the best cloud Queue service, use Azure's Service Bus since it's a TRUE Queue concept.