I am developing a web application where a HTTP request triggers a long background task. To decouple the HTTP request processing I am using AWS SQS. I enqueue the request to do the background processing when handling the HTTP request. The message is then picked up by background process which actually does the work. This way the latency of my application is kept low.
Recently I noticed worryingly high latencies when sending messages to SQS. I tried googling and the normal latency should be in hundreds of milliseconds.
The problem is the latency sometimes spikes over 130000 ms! The background processing actually takes less time than enqueuing the work.
I am using Standard queue, which I understand is a more of a best-effort service. Is this kind of latency common thing with AWS SQS? How can I proceed with debugging of this issue?
The messages are short JSONs containing the ID of the object which should be processed in background.
{"type":"DO_BACKGROUND_WORK","ids":["123456"]}
The obvious reason could be networking issue between AWS and my server. However the SES endpoint does not have such latencies, nor the Rollbar error logging does.
The application is hosted in Europe (Contabo) but not in AWS. The CPU load is normal during the work, also the RAM usage is normal.
Related
We are currently implementing a distributed Spring Boot microservice architecture on Amazon's AWS, where we use SNS/SQS as our messaging system:
Events are published by a Spring Boot service to an SNS FIFO topic using Spring Cloud AWS. The topic hands over the events to multiple SQS queues subscribed to the topic, and the queues are then in turn consumed by different consumer services (again Spring Boot using Spring Cloud AWS).
Everything works as intended, but we are sometimes seeing very high latency on our production services.
Our product isn't released yet (we are currently in testing), meaning we have very, very low traffic on prod, i.e., only a few messages a day.
Unfortunately, we see very high latency until a message is delivered to its subscribers after a long period of inactivity (typically up to 6 seconds, but can be as high as 60 seconds). Things speed up considerably afterwards with message delivery times dropping to below 100ms for the next messages being sent to the topic.
Turning on logging on the SNS topic in AWS revealed that most of the delay for the first message is spent at the SNS part of things, where the SNS dwellTime roughly correlates with the delays we are seeing in message delivery. Spring Cloud AWS seems fine.
Is this something expected? Is there something like a "cold startup" time for idle SNS FIFO topics (as seen when using AWS lambdas)? Will this latency simply go away once we increase the load and heat up the topic? Or is there something we missed configuring?
We are using fairly standard SQS subscriptions, btw, no subscription throttling in place. The Spring Boot services run on a Fargate ECS cluster.
Seems like AWS inactivates unused SNS topics somehow. What we are doing now is, we are sending a "dummy" Keep-Alive message to the topic every ten minutes, which keeps the dwellTime reasonably low for us (<500ms)
I've gone over the documentation and cannot find a clear statement regarding how much latency is X-ray tracing supposed to add to Lambda function executions (and to other services as well). It should be minimal, but since it's sending out traces, some latency is expected.
Does anyone have the numbers?
AWS X-Ray SDKs which you use in your application do not send trace segments to X-Ray service directly. The segments are transmitted over UDP to xray daemon running on localhost. So the latency involved is only for in memory updates to the segment data. Only when the segments are complete, they are sent over UDP to localhost. Hence, you should expect minimal possible overhead on your application. Also the daemon which runs in a separate process does not send segments to the service immediately. It buffers the segments for a short period and periodically sends them in batches using the PutTraceSegments API call.
If you are interested to dig further, most AWS X-Ray SDKs are open sourced on GitHub. Java SDK for example https://github.com/aws/aws-xray-sdk-java
I'm using Spring JMS to communicate with Amazon SQS queues. I set up a handful of queues and wired up the listeners, but the app isn't sending any messages through them currently. AWS allows 1 million requests per month for free, which I thought should be no problem, but after a month I got billed a small amount for going over that limit.
Is there a way to tune SQS or Spring JMS to keep the requests down?
I'm assuming a request is whenever my app polls the queue to check for new messages. Some queues don't need to be near realtime so I could definitely reduce those requests. I'd appreciate any insights you can offer into how SQS and Spring JMS communicate.
"Normal" JMS clients, when polled for messages, don't poll the server - the server pushes messages to the client and the poll is just done locally.
If the SQS client polls the server, that would be unusual, to say the least, but if it's using REST calls, I can see why it would happen.
Increasing the container's receiveTimeout (default 1 second) might help, but without knowing what the client is doing under the covers, it's hard to tell.
I have first web service which is used to send messages into the aws sqs, this web service is deployed on a separate ec2 instance. Web service is running under IIS 8. This web service is able to handle 500 requests per second from two machines meaning 1000 requests per second. It can handle more requests.
I have second web service deployed on another ec2 instance of the same power/configuration. This web service will be used to process the messages stored in the Sqs. For testing purpose currently, I am only receiving the message from Sqs and just deleting that.
I have a aws Sns service which tells the second web service that a message has come in the sqs, go and receive that message to process.
But I observe that my second web service is not as fast as my first web service, every time I run the test, messages are left in the sqs, but ideally no message should remain in the sqs.
Please guide me what are the possible reasons of this and area on which I should focus.
Thanks in advance.
The receiver has double the work to do since it both receives and deletes the message which is done in two separate calls. You may need double the instances to process the sent messages if you have high volume.
How many messages are you receiving at once? I highly recommend setting the MaxNumberOfMessages to 10 and then using DeleteMessageBatch with batches of 10. Not only will this greatly increase throughput, it will cut your SQS bill in by about 60%.
Also, I'm confused about the SNS topic. There is no need to have an SNS topic tell the other web service that a message exists. If every message generates a publish to that topic, then you are adding a lot of extra work and expense. Instead you should use long polling and set the WaitTimeSeconds to 20 and just always be calling SQS. Even if you get 0 messages for a month 2 servers constantly long polling will be well within the free tier. If you are above the free tier, the total cost of 2 servers constantly long polling an SQS queue is $0.13/month
For a sake of HA I'm considering switching from self hosted solution (ZeroMQ) to AWS Simple Notification Service for pub/sub in an application. Which is a backend for an app, thus should be reasonably real-time.
What are latency and throughput I can expect of SNS?
Is the app going to be hosted on EC2? If so, the latency will be far diminished, as the communication channel will be going across Amazon's connection, rather than through the internet.
If you are going to call AWS services from boxes not hosted on EC2, here's a cool site that attempts to give you an idea of the amount of latency between you and various AWS services and locations.
How are you measuring the HTTP Ping Request Latency?
We are making a HTTP GET request to AWS Service Endpoints (like EC2,
SQS, SNS etc) for PING and measuring the observed latency for it
across all regions.
As for thoughput, that is left up to you. You can use various strategies to increase throughput, like multi-treading, batching messages, etc.
Keep in mind that you will have to code for some side effects, like possibly seeing the same message twice (At Least Once Delivery), and not being able to rely on FIFO.