AWS ELB - ECS - RDS stack latency - amazon-web-services

This is a bit of share and compare kind of question.
I have the following stack deployed on AWS:
ELB > ECS Fargate > node/express > RDS
I'm (negatively) surprised by some of the latencies observed for some simple requests that involve or not DB queries:
simple requests to /healthcheck would average at 150/200ms
simple SELECT queries done directly to my RDS instance thru pgadmin would average at 400ms (I only have a few entries on the requested table).
I tried to search for benchmark results but couldn't find anything useful. So I'd be grateful for anyone sharing his experience for a similar stack.
Thanks a lot!
Additional info on the deployment:
both ECS and RDS deployed within the same region (eu-west-1)
requests made from Spain (could that be it?)
ECS sits on 256 cpu units and 512 reserved memory
I'm the only one making requests on a dev environment (is there any "cold start" on ELB?)
RDS sits on an db.t2.micro instance and a postgresql v12.4 engine
Thanks #Maurice, I've added the info in the ticket but here a summary:
no utilization issue: single digit CPU utilization and memory at c. 25%; CPU never goes up 10% with several requests and memory always stable.
I instantiate the DB connection via Sequelize when creating the app and reuse it for each request. DB pooling used via Sequelize with 4 max connections
A typical cURL latency analysis on the ELB dns:
❯ curl -kso /dev/null http://be-api-main-elb-uat.wantedtv.com -w "==============\n\n
| dnslookup: %{time_namelookup}\n
| connect: %{time_connect}\n
| appconnect: %{time_appconnect}\n
| pretransfer: %{time_pretransfer}\n
| starttransfer: %{time_starttransfer}\n
| total: %{time_total}\n
| size: %{size_download}\n
| HTTPCode=%{http_code}\n\n"
==============
| dnslookup: 0,003741
| connect: 0,065718
| appconnect: 0,000000
| pretransfer: 0,065813
| starttransfer: 0,155532
| total: 0,155639
| size: 92
| HTTPCode=200

ECS sits on 256 cpu units and 512 reserved memory
It might be worth allocating more resources to see if that has some improvement, especially since there are some weird hidden limitations tied to the different CPU and Memory levels that might not be that apparent at first. Since 0.25 vCPU doesn't even give you a full thread to work with there could be other preempting going on that isn't visible to you.
Outside of that there are other things you can look for-
Is your application pooling requests to RDS, or creating new ones each time? I know you're intending to use pooling but it might be worth confirming it is actually working.
Are you exposing your container directly to the load balancer, or using a sidecar container such as NGINX to handle request buffering?
What happens if you hit the containers directly instead of through the load balancer? This can at least help isolate whether the issue is on the load balancer or on the container side.
How does your application handle concurrency?
How much data is being sent in each request? It's possible that large amounts of data may be locking up threads or processes and making other requests slow down as a result.
Are there any other services involved that aren't obvious? I once had a service crash because the logging service we were using broke, causing the messages to queue up and lock down the services.
The basic idea with a lot of this is to try and isolate the various components to identify the one causing the slow down. I do believe it'll end up being something in the task itself (service container, sidecar, or service) considering you mentioned quick responses from the database server itself.

Related

Eureka Server memory, renew threshold is 0, self preservation issue - AWS

I deployed 2 instances of Eureka server and a total of 12 instances microservices. .
Renews (last min) is as expected 24. But Renew Threshold is always 0. Is this how it supposed to be when self preservation is turned on? Also seeing this error - THE SELF PRESERVATION MODE IS TURNED OFF. THIS MAY NOT PROTECT INSTANCE EXPIRY IN CASE OF NETWORK/OTHER PROBLEMS. What's the expected behavior in this case and how to resolve this if this is a problem?
As mentioned above, I deployed 2 instances of Eureka Server but after running for a while like around 19-20 hours, one instance of Eureka Server always goes down. Why that could be possibly happening? I checked the processes running using top command and found that Eureka Server is taking a lot of memory. What needs to be configured on Eureka Server so that it don't take a lot of memory?
Below is the configuration in the application.properties file of Eureka Server:
spring.application.name=eureka-server
eureka.instance.appname=eureka-server
eureka.instance.instance-id=${spring.application.name}:${spring.application.instance_id:${random.int[1,999999]}}
eureka.server.enable-self-preservation=false
eureka.datacenter=AWS
eureka.environment=STAGE
eureka.client.registerWithEureka=false
eureka.client.fetchRegistry=false
Below is the command that I am using to start the Eureka Server instances.
#!/bin/bash
java -Xms128m -Xmx256m -Xss256k -XX:+HeapDumpOnOutOfMemoryError -Dspring.profiles.active=stage -Dserver.port=9011 -Deureka.instance.prefer-ip-address=true -Deureka.instance.hostname=example.co.za -Deureka.client.serviceUrl.defaultZone=http://example.co.za:9012/eureka/ -jar eureka-server-1.0.jar &
java -Xms128m -Xmx256m -Xss256k -XX:+HeapDumpOnOutOfMemoryError -Dspring.profiles.active=stage -Dserver.port=9012 -Deureka.instance.prefer-ip-address=true -Deureka.instance.hostname=example.co.za -Deureka.client.serviceUrl.defaultZone=http://example.co.za:9011/eureka/ -jar eureka-server-1.0.jar &
Is this approach to create multiple instances of Eureka Server correct?
Deployment is on AWS. Is there any specific configuration needed for Eureka Server on AWS?
Spring Boot version: 2.3.4.RELEASE
I am new to all these, any help or direction will be a great help.
Let me try to answer your question one by one.
Renews (last min) is as expected 24. But Renew Threshold is always 0. Is this how it supposed to be when self-preservation is turned on?
What's the expected behaviour in this case and how to resolve this if this is a problem?
I can see that eureka.server.enable-self-preservation=false in your configuration, This is really needed if you want to remove an already registered application as soon as it fails to renew its lease.
Self-preservation feature is to prevent the above-mentioned situation since it can happen if there are some network hiccups. Say, you have two services A and B, both are registered to eureka and suddenly, B failed to renew its lease because of a temporary network hiccup. If Self-preservation is not there then B will be removed from the registry and A won't be able to reach B despite B is available.
So we can say that Self-preservation is a resiliency feature of eureka.
Renews threshold is the expected renews per minute, Eureka server enters self-preservation mode if the actual number of heartbeats in last minute(Renews) is less than the expected number of renews per minute(Renew Threshold) and
Of course, you can control the Renews threshold. you can configure renewal-percent-threshold (by default it is 0.85)
So in your case,
Total number of application instances = 12
You don't have eureka.instance.leaseRenewalIntervalInSeconds so default value 30s
and eureka.client.registerWithEureka=false
so Renewals(last minute) will be 24
You don't have renewal-percent-threshold configured, so the default value is 0.85
Number of renewals per application instance per minute = 2 (30s each)
so in case of self-preservation is enable Renews threshold will be calculated as 2 * 12 * 0.85 = 21 (rounded)
And in your case self-preservation is turned off, so Eureka won't calculate Renews Threshold
One instance of Eureka Server always goes down. Why that could be possibly happening?
I'm not able to answer this question time being, this can be because of multiple reasons.
You can find the reason mostly from logs, or if you can post logs here it would be great.
What needs to be configured on Eureka Server so that it doesn't take a lot of memory?
From the information that you have provided, I cannot tell about your memory issue and in addition to that you already specified -Xmx256m and I didn't face any memory issues with the eureka servers so far.
But I can say that top is not the right tool for checking the memory consumed by your java process. When JVM starts, It takes some memory from the operating system.
This is the amount of memory you see in tools like ps and top. so better use jstat or jvmtop
Is this approach to create multiple instances of Eureka Server correct?
It seems you are having the same hostname(eureka.instance.hostname) for both instances. Replication won't work if you use the same hostname.
And make sure that you have the same application names in both instances.
Deployment is on AWS. Is there any specific configuration needed for Eureka Server on AWS?
Nothing specifically for AWS as per my knowledge, other than making sure that the instances can communicate with each other.

Seeing RDS Deadlock errors, could this be tied to IOPS limit

I'm seeing some errors on our AWS RDS MySQL server:
General error: 1205 Lock wait timeout exceeded; try restarting transaction
Serialization failure: 1213 Deadlock found when trying to get lock; try restarting transaction
Looked at the RDS console monitoring tab, and there seems read IOPS is cut off 1, perhaps indicating that the disk IO is not keeping up with the requests. Funny thing is that write IOPS does not seem to be cut off 2. In general there's very few app server requests that fail due to the database error, but would like to get this sorted.
CPU load on the RDS server peaks around 50%. This makes me think the db.t3.small RDS size is sufficient.
The database is tiny, just 20GB and was created some years ago, so it's on magnetic storage. Have read that this means there's a limit of 200 IOPS, which matches the approx 150 + 50 IOPS peaks seen. I am therefore thinking about moving to General Purpose SSD. However for the small db this will only provide 100 IOPS as baseline performance according to the docs, but according to the docs, a burst load of 3000IOPS is possible.
Does this sound like a good move, and any other suggestions on what to do?
I have been running with General Purpose SSD for a couple of days now. The MySQL deadlock errors have not been seen since, so in case someone else finds the question, change from Magnetic to General Purpose SSD in RDS is certainly something to try out if you have similar problems.

Simple HelloWorld app on cloudrun (or knative) seems too slow

I deployed a sample HelloWorld app on Google Cloud Run, which is basically k-native, and every call to the API takes 1.4 seconds at best, in an end-to-end manner. Is it supposed to be so?
The sample app is at https://cloud.google.com/run/docs/quickstarts/build-and-deploy
I deployed the very same app on my localhost as a docker container and it takes about 22ms, end-to-end.
The same app on my GKE cluster takes about 150 ms, end-to-end.
import os
from flask import Flask
app = Flask(__name__)
#app.route('/')
def hello_world():
target = os.environ.get('TARGET', 'World')
return 'Hello {}!\n'.format(target)
if __name__ == "__main__":
app.run(debug=True,host='0.0.0.0',port=int(os.environ.get('PORT', 8080)))
I am a little experience in FaaS and I expect API calls would get faster as I invoked them in a row. (as in cold start vs. warm start)
But no matter how many times I execute the command it doesn't go below 1.4 seconds.
I think the network distance isn't the dominant factor here. The round-trip time via ping to the API endpoint is only 50ms away, more or less
So my questions are as follows:
Is it potentially an unintended bug? Is it a technical difficulty which will be resolved eventually? Or maybe nothing's wrong, it's just the SLA of k-native?
If nothing's wrong with Google Cloud Run and/or k-native, what is the dominant time-consuming factor here for my API call? I'd love to learn the mechanism.
Additional Details:
Where I am located at: Seoul/Asia
The region for my Cloud Run app: us-central1
type of Internet connection I am testing under: Business, Wired
app's container image size: 343.3MB
the bucket location that Container Registry is using: gcr.io
WebPageTest from Seoul/Asia (warmup time):
Content Type: text/html
Request Start: 0.44 s
DNS Lookup: 249 ms
Initial Connection: 59 ms
SSL Negotiation: 106 ms
Time to First Byte: 961 ms
Content Download: 2 ms
WebPageTest from Chicago/US (warmup time):
Content Type: text/html
Request Start: 0.171 s
DNS Lookup: 41 ms
Initial Connection: 29 ms
SSL Negotiation: 57 ms
Time to First Byte: 61 ms
Content Download: 3 ms
ANSWER by Steren, the Cloud Run product manager
We have detected high latency when calling Cloud Run services from
some particular regions in the world. Sadly, Seoul seems to be one of
them.
[Update: This person has a networking problem in his area. I tested his endpoint from Seattle with no problems. Details in the comments below.]
I have worked with Cloud Run constantly for the past several months. I have deployed several production applications and dozens of test services. I am in Seattle, Cloud Run is in us-central1. I have never noticed a delay. Actually, I am impressed with how fast a container starts up.
For one of my services, I am seeing cold start time to first byte of 485ms. Next invocation 266ms, 360ms. My container is checking SSL certificates (2) on the Internet. The response time is very good.
For another service which is a PHP website, time to first byte on cold start is 312ms, then 94ms, 112ms.
What could be factors that are different for you?
How large is your container image? Check Container Registry for the size. My containers are under 100 MB. The larger the container the longer the cold start time.
Where is the bucket located that Container Registry is using? You want the bucket to be in us-central1 or at least US. This will change soon with when new Cloud Run regions are announced.
What type of Internet connection are you testing under? Home based or Business. Wireless or Ethernet connection? Where in the world are you testing from? Launch a temporary Compute Engine instance, repeat your tests to Cloud Run and compare. This will remove your ISP from the equation.
Increase the memory allocated to the container. Does this affect performance? Python/Flask does not require much memory, my containers are typically 128 MB and 256 MB. Container images are loaded into memory, so if you have a bloated container, you might now have enough memory left reducing performance.
What does Stackdriver logs show you? You can see container starts, requests, and container terminations.
(Cloud Run product manager here)
We have detected high latency when calling Cloud Run services from some particular regions in the world. Sadly, Seoul seems to be one of them.
We will explicitly capture this as a Known issue and we are working on fixing this before General Availability. Feel free to open a new issue in our public issue tracker

Scrapy + Splash returns a lot of 504 Time Out errors

I have followed Splash's FAQ for production setups and my system currently looks like this:
1 Scrapy Container with 6 concurrency requests.
1 HAProxy Container that load balance to splash containers
2 Splash Containers with 3 slots each.
I use docker stats to monitor my setup and I never get more than 7% CPU usage or more than 55% Memory usage.
I still get a lot of
DEBUG: Retrying <GET https://the/url/ via http://haproxy:8050/execute> (failed 1 times): 504 Gateway Time-out
For every successful request I get 6-7 of these timeouts.
I have experimented with changing the slots of the splash containers and the amount of concurrency requests. I've also tried running with a single splash container behind the HAProxy. I keep getting these errors.
I'm running on a AWS EC2 t2.micro instance which have 1gb memory.
I suspect that the issue is still related to the splash instance getting flooded. Is there any advice you can give me to reduce the load of the Splash instances? Is there a good ratio between slots and concurrency requests? Should I throttle requests?

Memory issues on RDS PostgreSQL instance / Rails 4

We are running into a memory issues on our RDS PostgreSQL instance i. e. Memory usage of the postgresql server reaches almost 100% resulting in stalled queries, and subsequent downtime of production app.
The memory usage of the RDS instance doesn't go up gradually, but suddenly within a period of 30min to 2hrs
Most of the time this happens, we see that lot of traffic from bots is going on, though there is no specific pattern in terms of frequency. This could happen after 1 week to 1 month of the previous occurence.
Disconnecting all clients, and then restarting the application also doesn't help, as the memory usage again goes up very rapidly.
Running "Full Vaccum" is the only solution we have found that resolves the issue when it occurs.
What we have tried so far
Periodic vacuuming (not full vacuuming) of some tables that get frequent updates.
Stopped storing Web sessions in DB as they are highly volatile and result in lot of dead tuples.
Both these haven't helped.
We have considered using tools like pgcompact / pg_repack as they don't acquire exclusive lock. However these can't be used with RDS.
We now see a strong possibility that this has to do with memory bloat that can happen on postgresql with prepared statements in rails 4, as discussed in following pages:
Memory leaks on postgresql server after upgrade to Rails 4
https://github.com/rails/rails/issues/14645
As a quick trial, we have now disabled prepared statements in our rails database configuration, and are observing the system. If the issue re-occurs, this hypothesis would be proven wrong.
Setup details:
We run our production environment inside Amazon Elastic Beanstalk, with following configuration:
App servers
OS : 64bit Amazon Linux 2016.03 v2.1.0 running Ruby 2.1 (Puma)
Instance type: r3.xlarge
Root volume size: 100 GiB
Number of app servers : 2
Rails workers running on each server : 4
Max number of threads in each worker : 8
Database pool size : 50 (applicable for each worker)
Database (RDS) Details:
PostgreSQL Version: PostgreSQL 9.3.10
RDS Instance type: db.m4.2xlarge
Rails Version: 4.2.5
Current size on disk: 2.2GB
Number of tables: 94
The environment is monitored with AWS cloudwatch and NewRelic.
Periodic vacuum should help in containing table bloat but not index bloat.
1)Have you tried more aggressive parameters of auto-vacuum ?
2)Tried routine reindexing ? If locking is a concern then consider
DROP INDEX CONCURRENTLY ...
CREATE INDEX CONCURRENTLY ...