Experienced problems with our RDS instance - amazon-web-services

we experienced problems with our RDS instance.
RDS stops running. RDS are in state of "green"(on the AWS console) but we cannot connect to the RDS instance.
Cloud Logs we found following errors:
2018-03-07 8:52:31 47886953160896 [Note] InnoDB: Restoring possible half-written data pages from the doublewrite buffer...
InnoDB: Set innodb_force_recovery to ignore this error.
2018-03-07 8:52:32 47886953160896 [ERROR] Plugin 'InnoDB' init function returned error.
2018-03-07 8:53:46 47508779897024 [Note] InnoDB: Restoring possible half-written data pages from the doublewrite buffer...
InnoDB: Set innodb_force_recovery to ignore this error.
2018-03-07 8:53:46 47508779897024 [ERROR] Plugin 'InnoDB' init function returned error.
When we tried to reboot RDS instance its take almost 2 hours to reboot. After rebooting its working fine again!.
Can someone help us to know the root cause of this incident.

As t2.small provides 2G of RAM. However you might be knowing, most DB engines tend to use up 75% of the memory for caching purposes such as queries, temporary tables, table scans to make things go faster.
For our Maria DB engine, following parameters are by default set to below pre-optimized values :
innodb_buffer_pool_size (DB instance size *3/4= 1.5 Gb)
key_buffer_size (16777216 = 16.7 Mb)
innodb_log_buffer_size (8388608 =8.3Mb)
Apart from that the OS and the RDS Processes will also use some amount of RAM to do their own operations. Hence to summarize, around 1.6 Gb approximately is utilized by DB engine and the actual usable memory which will be getting after taking out these values innodb_buffer_pool_size, key_buffer_size, innodb_log_buffer_size will be around 400 MB.
Overall a decrease in your Freeable Memory as low as ~137MB. As a result, Swap Usage increased drastically in the same time period to 152MB approximately.
FreeableMemory was quite low and there was a high swap utilization. Further, due to the memory pressure ( insufficient memory and high swap usage), RDS internal monitoring system was not able to proceed with host communication which in turn resulted into underlying host replacement.

Related

Cloud Run, ideal vCPU and memory amount per instance?

When setting up a cloud run, I am worried about how many memory and vCPU should be set each time per server instance.
I use Cloud Run for mobile apps.
I am confused about when to increase vCPU and memory instead of increasing server instances, and when to increase server instances instead of vCPU and memory.
How should I calculate it?
There isn't a good answer to that question. You have to know the limits:
The max number of concurrent requests that you can handle concurrently with 4cpu or/and 32Gb of memory (up to 1000 concurrent requests)
The max number on instance on Cloud Run (1000)
Then it's a matter of tradeoff, and it's highly dependent of your use case.
Bigger instances reduce the number of cold starts (and so high latency when your service scale up). But, if you have only 1 request at a time, you will pay a BIG instance for a small processing
Smaller instances allow you to optimize cost and to add only a small slice of resource in your cluster, but you will have to spawn often a new instance and you will have several cold start to endure.
Optimize what you prefer, find the right balance. No magic formula!!
You can simulate a load of requests in your current settings using k6.io, check the memory and cpu percentage of your container and adjust them to a lower or higher setting to see if you can get more RPS out of a single container.
Once you are satisfied with a single container instance's let's say 100 rps per container instance, you can then specify using gcloud the flags --min-instances and --max-instances depending of course on the --concurrency flag, which in my explanation would be set to 100.
Also note that it starts at the default of 80 and can go up to 1000.
More info about this can be read on the links below:
https://cloud.google.com/run/docs/about-concurrency
https://cloud.google.com/sdk/gcloud/reference/run/deploy
I would also recommend you investigating if you need to pass the --cpu-throttling flag or the --no-cpu-throttling depending on your need for adjusting for cold starts.

Seeing RDS Deadlock errors, could this be tied to IOPS limit

I'm seeing some errors on our AWS RDS MySQL server:
General error: 1205 Lock wait timeout exceeded; try restarting transaction
Serialization failure: 1213 Deadlock found when trying to get lock; try restarting transaction
Looked at the RDS console monitoring tab, and there seems read IOPS is cut off 1, perhaps indicating that the disk IO is not keeping up with the requests. Funny thing is that write IOPS does not seem to be cut off 2. In general there's very few app server requests that fail due to the database error, but would like to get this sorted.
CPU load on the RDS server peaks around 50%. This makes me think the db.t3.small RDS size is sufficient.
The database is tiny, just 20GB and was created some years ago, so it's on magnetic storage. Have read that this means there's a limit of 200 IOPS, which matches the approx 150 + 50 IOPS peaks seen. I am therefore thinking about moving to General Purpose SSD. However for the small db this will only provide 100 IOPS as baseline performance according to the docs, but according to the docs, a burst load of 3000IOPS is possible.
Does this sound like a good move, and any other suggestions on what to do?
I have been running with General Purpose SSD for a couple of days now. The MySQL deadlock errors have not been seen since, so in case someone else finds the question, change from Magnetic to General Purpose SSD in RDS is certainly something to try out if you have similar problems.

'Kubelet stopped posting node status' and node inaccessible

I am having some issues with a fairly new cluster where a couple of nodes (always seems to happen in pairs but potentially just a coincidence) will become NotReady and a kubectl describe will say that the Kubelet stopped posting node status for memory, disk, PID and ready.
All of the running pods are stuck in Terminating (can use k9s to connect to the cluster and see this) and the only solution I have found is to cordon and drain the nodes. After a few hours they seem to be being deleted and new ones created. Alternatively I can delete them using kubectl.
They are completely inaccessible via ssh (timeout) but AWS reports the EC2 instances as having no issues.
This has now happened three times in the past week. Everything does recover fine but there is clearly some issue and I would like to get to the bottom of it.
How would I go about finding out what has gone on if I cannot get onto the boxes at all? (Actually just occurred to me to maybe take a snapshot of the volume and mount it so will try that if it happens again, but any other suggestions welcome)
Running kubernetes v1.18.8
There are two most common possibilities here, both most likely caused by a large load:
Out of Memory error on the kubelet host. Can be solved by adding proper --kubelet-extra-args to BootstrapArguments. For example: --kubelet-extra-args "--kube-reserved memory=0.3Gi,ephemeral-storage=1Gi --system-reserved memory=0.2Gi,ephemeral-storage=1Gi --eviction-hard memory.available<200Mi,nodefs.available<10%"
An issue explained here:
kubelet cannot patch its node status sometimes, ’cos more than 250
resources stay on the node, kubelet cannot watch more than 250 streams
with kube-apiserver at the same time. So, I just adjust kube-apiserver
--http2-max-streams-per-connection to 1000 to relieve the pain.
You can either adjust the values provided above or try to find the cause of high load/iops and try to tune it down.
I had the same issue, after 20-30 min my nodes became in NotRready status, and all pods linked to these nodes became stuck in Terminating status.I tried to connect to my nodes via SSH, sometimes I faced a timeout, sometimes I could (hardly) connect, and I executed the top command to check the running processes.The most consuming process was kswapd0.My instance memory and CPU were both full (!), because it tried to swap a lot (due to a lack of memory), causing the kswapd0 process to consume more than 50% of the CPU!Root cause:Some pods consumed 400% of their memory request (defined in Kubernetes deployment), because they were initially under-provisioned. As a consequence, when my nodes started, Kubernetes placed them on nodes with only 32Mb of memory request per pod (the value I had defined), but that was insufficient.Solution:The solution was to increase containers requests:
requests:
memory: "32Mi"
cpu: "20m"
limits:
memory: "256Mi"
cpu: "100m"
With these values (in my case):
requests:
memory: "256Mi"
cpu: "20m"
limits:
memory: "512Mi"
cpu: "200m"
Important:
After that I processed a rolling update (cordon > drain > delete) of my nodes in order to ensure that Kubernetes reserve directly enough memory for my freshly started pods.
Conclusion:
Regularly check your pods' memory consumption, and adjust your resources requests over time.
The goal is to never leave your nodes be surprised by a memory saturation, because the swap can be fatal for your nodes.
The answer turned out to be an issue with iops as a result of du commands coming from - I think - cadvisor. I have moved to io1 boxes and have had stability since then so going to mark this as closed and the move of ec2 instance types as the resolution
Thanks for the help!

Memory issues on RDS PostgreSQL instance / Rails 4

We are running into a memory issues on our RDS PostgreSQL instance i. e. Memory usage of the postgresql server reaches almost 100% resulting in stalled queries, and subsequent downtime of production app.
The memory usage of the RDS instance doesn't go up gradually, but suddenly within a period of 30min to 2hrs
Most of the time this happens, we see that lot of traffic from bots is going on, though there is no specific pattern in terms of frequency. This could happen after 1 week to 1 month of the previous occurence.
Disconnecting all clients, and then restarting the application also doesn't help, as the memory usage again goes up very rapidly.
Running "Full Vaccum" is the only solution we have found that resolves the issue when it occurs.
What we have tried so far
Periodic vacuuming (not full vacuuming) of some tables that get frequent updates.
Stopped storing Web sessions in DB as they are highly volatile and result in lot of dead tuples.
Both these haven't helped.
We have considered using tools like pgcompact / pg_repack as they don't acquire exclusive lock. However these can't be used with RDS.
We now see a strong possibility that this has to do with memory bloat that can happen on postgresql with prepared statements in rails 4, as discussed in following pages:
Memory leaks on postgresql server after upgrade to Rails 4
https://github.com/rails/rails/issues/14645
As a quick trial, we have now disabled prepared statements in our rails database configuration, and are observing the system. If the issue re-occurs, this hypothesis would be proven wrong.
Setup details:
We run our production environment inside Amazon Elastic Beanstalk, with following configuration:
App servers
OS : 64bit Amazon Linux 2016.03 v2.1.0 running Ruby 2.1 (Puma)
Instance type: r3.xlarge
Root volume size: 100 GiB
Number of app servers : 2
Rails workers running on each server : 4
Max number of threads in each worker : 8
Database pool size : 50 (applicable for each worker)
Database (RDS) Details:
PostgreSQL Version: PostgreSQL 9.3.10
RDS Instance type: db.m4.2xlarge
Rails Version: 4.2.5
Current size on disk: 2.2GB
Number of tables: 94
The environment is monitored with AWS cloudwatch and NewRelic.
Periodic vacuum should help in containing table bloat but not index bloat.
1)Have you tried more aggressive parameters of auto-vacuum ?
2)Tried routine reindexing ? If locking is a concern then consider
DROP INDEX CONCURRENTLY ...
CREATE INDEX CONCURRENTLY ...

Elasticsearch percolation dead slow on AWS EC2

Recently we switched our cluster to EC2 and everything is working great... except percolation :(
We use Elasticsearch 2.2.0.
To reindex (and percolate) our data we use a separate EC2 c3.8xlarge instance (32 cores, 60GB, 2 x 160 GB SSD) and tell our index to include only this node in allocation.
Because we'll distribute it amongst the rest of the nodes later, we use 10 shards, no replicas (just for indexing and percolation).
There are about 22 million documents in the index and 15.000 percolators. The index is a tad smaller than 11GB (and so easily fits into memory).
About 16 php processes talk to the REST API doing multi percolate requests with 200 requests in each (we made it smaller because of the performance, it was 1000 per request before).
One percolation request (a real one, tapped off of the php processes running) is taking around 2m20s under load (of the 16 php processes). That would've been ok if one of the resources on the EC2 was maxed out but that's the strange thing (see stats output here but also seen on htop, iotop and iostat): load, cpu, memory, heap, io; everything is well (very well) within limits. There doesn't seem to be a shortage of resources but still, percolation performance is bad.
When we back off the php processes and try the percolate request again, it comes out at around 15s. Just to be clear: I don't have a problem with a 2min+ multi percolate request. As long as I know that one of the resources is fully utilized (and I can act upon it by giving it more of what it wants).
So, ok, it's not the usual suspects, let's try different stuff:
To rule out network, coordination, etc issues we also did the same request from the node itself (enabling the client) with the same pressure from the php processes: no change
We upped the processors configuration in elasticsearch.yml and restarted the node to fake our way to a higher usage of resources: no change.
We tried tweaking the percolate and get pool size and queue size: no change.
When we looked at the hot threads, we DiscovereUsageTrackingQueryCachingPolicy was coming up a lot so we did as suggested in this issue: no change.
Maybe it's the amount of replicas, seeing Elasticsearch uses those to do searches as well? We upped it to 3 and used more EC2 to spread them out: no change.
To determine if we could actually use all resources on EC2, we did stress tests and everything seemed fine, getting it to loads of over 40. Also IO, memory, etc showed no issues under high strain.
It could still be the batch size. Under load we tried a batch of just one percolator in a multi percolate request, directly on the data & client node (dedicated to this index) and found that it used 1m50s. When we tried a batch of 200 percolators (still in one multi percolate request) it used 2m02s (which fits roughly with the 15s result of earlier, without pressure).
This last point might be interesting! It seems that it's stuck somewhere for a loooong time and then goes through the percolate phase quite smoothly.
Can anyone make anything out of this? Anything we have missed? We can provide more data if needed.
Have a look at the thread on the Elastic Discuss forum to see the solution.
TLDR;
Use multiple nodes on one big server to get better resource utilization.