RDS eating all the swap space - amazon-web-services

We have been using MariaDB in RDS and we noticed that the swap space is getting increasingly high whithout being recycled. The freeable memory however seems to be fine. Please check the attached files.
Instance type : db.t2.micro
Freeable memory : 125Mb
Swap space : increased by 5Mb every 24h
IOPS : disabled
Storage : 10Gb (SSD)
Soon RDS will eat all the swap space, which will cause lots of issues to the app.
Does anyone have similar issues?
What is the maximum swap space? (didn't find anything in the docs)
Please help!

Does anyone have similar issues?
I had similar issues on different instance types. The trend of swapping stays even if you would switch to higher instance type with more memory.
An explanation from AWS you can find here
Amazon RDS DB instances need to have pages in the RAM only when the pages are being accessed currently, for example, when executing queries. Other pages that are brought into the RAM by previously executed queries can be flushed to swap space if they haven't been used recently. It's a best practice to let the operating system (OS) swap older pages instead of forcing the OS to keep pages in memory. This helps make sure that there is enough free RAM available for upcoming queries.
And the resolution:
Check both the FreeableMemory and the SwapUsage Amazon CloudWatch metrics to understand the overall memory usage pattern of your DB instance. Check these metrics for a decrease in the FreeableMemory metric that occurs at the same time as an increase in the SwapUsage metric. This can indicate that there is pressure on the memory of the DB instance.
What is the maximum swap space?
By enabling Enhanced Monitoring you should be able to see OS metrics, e.g. The amount of swap memory free, in kilobytes.
See details here

Enabling enhanced monitoring in RDS has made things more clear.
Obviously what we needed to watch was Committed Swap instead of Swap Usage. We were able to see how much Free Swap we had.
I now also believe that MySQL is dumping things in swap just because there is too much space in there, even though it wasn't really in urgent need of memory.

Related

Google Cloud SQL - Database instance storage size increased dramatically everyday

I have a database instance (MySQL 8) on Google Cloud and since 20 days ago, the instance's storage usage just keeps increasing (approx 2Gb every single day!).
But I couldn't find out why.
What I have done:
Take a look at Point-in-time recovery "Point-in-time recovery" option, it's already disabled.
Binary logs is not enabled.
Check the actual database size and I see my database is just only 10GB in size
No innodb_per_table flag, so it must be "false" by default
Storage usage chart:
Database flags:
The actual database size is 10GB, now the storage usage takes up to 220GB! That's a lot of money!
I couldn't resolve this issue, please give me some ideal tips. Thank you!
I had the same thing happen to me about a year ago. I couldn't determine any root cause of the huge increase in storage size. I restarted the server and the problem stopped. None of my databases experienced any significant increase in size. My best guess is that some runaway process causes the binlog to blow up.
Turns out the problem is in a Wordpress theme's function called "related_products" which just read and write every instance of the products that user comes accross (it would be millions per day) and makes the database physically blew up.

Getting an error with no resources when creating a vm that is ongoing

I keep getting an error message that says there are not enough resources in the zone to create a VM (Us-Central F). This has been going on for a couple of days. Is there a way to fix this or report this? Any advice and answers would be appreciated!
You can reserve resources you need or wait and try your luck with creating desired VM. Changing the machine type, amount of ram etc - lowering VM specs will also increase your chances.
Otherwise you have to use other zone or even region - there's no way around it since even GCP has limited resources and due to high demand some of them may not be available. The only difference will be higher latency.

Disabling Swapping of a Redis Instance on AWS's ElastiCache

We are trying to disable swapping RAM to the disk for a Redis instance managed by AWS's Elasticache - but couldn't find the right property to do so.
We also cannot find a way to SSH it and turn off swapping from the kernerl, can you please help ?
While not a direct answer to your question about disabling swapping, we've been struggling with Redis swaping on Elasticache as well. What we ended up doing to address swapping is the following:
Followed Leo's suggestion of setting reserved memory
Run a nightly batch job to SCAN all keys in batches of 10,000. The SCAN command will evict any expired keys. This helps by proactively cleaning up the cache before swapping kicks in.
Run another custom batch job which processes entities we know can be evicted. These are entities which aren't as important as others which are in the cache. We've setup the keys so they contain enough information to easily identify those associated to an entity. Use SCAN with a match to find the keys. Once you find them, call DEL on each. This batch job alone is saving lots of space in our Redis instance. Word of caution, avoid using the KEYS command as it is slow and will block other threads.
We've been using the above for a few weeks now and so far it has been working well. In a few more weeks we'll know how well it works since we have a default TTL of 30 days and the number of cached items is still increasing.
Good luck!
Update
We turned off the job which uses SCAN on all keys. We discovered it was causing swap to slowly creep up (roughly 500k every other day). Once we turned that off, swap started shrinking. The combination of setting reserved memory and flushing objects we know can be expired is working well. When redis starts running out of room, it evicts any expired cached objects to make room for new entries. The only impact we've noticed is a very small increase in CPU usage, which isn't causing any trouble.
I had a similar problem, where Elasticache(Redis) in AWS suddenly started using Swap space even while we use the Allkeys-LRU algorithm. The machine was not using swap while consuming the whole memory for the past few weeks until that changed one early morning.
I used the command
redis-cli -h elasticache.service-name memory DOCTOR
The output was -->
High allocator fragmentation: This instance has an allocator external fragmentation greater than 1.1. This problem is usually due
either to a large peak memory (check if there is a peak memory entry
above in the report) or may result from a workload that causes the
allocator to fragment memory a lot. You can try enabling
'activedefrag' config option.
checking with command
redis-cli -h elasticache.service-name memory STATS
I saw that the defragment value was high(1.4)
I looked onto the AWS console for Elasticache-Redis params and made the defragment setting to true as it was set as false.
It is not possible to connect to Elasticache via SSH.
Are you sure that you are having issues with Redis swapping to disk, or the host running out of memory and crashing (I've seen this happen with the default configuration)? If so, the guidance is to leave about 25% of the system memory available for host processes - http://docs.aws.amazon.com/AmazonElastiCache/latest/UserGuide/redis-memory-management.html

mmap performance of Amazon ESB

I am looking at porting an application to the cloud, more speficially I am looking at Amazon EC2 or Google GCE.
My app heavily uses Linux's mmap to memory map large read-only files and I I would like to understand how mmap would actually work when a file is on the ESB volume.
I would specifically like to know what happens when I call mmap as EBS appears to be a black-box. Also, are the benefits negated?
I can speak for GCE Persistent Disks. It behaves pretty much in the same way a physical disk would. At a high level, pages are faulted in from disk as mapped memory is accessed. Depending on your access pattern these pages might be loaded one by one, or in a larger quantity when readahead kicks in. As the file system cache fills up, old pages are discarded to give space to new pages, writing out dirty pages if needed.
One thing to keep in mind with Persistent Disk is that performance is proportional to disk size. So you'd need to estimate your throughput and IOPS requirements to ensure you get a disk with enough performance for your application. You can find more details here: Persistent disk performance.
Is there any aspect of mmap that you're worried about? I would recommend to write a small app that simulates your workload and test it before deciding to migrate your application.
~ Fabricio.

Can I improve performance of my GCE small instance?

I'm using cloud VPS instances to host very small private game servers. On Amazon EC2, I get good performance on their micro instance (1 vCPU [single hyperthread on a 2.5GHz Intel Xeon], 1GB memory).
I want to use Google Compute Engine though, because I'm more comfortable with their UX and billing. I'm testing out their small instance (1 vCPU [single hyperthread on a 2.6GHz Intel Xeon], 1.7GB memory).
The issue is that even when I configure near-identical instances with the same game using the same settings, the AWS EC2 instances perform much better than the GCE ones. To give you an idea, while the game isn't Minecraft I'll use that as an example. On the AWS EC2 instances, succeeding world chunks would load perfectly fine as players approach the edge of a chunk. On the GCE instances, even on more powerful machine types, chunks fail to load after players travel a certain distance; and they must disconnect from and re-login to the server to continue playing.
I can provide more information if necessary, but I'm not sure what is relevant. Any advice would be appreciated.
Diagnostic protocols to evaluate this scenario may be more complex than you want to deal with. My first thought is that this shared core machine type might have some limitations in consistency. Here are a couple of strategies:
1) Try backing into the smaller instance. Since you only pay for 10 minutes, you could see if the performance is better on higher level machines. If you have consistent performance problems no matter what the size of the box, then I'm guessing it's something to do with the nature of your application and the nature of their virtualization technology.
2) Try measuring the consistency of the performance. I get that it is unacceptable, but is it unacceptable based on how long it's been running? The nature of the workload? Time of day? If the performance is sometimes good, but sometimes bad, then it's probably once again related to the type of your work load and their virtualization strategy.
Something Amazon is famous for is consistency. They work very had to manage the consistency of the performance. it shouldn't spike up or down.
My best guess here without all the details is you are using a very small disk. GCE throttles disk performance based on the size. You have two options ... attach a larger disk or use PD-SSD.
See here for details on GCE Disk Performance - https://cloud.google.com/compute/docs/disks
Please post back if this helps.
Anthony F. Voellm (aka Tony the #p3rfguy)
Google Cloud Performance Team