Amazon RDS Enhanced Monitoring - amazon-web-services

I have RDS DB with low number of connections on it (usually something around 30 connections), but it shows high CPU load all the time (about 25%). The DB Family (r3.2xlarge) .
As shown in the screenshot below of enhanced monitoring, it shows some processes with high CPU and Memory utilization. what does the numbers that i have marked in rectangles mean? I thought they are the threads' IDs of Queries, but in show processlist, i can't see those numbers!
So briefly:
What does those numbers (in rectangle) mean?
is there anyway to know which query is taking the top utilization of CPU and memory (in realtime, not via slow log)?

What does those numbers (in rectangle) mean?
They are just process/thread IDs. Don't mean anything.
is there anyway to know which query is taking the top utilization of CPU and memory (in realtime, not via slow log)?
Since you're using MySQL flavor of RDS, connect to your instance with any MySQL client and use SHOW PROCESSLIST; or SHOW FULL PROCESSLIST; commands to see the list of running queries.
https://dev.mysql.com/doc/refman/5.7/en/show-processlist.html

Related

Which EC2 instance size should I use if I have 100 concurrent users daily and 300 max users? I need to save cost

I have made a Business Card ordering portal and its backend is in NodeJS.
I am currently using t2.micro and I am getting like 50 daily users and 15-20 concurrent users but in some time the user count would go up to 300 daily users and 100 concurrent users. I don't want to spend much either.
It has single database and we don't use threads.
I am confused whether I should change my instance type or should use Auto Scaling Groups.
I am not a pro in AWS. Please help!!
Nobody can give you an answer to your question because every application is different.
Some applications need more CPU (eg for encoding/encrypting). Some need lots of RAM (for calculations). Some need lots of disk access (eg for file manipulation).
Only you know how your application behaves and what resources it would need.
You could either pick something and then monitor it (CPU, RAM, Disk) in production to see where the 'bottleneck' lies, or you could create a test that simulates the load of users and pushes it to breaking point to discover the bottleneck.

Is there a better way for me to architect this batch processing pipeline?

So I have a large dataset (1.5 Billion) I need to perform a I/O bound transform task on (same task for each point) and place the result into a store that allows fuzzy searching on the transforms fields.
What I currently have is a Step-Function Batch Job Pipeline feeding into RDS. It works like so
Lamba splits input data into X number of even partitions
An Array Batch job is created with X array elements matching the X paritions
Batch jobs (1 vCPU, 2048 Gb ram) run on number of EC2 spot instances, transform the data and place it into RDS.
This current solution (with X=1600 workers) runs in about 20-40 minutes, mainly based on the time it takes to spin up spot instance jobs. The actual jobs themselves average about 15 minutes in run time. As for total cost, with spot savings the workers cost ~40 bucks but the real kicker is the RDS postgres DB. To be able to handle 1600 concurrent writes you need at least a r5.xlarge which is 500 a month!
Therein lies my problem. It seems I could run the actual workers quicker and for cheaper ( due to second based pricing) by having say 10,000 workers but then I would need a RDS system that could handle 10,000 concurrent DB connections somehow.
I've looked high and low and can't find a good solution to this scaling wall I am hitting. Below I'll detail some things I've tried and why they haven't worked for me or don't seem like a good fit.
RDS proxies - I tried creating 2 proxies set to 50% connection pool and giving "Even" numbered jobs one proxy and odd numbered jobs the other but that didn't help
DynamoDb - This seems off the bat to solve my problem hugely concurrent, can definitely handle the write load but it doesn't allow fuzzy searching like select * where field LIKE Y which is a key part of my workflow with the batch job results
(Theory) - have the jobs write their results to S3 then trigger a lambda on new bucket entries to insert those into the DB. (This might be a terrible idea I'm not sure)
Anyways, what I'm after is improving the cost of running this batch pipeline (mainly the DB), improving the time to run (to save on Spot costs) or both! I am open to any feedback or suggestion!
Let me know if there's some key piece of info you need I missed.

What is the scale of the CPU utilisation for f1-micro on GCP?

The instance details of my f1-micro instance shows a graph of CPU utilisation fluctuating between 8% and 15%, but what is the scale? The f1-micro has 0.2 CPU so is my max 20%? Or does the 100% in the graph mark my 20% of the CPU? Occasionally the graph has gone above 20% but is it bursting then? Or does the bursting start at 100% in the graph?
The recommendation to increase performance is always displayed. Is it just sales tactics? The VM is a watchdog so it is not doing much.
I tried to build up a small test in order to answer your question and if interested you can do the same to double check.
TEST
I created two instances one f1-micro and one n1-standard-1 and then I forced a CPU burst making use of stress, but you can use any tool of your choice.
$ sudo apt-get install stress
$ stress --cpu 1 & top
In this way we can compare the output of top of the two instances with the one showed in the dashboard, since the operating system is not aware to share the CPU so we expect a 100% seen from the inside of the machine.
RESULTS
While the output of top for both the instances showed as expected that the 99.9% of the CPU was currently used, the output of the dashboard is more interesting.
n1-standard-1 showed a stable value around 100% the whole time.
f1-micro shows an initial spike to 250% (because it is using a bigger share of the CPU assigned, i.e. it is running on bursting mode) and then it reduces to 100%.
I repeated the test several times and each time I got the same behaviour, therefore the % refers to the share of CPU that you are currently using.
This features is Documented here:
"f1-micro machine types offer bursting capabilities that allow instances to use additional physical CPU for short periods of time. Bursting happens automatically when your instance requires more physical CPU than originally allocated"
On the other hand if you want to know more about those recommendation and how they work you can check the Official Documentation.

RDS eating all the swap space

We have been using MariaDB in RDS and we noticed that the swap space is getting increasingly high whithout being recycled. The freeable memory however seems to be fine. Please check the attached files.
Instance type : db.t2.micro
Freeable memory : 125Mb
Swap space : increased by 5Mb every 24h
IOPS : disabled
Storage : 10Gb (SSD)
Soon RDS will eat all the swap space, which will cause lots of issues to the app.
Does anyone have similar issues?
What is the maximum swap space? (didn't find anything in the docs)
Please help!
Does anyone have similar issues?
I had similar issues on different instance types. The trend of swapping stays even if you would switch to higher instance type with more memory.
An explanation from AWS you can find here
Amazon RDS DB instances need to have pages in the RAM only when the pages are being accessed currently, for example, when executing queries. Other pages that are brought into the RAM by previously executed queries can be flushed to swap space if they haven't been used recently. It's a best practice to let the operating system (OS) swap older pages instead of forcing the OS to keep pages in memory. This helps make sure that there is enough free RAM available for upcoming queries.
And the resolution:
Check both the FreeableMemory and the SwapUsage Amazon CloudWatch metrics to understand the overall memory usage pattern of your DB instance. Check these metrics for a decrease in the FreeableMemory metric that occurs at the same time as an increase in the SwapUsage metric. This can indicate that there is pressure on the memory of the DB instance.
What is the maximum swap space?
By enabling Enhanced Monitoring you should be able to see OS metrics, e.g. The amount of swap memory free, in kilobytes.
See details here
Enabling enhanced monitoring in RDS has made things more clear.
Obviously what we needed to watch was Committed Swap instead of Swap Usage. We were able to see how much Free Swap we had.
I now also believe that MySQL is dumping things in swap just because there is too much space in there, even though it wasn't really in urgent need of memory.

Data Intensive process in EC2 - any tips?

We are trying to run an ETL process in an High I/O Instance on Amazon EC2. The same process locally on a very well equipped laptop (with a SSD) take about 1/6th the time. This process is basically transforming data (30 million rows or so) from flat tables to a 3rd normal form schema in the same Oracle instance.
Any ideas on what might be slowing us down?
Or another option is to simply move off of AWS and rent beefy boxes (raw hardware) with SSDs in something like Rackspace.
We have moved most of our ETL processes off of AWS/EMR. We host most of it on Rackspace and getting a lot more CPU/Storage/Performance for the money. Don't get me wrong AWS is awesome but there comes a point where it's not cost effective. On top of that you never know how they are really managing/virtualizing the hardware that applies to your specific application.
My two cents.