Aurora PostgresSQL Serverless V2 showing higher CPU during idle - amazon-web-services

Component : Aurora PostgresSQL Serverless V2 (0.5 - 4 ACUs) - Multi AZ deployment
Post instance startup CPU utilization stabilizes at around 55% - 60% on writer node only and does not comes down. Reader node stabilizes at ~19%.
The only query running as checked with pg_stat_activity on database are as follows :
RDS Replication
Autovacuum
WAL Process
Checkpoint process
Other Internal process
Number of connections to DB : 1
Database processes running in writer node : 13
Kindly advice what additional can be checked and probable cause of issue.
Tried to kill Autovacuum process
Checked number of process from pg_stat_activity

Hi you could enable and review monitoring stats at OS level using AWS Aurora Enhanced Monitoring, it will provide you details at process level and identify what process is using CPU resources.
https://docs.aws.amazon.com/AmazonRDS/latest/AuroraUserGuide/USER_Monitoring.OS.Viewing.html

Related

elastic cache redis shard balance not working

We are in stabilization phase of using redisson client with aws redis elastic cache in clustered mode with two shards. Redisson connection endpoint uses the configendpoint recommended by AWS. https://docs.aws.amazon.com/AmazonElastiCache/latest/red-ug/Endpoints.html
We seem to see shard 002(other is 001) seems to take most the data(69% vs 001 is 0.5% which is big difference in usage) written while seeing AWS elastic cache metrics. This can be observed in dbmemoryusage percent 001.
We are using this redisson RMapCache
private RMapCache<byte[], SOME JAVA POJO> redisCache;
Not sure if one shard alone is taking all redis slots. Any pointers or documentation on this problem for recommendation would help.

How to properly check resource usage of AWS EMR cluster(master and cores) from notebook

Here are my cluster details:
Master : Running 1 m4.xlarge
Core : Running 3 m4.xlarge
Task : --
Cluster scaling: Not enabled
And I am using notebooks to practice pyspark. And I would like to know how the resources are being utilised, to assess if the resources are being under-utilised or not enough for my tasks. And part of which, when checking RAM/memory usage, here's what I got from terminal:
notebook#ip-xxx-xxx-xxx-xxx ~$ free -h
total used free shared buff/cache available
Mem: 1.9G 456M 759M 72K 741M 1.4G
Swap: 0B 0B 0B
Each instance of m4.xlarge comes with 16GB of memory. What's happening and why is only two gigs of 16GB being shown? And How do I properly learn how much of my CPU, Memory and Storage are actually being used? (yes, to reduce costs!!)
If you want to check memory and CPU utilization you can check that in CloudWatch with the instance Id.
To get the instance id of the node (Hardware -> Instance Group -> Instances).
You can get detailed metrics of CPU, memory, IO for each node.
Another option to use resource manager UI (YARN). default url -> http://master-node-ip:8088.
You can get metrics on job level as well as node level.
To start reducing costs you can use m5 or m6g instances and also consider using spot instances. Also, you can use the metrics in EMR console in Monitoring tab. Container pending graph it's a good way to start monitoring. If you dpn't have any container pending and you have executors without tasks running inside you are wasting resources(CPU).

Amazon EC2 Servers getting frozen sporadically

I've been working with Amazon EC2 servers for 3+ years and I noticed a recurrent behaviour: some servers get frozen sporadically (between 1 to 5 times by year).
When this fact ocurs, I can't connect to server (tried http, mysql and ssh connections) till server be restarted.
The server back to work after a restart.
Sometimes the server goes online by 6+ months, sometimes the server get frozen about 1 month after a restart.
All servers I noticed this behavior were micro instances (North Virginia and Sao Paulo).
The servers have an ordinary Apache 2, Mysql 5, PHP 7 environment, with Ubuntu 16 or 18. The PHP/MySQL Web application is not CPU intensive and is not accessed by more than 30 users/hour.
The same environment and application on Digital Ocean servers does NOT reproduce the behaviour (I have two digital ocean servers running uninterrupted for 2+ years).
I like Amazon EC2 Servers, mainly because Amazon has a lot of useful additional services (like SES), but this behaviour is really frustating. Sometimes I got customers calls complaining about systems down and I just need an instance restart to solve the problem.
Does anybody have a tip about solving this problem?
UPDATE 1
They are t2.micro instances (1Gb RAM, 1 vCPU).
MySQL SHOW GLOBAL VARIABLES: pastebin.com/m65ieAAb
UPDATE 2
There is a CPU utilization peak in the logs, near the time when server was down. It was at 3AM. At this time there is a daily crontab task to make a database backup. But, considering this task runs everyday, why just sometimes it would make server get frozen?
I have not seen this exact issue, but on any cloud platform I assume any instance can fail at any time, so we design for failure. For example we have autoscaling on all customer facing instances. Anytime an instance fails, it is automatically replaced.
If a customer is calling to advise you a server is down, you may need to consider more automated methods of monitoring instance health and take automated action to recover the instance.
CloudWatch also has server recovery actions available that can be trigger if certain metric thresholds are reached.

Apache Airflow - how many tasks in a DAG is too many?

I tried having a DAG with 400 tasks (like each one running calling remote spark server to process a separate data file into s3...nothing to do with mysql) and airflow (v1.10.3) did the following for the next 15mins:
cpu stayed at 99%
did not handle new putty login or ssh requests to
my machine (amazon linux)
airflow webserver stopped responding..only
gave 504 errors
Started 130 concurrent connections to mysql RDS
(airflow metadb)
kept my tasks stuck in scheduled state
i eventually switched to another ec2 instance but got same outcome...
I am running LocalExecutor on single machine (16 CPUs).
Note for a DAG with 30 tasks it runs fine.
There's no actual limit to the number of tasks in a DAG. In your case, you're using LocalExecutor - airflow will then use any resources available on the host to execute the tasks. It sounds like you just overwhelmed your ec2 instance's resources and overloaded the airflow worker(s) / scheduler. I'd recommend adding more workers to break up the tasks or lowering the parallelism value in your airflow.cfg

How to optimise aws cluster instance types in apache spark and drill aws cluster?

I am reading s3 buckets with drill and writing it back to s3 with parquet in order to read it with spark data frames for further analysis. I am required by AWS emr to have at least 2 core machines.
will using i mirco instance for master and cores affect performance?
I don't make a use of hdfs as such so I am thinking to make them mirco instances to save money.
All computation will be done in memory by R3.xlarge spot instances as task nodes anyway.
And finally does spark utilise multiple cores in each machine? or is it better to launch fleet of task nodes R3.xlarge with 4.1 version so they can be auto resized?
I don't know how familiar you are with Spark but there is a couple of things you need to know about core usage :
You can set the number of cores to use for the driver process, only in cluster mode. It's 1 by default.
You can also set the number of cores to use on each executor. For YARN and standalone mode only. It's 1 in YARN mode, and all the available cores on the worker in standalone mode. In standalone mode, setting this parameter allows an application to run multiple executors on the same worker, provided that there are enough cores on that worker. Otherwise, only one executor per application will run on each worker.
Now to answer both of your questions :
will using i micro instance for master and cores affect performance?
Yes, the driver needs minimum resources to schedule job, collect data sometimes etc. Performance-wise you'll need to benchmark according to your use case on what suits your usage better which you can do using Ganglia per example on AWS.
does spark utilise multiple cores in each machine?
Yes Spark uses multiple cores on each machine.
You can also read this concerning Which instance type is preferred for AWS EMR cluster for Spark.
The support of Spark is nearly new on AWS, but it's usually close to all other Spark cluster setups.
I advice you to read the AWS EMR developer guide - Plan EMR Instances chapter along with the Spark official documentation guide.