Main causes for leader node to be at high CPU - amazon-web-services

I often see the leader node of our Redshift cluster peak at 100% CPU. I've identified one possible cause for this: many concurrent queries, and so many execution plans for the leader to calculate. This hypothesis seems very likely as the times we have the most queries coming seem to be the same as the ones we see the leader at 100%.
To fix this at best, we are wondering: are there other main possible causes for a high CPU on the leader?
(I'm precising the situation is when only the leader node is at high CPU and the workers seem fine)

The Redshift leader node is the same size and class of compute as the compute nodes. Typically this means that the leader is over provisioned for the role it plays but since its role is so important and impactful if things slows down, it is good that it is over provisioned. The leader needs to compile and optimized the queries and perform final steps in queries (final sort for example). It communicates with the session clients and handles all their requests. If the leader becomes overloaded all these activities slow down creating significant performance issues. It is not good that your leader is hitting 100% CPU often enough for you to notice. I bet the seems sluggish when this happens.
There are a number of ways I've seen "leader abuse" and it usually becomes a problem when bad patterns are copied between users. In no particular order:
Large data literals in queries (INSERT ... VALUES ...). This puts your data through the query compiler on the leader node. This is not what it is design to do and is very expensive for the leader. Use the COPY command to bring data into the cluster. (Just bad, don't do this)
Over use of COMMIT. A commits cause an update to the coherent state of the database and needs to run through the "commit queue" and creates work for the leader and the compute nodes. Having COMMITs every other statement can cause this queue to back up and work to generally back up.
Too many slots defined in the WLM. Redshift can typically only efficiently run between 1 and 2 dozen queries at once. Setting the total slot count very high (like 50) can lead to very inefficient operation and high CPU loads. Depending on workload this can show up for compute or occasionally the lead node.
Large data output through SELECT statements. SELECTs return data but when this data is many GBs in size the management of this data movement (and sorting) is done by the leader node. If large amounts of data need to be extracted from Redshift it should be done with an UNLOAD statement.
Overuse of large cursors. Cursors can be an important tool and needed for many BI tools but cursors are located on the leader and overuse can lead to reduced leader attention on other tasks.
Many / large UNLOADs with parallel off. UNLOADs generally come from the compute nodes straight to S3 but with "parallel off" all the data is routed to the leader node where it is combined (sorted) and sent to S3.
While none of the above of problems in and of themselves, it is when these are overused, used in ways they are not intended, or all at once that the leader starts to be impacted. It also comes down to what you intend to do with your cluster - if it support BI tools then you may have a lot of cursors but this load on the leader is part of the cluster's intent. Issue often arise when the cluster's intent is to all things to everybody.
If your workload for Redshift is leader function heavy and you are efficiently using the leader node (no large literals, using COPY and UNLOAD, etc.) then high leader workload is what you want. You're getting the most out of the critical resource. However, most use Redshift to perform analytics on large data which is the function of the compute nodes. A highly loaded leader can detract significantly from this mission and needs to be addressed.
Another way that leader can get stressed is when clusters are configured with many smaller node types instead of fewer bigger nodes. Since the leader is the same size as the compute nodes many smaller nodes means you have a small leader doing the work. Something to consider but I'd make sure you don't have unneeded leader node stressers before investing in a resize.

Whenever you execute some commands which require calculation on the leader node, whether for dispatching data, computing statistics, or aggregating results from the workers, like COPY, UNLOAD, VACUUM, ANALYZE, you'll see an increase in CPU usage. More information about this here: https://docs.aws.amazon.com/redshift/latest/dg/c_high_level_system_architecture.html

Related

Redshift CPU utilisation is 100 percent most of the time

I have a 96 Vcpu Redshift ra3.4xlarge 8 node cluster, Most of the times the CPU utilisation is 100 percent , It was a dc2.large 3 node cluster before , that was also always 100 percent that's why we increased it to ra3. We are doing most of our computes on Redshift but the data is not that much! I read somewhere Doesn't matter how much compute you increase unless its significantly , there will only be a slight improvement in the Computation. Can anyone explain this?
I can give it a shot. Having 100% CPU for long stretches of time is generally not a good (optimal) thing in Redshift. You see Redshift is made for performing analytics on massive amounts of structured data. To do this it utilizes several resources - disks/disk IO bandwidth, memory, CPU, and network bandwidth. If you workload is well matched to Redshift your utilization of all these things will average around 60%. Sometimes CPU bound, sometimes memory bound, sometimes network bandwidth bound, etc. Lots of data being read means disk IO bandwidth is at a premium, lots of redistribution of data means network IO bandwidth is constraining. If you are using all these factors above 50% capacity you are getting what you paid for. Once any of these factors gets to 100% there is a significant drop-off of performance as working around the oversubscribed item steals performance.
Now you are in a situation where you are see 100% for a significant portion of the operating time, right? This means you have all these other attributes you have paid for but are not using AND inefficiencies are being realized to manage through this (though of all the factors, high CPU has the lease overhead). The big question is why.
There are a few possibilities but the most likely, in my experience, is inefficiently queries. An example might be the best way to explain this. I've seen queries that are intended to find all the combinations of certain factors from several tables. So they cross join these tables but this produces lots of repeats so they add DISTINCT, problem solved. But this still creates all the duplicates and then reduces the set down. All the work is being done and most of the results thrown away. However, if they pared down the factors in the tables first, then cross joined them, the total work will be significantly lower. This example will do exactly what you are seeing, high CPU as it spins making repeat combinations and then throwing most of them away.
If you have many of this type of "fat in the middle" query where lots of extra data is made and immediately reduced, you won't get a lot of benefit for adding CPU resources. Things will get 2X faster with 2X the cluster size but you are buying 2X of all these other resources that aren't helping you. You would expect that buying 2X CPU and 2X memory and 2X disk IO etc. would give you much more than a 2X improvement. Being constrained on 1 thing make scaling costly. Also, you are unlikely to see the CPU utilization come down as your queries just "spin the tires" of the CPU. More CPUs will just mean you can run more queries resulting in the spinning more tires.
Now the above is just my #1 guess based on my consulting experience. It could be that your workload just isn't right for Redshift. I've seen people try to put many small database problems into Redshift thinking that it's powerful so it must be good at this too. They turn up the slot count to try to pump more work into Redshift but just create more issues. Or I've seem people try to run transactional workloads. Or ... If you have the wrong tool for the job it may not work well. One 6 ton dump truck isn't the same thing as 50 motorcycle delivery team - each has their purpose but they aren't interchangeable.
Another possibility is that you have a very unusual workload but Redshift is still the best tool for the job. You don't need all the strengths of Redshift but this is ok, you are getting the job done at an appropriate cost. If this case 100% CPU is just how your workload uses Redshift. It's not a problem, just reality. Now I doubt this is the case, but it is possible. I'd want to be sure I'm getting all the value from the money I'm spending before assuming everything is ok.

Understanding why AWS Elasticsearch GC (young and old) keeps on rising while memory pressure is not

I am trying to understand if I have an issue with my AWS Elasticsearch garbage collection time but all the memory-related issues that I find are relating to memory pressure which seems OKish.
So while I run a load test on the environment, I observe a constant rise in all GC collection time metrics, for example:
But when looking at memory pressure, I see that I am not passing the 75% mark (but getting near..) which, according to documentation, will trigger a concurrent mark & sweep.
So I fear that once I add more load or run a longer test, I might start seeing real issues which will have an impact on my environment. So, do I have an issue here? how should I approach rising GC time when I cant take memory dumps and see what's going on?
The top graph reports aggregate GC collection time, which is what's available from GarbageCollectorMXBean. It continues to increase because every young generation collection adds to it. And in the bottom graph, you can see lots of young generation collections happening.
Young generation collections are expected in any web-app (which is what an OpenSearch cluster is): you're constantly making requests (queries or updates), and the those requests create garbage.
I recommend looking at the major collection statistics. In my experience with OpenSearch, these happen when you're performing large numbers of updates, perhaps as a result of coalescing indexes. However, they should be infrequent unless you're constantly updating your cluster.
If you do experience memory pressure, the only real solution is to move to a larger node size. Adding nodes probably won't help, due to the way that indexes are sharded across nodes.
I sent a query to AWS technical support and, counter to any intuitive behavior, the values of the Young and Old Collection time and count in Elasticsearch is cumulative. This means that this value keeps increasing and does not drop down to a value of 0 until there is a node drop or node restart

When to Add Shards to a Distributed Key/Value Store

I've been reading up on distributed systems lately, and I've seen a lot of examples of how to shard key value stores, like a memcache system or a nosql db.
In general, adding shards makes intuitive sense to me when you want to support more concurrent access to the table, and most of the examples cover that sort of usage. One thing I'm not clear on though is whether you are also supposed to add shards as your total table size grows. For something like a memcache, I'd imagine this is necessary, because you need more nodes with more memory to hold more key/values. But what about databases which also keep the values on some sort of hard drive?
It seems like, if your table size is growing but the amount of concurrent access is not, it would be somewhat wasteful to keep adding nodes just to hold more data. In that case I'd think you could just add more long-term storage. But I suppose the problem is, you are increasing the chance that your data becomes "cold" when somebody needs it, causing more latency for those requests.
Is there a standard approach to scaling nodes vs. storage? Are they always linked? Thanks much for any advice.
I think it is the other way around.
Almost always shards are added because the data is growing to the point where it cannot be held on 1 machine
Sharding makes everything so much more painful so should be only done when vertical scaling doesn't work anymore

AWS Redshift adding nodes and changing from Single-node to Multi-nodes to increase the disk size

As I am new to Redshift, I have the following questions.
When adding new node to increase the disk space which "distribution style" that i need to
choose?
As my purposes is increasing the disk space do i need to consider "distribution style" or any changes in already written queries(queries works in single node without any issues)?
Distribution becomes important as more and more nodes exist.
Each node will have at least 2 slices, depending how your data is distributed across these slices your performance may be affected when performing queries.
You can distribute in the following ways:
EVEN - Datasets are distributed evenly between slices, whilst this will distribute the storage evenly across all nodes if you must perform joins to this data from other slices you might find a performance hit. It will benefit greatly from denormalized data with no joins as it will gain the CPU across each node to perform computation.
KEYS - Datasets are distributed to slices based on their relationships with other data, this will really benefit when using joins in your tables but be aware that data can be distributed across slices unevenly.
ALL - Every slice gets the entire dataset, use this option for small datasets (tables less than 10GB), or those with data that changes infrequently.
AUTO - Redshift takes care of the distribution style and attempts to select the correct for the dataset, you have no control over the decisions it is making.
You should have a think about how you will use the data before making the decision as it will affect both the storage and performance output you get.

Which Key value, Nosql database can ensure no data loss in case of a power failure?

At present, we are using Redis as an in-memory, fast cache. It is working well. The problem is, once Redis is restarted, we need to re-populate it by fetching data from our persistent store. This overloads our persistent
store beyond its capacity and hence the recovery takes a long time.
We looked at Redis persistence options. The best option (without compromising performance) is to use AOF with 'appendfsync everysec'. But with this option, we can loose last second data. That is not acceptable. Using AOF with 'appednfsync always' has a considerable performance penalty.
So we are evaluating single node Aerospike. Does it guarantee no data loss in case of power failures? i.e. In response to a write operation, once Aerospike sends success to the client, the data should never be lost, even if I pull the power cable of the server machine. As I mentioned above, I believe Redis can give this guarantee with the 'appednfsync always' option. But we are not considering it as it has the considerable performance penalty.
If Aerospike can do it, I would want to understand in detail how persistence works in Aerospike. Please share some resources explaining the same.
We are not looking for a distributed system as strong consistency is a must for us. The data should not be lost in node failures or split brain scenarios.
If not aerospike, can you point me to another tool that can help achieve this?
This is not a database problem, it's a hardware and risk problem.
All databases (that have persistence) work the same way, some write the data directly to the physical disk while others tell the operating system to write it. The only way to ensure that every write is safe is to wait until the disk confirms the data is written.
There is no way around this and, as you've seen, it greatly decreases throughput. This is why databases use a memory buffer and write batches of data from the buffer to disk in short intervals. However, this means that there's a small risk that a machine issue (power, disk failure, etc) happening after the data is written to the buffer but before it's written to the disk will cause data loss.
On a single server, you can buy protection through multiple power supplies, battery backup, and other safeguards, but this gets tricky and expensive very quickly. This is why distributed architectures are so common today for both availability and redundancy. Distributed systems do not mean you lose consistency, rather they can help to ensure it by protecting your data.
The easiest way to solve your problem is to use a database that allows for replication so that every write goes to at least 2 different machines. This way, one machine losing power won't affect the write going to the other machine and your data is still safe.
You will still need to protect against a power outage at a higher level that can affect all the servers (like your entire data center losing power) but you can solve this by distributing across more boundaries. It all depends on what amount of risk is acceptable to you.
Between tweaking the disk-write intervals in your database and using a proper distributed architecture, you can get the consistency and performance requirements you need.
I work for Aerospike. You can choose to have your namespace stored in memory, on disk or in memory with disk persistence. In all of these scenarios we perform favourably in comparison to Redis in real world benchmarks.
Considering storage on disk when a write happens it hits a buffer before being flushed to disk. The ack does not go back to the client until that buffer has been successfully written to. It is plausible that if you yank the power cable before the buffer flushes, in a single node cluster the write might have been acked to the client and subsequently lost.
The answer is to have more than one node in the cluster and a replication-factor >= 2. The write then goes to the buffer on the client and the replica and has to succeed on both before being acked to the client as successful. If the power is pulled from one node, a copy would still exist on the other node and no data would be lost.
So, yes, it is possible to make Aerospike as resilient as it is reasonably possible to be at low cost with minimal latencies. The best thing to do is to download the community edition and see what you think. I suspect you will like it.
I believe aerospike would serves your purpose, you can configure it for hybrid storage at namespace(i.e. DB) level in aerospike.conf
which is present at /etc/aerospike/aerospike.conf
For details please refer official documentation here: http://www.aerospike.com/docs/operations/configure/namespace/storage/
I believe you're going to be at the mercy of the latency of whatever the storage medium is, or the latency of the network fabric in the case of cluster, regardless of what DBMS technology you use, if you must have a guarantee that the data won't be lost. (N.B. Ben Bates' solution won't work if there is a possibility that the whole physical plant loses power, i.e. both nodes lose power. But, I would think an inexpensive UPS would substantially, if not completely, mitigate that concern.) And those latencies are going to cause a dramatic insert/update/delete performance drop compared to a standalone in-memory database instance.
Another option to consider is to use NVDIMM storage for either the in-memory database or for the write-ahead transaction log used to recover from. It will have the absolute lowest latency (comparable to conventional DRAM). And, if your in-memory database will fit in the available NVDIMM memory, you'll have the fastest recovery possible (no need to replay from a transaction log) and comparable performance to the original IMDB performance because you're back to a single write versus 2+ writes for adding a write-ahead log and/or replicating to another node in a cluster. But, your in-memory database system has to be able to support direct recovery of an in-memory database (not just from a transaction log). But, again, two requirements for this to be an option:
1. The entire database must fit in the NVDIMM memory
2. The database system has to be able to support recovery of the database directly after system restart, without a transaction log.
More in this white paper http://www.odbms.org/wp-content/uploads/2014/06/IMDS-NVDIMM-paper.pdf