Redshift CPU utilisation is 100 percent most of the time - amazon-web-services

I have a 96 Vcpu Redshift ra3.4xlarge 8 node cluster, Most of the times the CPU utilisation is 100 percent , It was a dc2.large 3 node cluster before , that was also always 100 percent that's why we increased it to ra3. We are doing most of our computes on Redshift but the data is not that much! I read somewhere Doesn't matter how much compute you increase unless its significantly , there will only be a slight improvement in the Computation. Can anyone explain this?

I can give it a shot. Having 100% CPU for long stretches of time is generally not a good (optimal) thing in Redshift. You see Redshift is made for performing analytics on massive amounts of structured data. To do this it utilizes several resources - disks/disk IO bandwidth, memory, CPU, and network bandwidth. If you workload is well matched to Redshift your utilization of all these things will average around 60%. Sometimes CPU bound, sometimes memory bound, sometimes network bandwidth bound, etc. Lots of data being read means disk IO bandwidth is at a premium, lots of redistribution of data means network IO bandwidth is constraining. If you are using all these factors above 50% capacity you are getting what you paid for. Once any of these factors gets to 100% there is a significant drop-off of performance as working around the oversubscribed item steals performance.
Now you are in a situation where you are see 100% for a significant portion of the operating time, right? This means you have all these other attributes you have paid for but are not using AND inefficiencies are being realized to manage through this (though of all the factors, high CPU has the lease overhead). The big question is why.
There are a few possibilities but the most likely, in my experience, is inefficiently queries. An example might be the best way to explain this. I've seen queries that are intended to find all the combinations of certain factors from several tables. So they cross join these tables but this produces lots of repeats so they add DISTINCT, problem solved. But this still creates all the duplicates and then reduces the set down. All the work is being done and most of the results thrown away. However, if they pared down the factors in the tables first, then cross joined them, the total work will be significantly lower. This example will do exactly what you are seeing, high CPU as it spins making repeat combinations and then throwing most of them away.
If you have many of this type of "fat in the middle" query where lots of extra data is made and immediately reduced, you won't get a lot of benefit for adding CPU resources. Things will get 2X faster with 2X the cluster size but you are buying 2X of all these other resources that aren't helping you. You would expect that buying 2X CPU and 2X memory and 2X disk IO etc. would give you much more than a 2X improvement. Being constrained on 1 thing make scaling costly. Also, you are unlikely to see the CPU utilization come down as your queries just "spin the tires" of the CPU. More CPUs will just mean you can run more queries resulting in the spinning more tires.
Now the above is just my #1 guess based on my consulting experience. It could be that your workload just isn't right for Redshift. I've seen people try to put many small database problems into Redshift thinking that it's powerful so it must be good at this too. They turn up the slot count to try to pump more work into Redshift but just create more issues. Or I've seem people try to run transactional workloads. Or ... If you have the wrong tool for the job it may not work well. One 6 ton dump truck isn't the same thing as 50 motorcycle delivery team - each has their purpose but they aren't interchangeable.
Another possibility is that you have a very unusual workload but Redshift is still the best tool for the job. You don't need all the strengths of Redshift but this is ok, you are getting the job done at an appropriate cost. If this case 100% CPU is just how your workload uses Redshift. It's not a problem, just reality. Now I doubt this is the case, but it is possible. I'd want to be sure I'm getting all the value from the money I'm spending before assuming everything is ok.

Related

Vertica PlannedConcurrency

I have been trying to tune the performance of queries running on a Vertica cluster by changing the value of PlannedConcurrency of the general resource pool. We have a cluster of 4 nodes with 32 cores/node.
According to Vertica docs,
Query budget = Queuing threshold of the GENERAL pool / PLANNEDCONCURRENCY
Increasing PlannedConcurrency should reduce the query budget, reserving lesser memory/query which might lead to fewer queries being queued up.
Increasing the value of PlannedConcurrency, seems to improve query performance.
PlannedConcurrency = 256 gives better performance than 128 which performs better than AUTO.
PlannedConcurrency being the preferred number of concurrently executing queries in the resource pool, how can this number be greater than the number of cores and still give better query performance?
Also, the difference between RESOURCE_ACQUISITIONS.MEMORY_INUSE_KB and QUERY_PROFILES.RESERVED_EXTRA_MEMORY should give the memory in use.
However, this number does not remain constant for a single query when the planned concurrency is changed.
Can someone please help me understand why does this memory usage differ with the value of PlannedConcurrency ?
Thanks !
References:
https://my.vertica.com/blog/do-you-need-to-put-your-query-on-a-budgetba-p236830/
https://my.vertica.com/docs/7.1.x/HTML/Content/Authoring/AdministratorsGuide/ResourceManager/GuidelinesForSettingPoolParameters.htm
It's hard to give an exact answer without the actual queries.
but, in general - increasing the planned concurrency means you reserve and allocate less resources per query and allow for greater concurrency.
If your use case has lot's of small queries which don't require lot's of resources - it might improve things.
also keep in mind that the CPU is not the only resource being used - you have to wait for IO (disks, network etc') this is time you can better spend on running more queries...

Hard disk contention using multiple threads

I have not performed any profile testing of this yet, but what would the general consensus be on the advantages/disadvantages of resource loading from the hard disk using multiple threads vs one thread? Note. I am not talking about the main thread.
I would have thought that using more than one "other" thread to do the loading to be pointless because the HD cannot do 2 things at once, and therefore would surely only cause disk contention.
Not sure which way to go architecturally, appreciate any advice.
EDIT: Apologies, I meant to mean an SSD drive not a magnetic drive. Both are HD's to me, but I am more interested in the case of a system with a single SSD drive.
As pointed out in the comments one advantage of using multiple threads is that a large file load will not delay the presentation of a smaller for to the receiver of the thread loader. In my case, this is a big advantage, and so even if it costs a little perf to do it, having multiple threads is desirable.
I know there are no simple answers, but the real question I am asking is, what kind of performance % penalty would there be for making the parallel disk writes sequential (in the OS layer) as opposed to allowing only 1 resource loader thread? And what are the factors that drive this? I don't mean like platform, manufacturer etc. I mean technically, what aspects of the OS/HD interaction influence this penalty? (in theory).
FURTHER EDIT:
My exact use case are texture loading threads which only exist to load from HD and then "pass" them on to opengl, so there is minimal "computation in the threads (maybe some type conversion etc). In this case, the thread would spend most of its time waiting for the HD (I would of thought), and therefore how the OS-HD interaction is managed is important to understand. My OS is Windows 10.
Note. I am not talking about the main thread.
Main vs non-main thread makes zero difference to the speed of reading a disk.
I would have thought that using more than one "other" thread to do the loading to be pointless because the HD cannot do 2 things at once, and therefore would surely only cause disk contention.
Indeed. Not only are the attempted parallel reads forced to wait for each other (and thus not actually be parallel), but they will also make access pattern of the disk random as opposed to sequential, which is much much slower due to disk head seek time.
Of course, if you were to deal with multiple hard disks, then one thread dedicated for each drive would probably be optimal.
Now, if you were using a solid state drive instead of a hard drive, the situation isn't quite so clear cut. Multiple threads may be faster, slower, or comparable. There are probably many factors involved such as firmware, file system, operating system, speed of the drive relative to some other bottle neck, etc.
In either case, RAID might invalidate assumptions made here.
It depends on how much processing of the data you're going to do. This will determine whether the application is I/O you bound or compute bound.
For example, if all you are going to do to the data is some simple arithmetic, e.g. add 1, then you will end up being I/O bound. The CPU can add 1 to data far quicker than any I/O system can deliver flows of data.
However, if you're going to do a large amount of work on each batch of data, e.g. a FFT, then a filter, then a convolution (I'm picking random DSP routine names here), then it's likely that you will end up being compute bound; the CPU cannot keep up with the data being delivered by the I/O subsystem which owns your SSD.
It is quite an art to judge just how an algorithm should be structured to match the underlying capabilities of the underlying machine, and vice versa. There's profiling tools like FTRACE/Kernelshark, Intel's VTune, which are both useful in analysing exactly what is going on. Google does a lot to measure how many searches-per-Watt their hardware accomplishes, power being their biggest cost.
In general I/O of any sort, even a big array of SSDs, is painfully slow. Even the main memory in a PC (DDR4) is painfully slow in comparison to what the CPU can consume. Even the L3 and L2 caches are sluggards in comparison to the CPU cores. It's hard to design and multi-threadify an algorithm just right so that the right amount of work is done on each data item whilst it is in L1 cache so that the L2, L3 caches, DDR4 and I/O subsystems can deliver the next data item to the L1 caches just in time to keep the CPU cores busy. And the ideal software design for one machine is likely hopeless on another with a different CPU, or SSD, or memory SIMMs. Intel design for good general purpose computer performance, and actually extracting peak performance from a single program is a real challenge. Libraries like Intel's MKL and IPP are very big helps in doing this.
General Guidance
In general one should look at it in terms of data bandwidth required by any particular arrangement of threads and work those threads are doing.
This means benchmarking your program's inner processing loop and measuring how much data it processed and how quickly it managed to do it in, choosing an number of data items that makes sense but much more than the size of L3 cache. A single 'data item' is an amount of input data, the amount of corresponding output data, and any variables used processing the input to the output, the total size of which fits in L1 cache (with some room to spare). And no cheating - use the CPUs SSE/AVX instructions where appropriate, don't forego them by writing plain C or not using something like Intel's IPP/MKL. [Though if one is using IPP/MKL, it kinda does all this for you to the best of its ability.]
These days DDR4 memory is going to be good for anything between 20 to 100GByte/second (depending on what CPU, number of SIMMs, etc), so long as your not making random, scattered accesses to the data. By saturating the L3 your are forcing yourself into being bound by the DDR4 speed. Then you can start changing your code, increasing the work done by each thread on a single data item. Keep increasing the work per item and the speed will eventually start increasing; you've reached the point where you are no longer limited by the speed of DDR4, then L3, then L2.
If after this you can still see ways of increasing the work per data item, then keep going. You eventually get to a data bandwidth somewhere near that of the IO subsystems, and only then will you be getting the absolute most out of the machine.
It's an iterative process, and experience allows one to short cut it.
Of course, if one runs out of ideas for things to increase the work done per data item then that's the end of the design process. More performance can be achieved only by improving the bandwidth of whatever has ended up being the bottleneck (almost certainly the SSD).
For those of us who like doing this software of thing, the PS3's Cell processor was a dream. No need to second guess the cache, there was none. One had complete control over what data and code was where and when it was there.
A lot people will tell you that an HD can't do more than one thing at once. This isn't quite true because modern IO systems have a lot of indirection. Saturating them is difficult to do with one thread.
Here are three scenarios that I have experienced where multi-threading the IO helps.
Sometimes the IO reading library has a non-trivial amount of computation, think about reading compressed videos, or parity checking after the transfer has happened. One example is using robocopy with multiple threads. Its not unusual to launch robocopy with 128 threads!
Many operating systems are designed so that a single process can't saturate the IO, because this would lead to system unresponsiveness. In one case I got a 3% percent read speed improvement because I came closer to saturating the IO. This is doubly true if some system policy exists to stripe the data to different drives, as might be set on a Lustre drive in a HPC cluster. For my application, the optimal number of threads was two.
More complicated IO, like a RAID card, contains a substantial cache that keep the HD head constantly reading and writing. To get optimal throughput you need to be sure that whenever the head is spinning its constantly reading/writing and not just moving. The only way to do this is, in practice, is to saturate the card's on-board RAM.
So, many times you can overlap some minor amount of computation by using multiple threads, and stuff starts getting tricky with larger disk arrays.
Not sure which way to go architecturally, appreciate any advice.
Determining the amount of work per thread is the most common architectural optimization. Write code so that its easy to increase the IO worker count. You're going to need to benchmark.

Is there any usage for letting a process "warm up"?

I recently did some digging into memory and how to use it properly. Of course, I also stumbled upon prefetching and how I can make life easier for the CPU.
I ran some benchmarks to see the actual benefits of proper storage/access of data and instructions. These benchmarks showed not only the expected benefits of helping your CPU prefetch, it also showed that prefetching also speeds up the process during runtime. After about 100 program cycles, the CPU seems to have figured it out and has optimized the cache accordingly. This saves me up to 200.000 ticks per cycle, the number drops from around 750.000 to 550.000. I got these Numbers using the qTestLib.
Now to the Question: Is there a safe way to use this runtime-speedup, letting it warm up, so to speak? Or should one not calculate this in at all and just build faster code from the start up?
First of all, there is generally no gain in trying to warm up a process prior to normal execution: That would only speed up the first 100 program cycles in your case, gaining a total of less than 20000 ticks. That's much less than the 75000 ticks you would have to invest in the warming up.
Second, all these gains from warming up a process/cache/whatever are rather brittle. There is a number of events that destroy the warming effect that you generally do not control. Mostly these come from your process not being alone on the system. A process switch can behave pretty much like an asynchronous cache flush, and whenever the kernel needs a page of memory, it may drop a page from the disk cache.
Since the factors make computing time pretty unpredictable, they need to be controlled when running benchmarks that are supposed to produce results of any reliability. Apart from that, these effects are mostly ignored.
It is important to note that keeping the CPU busy isn't necessarily a bad thing. Ideally you want your CPU to run anywhere from 60% to 100% because that means that your computer is actually doing "work". Granted, if there is a process that you are unaware of and that process is taking up CPU cycles, that isn't good.
In answer to your question, the machine usually takes care of this.

Reading files directly from the memory of another computer over a network

I am doing a large scale deep learning experiment involving image data of the order of around 800 GB.
The space available on a computational server is only 30 GB, and cannot be extended to match 800 GB.
At present I counter the problem by dividing my data using Python into chunks of 30 GB, and then process them by copying using openssh. Everytime I need another chunk, I delete the present chunk and then repeat the process for the next chunk. For several epochs of CNN training, this process is repeated hundreds of time.
Though I have not benchmarked, but I am concerned if this is a very major performance bottleneck, because CNN training itself takes weeks on a data of this scale. Repeated copying might be very costly.
I have never had an opportunity to face this issue so now I am thinking, if it is possible for me to read files directly from the memory of my storage server for processing.
Specifically my questions are :
Is it possible to read files directly from memory of another system, as though the files are on the same system, without explicit scp ?
What kind of C++ framework(s) are available for doing something like that ?
What techniques are typically used by professional programmers in such a resource-constrained situation ?
I am not a computer science major and this is my first stint where I am faced with such performance-centric issues.Thus, I have almost no practical experience of dealing with such cases. So, a little enlightment or reference would be great.
It may sound a little bit rude, but you need to realize that you can't do any sort of real-world machine learning on a calculator.
If you have a machine 10 years old or a dial-up internet connection, you can not analyze big data. The fact that your server has 30Gb of free hdd space at the time, when you can easily buy 1Tb for a price below 200$ means that something is really wrong here.
A lot of machine learning algorithms iterate through data many many times before they will converge, so any solution that requires to download / remove data many times will be significantly (impractically) slower. Even assuming a pretty fast and steady 200 Mb/s connection it will take you a couple of hours to download the whole data. Now repeat this even 100 times (NN converging after 100 iterations is mostly impossible) and you will see how bad is your situation.
This is close to my final remark - if you want to work with big-data, upgrade your machine to handle big data
Which costs more, the explicit cost of copying, or the implicit and hidden cost of reading data with latency?
As a data point, Google just announced a distributed version of TensorFlow. Which can do CNN training. (See https://www.tensorflow.org/versions/r0.8/tutorials/deep_cnn/index.html for details.) And in that, each machine winds up processing a chunk of data at a time. In a way that is not dissimilar to what you are already doing.

several t2.micro better than a single t2.small or t2.medium

I read EC2's docs: instance types, pricing, FAQ, burstable performance and also this about CPU credits.
I even asked the following AWS support and the answer wasn't clear.
The thing is, according to the docs (although not too clear) and AWS support, all 3 instance types have the same performance while bursting, it's 100% usage of a certain type of CPU core.
So this is my thought process. Assuming t2.micro's RAM is enough and that the software can scale horizontally. Having 2 t2.micro has the same cost as 1 t2.small, assuming the load distributed equally between them (probably via AWS LB) they will use the same amount of total CPU and consume the same amount of CPU credits. If they were to fall back to baseline performance, it would be the same.
BUT, while they are bursting, 2 t2.micro can achieve x2 the performance of a t2.small (again, for the same cost). Same concept applies to t2.medium. Also using smaller instances allows for tigther auto (or manual) scaling which allows one to save money.
So my question is, given RAM and horizontal scale is not a problem, why would one use other than t2.micro.
EDIT: After some replies, here are a few notes about them:
I asked on AWS support and supposedly, each vCPU of the t2.medium can achieve 50% of "the full core". This means the same thing I said applies to t2.medium (if what they said was correct).
T2.micro instances CAN be used on production. Depending on the technology and implementation, a single instance can handle over 400 RPS. I do, and so does this guy.
They do require a closer look to make sure credits don't go low, but I don't accept that as a reason not to use them.
Your analysis seems correct.
While the processor type isn't clearly documented, I typically see my t2.micro instances equipped with one Intel Xeon E5-2670 v2 (Ivy Bridge) core, and my t2.medium instances have two of them.
The micro and small should indeed have the same burst performance for as long as they have a reasonable number of CPU credits remaining. I say "a reasonable number" because the performance is documented to degrade gracefully over a 15 minute window, rather than dropping off sharply like the t1.micro does.
Everything about the three classes (except the core, in micro vs small) multiplies by two as you step up: baseline, credits earned per hour, and credit cap. Arguably, the medium is very closely equivalent to two smalls when it comes to short term burst performance (with its two cores) but then again, that's also exactly the capability that you have with two micros, as you point out. If memory is not a concern, and traffic is appropriately bursty, your analysis is sensible.
While the t1 class was almost completely unsuited to a production environment, the same thing is not true of the t2 class. They are worlds apart.
If your code is tight and efficient with memory, and your workload is appropriate for the cpu credit-based model, then I concur with your analysis about the excellent value a t2.micro represents.
Of course, that's a huge "if." However, I have systems in my networks that fit this model perfectly -- their memory is allocated almost entirely at startup and their load is relatively light but significantly variable over the course of a day. As long as you don't approach exhaustion of your credit balances, there's nothing I see wrong with this approach.
There is a lot's of moving targets here. What are your instances are doing?
You said the traffic varies over the day but not spiky. So if you wish to "Closely follow" the load with a small amount of t2.micro instances, you won't be able to use too much bursting, because at each upscaling you will have a low CPU credits. So if most of your instances are running only when they are under load, they will never collect CPU credits.
Also you loose time and money with each startup time and the unused but started usage hours, so doing a too frequent up/down scaling isn't the most cost efficient.
Last but not least, the operating system, other softwares has more or less a fix overhead, running it 2 times instead of one, may takes more resources away from your application in a system, where you gets CPU credits only under 20% of load.
If you want extreme cost efficiency, use spot instances.
The credit balance assigned to each instance varies. So while two micros could provide double the performance of small during burst, it will only be able to do so for half as long.
I generally prefer at least two instances for availability purposes. But with the burstable model, workload also comes into consideration. Are you looking at sustained load? or are you expecting random spikes throughout the day?