Strange behaviour of CPU balance in t2.2xlarge AWS EC2 instance [duplicate]

Strange behaviour of CPU balance in t2.2xlarge AWS EC2 instance [duplicate] - amazon-web-services

I am using a T2.medium instance. A third of the day I am doing intensive statistical calculations and figured that the rest 2/3 of the time I would "earn" credits at a rate at 24 per hour.
But that is not happening. This is my usage the last two days:
And this is my credit account:
I hadn´t used it for (more than) a day until yesterday 6 pm. I use it intensive for five hours. Then I would expect my "account" to acummulate 24 credits per hour but for 9-10 hours almost nothing happens, then it acummulate as expected for 9 hours and then goes flat again.
I am unable to figure out what is going on and if it is a fault. Do anyone have a good explanation?
EDIT: I have included a week of activity below. I still can´t figure out the algoritm:

Update: The rules used to calculate t2 CPU credit balances appear to have changed such that the issue prompting this question should no longer have an impact.
Based on customer feedback, we’ve updated T2 instances with a new CPU Credit allocation policy that is the same as or better than the previous policy in all cases.
...
Now, earned CPU Credits do not expire until the instance is terminated or stopped. A T2 instance can still earn up to the same maximum level allowed by the instance size. The CPUCreditBalance will now increase anytime the current CPUCreditUsage is below the baseline and can grow to the maximum allowed for the instance size
https://forums.aws.amazon.com/ann.jspa?annID=5196
h/t: Last Week in AWS for the update.
The original answer follows.
This question has caused me quite a bit of mental anguish over the last few hours, because the graphs almost make sense, based on what I know about t2 instances. Almost, but not quite, and I couldn't put my finger on the problem. That's the worst kind. Particularly being a huge fan of the value proposition offered by t2 machines.
But I did finally figure out what's going on here.
There's one concept of CPU credits the documentation doesn't seem to explain, but the math works out, and the explanation holds up nicely under real-world observations:
The most recently earned CPU credits are spent first, not last.
Does order matter? It does.
For testing, I used a t2.micro (primarily because I had an idle one that had been running for several days, and needed something to do, and I didn't want the extra "initial" credits of a new instance to cloud up the observations) but all instance types in the t2 class have similar behavior.
By way of background: in the t2 class, CPU credits are earned at different rates, but CPU credits are used at the same rate for all instance types in the class:
A CPU Credit provides the performance of a full CPU core for one minute.
The t2.micro and t2.small have only one core, so they can burn up to 1 credit per minute or 60 credits per hour, at 100% CPU utilization. The t2.medium and t2.large are dual core, so they can burn up to 2 credits per minute, or 120 credits per hour, at 100% CPU utilization on both cores.
If 1 credit = 100% of 1 core for 1 minute, then 1 credit is also equal to 20% of 1 core for 5 minutes. Since the Cloudwatch graph interval is in 5 minute increments, I set up the following test:
On a t2.micro that has been running for several weeks with essentially no load, I installed lookbusy, a handy utility that allows you to make a machine "look busy" with parameters you specify -- e.g, keep the CPU at 20% utilization.
$ screen -S eat_cpu
$ ./lookbusy -v -c 20 -r fixed
This does exactly what you'd expect, burning 1 CPU credit every 5 minutes. The "CPU Credit Usage" graph confirms this, showing 1 credit being used every 5 minutes. (The CPU Utilization graph, and top, both confirm the 20%.)
But what's happening to my credit balance? It's being depleted by 1 credit every 5 minutes. That seems wrong, doesn't it? I mean, yes, I just said that's how many I'm using, but... I'm also supposed to be earning 6 credits per hour, so I should only be depleting by balance by a net of 0.5 credits every 5 minutes, right?
Hold on... checking the numbers, again: I'm earning 6 per hour, spending 12 per hour, so, yes... that seems like it should be a net decrease of only 6 per hour, not 12... right? Clearly, something doesn't add up the way I expected, because my balance is definitely going down by 12 per hour, and my CPU is definitely only running at 20%.
I seem to be earning no credits to offset my usage. How is that possible?
Unless...
Unused earned credits from a given 5 minute interval expire 24 hours after they are earned
Well, 24 hours ago, my instance was completely idle. During that hour, I earned 6 credits that I... didn't (?) use. Am I not using them now? Shouldn't I be?
any expired credits are removed from the CPU credit balance at that time, before any newly earned credits are added
Crud. Could this be related? This hour, I earned 6 new credits. But right before that, I lost 6 credits from 24 hours ago. Then I spent 12 credits this hour... so my balance when down by 6, up by 6, and down by another 12. Well, that explains the -12 change for the hour, but...
Can that be the reason?
I'm a voracious reader of documentation, so I knew about the expiring credits aspect... but I assumed all along that this was nothing more than the reason an idle instance hovers near its maximum balance, and did not have any other significance. How could it? If I have less than the maximum (6 x 24 = 144 for a t2.micro) then how could I have credits the need to expire?
If my credits from 24 hours ago are always counting against me, wouldn't my balance tend toward zero, regardless of what I do?
Unless...
After tossing and turning most of the night while contemplating sliding around piles of imaginary tokens (representing CPU credits) on an imaginary table top (representing time)... I realized that the "expiration" rule would cause exactly the behavior we observe if, counter-intuitively, credits are not spent in the order in which they are earned (FIFO), but rather in the reverse order (LIFO).
Following that line of reasoning, the explanation for what my 20% CPU test is actually doing is this, where the first hour of my test was "hour 0" --
| spends 6+6 credits | expire 6 credits
test | earned this many | earned this many
hour | hours before hour 0 | hours before hour 0
-----+---------------------+--------------------
0 -1, -2 -24
1 -3, -4 -23
2 -5, -6 -22
3 -7, -8 -21
4 -9, -10 -20
5 -11, -12 -19
6 -13, -14 -18
7 -15, -16 -17
And they meet in the middle.
Is this genuine, or am I guessing? I'm not guessing, and here's the evidence:
After 8 hours, my CPU credit usage graph remains solid, still holding steady at 1 credit per 5 minutes, but after the same 8 hours, my CPU credit balance finally begins to deplete at the (slower) rate I originally expected: 0.5 credits every 5 minutes.
Apparently, as I worked backward in time, spending previously earned credits "newest first," I caught up with my old credits that were about to expire, finally reaching the point where I was using them before they had a chance to expire. Now, I have no credits that are 24 hours old, and so no credits are expiring -- so I am no longer losing credits before new credits are earned. I am now able to keep the 6 that I earn per hour, because I used up the old ones, decreasing the net impact to my credit balance to the expected level.
This explains the only reservation I had about the graphs in the question: why, when utilization drops off, does it take so long for the balance to rebound?
The TL;DR answer is this: the balance doesn't rebound immediately, after a burst of heavy utilization, because you still have unused credits from 24 hours prior, which are canceling out your newly-earned credits, until you reach the point in time when you don't have any 24-hour-old unused credits. When that happens, your credit balance increases again.
Leave the instance completely idle for 24 hours and you will eventually see the balance steadily (for the most part) rise to the maximum again, as expected. Anything less than 24 hours completely idle will cause your balance to remain perpetually be somewhere below the max.
My test script eventually depleted my credit balance almost all the way down. When I killed the process eating the CPU, the credit balance began to recover immediately, at the expected rate of 6 credits per hour.
Conversely, when I took a different machine that had seen low utilization for 24 hours, and ran it's CPU to 100% for a few minutes, then took it back to idle, the credits did not begin to accumulate immedately... being offset by old, expiring ones.
Quotes are from http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/t2-instances.html.

Related

Inconsistent Lambda AWS Runtimes

We have created an Amazon Web Services Lambda Function in python and added code to get the execution time because we noticed that the runtime is varying significantly based on how long it has been since the last lambda call and provision concurrency seems to have little impact. For all the runs in the table below, we were running the same lambda with the same inputs and the program is deterministic, so everything related to our code is held constant. Below is a table of what we found. Interval means how long we waited before making another call to the lambda; for example 2 hours means that if we called a lambda at 1:00 PM, we called it again at 3:00 PM and 5:00 PM etc. (but not in between). # Invocations is the total number of times the lambda was called in succession for a given interval.
Provisioned Concurrency
Interval
# Invocations
Avg Duration
p90 Duration
off
2 hours
42
17.9 sec
32.0 sec
off
10 minutes
304
14.6 sec
21.2 sec
off
1 minute
364
4.5 sec
6.3 sec
on
2 hours
50
18.1 sec
31.2 sec
on
10 minutes
49
10.2 sec
29.6 sec
on
1 minute
404
4.6 sec
6.3 sec
So, a few questions that seem odd that maybe someone else has experience or knows something about:
We thought there would be a "hot" versus "cold" setting for the lambdas where provision concurrency would keep the lambdas "hot" all the time. However, 10 minutes versus 2 hours seems to be in a "warm" state (i.e. 10 minutes is faster than 2 hours, but not much faster). Then, at 1 minute, the lambda appears to be "hot" and consistently fast. Any idea what is going on here?
Also, we thought that provisioned concurrency would essentially make the lambda stay in the "hot" state regardless of invocation interval; however, that is not the case as can be seen with the 2 hour interval actually taking slightly longer. Having said that though, the 10 minute interval is slightly shorter with provision concurrency on.
Any help/incite on this is greatly appreciated.
Extra information based on comments:
The Lambda is triggered via API Gateway from an HTTP POST request.
The runtime is a custom Docker image. We are doing this so we can use a particular version of Tensorflow.
The Lambda does use other services: S3 and DynamoDB. We added logs to time each section, and we have identified that the majority of the time spent is inside of our self-contained section. Meaning, it doesn’t appear that S3 download/upload or DynamoDB read/write is taking much time at all. Most of the time is spent in our CPU intensive algorithm, which does image processing with Tensoflow. For those familiar with Deep Learning, the program loads a trained model and we do inference with that model and each test used the same image as input.
We’ve also tried adjusting the CPU/Memory of the lambda, and found it doesn’t seem to improve much beyond a certain point, and it does not fix the problem. The experiments from the table were gathered using 4GB Lambdas, and our logs report that less than 1GB of memory is ever used.

How do I add a time delay to an Alexa Quiz SKill

I am creating a Quiz based Alexa Skill. This Quiz has three levels (1,2 and 3). I would like to reduce the amount of time the user has to answer as they progress through levels.
I'm aware that I cannot extend the 8 second reply time that is fixed with Alexa Skills so here is my current attempt. At level 1, the user will have the initial 8 seconds to respond and if they do not, Alexa will re-prompt them, adding another 8 seconds. In total level 1 players have approx 16 seconds to respond. At level two I will not allow Alexa to re-prompt the user, but after the 8 seconds state that they have run out of time and tell the user their score before saving it, so level 2 plays have roughly 8 seconds. However, I'm unsure whether or not I can reduce the initial 8 seconds to 5 seconds for level 3. Any help is much appreciated, thanks.
Edit: This is all taking place within a Amazon Lambda function

Unfortunately, this isn't something that can be changed. The time allowed for the user to respond is fixed.

You don 't have control over the timeouts aside from 8 seconds + 8 seconds in the reprompt. The only workaround I can think of is measuring the roundtrip of the question/answer in the backend and reject the answers that go above the time limit you want to configure.

There is a workaround.
You may use an audio in SSML and prompt the user to precede the answer with the awake word.
<speak>
You have 30 seconds, when you are ready just say: Alexa, and your answer.
<audio src="soundbank://soundlibrary/ui/gameshow/amzn_ui_sfx_gameshow_countdown_loop_32s_full_01"/>
Time is over. Tell me your answer.
</speak>

Reserve instance when running less than 24 hours a day

I would like to know the best solution.
I have an instance of t3.medium type running 6 hours a day.
Does it make sense for me to buy t3.nano type reserved instance, if so, how many instances? Or does it not pay to buy a reserved instance?

From a purely mathematical viewpoint, in US regions a t3.medium Linux instance would cost:
On-Demand: $0.0416 per Hour x 6 hours per day x 5 days per week x 52 weeks = $64.896 per year (Or ~$90 if 7 days per week)
1-Year upfront Reserved instance: $213 per year
3-Year upfront Reserved instance: $412 = $137 per year
So, the cheapest option is On-Demand.
An alternative is a Scheduled Reserved Instance, which "are a good choice for workloads that do not run continuously, but do run on a regular schedule." However, it seems that this option has been removed from the Management Console in some regions.
A Reserved Instance also includes a capacity reservation in case of capacity constraints, which makes it attractive beyond merely price.

The maximum discount you can get is ~62% for 3 year, 100% upfront. Since you running the instance for only 6 hours / day, it makes no financial sense to reserve your instance.

EC2 t2.medium burstable credit "savings" calculation

I am using a T2.medium instance. A third of the day I am doing intensive statistical calculations and figured that the rest 2/3 of the time I would "earn" credits at a rate at 24 per hour.
But that is not happening. This is my usage the last two days:
And this is my credit account:
I hadn´t used it for (more than) a day until yesterday 6 pm. I use it intensive for five hours. Then I would expect my "account" to acummulate 24 credits per hour but for 9-10 hours almost nothing happens, then it acummulate as expected for 9 hours and then goes flat again.
I am unable to figure out what is going on and if it is a fault. Do anyone have a good explanation?
EDIT: I have included a week of activity below. I still can´t figure out the algoritm:

Metric-based Auto scaling policies in Amazon EC2

I have defined the following policies on t2.micro instance:
Take action A whenever {maximum} of CPU Utilization is >= 80% for at least 2 consecutive period(s) of 1 minute.
Take action B whenever {Minimum} of CPU Utilization is <= 20% for at least 2 consecutive period(s) of 1 minute.
Is my interpretation is wrong that: if the min (max) of CPU drops below (goes beyond) 20 (80) for 2 minutes, these rules have to be activated?
Because my collected stats show for example the Max of cpu has reached 90% twice in two consecutive period of 1 minute, but I got No Alarm!
Cheers

It seems my interpretation is not correct! The policy works based on the Average of the metric for every minute! It means the first policy will be triggered if the AVERAGE of stat Datapoints within a minute is >= 80% for two consecutive periods of 1 minute. The reason is simple: Cloudwatch does not consider stat datapoints less than 1 Min granularity. So if I go for 5 Minutes period, Max and Min show the correct behavior.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js