Is the maintenance window burning error budget - sre

Is the maintenance window burning error budget?
Example:
Let's say I have a 1h error budget left. I stop the service for planned maintenance for 30 minutes. Is the error budget still 1h or is it 30 minutes?
The maintenance window is happening when there is no traffic to the application, for example, 3-5 am for online retailer that is available in one country.

it is 30 minutes
“The development team can ‘spend’ this error budget in any way they like. If the product is currently running flawlessly, with few or no errors, they can launch whatever they want, whenever they want. Conversely, if they have met or exceeded the error budget and are operating at or below the defined SLA, all launches are frozen until they reduce the number of errors to a level that allows the launch to proceed.”
from
https://www.atlassian.com/br/incident-management/devops/sre

Related

GCP alert: Only alerting when threshold is violated for multiple measurement periods

In GCP, I have an alerting policy on database CPU and memory usage. For example, if CPU is over 50% over a 1m period, the alert fires.
The alert is kind of noisy. With other systems, I've been able to alert only if the threshold is violated multiple times, e.g.
If the threshold is violated for 2 consecutive minutes.
If over a 5 minute period, the threshold is violated in 3 of those minutes.
(Note: I don't want to simply change my alignment period to 2 minutes.)
There are a couple things I've seen in the GCP alert configuration that might help here:
Change the "trigger"
UI: "Alert trigger: Number of time series violates", "Minimum number of time series in violation: 2".
JSON: "trigger": {"count": 2}
Change the "retest window"
UI: "Advanced Options" → "Retest window: 2m"
JSON: "duration": "120s"
But I can't figure out exactly how these work. Can these be used to achieve the goal?
The restest window option is usefull in the scenario i think, i have similar requirement that i have set it up in GCP alert policy for db CPU uttlisation breaches 70 % for 5 mins rolling time . if the alert gets clear in 5mins it wont alert but it reappears again for more than 10mins ,it can trigger alert.
I have setup in restest window of time limit 10mins.

Maintenance window for GKE clusters

I am aware of the fact that,
“You must allow at least 48 hours of maintenance availability in a 32 day rolling window”
Hence we configured the maintenance window for the cluster to be set dynamically during cluster creation using Terraform as:
maintenance_policy {
recurring_window {
start_time = timeadd(timestamp(),”720h”)
end_time = timeadd(timestamp(),”768h”)
recurrence = “FREQ=MONTHLY”
}
}
So basically setting a monthly maintenance window wherein the start time is 30 days from cluster creation.
We have not faced any issues with this config earlier, but when I tried using this on the 1st of March, Terraform was correctly evaluating the start_time as the 31st of March, however GKE doesn’t and sets the start time as 2nd April, which throws an error since it is out of 32 days window.
Error: googleapi : Error 400: Error validating maintenance policy: maintenance policy would go longer than 32d without 48h maintenance availability of >=4h contiguous duration (in time range [2021-04-02T04:25:38Z, 2021-05-04T04:25:38Z])., badRequest
We tried hardcoding in several values, but observed some disparity wherein the start_time was falling in on days like 30th and 31st of the month.
I found no docs on any exceptions for specific dates and any leads would be really appreciated!
You cannot have a maintenance window more than 30 days in GKE. if you have a maintenance window of more than 30 days you have to break them up into multiple exclusion windows and make sure that the diff between one end time and start time is at least 48 hours.
So for example if there is a maintenance exclusion between 2021-10-01 and 2021-12-31 it would be defined as such
exclusion-window-1:
endTime: '2021-10-30T00:00:00Z'
startTime: '2021-10-01T00:00:00Z'
exclusion-window-2:
endTime: '2021-11-30T00:00:00Z'
startTime: '2021-11-01T00:00:00Z'
exclusion-window-3:
endTime: '2021-12-31T00:00:00Z'
startTime: '2021-12-02T00:00:00Z'

elastic search update Service software release in AWS console

after pressing update service software release in AWS console the following message appeared An update to release *******has been requested and is pending.
Before the update starts, you can cancel it any time."
Right now I waited for 1 day - still pending.
Any ideas how much time does it take, or do i need to do anything to move it from pending to updating, and should i expect any downtime in the update processenter image description here
I requested the R20210426-P2 update on a Monday and it was completed on the next Saturday so roughly 6 days from request to actual update. It's also worth noting that the update does not show up in the Upgrade tab in the UI, it shows up in the Notifications tab with this:
Service software update R20210426-P2 completed.
[UPDATE 11 Jul 2021] I just proceeded with updates on two additional domains and the updates began within 15 minutes.
[UPDATE 17 Dec 2021 Log4J CVE] I've had variable luck with the R20211203-P2. One cluster updated in a few hours and one took a few days. A third I was sure I started a few days ago but it gave me the option to update today (possibly a timeout?). I'm guessing they limit the number of concurrent updates and things are backed up. I recommend continuing to check the console but have patience, they do eventually get updated. If you have paid support, definitely open a ticket.

AWS Lambda function that has been working for weeks, one day timed out for no apparent reason. Ideas?

I wrote a simple lambda function (in python 3.7) that runs once a day, which keeps my Glue data catalog updated when new partitions are created. It works like this:
Object creation in a specific S3 location triggers the function asynchronously
From the event, lambda extracts the key (e.g.: s3://my-bucket/path/to/object/)
Through AWS SDK, lambda asks glue if the partition already exists
If not, creates the new partition. If yes, terminates the process.
Also, the function has 3 print statements:
one at the very beginning, saying it started the execution
one in the middle, which says if the partition exists or not
one at the end, upon successful execution.
This function has an average execution time of 460ms per invocation, with 128MB RAM allocated, and it cannot have more than about 12 concurrent executions (as 12 is the maximum amount of new partitions that can be generated daily). There are no other lambda functions running at the same time that may steal concurrency capacity. Also, just to be sure, I have set the timeout limit to be 10 seconds.
It has been working flawlessly for weeks, except this morning, 2 of the executions timed out after reaching the 10 seconds limit, which is very odd given it's 20 times larger than the avg. duration.
What surprises me the most, is that in one case only the 1st print statement got logged in CloudWatch, and in the other case, not even that one, as if the function got called but never actually started the process.
I could not figure out what may have caused this. Any idea or suggestion is much appreciated.
May be AWS had a problem with their services, I got the same issue.
Not sure it can help. You can check at:
https://status.aws.amazon.com
[CloudFront High Error Rate]
4:28 PM PDT We are investigating elevated error rates and elevated
latency in multiple edge locations. 5:08 PM PDT We can confirm
elevated error rates and high latency accessing content from multiple
Edge Locations, which is also contributing to longer than usual
propagation times for changes to CloudFront configurations. We have
identified the root cause and continue to work toward resolution. 5:54
PM PDT We are beginning to see recovery for the elevated error rates
and high latency accessing content from multiple Edge Locations. Error
rates have recovered for all locations except for Europe.
Additionally, we continue to work toward recovery for the increased
delays in propagating configuration changes to Cloudfront
Distributions. 6:21 PM PDT Starting 3:18 PM PDT, we experienced
elevated error rates and high latency accessing content from multiple
Edge Locations. The elevated error rates and elevated latency
accessing content were fully recovered at 5:48 PM PDT. During this
time, customers may also have experienced longer than usual change
propagation delays for CloudFront configurations and invalidations.
The backlog of CloudFront configuration changes and invalidations were
fully processed by 6:14 PM PDT. All issues have been fully resolved
and the system is operating normally

AWS EC2 reaches high CPU uses during nighttime

I just implemented a few alarms with CloudWatch last week and I noticed a strange behavior with EC2 small instances everyday between 6h30 and 6h45 (UTC time).
I implemented one alarm to warn me when a AutoScallingGroup have its CPU over 50% during 3 minutes (average sample) and another alarm to warn me when the same AutoScallingGroup goes back to normal, which I considered to be CPU under 30% during 3 minutes (also average sample). Did that 2 times: one for zone A, and another time for zone B.
Looks OK, but something is happening during 6h30 to 6h45 that takes certain amount of processing for 2 to 5 min. The CPU rises, sometimes trigger the "High use alarm", but always triggers the "returned to normal alarm". Our system is currently under early stages of development, so no user have access to it, and we don't have any process/backups/etc scheduled. We barely have Apache+PHP installed and configured, so it can only be something related to the host machines I guess.
Anybody can explain what is going on and how can we solve it besides increasing the sample time or % in the "return to normal" alarm? Folks at Amazon forum said the Service Team would have a look once they get a chance, but it's been almost a week with no return.