Retroactive calculation of VM instance total running time (GCP) - google-cloud-platform

I have a number of instances in a GCP project, that I want to check retroactively how long they've been in use in the last 30 days, i.e. sum the total time an instance is not stopped or terminated during a specific month.
Does anyone know if this can be calculated, and if so - how?
Or maybe another idea that would allow me to sum the total time an instance was in use?

Based on this other post, I would recommend something like:
fetch gce_instance
| metric 'compute.googleapis.com/instance/uptime'
| align delta(30d)
| every 30d
| group_by [metric.instance_name]
I would also consider creating uptime_checks as one of the answers in said post recommends for future checks, but those wouldn't work retroactively. If you need more info about MQL you can find it here.

Related

Is there a way to just sum a google cloud metric within a time period?

With the Metrics Explorer in the Google Cloud Platform, is it possible to sum a metric within a defined time period? I have custom gauge metrics set up for various data that I care for. And I am just trying to sum them up over the course of days, weeks, and months. It is important that I count everything between the first of the month and the last of the month. But instead, the aligner seems to pick an arbitrary time to align to (e.g. the 4th of the month) and I can't be certain that I am getting the correct values.
For example, if I try to use delta, rate, or sum within a time window of May 1st 00:00 to June 1st 00:00, with an alignment of 31 days, I will see two numbers. One will be for the 14th of May and one for the 14th of June and they will add up to a very large number.
fetch generic_task
| metric 'custom.googleapis.com/go-metrics/request_counter'
| filter (metric.environment == 'prod')
| align delta(31d)
| every 31d
| group_by [metric.environment],
[value_request_counter_aggregate: aggregate(value.request_counter)]
That isn't so bad but if I change alignment, the sum of those numbers don't add up, such as if I tried 7d or 1d instead. Like, the number is twice as much as if it were counting data from outside the time range that I specified. And worse, upon reload of the webpage, it picks different days/times it will align on too.
To get around this, I have been reduced to setting the alignment to a fine amount and just tolerating a small amount of error/inconsistency.

Monitor that lambda executes in NewRelic

I'm trying to monitor if my Lambda has been executed within the last 25 hours within New Relic. I want to alert if it hasn't.
I have the following NRQL which gives me the graph I want to see:
SELECT sum(`provider.invocations.Sum`) FROM ServerlessSample WHERE provider.resource = 'my_lambda_name'
I then just want to say that if it dips below 1 for 1500 minutes (25 hours) then alert, but NR only allows me to set an alarm for 120 minutes. Any tips on how to get around this?
Interesting question, as I have seen in New Relic discussion page, or Explorers Hub, there might be solution for your task.
Can you please review this link:
https://discuss.newrelic.com/t/relic-solution-extending-the-functionality-of-nrql-alert-conditions-beyond-a-single-minute/75441
If you think about this for a moment, you might see how NRQL queries using percentile or stddev are a lot less useful than they seem, when used in an alert condition. After all, if you calculate the standard deviation over an hour (or 24 hours), that can be meaningful. But stddev(duration), or percentile(duration,95) calculated over only 60 seconds is less meaningful.
I think that limit is 24 hours but I haven't test it yet.
Hope this will help you, I will try to give it a go as well to see will this work.

AWS CloudWatch interpreting insights graph -- how many read/write IOs will be billed?

Introduction
We are trying to "measure" the cost of usage of a specific use case on one of our Aurora DBs that is not used very often (we use it for staging).
Yesterday at 18:18 hrs. UTC we issued some representative queries to it and today we were examining the resulting graphs via Amazon CloudWatch Insights.
Since we are being billed USD 0.22 per million read/write IOs, we need to know how many of those there were during our little experiment yesterday.
A complicating factor is that in the cost explorer it is not possible to group the final billed costs for read/write IOs per DB instance! Therefore, the only thing we can think of to estimate the cost is from the read/write volume IO graphs on CLoudwatch Insights.
So we went to the CloudWatch Insights and selected the graphs for read/write IOs. Then we selected the period of time in which we did our experiment. Finaly, we examined the graphs with different options: "Number" and "Lines".
Graph with "number"
This shows us the picture below suggesting a total billable IO count of 266+510=776. Since we have choosen the "Sum" metric, this we assume would indicate a cost of about USD 0.00017 in total.
Graph with "lines"
However, if we choose the "Lines" option, then we see another picture, with 5 points on the line. The first and last around 500 (for read IOs) and the last one at approx. 750. Suggesting a total of 5000 read/write IOs.
Our question
We are not really sure which interpretation to go with and the difference is significant.
So our question is now: How much did our little experiment cost us and, equivalently, how to interpret these graphs?
Edit:
Using 5 minute intervals (as suggested in the comments) we get (see below) a horizontal line with points at 255 (read IOs) for a whole hour around the time we did our experiment. But the experiment took less than 1 minute at 19:18 (UTC).
Wil the (read) billing be for 12 * 255 IOs or 255 ... (or something else altogether)?
Note: This question triggered another follow-up question created here: AWS CloudWatch insights graph — read volume IOs are up much longer than actual reading
From Aurora RDS documentation
VolumeReadIOPs
The number of billed read I/O operations from a cluster volume within
a 5-minute interval.
Billed read operations are calculated at the cluster volume level,
aggregated from all instances in the Aurora DB cluster, and then
reported at 5-minute intervals. The value is calculated by taking the
value of the Read operations metric over a 5-minute period. You can
determine the amount of billed read operations per second by taking
the value of the Billed read operations metric and dividing by 300
seconds. For example, if the Billed read operations returns 13,686,
then the billed read operations per second is 45 (13,686 / 300 =
45.62).
You accrue billed read operations for queries that request database
pages that aren't in the buffer cache and must be loaded from storage.
You might see spikes in billed read operations as query results are
read from storage and then loaded into the buffer cache.
Imagine AWS report these data each 5 minutes
[100,150,200,70,140,10]
And you used the Sum of 15 minutes statistic like what you had on the image
F̶i̶r̶s̶t̶,̶ ̶t̶h̶e̶ ̶"̶n̶u̶m̶b̶e̶r̶"̶ ̶v̶i̶s̶u̶a̶l̶i̶z̶a̶t̶i̶o̶n̶ ̶r̶e̶p̶r̶e̶s̶e̶n̶t̶ ̶o̶n̶l̶y̶ ̶t̶h̶e̶ ̶l̶a̶s̶t̶ ̶a̶g̶g̶r̶e̶g̶a̶t̶e̶d̶ ̶g̶r̶o̶u̶p̶.̶ ̶I̶n̶ ̶y̶o̶u̶r̶ ̶c̶a̶s̶e̶ ̶o̶f̶ ̶1̶5̶ ̶m̶i̶n̶u̶t̶e̶s̶ ̶a̶g̶g̶r̶e̶g̶a̶t̶i̶o̶n̶,̶ ̶i̶t̶ ̶w̶o̶u̶l̶d̶ ̶b̶e̶ ̶(̶7̶0̶+̶1̶4̶0̶+̶1̶0̶)̶
Edit: First, the "number" visualization represent the whole selected duration, aggregated with would be the total of (100+150+200+70+140+10)
The "line" visualization will represent all the aggregated groups. which would in this case be 2 points (100+150+200) and (70+140+10)
It can be a little bit hard to understand at first if you are not used to data points and aggregations. So I suggest that you set your "line" chart to Sum of 5 minutes you will need to get value of each points and devide by 300 as suggested by the doc then sum them all
Added images for easier visualization

Depth of sys.dm_pdw_exec_requests on Azure SQL Data Warehouse

I am running tests that take many hours to complete on ADW and the amount of SQL involved rolls off the 10,000 row limit of sys.dm_pdw_exec_requests (as documented at https://learn.microsoft.com/en-us/azure/sql-data-warehouse/sql-data-warehouse-service-capacity-limits ) in less than 30 minutes.
Is my only option to create a process to capture into a table in my database the data on sys.dm_pdw_exec_requests every N minutes (where N << 30 )?
I'm not sure what your use case is, but perhaps you can get the same useful information out of the audit logs?
https://learn.microsoft.com/en-us/azure/sql-data-warehouse/sql-data-warehouse-auditing-overview
You might be able to use something that was already built for that purpose, instead of reinventing the wheel:
https://github.com/andrealibero/Azure_SQL_DWH_Perf_Stats
the PowerShell script can collect output of DMVs (configured in an XML file) in a loop or for a number of specified iterations.
Given how quickly the DMVs roll out for you this might help in your scenario.

DynamoDB Schema Design

I'm thinking of using Amazon AWS DynamoDB for a project that I'm working on. Here's the gist of the situation:
I'm going to be gathering a ton of energy usage data for hundreds of machines (energy readings are taken around every 5 minutes). Each machine is in a zone, and each zone is in a network.
I'm then going to roll up these individual readings by zone and network, by hour and day.
My thinking is that by doing this, I'll be able to perform one query against DynamoDB on the network_day table, and return the energy usage for any given day quickly.
Here's my schema at this point:
table_name | hash_key | range_key | attributes
______________________________________________________
machine_reading | machine.id | epoch | energy_use
machine_hour | machine.id | epoch_hour | energy_use
machine_day | machine.id | epoch_day | energy_use
zone_hour | machine.id | epoch_hour | energy_use
zone_day | machine.id | epoch_day | energy_use
network_hour | machine.id | epoch_hour | energy_use
network_day | machine.id | epoch_day | energy_use
I'm not immediately seeing that great of performance in tests when I run the rollup cronjob, so I'm just wondering if someone with more experience could comment on my key design? The only experience I have so far is with RDS, but I'm very much trying to learn about DynamoDB.
EDIT:
Basic structure for the cronjob that I'm using for rollups:
foreach network
foreach zone
foreach machine
add_unprocessed_readings_to_dynamo()
roll_up_fixture_hours_to_dynamo()
roll_up_fixture_days_to_dynamo()
end
roll_up_zone_hours_to_dynamo()
roll_up_zone_days_to_dynamo()
end
roll_up_network_hours_to_dynamo()
roll_up_network_days_to_dynamo()
end
I use the previous function's values in Dynamo for the next roll up, i.e.
I use zone hours to roll up zone days
I then use zone days to roll up
network days
This is what (I think) is causing a lot of unnecessary reads/writes. Right now I can manage with low throughputs because my sample size is only 100 readings. My concerns begin when this scales to what is expected to contain around 9,000,000 readings.
First things first, time series data in DynamoDB is hard to do right, but not impossible.
DynamoDB uses the hash key to shard the data so using the machine.id means that some of you are going to have hot keys. However, this is really a function of the amount of data and what you expect your IOPS to be. DynamoDB doesn't create a 2nd shard until you push past 1000 read or write IOPS. If you expect to be well below that level you may be fine, but if you expect to scale beyond that then you may want to redesign, specifically include a date component in your hash key to break things up.
Regarding performance, are you hitting your provisioned read or write throughput level? If so raise them to some crazy high level and re-run the test until the bottleneck becomes your code. This could be a simple as setting the throughput level appropriately.
However, regarding your actual code, without seeing the actual DynamoDB queries you are performing a possible issue would be reading too much data. Make sure you are not reading more data than you need from DynamoDB. Since your range key is a date field use a range conditional (not a filter) to reduce the number of records you need to read.
Make sure you code executes the rollup using multiple threads. If you are not able to saturate the DynamoDB provisioned capacity the issue may not be DynamoDB, it may be your code. By performing the rollups using multiple threads in parallel you should be able to see some performance gains.
What's the provisioned throughput on the tables you are using? how are you performing the rollup? Are you reading everything and filtering / filtering on range keys, etc?
Do you need to roll up/a cron job in this situation?
Why not use a table for the readings
machine_reading | machine.id | epoch_timestamp | energy_use
and a table for the aggregates
hash_key can be aggregate type and range key can be aggregate name
example:
zone, zone1
zone, zone3
day, 03/29/1940
when getting machine data, dump it in the first table and after that use atomic counters to increment entities in 2nd table:
http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/WorkingWithItems.html#WorkingWithItems.AtomicCounters