Google Cloud Monitoring Writing Datapoints Faster than Maximum Sampling Period - google-cloud-platform

Context:
I'm attempting to use Google Cloud's monitoring SDK to publish metrics on error status codes, latency and other server-side metrics.
Due to the rate of requests per second on my machines, this will exceed Google's metric limit of 1 datapoint for 10 seconds.
I am using the instance_id as one of the labels, so they will be unique per machine, but I will still exceed the 1 datapoint / 10 seconds.
Question:
As mentioned in a similar question, here, an option would be to log, buffer and forward the messages. This seems strange to have each customer implement this for common high-rate metric use-cases.
Is there an alternative way of recording high-rate metrics with the sdk such as latency, num requests, num errors?
Resources:
https://cloud.google.com/monitoring/quotas
https://cloud.google.com/monitoring/custom-metrics/creating-metrics#monitoring_create_metric-nodejs
One or more points were written more frequently than the maximum sampling period configured for the metric

Related

How to increase resolution of CpuUtilization metric of ECS cluster past 1 min mark?

I'm trying to create a robust autoscaling process for my ECS cluster but am facing problems with resolution of CpuUtilization metric. I have turned on 'Detailed metrics' for 1-min resolution, but am not able to achieve good scaling results. I am deploying an ML model which takes roughly 1.5s to infer. I am not facing any memory bottleneck and hence, am using CpuUtilization for scaling.
I need fast scaling as when requests start piling up the response time easily shoots up to 3-5s. Currently, with 'Detailed Metrics' enabled. The scale-out time takes around 3-5 miuntes to start as 3 datapoints are checked for 1-min res metrics. If I have 5-10s res metric, then I can look at 6 data points within 30s and start the scale-out job faster.
I tried using Lambda, StepFunctions and EventBridge from this blog. But, I am not able to get CpuUtilization or MemoryUtilization, only the task, service and container counts.
Is there a way to get Cpu and Memory metrics directly from ECS? I know we can use cloudwatch.get_metric_statistics(). But, we can only get datapoints that are reported to CloudWatch. So, not useful.
You can't change that. 1 min value is set by AWS. The only thing you can do to get better resolution is to create your own custom metrics. Custom metrics can have resolution of 1 second.

AWS High Resolution Metrics for faster ECS scaling

I have a complex REST API deployed in AWS ECS. The autoscaling policy for the same is based on RequestCount of 2000.
The scale out will happen when RequestCount is consistently higher than 2000 with standard resolution per 60 seconds. This takes at least 2 minutes before scaling happens. This is becoming a problem with short-time request surge when request count increases to 10k and above. The containers start rejecting requests(throttling).
I need to at least make the scaling happen more quickly within a minute if not within seconds. AWS CloudWatch seems to offer High-Resolution metrics, but there's very less information about:
Can I enable specific metrics with high-resolution. Is it possible that I can have request counts resolved at high granularity of 5 seconds and CPUUtilization at standard granularity of 1 minute?
How can I enable high resolution on AWS metrics?
The AWS CloudWatch Documentation seems to be insufficient to understand this process.
There's two different things that can be 'high resolution', the alarm and the metric.
A High Resolution metric just means the source is pushing values more frequently. You can't control this if your using an AWS metric, and most of them don't push more often than once a minute.
A High Resolution alarm is one where the period is less than 60 seconds and will be billed at a higher rate than standard alarms. However, this isn't very useful in most cases if the metric your basing it on only gets pushed once per minute
EDIT:
To directly answer your questions
No, I don't think any of the AWS RequestCount metrics for things like ELB have a 'high resolution on/off' toggle (although ELB might push more frequently than 1 minute by default, I'm not sure)
its based on how often the source pushes data points to cloudwatch. If the AWS metrics don't work for what you need, you would need to add something like the CloudWatch agent (or just a script in your instance) pushing metric more frequently. Be careful about the CloudWatch API call charges if you do this from a lot of sources at a high frequency though

Moving average and trends in cloudwatch metrics

Is it possible to have a moving average or a trend line in the metrics of aws cloudwatch?
The idea is to show for example the cpu utilization of a server over time and not just the average of the last x minutes, so we can see if the trend over long period of time is going up or down.
CloudWatch does not have trend lines built into standard metrics.
If you're looking for this you can enable this you would need to setup anomaly detection for the metric.
By enabling this you will be able to build up an overview of the trends for that metric whilst also configuring the normal/abnormal ranges for the metric. If your data goes outside this line you can have a CloudWatch alarm notify you.

Azure Event Hub - Outgoing message/bytes doesn't increase after increasing throughput units and IEventProcessor instances

I have an EventHub instance with 200 partitions (Dedicated cluster). Originally, I have a consumer group with 70 instances of IEventProcessor + 1 throughput unit.
It appears that I can only have 30M outgoing messages per hour while there are double of that amount for incoming. So I increased to 20 throughput units and 100 processor instances. But the outgoing messages don't increase beyond 30M. I don't see any throttle messages.
Are there other EventHub limits that I should adjust here?
EDIT 1:
After setting prefetch and batch size to 1000 I still only see moderate increase: Imgur
Couple things I can recommend to check and do:
Increase batch and prefetch size.
Check client side resources like CPU and available memory. See if there is any high resource utilization which may become a bottleneck.
If hosts are on in a different region than the eventhub, network latency can slow down the receives. Co-locate hosts with eventhub if so.
Consider creating a support ticket so PG can do a proper investigation.

VPN metrics in Cloudwatch

Recently I have configured the ALARM in Cloudwatch for tracking VPN Tunnel connection. It is well known that 0 indicates tunnel is DOWN and 1 indicates tunnel is UP. When Connection is down, I have seen some data points on the graph shown as 0.66, 0.75.
So what does that mean, is the connection is DOWN or UP?
The correct statistic for each metric depends on your use case, and the underling metric.
From CloudWatch Concepts - Statistics
Statistics are metric data aggregations over specified periods of
time. CloudWatch provides statistics based on the metric data points
provided by your custom data or provided by other AWS services to
CloudWatch. Aggregations are made using the namespace, metric name,
dimensions, and the data point unit of measure, within the time period
you specify. The following table describes the available statistics.
Given the VPN metric above, try using the Maximum or Minimum statistics for the alarm. You are using the Average statistic, which, as you noted, will not produce meaningful data for your use case.
Minimum
The lowest value observed during the specified period. You can use this value to determine low volumes of activity for your application.
Maximum
The highest value observed during the specified period. You can use this value to determine high volumes of activity for your application.
That happens if your graph shows averages (that is why both of your values are between 1 and 0). In the ClouWatch console select your metric and then click on the Graphed metrics tab. There you will see Statistics column which is most likely set to Average now.