gcp metric vibrates in small scale - google-cloud-platform

This is my es workload's metric and it's vibrating somehow. But when I scale-up the metric, it'll be stationary as below:
Each of pods is stationary even in small scale:
How can I make it stationary? I can't find any documentation about this and logs. thanks

GKE metrics default behavior on previous versions was to report the metrics every 2 minutes.
The version 1.16 uses a different metrics agent to export that data and this is the reason why it's showing that way because the data is not being exported at the same time.
The issue is on the graphs, not on the deployment itself as fair as I can see.
This is currently a work in progress, but you can follow the resolution of this issue on this link:
GKE Fluctuating Metrics Reported After Upgrade

Related

GCP: Recommendations for reducing time series samples GMP

Looking for suggestions for reducing the samples collected over time.
Using google managed prometheus. Its GKE workloads with autoscaling enabled. The node pools are on preemptible VMs.
Because of the combination, the pods spinning up/down over the 30 days periods is a lot increasing the unique metric sample count significantly. Most (all) of the time, we are not really interested in application metrics at single pod level, but just overall deployment level.
Should I be using a local prometheus which collects for a day and aggregate and report to GMP? Is there anyway to scrape aggregated directly from resources? Any pointers appreciated.

How to scale a heavy video rendering server?

I'm working on a video rendering server in Node.js. It spawns multiple headless chrome and uses puppeteer library to capture the screenshot and feed them to ffmpeg. Later it concats all the parts with some post processing.
Now I want to move it to production but the it's not performing efficiently.
Tried serverless architecture and cloudrun etc but still unable to achieve it. Also they clearly mention that these are not meant for heavy and long running tasks. The video is taking too much to get rendered and even longer than time my laptop takes.
I tried using GCE, the results are satisfactory but now I'm having hard time to scale it. Actually the server can only handle one request at a time efficiently. How to scale horizontally and make sure that each get only one request at a time?
Thanks in advance.
To scale up number of identical instances you can use Managed Instance Groups. Have a look at the autoscaling documentation to get better understanding how it works but basically it says:
You can autoscale based on one or more of the following metrics that reflect the load of the instance group:
Average CPU utilization.
HTTP load balancing serving capacity, which can be based on either utilization or requests per second.
Cloud Monitoring metrics.
If you will be autoscaling based on the CPI uage the just enable autoscaling and set it up when creating a new group of instances;
Here's an example gcloud command to do this:
gcloud compute instance-groups managed set-autoscaling example-managed-instance-group \
--max-num-replicas 20 \
--target-cpu-utilization 0.60 \
--cool-down-period 90
You can also use any available metric to scale up your group or even create a new custom metric that will trigger scaling up your group;
You can create custom metrics using Cloud Monitoring and write your own monitoring data to the Monitoring service. This gives you side-by-side access to standard Google Cloud data and your custom monitoring data, with a familiar data structure and consistent query syntax. If you have a custom metric, you can choose to scale based on the data from these metrics.
And last - I've found this example use case that scales up group of VM's based on pub/sub queue which might be the solution you're looking for.

Unable to understand GCP bill for Stackdriver Monitoring usage

We have implemented kube-state metrics (by following the steps mentioned in this article section 4.4.1 Install monitoring components) on one of our kubernetes clusters on GCP. So basically it created 3 new deployments node-exporter, prometheus-k8s and kube-state metrics on our cluster. After that, we were able to see all metrics inside Metric Explorer with prefix "external/prometheus/".
In order to check External metrics pricing, we referred to this link. Hence, we calculated the price accordingly but when we received the bill it's a shocking figure. GCP has charged a lot of amount but we haven't added any single metric in dashboard or not set monitoring for anything. From the ingested volume (which is around 1.38GB/day), it looks these monitoring tools do some background job (at specific time it reads some metrics or so) which consumed this volume and we received this bill.
We would like to understand how these kube-state metrics monitoring components work. Will it automatically get metrics data and increase the ingested volume and bill in such way or there is any mis-configuration in its setup?
Any guidance on this would be really appreciated!
Thank you.
By default, when implemented, kube-state-metrics exposes several metrics for events across your cluster:
If you have a number of frequently-updating resources on your cluster, you may find that a lot of data is ingested into these metrics which incurs high costs.
You need to configure what metrics you'd like to expose, as well as consult the documentation for your Kubernetes environment in order to avoid unexpected high costs.

EC2 CPU Spikes in CloudWatch, but not within the VM

During the past couple of weeks, a couple of my server instances keep triggering CloudWatch alarms that I've set for CPU usage.
I can see periodic 1min spikes in CloudWatch only within the EC2 CPUUtilization metric. I cannot see them on the instance itself using top, atop, CloudWatch Agent, etc.
I cannot find any correlation or corroborating measurement inside the VM to these events.
I've searched and read documentation and come up empty handed.
Any thoughts?
At the moment, I'm confident that the deployed app code is behaving. Am I wrong to think that CloudWatch is showing artifacts of something I don't have visibility into?... on only two instances? If it happened across the autoscaling group, I'd probably write this off as some kind of background noise.

AWS AutoScaling CPUUtilization metric not accurate?

I do heavy computation on incoming data traffic, similar to web server requests but not exactly. The computation uses mainly cpu. Memory or disk read/write is hardly used at all. I deployed this application to an autoscaling group. I also have some customised measurements on the performance of the system, beside the default AWS AutoScaling CPUUtilization metrics.
The strange thing I found out is, sometimes the default AWS AutoScaling CPUUtilization metrics can be as high as 95%, but my customised measurements show that the system works just fine, as also justified by my visual user check.
I'm quite confident on my customised measurements and I believe that it really shows the true performance. But should I consider the high CPUUtilization in this case as abnormal, and it's just simply not accurate?