GCP: Recommendations for reducing time series samples GMP - google-cloud-platform

Looking for suggestions for reducing the samples collected over time.
Using google managed prometheus. Its GKE workloads with autoscaling enabled. The node pools are on preemptible VMs.
Because of the combination, the pods spinning up/down over the 30 days periods is a lot increasing the unique metric sample count significantly. Most (all) of the time, we are not really interested in application metrics at single pod level, but just overall deployment level.
Should I be using a local prometheus which collects for a day and aggregate and report to GMP? Is there anyway to scrape aggregated directly from resources? Any pointers appreciated.

Related

GCP cost optimization of compute resources

We have multiple teams using Google Dataproc service. We would like to analyze the usage of Dataproc resources and determine if there is any scope for optimization. Few examples are listed below.
A dataproc cluster is created without a TTL, in this case even >after a job is run, the cluster is still active and incurs cost. Here, >if we are able to determine if the cluster is idle, we could recommend >the team to stop the cluster when not in use.
A dataproc cluster is provisioned with higher configuration (CPU >and RAM), but the utilization is very minimal (ex. less than 10%). >Here scaling down the resource might bring down the cost and also >serve the requirement to run a job.
We would like to understand if GCP already has such features, if not available, is there a place where we can find relevant logs and build our own solution for this use case?

Are Google Cloud preemptible VMs less likely to be interrupted mid-task during the night hours in each respective region?

I want to run a cluster on Google Compute Engine preemptible nodes. I'm wondering if it's more advantageous to pick a region for the cluster nodes where it's night when I typically run my jobs.
Are there any statistics on where it's most advantageous to run a cluster of preemptible nodes by the local start time of jobs?
(There is an old related question here: Which Google Compute Engine Server is least likely to preempt my vms? , but it does not address my question specifically about usage statistics by time of day.)
To address your question, currently there is no usage statistics on what time of the day it will be best or advantageous to run the Preemptible VMs/Spot VMs, in terms of picking up a region and zone keep in mind that Generally, communication within regions will always be cheaper and faster than communication across different regions. And some point in time, your instances might experience an unexpected failure. To mitigate the effects of these possible events, you should duplicate important systems in multiple zones and regions.
for further information you can also check the documentation on Preemptible instance/Spot VMs, Location selection tips and General best practices for Spot VMs.

Unable to understand GCP bill for Stackdriver Monitoring usage

We have implemented kube-state metrics (by following the steps mentioned in this article section 4.4.1 Install monitoring components) on one of our kubernetes clusters on GCP. So basically it created 3 new deployments node-exporter, prometheus-k8s and kube-state metrics on our cluster. After that, we were able to see all metrics inside Metric Explorer with prefix "external/prometheus/".
In order to check External metrics pricing, we referred to this link. Hence, we calculated the price accordingly but when we received the bill it's a shocking figure. GCP has charged a lot of amount but we haven't added any single metric in dashboard or not set monitoring for anything. From the ingested volume (which is around 1.38GB/day), it looks these monitoring tools do some background job (at specific time it reads some metrics or so) which consumed this volume and we received this bill.
We would like to understand how these kube-state metrics monitoring components work. Will it automatically get metrics data and increase the ingested volume and bill in such way or there is any mis-configuration in its setup?
Any guidance on this would be really appreciated!
Thank you.
By default, when implemented, kube-state-metrics exposes several metrics for events across your cluster:
If you have a number of frequently-updating resources on your cluster, you may find that a lot of data is ingested into these metrics which incurs high costs.
You need to configure what metrics you'd like to expose, as well as consult the documentation for your Kubernetes environment in order to avoid unexpected high costs.

AWS Network out

Our web application has 5 pages (Signin, Dashboard, Map, Devices, Notification)
We have done the load test for this application, and load test script does the following:
Signin and go to Dashboard page
Click Map
Click Devices
Click Notification
We have a basic free plan in AWS.
While performing load test, till about 100 users, we didn’t get any error. please see the below image. We could see NetworkIn, CPUUtilization seems to be normal. But the NetworkOut showed 846K.
But when reach around 114 users, we started getting error in the map page (highlighted in red). During that time, it seems only NetworkOut is high. Please see the below image.
We want to know what is the optimal score for the NetworkOut, If this number is high, is there any way to reduce this number?
Please let me know if you need more information. Thanks in advance for your help.
You are using a t2.micro instance.
This instance type has limitations on CPU that means it is good for bursty workloads, but sustained loads will consume all the available CPU credits. Thus, it might perform poorly under sustained loads over long periods.
The instance also has limited network bandwidth that might impact the throughput of the server. While all Amazon EC2 instances have limited allocations of bandwidth, the t2.micro and t2.nano have particularly low bandwidth allocations. You can see this when copying data to/from the instance and it might be impacting your workloads during testing.
The t2 family, especially at the low-end, is not a good choice for production workloads. It is great for workloads that are sometimes high, but not consistently high. It is also particularly low-cost, but please realise that there are trade-offs for such a low cost.
See:
Amazon EC2 T2 Instances – Amazon Web Services (AWS)
CPU Credits and Baseline Performance for Burstable Performance Instances - Amazon Elastic Compute Cloud
Unlimited Mode for Burstable Performance Instances - Amazon Elastic Compute Cloud
That said, the network throughput showing on the graphs is a result of your application. While the t2 might be limiting the throughput, it is not responsible for the spike on the graph. For that, you will need to investigate the resources being used by the application(s) themselves.
NetworkOut simply refers to volume of outgoing traffic from the instance. You reduce the requests you are sending from this instance to reduce the NetworkOut .So you may need to see which one of click Map, Click Devices and Click Notification is sending traffic outside of the instances. It may not necessarily related only to the number of users but a combination of number of users and application module.

Monitoring works or identifying bottlenecks in data pipeline

I am using google cloud datafow. Some of my data pipelines needs to be optimized. I need to understand how workers are performing in the dataflow cluster on these lines .
1. How much memory is being used ?
Currently I am logging memory usage using java code .
2. Is there a bottleneck on the disk operations ? To understand whether a SSD is required ?
3. Is there a bottleneck in Vcpus ? So as to increase the Vcpus in workers nodes.
I know stackdriver can be used to monitor Cpu and disk usage for the cluster. However it does not provide information on individual workers and also on whether we are hitting the bottle neck in these.
Within the Dataflow Stackdriver UI, you are correct, you cannot view the individual worker's metrics. However, you can certainly setup a Stackdriver Dashboard which gives you the invdividual worker metrics for all of what you have mention. Below is a sample dashboard which shows metrics for CPU, Memory, Network, Read IOPs, and Write IOPS.
Since the Dataflow job name will be part of the GCE instance name, here I filter down the GCE instances being monitored by the job name I'm interested in. In this case, my Dataflow job was named "pubsub-to-bigquery", so I filtered down to instance_name ~= pubsub-to-bigquery.*. I did a regex filter to be sure I captured any job names which may be suffixed with additional data in future runs. Setting up a dashboard such as this can inform you when you'd actually benefit from SSDs, more network bandwidth, etc.
Also be sure to check the Dataflow job graph in the cloud console when looking to optimize your pipeline. The wall time below the step name can give a good indication on what custom transforms or dofns should be targeted for optimization.