I use Istio 1.8 for service mesh and Prometheus to collect metrics from sidecards. Currently these metrics are been provided by sidecards:
istio_request_bytes_bucket
istio_request_duration_milliseconds_bucket
istio_requests_total
envoy_cluster_upstream_cx_connect_ms_bucket
istio_request_messages_total
istio_response_messages_total
envoy_cluster_upstream_cx_length_ms_bucket
istio_response_bytes_bucket
istio_request_bytes_sum
istio_request_bytes_count
This amount of metrics use lots of network bandwidth. (We have around 5k pods)
All we need for now are istio_requests_total and istio_request_duration_milliseconds_bucket only from Inbound.
I know how to remove labels by EnvoyFilter but I was unable to find documentation for removing a metric.
For better visibility I'm posting my comment as a Community Wiki answer as it is only the extension of what Peter Claes already mentioned in his answer.
According to the Istio docs:
The metrics section provides values for the metric dimensions as
expressions, and allows you to remove or override the existing metric
dimensions. You can modify the standard metric definitions using
tags_to_remove or by re-defining a dimension. These configuration
settings are also exposed as istioctl installation options, which
allow you to customize different metrics for gateways and sidecars as
well as for the inbound or outbound direction.
Here you can find the info regarding customizing Istio (1.8) metrics :
https://istio.io/v1.8/docs/tasks/observability/metrics/customize-metrics/
Related
I'm looking to scale my Compute Engine instances based on memory which is an agent metric in Stackdriver. The caveat is that out of the 5 states that the agent can monitor(buffered, cached, free, slab, used) see the link here, I only want to look at 'used' memory and if that value is above certain %age threshold across the group(or per-instance would also work for me), I want to autoscale.
I've already installed the Stackdriver Monitoring agent in all the nodes across the Managed Instance Group and I am successfully able to visualize 'used' memory in my monitoring dashboard as I'm well acquainted with it.
Unfortunately, I can't do it for autoscaling. This is what I see when I go to configure it in the autoscaling section of MIG.
In my belief, adding filter expressions should work as expected, since this expression works correctly in the Stackdriver console using the Monitoring dashboard. Also, it's mentioned here that the syntax is compatible with Cloud Monitoring filter syntax that is given here.
I've tried different combinations for the syntax in the filter expression field but none of them has worked. Please help.
I was attempting the exact same configuration in attempts to scale based on memory usage. After testing various unsuccessful entries I reached out to Google support. Based on your question I can't tell what kind of instance group you have. It matters because of the following.
TLDR
Based on input from Google support, only zonal instance groups allow the filter expression entry.
Zonal Instance Group
Only zonal instance groups will allow the metric setting. The setting you are attempting to enter is correct with metric.state=used for a zonal instance group. However, that field must be left blank for regional instance group.
Regional Instance Group
As noted above, applying the filter for a regional instance group is not supported. As noted in their documentation they mention that you leave that field blank.
In the Additional filter expression section:For a zonal MIG, optionally enter a filter to use individual values from metrics with multiple streams or labels. For more information, see Filtering per-instance metrics.For a regional MIG, leave this section blank.
If you add an entry you'll receive the message "Regional managed instance groups do not support autoscaling using per-group metrics." when attempting to save your changes.
On the other hand if you leave the field empty it will save. However, I found that leaving the field empty and setting almost any number in the Target Utilization field always caused my group to scale to the maximum number.
Summary
Google informed me that they do have a feature request for this. I communicated that it didn't make sense to even have the option to select percent_used if it's not supported. The response was that we should see the documentation updated in the future to clarify that point.
We have implemented kube-state metrics (by following the steps mentioned in this article section 4.4.1 Install monitoring components) on one of our kubernetes clusters on GCP. So basically it created 3 new deployments node-exporter, prometheus-k8s and kube-state metrics on our cluster. After that, we were able to see all metrics inside Metric Explorer with prefix "external/prometheus/".
In order to check External metrics pricing, we referred to this link. Hence, we calculated the price accordingly but when we received the bill it's a shocking figure. GCP has charged a lot of amount but we haven't added any single metric in dashboard or not set monitoring for anything. From the ingested volume (which is around 1.38GB/day), it looks these monitoring tools do some background job (at specific time it reads some metrics or so) which consumed this volume and we received this bill.
We would like to understand how these kube-state metrics monitoring components work. Will it automatically get metrics data and increase the ingested volume and bill in such way or there is any mis-configuration in its setup?
Any guidance on this would be really appreciated!
Thank you.
By default, when implemented, kube-state-metrics exposes several metrics for events across your cluster:
If you have a number of frequently-updating resources on your cluster, you may find that a lot of data is ingested into these metrics which incurs high costs.
You need to configure what metrics you'd like to expose, as well as consult the documentation for your Kubernetes environment in order to avoid unexpected high costs.
I have a small number of HTTP servers on GCP VMs. I have a mixture of different server languages and Linux based OS's.
Questions
A. It it possible to use the Stackdriver monitoring service to set alerts at specific percentiles for HTTP response latencies?
B. Can I do this without editing the code of each server process?
C. Will installing the agent into the VM report HTTP latencies?
For example, if the 95th percentile goes over 100ms for a certain time period I want to know.
I know I can do this for CPU utilisation and other hypervisor provided stats using:
https://console.cloud.google.com/monitoring/alerting
Thanks.
Request latencies are extracted by cloud load balancers. As long as you are using cloud load balancer you don't need to install monitoring agent to create alerts based 95th Percentile Metrics.
Monitoring agent captures latencies for some preconfigured systems such as riak, cassandra and some others. Here's a full list of systems and metrics monitoring agent supports by default: https://cloud.google.com/monitoring/api/metrics_agent
But if you want anything custom, i.e. you want to measure request latencies from VM you would need to capture response times yourself and configure logging agent to create a custom metric which you can use to create alerts. And as long as you are capturing them as distribution metrics you should be able to visualise different percentiles (i.e. 25, 50, 75, 80, 90, 95 and 99th etc.) and create alert based on that.
see: https://cloud.google.com/logging/docs/logs-based-metrics/distribution-metrics
A. It it possible to use the Stackdriver monitoring service to set
alerts at specific percentiles for HTTP response latencies?
If you want to simply consider network traffic, yes it is possible. Also if you are using a load balancer it's also possible to set alerts on that.
What you want to do should be pretty straight forward from the interface, however you can also find more info in the documentation.
If you want to use some advanced metric on top of tomcat/apache2 etc, you should check the list of metrics provided by the stackdriver monitoring agent here.
B. Can I do this without editing the code of each server process?
Yes, no need to update any program, stackdriver monitoring works transparently and will be able to fetch basic metrics from a GCP VMs without the need of the monitoring agent, including network traffic and cpu utilization.
C. Will installing the agent into the VM report HTTP latencies?
No, the agent shouldn't cause any http latencies.
I had installed istio on kubernets(hosted on aws ) and exposed a service as a via istio ingress. I could achieve very less through put < 2000 requests per minute. I exposed the service as a standard ELB, I was able to achieve >600,000 request per sec. Is there any guide/steps to tune istio for high performance ?
Thanks
Joji
One thing we recommend users to do is to switch to the non-debug image for the envoy side car, e.g. replace docker.io/istio/proxy_debug with docker.io/istio/proxy in the istio and istio-initializer yaml file and redeploy istio. We are also working on reduce mixer traces. Performance is an area we are very actively working on in the next release of Istio and we welcome contribution to this!
I'm looking to set up Auto-Scaling for a service running on AWS ECS. From the ECS Auto-Scaling docs it suggests to use SurgeQueueLength to determine whether to trigger an autoscale event. We use an Application Load Balancer which does not have this metric, looking through the table of metrics nothing seems equivalent. Am I missing something or is this just a missing feature in ALBs at present?
Disclaimer: I don't have experience with Application Load Balancers. I'm just deriving these facts from AWS docs. For a more hands on read you might read the ALB section of this medium post
You are correct, in the CloudWatch metrics for Application Load Balancers there is no SurgeQueueLength. This is also confirmed in this thread by an AWS employee, however, these metrics could be used as a CloudWatch metric to trigger auto scaling:
TargetConnectionErrorCount: IMO this is corresponding best to the SurgeQueueLength as it indicates that the Loadbalancer tried to open a connection to a backend node and failed
HTTPCode_ELB_5XX_Count: depending on the backend nodes this might be an indicator that they refuse new connections because e.g. their max connection count is reached
RejectedConnectionCount: this is what the AWS employee suggested in the treadh linked above. Buuut.. the doc says "number of connections that were rejected because the load balancer had reached its maximum number of connections" this seems more like a limit on aws side which you cannot really influence (i.e. it is not described in the limits on ALBs)
RequestCountPerTarget: that's the average number of connections a backend node gets per minute. When you track that over a period of time you might be able to evaluate a "healthy threshold"
TargetResponseTime: number of seconds a backend node needs to answer a request. Another candidate for evaluating as "healthy threshold" (i.e. "what's the maximum response time you want the end user to experience?")
Overall it seems that there is no "clear correct answer" to your question and the answer is a "it depends on your situation".
The question which suggests itself is: "why are there no queue metrics such as SurgeQueueLength". This is nowhere answered in the docs. I guess this is either because ALBs are designed differently than ELBs or it is a metric which is just not exposed yet.
ALBs are designed differently and don't have SurgeQueueLength or SpillOver metrics. Source: AWS Staff.