why not all metrics are collected ArgoCD? - argocd

I set the nodePort value for argocd-metrics and argocd-server-metrics services, but for some reason the metrics specified in the documentation are not given?
for example, the argocd_app_info metric or argocd_app_labels, there is only argocd_redis_request_duration_bucket etc are given.

I don't know why, but all metrics began to be given only through ingress.

Related

Remove Reported Metrics of Istio Sidecar

I use Istio 1.8 for service mesh and Prometheus to collect metrics from sidecards. Currently these metrics are been provided by sidecards:
istio_request_bytes_bucket
istio_request_duration_milliseconds_bucket
istio_requests_total
envoy_cluster_upstream_cx_connect_ms_bucket
istio_request_messages_total
istio_response_messages_total
envoy_cluster_upstream_cx_length_ms_bucket
istio_response_bytes_bucket
istio_request_bytes_sum
istio_request_bytes_count
This amount of metrics use lots of network bandwidth. (We have around 5k pods)
All we need for now are istio_requests_total and istio_request_duration_milliseconds_bucket only from Inbound.
I know how to remove labels by EnvoyFilter but I was unable to find documentation for removing a metric.
For better visibility I'm posting my comment as a Community Wiki answer as it is only the extension of what Peter Claes already mentioned in his answer.
According to the Istio docs:
The metrics section provides values for the metric dimensions as
expressions, and allows you to remove or override the existing metric
dimensions. You can modify the standard metric definitions using
tags_to_remove or by re-defining a dimension. These configuration
settings are also exposed as istioctl installation options, which
allow you to customize different metrics for gateways and sidecars as
well as for the inbound or outbound direction.
Here you can find the info regarding customizing Istio (1.8) metrics :
https://istio.io/v1.8/docs/tasks/observability/metrics/customize-metrics/

need understanding on guest-metrics in google cloud

I am collecting metrics from Monitoring in google cloud through rest-api. In the api documentation from https://cloud.google.com/monitoring/api/metrics_gcp I am seeing lot of metrics beginning with guest like
guest/cpu/usage_time
guest/disk/bytes_used
guest/disk/io_time
I am seeing the same kind of metrics beginnging with instance like
instance/cpu/usage_time
instance/disk/max_read_bytes_count
I have searched the documentation, but I am not getting clear idea of what is the difference between guest and instance metrics. Which metrics is preferred? Can anyone give suggestion? Thanks
The guest/... is used to monitor the system health of COS instances.
While for the instance/... it is targeted at regular GCE VM instances metrics not COS instance type.

Unable to understand GCP bill for Stackdriver Monitoring usage

We have implemented kube-state metrics (by following the steps mentioned in this article section 4.4.1 Install monitoring components) on one of our kubernetes clusters on GCP. So basically it created 3 new deployments node-exporter, prometheus-k8s and kube-state metrics on our cluster. After that, we were able to see all metrics inside Metric Explorer with prefix "external/prometheus/".
In order to check External metrics pricing, we referred to this link. Hence, we calculated the price accordingly but when we received the bill it's a shocking figure. GCP has charged a lot of amount but we haven't added any single metric in dashboard or not set monitoring for anything. From the ingested volume (which is around 1.38GB/day), it looks these monitoring tools do some background job (at specific time it reads some metrics or so) which consumed this volume and we received this bill.
We would like to understand how these kube-state metrics monitoring components work. Will it automatically get metrics data and increase the ingested volume and bill in such way or there is any mis-configuration in its setup?
Any guidance on this would be really appreciated!
Thank you.
By default, when implemented, kube-state-metrics exposes several metrics for events across your cluster:
If you have a number of frequently-updating resources on your cluster, you may find that a lot of data is ingested into these metrics which incurs high costs.
You need to configure what metrics you'd like to expose, as well as consult the documentation for your Kubernetes environment in order to avoid unexpected high costs.

How to extract an instance uptime based on incidents?

On stackdriver, creating an Uptime Check gives you access to the Uptime Dashboard that contains the uptime % of your service:
My problem is that uptime checks are restricted to http/tcp checks. I have other services running and those services report their health in different ways (say, for example, by a specific process running). I have incident policies already set up for this services, so if the service is not running I get notified.
Now I want to be able to look back and know how long the service was down for the last hour. Is there a way to do that?
There's no way to programmatically retrieve alerts at the moment, unfortunately. Many resource types expose uptime as a metric, though (e.g., instance/uptime on GCE instances) - could you pull those and do the math on them? Without knowing what resource types you're using, it's hard to give specific suggestions.
Aaron Sher, Stackdriver engineer

SurgeQueueLength equivalent for Application Load Balancers

I'm looking to set up Auto-Scaling for a service running on AWS ECS. From the ECS Auto-Scaling docs it suggests to use SurgeQueueLength to determine whether to trigger an autoscale event. We use an Application Load Balancer which does not have this metric, looking through the table of metrics nothing seems equivalent. Am I missing something or is this just a missing feature in ALBs at present?
Disclaimer: I don't have experience with Application Load Balancers. I'm just deriving these facts from AWS docs. For a more hands on read you might read the ALB section of this medium post
You are correct, in the CloudWatch metrics for Application Load Balancers there is no SurgeQueueLength. This is also confirmed in this thread by an AWS employee, however, these metrics could be used as a CloudWatch metric to trigger auto scaling:
TargetConnectionErrorCount: IMO this is corresponding best to the SurgeQueueLength as it indicates that the Loadbalancer tried to open a connection to a backend node and failed
HTTPCode_ELB_5XX_Count: depending on the backend nodes this might be an indicator that they refuse new connections because e.g. their max connection count is reached
RejectedConnectionCount: this is what the AWS employee suggested in the treadh linked above. Buuut.. the doc says "number of connections that were rejected because the load balancer had reached its maximum number of connections" this seems more like a limit on aws side which you cannot really influence (i.e. it is not described in the limits on ALBs)
RequestCountPerTarget: that's the average number of connections a backend node gets per minute. When you track that over a period of time you might be able to evaluate a "healthy threshold"
TargetResponseTime: number of seconds a backend node needs to answer a request. Another candidate for evaluating as "healthy threshold" (i.e. "what's the maximum response time you want the end user to experience?")
Overall it seems that there is no "clear correct answer" to your question and the answer is a "it depends on your situation".
The question which suggests itself is: "why are there no queue metrics such as SurgeQueueLength". This is nowhere answered in the docs. I guess this is either because ALBs are designed differently than ELBs or it is a metric which is just not exposed yet.
ALBs are designed differently and don't have SurgeQueueLength or SpillOver metrics. Source: AWS Staff.