Prometheus Error - context deadline exceeded - wmi

I have multiple targets for my prometheus server but only for one I am getting the error
context deadline exceeded
Even I am able to browse the metrics URL from wmi_exporter and it just results in less a second, I tried increasing the scrape interval for this specific target but no luck.
I cannot understand if I am able to browser the wmi_exporter URL from the same machine where prometheus is running, why prometheus is still showing that error.
Please help

The solution is very easy, you only need to edit the YAML file of daemon set prometheus-node-exporter to delete hostNetwork: true. If you are using Helm chart prometheus-operator to install Prometheus Operator, you can override the hostNetwork setting using a values file with -f option.
prometheus-node-exporter:
hostNetwork: false

Related

Istio External Authorization Error with Istio Operator

We have deployed Istio 1.11.0 using helm-chart in our dev and production environment.
We are using below configuration in istio configmap, which we have updated via istio-control helm-chart.
meshConfig:
extensionProviders:
- name: "ext-authz-grpc"
envoyExtAuthzGrpc:
service: "ext-auth-service.default.svc.cluster.local"
port: "50051"
includeHeadersInCheck: [ "authorization", "ws-protocol" ]
headersToUpstreamOnAllow: [ "authorization", "x-role", "x-id" ]
accessLogFile: /dev/stdout
enablePrometheusMerge: true
Basically we are using grpc service for external authorization server.
Above configuration is working fine.
One of our client has deployed Istio 1.9.8 using operator. (They have their own deployment model for Istio. Not allowing us to deploy istio using helm-chart)
When we try to apply above changes using operator it gives us below error :
2022-04-05T10:23:09.657830Z info installer Loading values from compiled in VFS at path profiles/minimal.yaml
2022-04-05T10:23:09.657837Z info installer Loading values from compiled in VFS at path profiles/default.yaml
2022-04-05T10:23:09.679340Z error installer failed to merge base profile with user IstioOperator CR profile-poc-customized, failed to unmarshall mesh config: unknown field "includeHeadersInCheck" in v1alpha1.MeshConfig_ExtensionProvider_EnvoyExternalAuthorizationGrpcProvider moreInfo=The values in the selected spec.profile could not be merged with the user IstioOperator resource. impact=The operator controller cannot create and act upon the user defined IstioOperator resource. The Istio control plane will not be installed or updated. action=Check that the IstioOperator resource has the correct syntax. If you are sure your configuration is correct, see https://istio.io/latest/about/bugs for possible solutions. likelyCause=The likely cause is an incorrect or badly formatted configuration.Another possible cause could be an issue with the Istio code.
If we directly edit the configmap and make changes then it is able to apply those changes.
But its giving error when we are updating it from operator.
Can anybody help me to understand why its not working with operator?
includeHeadersInCheck is only available for http and not grpc:
https://istio.io/v1.10/docs/reference/config/istio.mesh.v1alpha1/#MeshConfig-ExtensionProvider-EnvoyExternalAuthorizationGrpcProvider

Google Cloud Run not scaling up despite large backlog and available instances

I am seeing something similar to this post. It looked like additional detail was needed to answer that question, so I'm re-asking with my details since those details weren't provided.
I am running a modified version of the Google Cloud Run image processing tutorial example.
I am inserting tasks into a task queue using this create tasks snippet. The tasks from the queue get pushed to my cloud run instance.
The problem is it isn't scaling up and making it through my tasks in a timely manner.
My cloud run service configuration:
I have tried setting a minimum of both 0 and 50 instances
I have tried a maximum of 100 and 1000 instances
I have tried --concurrency=1 and 2, and 8
I have tried with --async and without --async
With 50 instances pre-allocated even with concurrency set to 1, I am typically seeing ~10 active container instances and ~40 idle container instances. I have ~30,000 tasks in the queue and it is getting through ~5 jobs/minute.
My tasks queue has the default settings. My containers aren't using a lot of cpu, but they are using a lot of memory.
A process takes about a minute to complete. I'm only running one process per container instance. What additional parameters should be set to get higher throughput?
Edit - adding additional logs
I enabled the logs for the queue, I'm seeing some errors for some of the jobs. The errors look like this:
{
insertId: "<my_id>"
jsonPayload: {
#type: "type.googleapis.com/google.cloud.tasks.logging.v1.TaskActivityLog"
attemptResponseLog: {
attemptDuration: "19.453155s"
dispatchCount: "1"
maxAttempts: 0
responseCount: "0"
retryTime: "2021-10-20T22:45:51.559121Z"
scheduleTime: "2021-10-20T16:42:20.848145Z"
status: "UNAVAILABLE"
targetAddress: "POST <my_url>"
targetType: "HTTP"
}
task: "<my_task>"
}
logName: "<my_log_name>"
receiveTimestamp: "2021-10-20T22:45:52.418715942Z"
resource: {
labels: {
location: "us-central1"
project_id: "<my_project>"
queue_id: "<my-queue>"
target_type: "HTTP"
}
type: "cloud_tasks_queue"
}
severity: "ERROR"
timestamp: "2021-10-20T22:45:51.459232147Z"
}
I don't see errors in the cloud run logs.
Edit - Additional Debug Information
I tried to take the queue out of the equation to determine if it is cloud run or the queue. Instead I directly used curl to post to the url. Some of the tasks ran successfully, for others I received an error. In the below logs empty lines are successful:
upstream connect error or disconnect/reset before headers. reset reason: connection termination
upstream connect error or disconnect/reset before headers. reset reason: connection termination
upstream connect error or disconnect/reset before headers. reset reason: connection termination
upstream connect error or disconnect/reset before headers. reset reason: connection termination
upstream connect error or disconnect/reset before headers. reset reason: connection termination
This makes me think cloud run isn't handling all of the incoming requests.
Edit - task completion time test
I wanted to test if the time it takes to complete a task causes any issues with CloudRun and the Queue scaling up and keeping up with the tasks.
In place of the task I actually want completed I put a dummy task that just sleeps for n seconds and prints the task details to stdout (which I can read in the cloud run logs).
With n set to 0, 5, 10 seconds I see the number of instances scale up and it keeps up with the tasks being added to the queue. With n set to 20 seconds or more I see that less CloudRun instances are instantiated and items accumulate in the task queue. I see more errors with the Unavailable status in my logs.
According to this post:
Cloud Run offers a longer request timeout duration of up to 60 minutes
So it seems that long running tasks are expected. Is this a Google bug or am I missing setting some parameter?
I do not think this is a Cloud Run Service problem. I think this is an issue with how you have Tasks setup.
The dates in the log entry look odd. Take a look at the receiveTimestamp and the scheduleTime. The task is scheduled for six hours before the receive time. Do you have a timezone problem?
According to the documentation, if the response_time is not set then the task was not attempted. It looks like you are scheduling tasks incorrectly and the tasks never run.
Search for the text The status of a task attempt. in this link:
Types for Google Cloud Tasks

eks fluent-bit to elasticsearch timeout

So I had a working configuration with fluent-bit on eks and elasticsearch on AWS that was pointing on the AWS elasticsearch service but for cost saving purpose, we deleted that elasticsearch and created an instance with a solo elasticsearch, enough for dev purpose. And the aws service doesn't manage well with only one instance.
The issue is that during this migration the fluent-bit seems to have broken, and I get lots of "[warn] failed to flush chunk" and some "[error] [upstream] connection #55 to ES-SERVER:9200 timed out after 10 seconds".
My current configuration:
[FILTER]
Name kubernetes
Match kube.*
Kube_URL https://kubernetes.default.svc:443
Kube_CA_File /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
Kube_Token_File /var/run/secrets/kubernetes.io/serviceaccount/token
Kube_Tag_Prefix kube.var.log.containers.
Merge_Log On
Merge_Log_Key log_processed
K8S-Logging.Parser On
K8S-Logging.Exclude Off
[INPUT]
Name tail
Tag kube.*
Path /var/log/containers/*.log
Parser docker
DB /var/log/flb_kube.db
Mem_Buf_Limit 50MB
Skip_Long_Lines On
Refresh_Interval 10
Ignore_Older 1m
I think the issue is in one of those configuration, if I comment the kubernetes filter I don't have the errors anymore but I'm loosing the fields in the indices...
I tried tweeking some parameters in fluent-bit to no avail, if anyone has a suggestion?
So, the previous logs did not indicate anything, but I finaly found something when activating trace_error in the elasticsearch output:
{"index":{"_index":"fluent-bit-2021.04.16","_type":"_doc","_id":"Xkxy 23gBidvuDr8mzw8W","status":400,"error":{"type":"mapper_parsing_exception","reas on":"object mapping for [kubernetes.labels.app] tried to parse field [app] as o bject, but found a concrete value"}}
Did someone get that error before and knows how to solve it?
So, after looking into the logs and finding the mapping issue I ssem to have resolved the issue. The logs are now corretly parsed and send to the elasticsearch.
To resolve it I had to augment the limit of output retry and add the Replace_Dots option.
[OUTPUT]
Name es
Match *
Host ELASTICSERVER
Port 9200
Index <fluent-bit-{now/d}>
Retry_Limit 20
Replace_Dots On
It seems that at the beginning I had issues with the content being sent, because of that the error seemed to have continued after the changed until a new index was created making me think that the error was still not resolved.

Random “upstream connect error or disconnect/reset before headers” between services with Istio 1.3

So, this problem is happening randomly (it seems) and between different services.
For example we have a service A which needs to talk to service B, and some times we get this error, but after a while, the error goes away. And this error doesn't happen too often.
When this happens, we see the error log in service A throwing the “upstream connect error” message, but none in service B. So we think it might be related with the sidecars.
One thing we notice is that in service B, we get a lot of this error messages in the istio-proxy container:
[src/istio/mixerclient/report_batch.cc:109] Mixer Report failed with: UNAVAILABLE:upstream connect error or disconnect/reset before headers. reset reason: connection failure
And according to documentation when a request comes in, envoy asks Mixer if everything is good (authorization and other things), and if Mixer doesn’t reply, the request is not success. So that’s why exists an option called policyCheckFailOpen.
We have that in false, I guess is a sane default, we don’t want the request to go through if Mixer cannot be reached, but why can’t?
disablePolicyChecks: true
policyCheckFailOpen: false
controlPlaneSecurityEnabled: false
NOTE: istio-policy is running with the istio-proxy sidecar. Is that correct?
We don’t see that error in some other service which can also fail.
Another log that I can see a lot, and this one happens in all the services not running as root with fsGroup defined in the YAML files is:
watchFileEvents: "/etc/certs": MODIFY|ATTRIB
watchFileEvents: "/etc/certs/..2020_02_10_09_41_46.891624651": MODIFY|ATTRIB
watchFileEvents: notifying
One of the leads I'm chasing is about default circuitBreakers values. Could that be related with this?
Thanks
The error you are seeing is because of a failure to establish a connection to istio-policy
Based on this github issue
Community members add two answers here which could help you with your issue
If mTLS is enabled globally make sure you set controlPlaneSecurityEnabled: true
I was facing the same issue, then I read about protocol selection. I realised the name of the port in the service definition should start with for example http-. This fixed the issue for me. And . if you face the issue still you might need to look at the tls-check for the pods and resolve it using destinationrules and policies.
istio-policy is running with the istio-proxy sidecar. Is that correct?
Yes, I just checked it and it's with sidecar.
Let me know if that help.

How to troubleshoot long pod kill time for GKE?

When using helm upgrade --install I'm every so often running into timeouts. The error I get is:
UPGRADE FAILED
Error: timed out waiting for the condition
ROLLING BACK
If I look in the GKE cluster logs on GCP, I see that when this happens its because this step takes an unusually long time to execute:
Killing container with id docker://{container-name}:Need to kill Pod
I've seen it range from a few seconds to 9 minutes. If I go into the log message's metadata to find the specific container and look at its logs, there is nothing in them suggesting a difference between it and a quickly killed container.
Any suggestions on how to keep troubleshooting this?
You could refer this troubleshooting guide for general issues connected with Google Kubernetes Engine.
As mentioned there, you may need to use the 'Troubleshooting Application' guide for further debugging the application pods or its controller objects.
I am assuming that you checked the logs(1) of the container that resides in the respective pod OR described(2)( look at the reason for termination) it using the below commands. If not, you can try these as well to get more valuable information.
1. kubectl logs POD_NAME -c CONTAINER_NAME -p
2. kubectl describe pods POD_NAME
Note: I saw a similar discussion thread reported in github.com about helm upgrade failure. You can have a look over there as well.