istio tracking network request and finding point of failure

istio tracking network request and finding point of failure - istio

Using Istio 1.2.10-gke.3 on gke
curl -v -HHost:user.domain.com --resolve user.domain.com:443:$gatewayIP https://user.domain.com/auth -v -k
return a 503 after tls verification
< date: Tue, 19 May 2020 20:50:29 GMT
< server: istio-envoy
Now I want to track the request and identify the first point of failure by tracing the logs of the components involved and resolve the issue
The logs of the istio-ingressgateway pod show nothing. After getting a shell on the pod, I do a top and see an envoy process running, however I don't see any logs for the envoy in /var/log/
What am I missing? Am I looking at the wrong place? Or do I need to read the code of the framework to be able to use it?
I need to find out which link in the request processing chain broke first and the reason so that the same can be fixed

Here are some useful links to istio documentation for debugging error 503:
Istio documentation for envoy access logs
Istio documentation for Connectivity troubleshooting.
Useful envoy debugging tool istioctl.
$ istioctl proxy-status
Also one rare case where error 503 could be present.
This error could also be present if envoy sidecar proxy has issues or did not properly inject itself to deployment pod. Or when there are mTLS miss-configurations.
Hope it helps.

Related

Airflow web-server produces temporary 502 errors in Cloud Composer

I'm encountering 502 errors on AirFlow(2.0.2) UI hosted in Cloud Composer(1.17.0).
Error: Server Error The server encountered a temporary error and could not complete your request.
Please try again in 30 seconds.
They last for a few minutes and it happens several times a day after it's gone everything works fine.
At the moment of errors:
there is a gap in logs and after we can see that logs resumed with messages about staring gunicorn:
[1133] [INFO] Starting gunicorn 19.10.0
there is a spike in resource usage of web-server
I didn't spot any other suspicious activity in other parts of the system(workers, scheduler, DB)
I think that this is a result of OOM error because we have DAGs with a big number of tasks (2k).
But I'd like to be sure and I haven't found a way to connect to VM of app engine in tenant project(where Airflow server is hosted by default) to get additional logs.
Maybe anyone knows a way to get additional logs from AirFlow server VMs or have any other idea?

Cloud Composer documentation shows Troubleshooting DAGs sections. It shows how to check individual workers logs. It even mentions OOM issues (direct link).
Generally troubleshooting section is well documented so you should be able to find many interesting information. You can also use Cloud Monitoring and Cloud Logging to monitor Composer, but I am not sure if this will be valuable in this use case (reference).

Making GCP Load-balancer HTTP logs integrate with Cloud Trace (in GCP Logs Explorer)?

Cloud Trace and Cloud Logging integrate quite nicely in most cases, described in https://cloud.google.com/trace/docs/trace-log-integration
Unfortunately, this doesn't seem to include the HTTP request logs generated by a Load Balancer when request logging is enabled.
The LB logs show the traces icon, and are correctly associated with an overall trace in the Cloud Trace system, but the context menu 'show trace details' is greyed out for those log items.
A similar problem arose with my application level logging/tracing, and was solved by setting the traceSampled attribute on the LogEntry, but this can't work for LB logs, since I'm not in control of their generation.
In this instance I'm tracing 100% of requests since the service is M2M and fairly low volume, but in the general case it makes sense that the LB can't know if something is actually generating traces without being told.
I can't find any good references in the docs, but in theory a response header indicating it was sampled could be observed by the LB and cause it to issue the appropriate log.
Any ideas if such a feature exists, in this form or any other?
(Last-ditch workaround might be to use Logs Router to feed LB logs into a pubsub queue (and exclude them from normal logging sinks), and resubmit them to the normal sink(s) with fields appropriately set by some Cloud Function or other pubsub consumer, but that seems like a lot of work and complexity for this purpose)

There is currently a Feature Request created for this, you can follow the status in the following link 1.
As a workaround, you could implement target proxies along with your Load Balancer, according to the documentation for a Global external HTTP(S) load balancer:
The proxies set HTTP request/response headers as follows:
Via: 1.1 google (requests and responses)
X-Forwarded-Proto: [http | https] (requests only)
X-Cloud-Trace-Context: <trace-id>/<span-id>;<trace-options> (requests only) Contains parameters for Cloud Trace.
X-Forwarded-For: [<supplied-value>,]<client-ip>,<load-balancer-ip>
(see X-Forwarded-For header) (requests only)
You can find the complete documentation about external HTTP(S) load balancers and target proxies here 2.
And finally, take a look at the documentation on how to use and configure target proxies here 3.

APIM 2.6.0 Micro Gateway - Class Cast Exception

I am getting class cast exception when trying to setup micro gateway in APIM 2.6.0. please advise.
please advise.
command executed : ./bin/micro-gw setup hello-world -a HelloWorld -v v1
[2021-01-06 15:19:21,126] DEBUG {org.wso2.apimgt.gateway.cli.rest.RESTAPIServiceImpl} - Retrieving API with name HelloWorld, version v1 was successful. [2021-01-06 15:19:21,357] ERROR {org.wso2.apimgt.gateway.cli.cmd.Main} - Internal error occurred while executing command.java.lang.ClassCastException: org.wso2.apimgt.gateway.cli.model.rest.policy.BandwidthLimitDTO cannot be cast to org.wso2.apimgt.gateway.cli.model.rest.policy.RequestCountLimitDTO
at org.wso2.apimgt.gateway.cli.model.template.policy.ThrottlePolicy.buildContext(ThrottlePolicy.java:138)
at org.wso2.apimgt.gateway.cli.codegen.ThrottlePolicyGenerator.generateSubscriptionPolicies(ThrottlePolicyGenerator.java:97)
at org.wso2.apimgt.gateway.cli.codegen.ThrottlePolicyGenerator.generate(ThrottlePolicyGenerator.java:59)
at org.wso2.apimgt.gateway.cli.cmd.SetupCmd.execute(SetupCmd.java:298)
at java.util.Optional.ifPresent(Optional.java:159)

In this example, I had selected API that had Tier value selected as bandwidth based throttling. I have removed that and re-published this API to resolve issue. It worked after that. I do not need bandwidth based throttling right now, if someone need that then this issue will be there.Thanks

Random “upstream connect error or disconnect/reset before headers” between services with Istio 1.3

So, this problem is happening randomly (it seems) and between different services.
For example we have a service A which needs to talk to service B, and some times we get this error, but after a while, the error goes away. And this error doesn't happen too often.
When this happens, we see the error log in service A throwing the “upstream connect error” message, but none in service B. So we think it might be related with the sidecars.
One thing we notice is that in service B, we get a lot of this error messages in the istio-proxy container:
[src/istio/mixerclient/report_batch.cc:109] Mixer Report failed with: UNAVAILABLE:upstream connect error or disconnect/reset before headers. reset reason: connection failure
And according to documentation when a request comes in, envoy asks Mixer if everything is good (authorization and other things), and if Mixer doesn’t reply, the request is not success. So that’s why exists an option called policyCheckFailOpen.
We have that in false, I guess is a sane default, we don’t want the request to go through if Mixer cannot be reached, but why can’t?
disablePolicyChecks: true
policyCheckFailOpen: false
controlPlaneSecurityEnabled: false
NOTE: istio-policy is running with the istio-proxy sidecar. Is that correct?
We don’t see that error in some other service which can also fail.
Another log that I can see a lot, and this one happens in all the services not running as root with fsGroup defined in the YAML files is:
watchFileEvents: "/etc/certs": MODIFY|ATTRIB
watchFileEvents: "/etc/certs/..2020_02_10_09_41_46.891624651": MODIFY|ATTRIB
watchFileEvents: notifying
One of the leads I'm chasing is about default circuitBreakers values. Could that be related with this?
Thanks

The error you are seeing is because of a failure to establish a connection to istio-policy
Based on this github issue
Community members add two answers here which could help you with your issue
If mTLS is enabled globally make sure you set controlPlaneSecurityEnabled: true
I was facing the same issue, then I read about protocol selection. I realised the name of the port in the service definition should start with for example http-. This fixed the issue for me. And . if you face the issue still you might need to look at the tls-check for the pods and resolve it using destinationrules and policies.
istio-policy is running with the istio-proxy sidecar. Is that correct?
Yes, I just checked it and it's with sidecar.
Let me know if that help.

How to troubleshoot long pod kill time for GKE?

When using helm upgrade --install I'm every so often running into timeouts. The error I get is:
UPGRADE FAILED
Error: timed out waiting for the condition
ROLLING BACK
If I look in the GKE cluster logs on GCP, I see that when this happens its because this step takes an unusually long time to execute:
Killing container with id docker://{container-name}:Need to kill Pod
I've seen it range from a few seconds to 9 minutes. If I go into the log message's metadata to find the specific container and look at its logs, there is nothing in them suggesting a difference between it and a quickly killed container.
Any suggestions on how to keep troubleshooting this?

You could refer this troubleshooting guide for general issues connected with Google Kubernetes Engine.
As mentioned there, you may need to use the 'Troubleshooting Application' guide for further debugging the application pods or its controller objects.
I am assuming that you checked the logs(1) of the container that resides in the respective pod OR described(2)( look at the reason for termination) it using the below commands. If not, you can try these as well to get more valuable information.
1. kubectl logs POD_NAME -c CONTAINER_NAME -p
2. kubectl describe pods POD_NAME
Note: I saw a similar discussion thread reported in github.com about helm upgrade failure. You can have a look over there as well.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js