Webhook call failed: URL_REJECTED error in DialogFlow v2 Fulfillments - google-cloud-platform

Error description
Upon calling DialogFlow v2 detectIntent API, we randomly get an internal error with status code 13:
Webhook call failed. Fetch failure with no HTTP status code. Status: State: URL_REJECTED Reason: 67
This error seems to happen randomly. The same request can succeed or fail.
Interesting point, the service has been deteriorating since Friday 23th August 2019, to fail on almost every call today.
Our investigation
We didn't find anything at all about URL_REJECTED with DialogFlow or Google on internet.
But we found the meaning of the status code 13 on this page:
Internal errors. This means that some invariants expected by the underlying system have been broken. This error code is reserved for serious errors.
We also checked that we aren't banning Google IP, our that our load-balancing is not messed up (we thought of that since it would make sense with random fails).
The webhook is up and running, and we can call it ourselves. The problem seems to happen in Google's infra, as the error code 13 seems to show.

(I answer immediatly because we fixed it before posting the question. But I posted nevertheless because it may be useful for others)
The problem was that the webhook was called using http.
Setting https solved the problem.
It seems that Google activated a webhook policy of rejecting unsecure calls in their servers.
It may have been deployed gradually on their cluster, which would explain the gradual degradation.
We know that we should have migrated to https a long time ago, but still we didn't find any mention of the application of this policy on the net.

Thank you for posting this. I came across the same issue. Changed my webhook to HTTPS seems to fix the problem.

Related

Mysterious 500 error with AWS Lambda; unable to debug

I have an API that I host using Lambda (nodejs), with API-gateway. I'm using serverless to deploy.
Generally things have been fine, but while I was working on a specific function today, I started to receive HTTP 500 errors when hitting the endpoint. However, while there were still API-Gateway access logs for the end point, there were no Cloudwatch logs for the lambda functions getting hit. I was able to verify that the Authorizer was getting hit successfully, and not returning any issue (if it was, it would have been a 401). After using CLI tools to invoke the function from the command line, the 500 error went away and I was able to successfully hit the endpoints again.
Has anyone ever ran into this before? If I'm missing a debug step, I would really like to know. It was really concerning that my API could be generating 500 errors with no paper trail to help me understand what was happening.
You can check your role and permissions ,this link could help you https://aws.amazon.com/premiumsupport/knowledge-center/api-gateway-lambda-stage-variable-500/
Also you can debug further with X-ray : https://docs.aws.amazon.com/lambda/latest/dg/services-xray.html

GCP Cloud functions health check failing

I am trying to deploy a function in GCP for 2 days and receive the following error each time.
OperationError: code=13, message=Function deployment failed due to a health check failure. This usually indicates that your code was built successfully but failed during test execution. Examine the logs to determine the cause. Try deploying again in a few minutes if it appears to be transient.
The Log viewer doesn't give a proper explanation of the problem. Giving the following logs continuously until the deployment fails.
"Error: function terminated. Recommended action: inspect logs for termination reason. The function cannot be initialized."
Now, the interesting fact is that the same code was working couple of weeks ago.
This issue really makes me wonder that it is a bug from GCP after the recent cloud function upgrade.
I've been having the same issue. I found out part of the issue might be related to a dependency issue. In my specific case, related to the slackclient library, but it seems to come from the yarl library and the recommended fix is to define it as yarl==1.4.2
Reference:
https://github.com/slackapi/python-slackclient/issues/764

Random “upstream connect error or disconnect/reset before headers” between services with Istio 1.3

So, this problem is happening randomly (it seems) and between different services.
For example we have a service A which needs to talk to service B, and some times we get this error, but after a while, the error goes away. And this error doesn't happen too often.
When this happens, we see the error log in service A throwing the “upstream connect error” message, but none in service B. So we think it might be related with the sidecars.
One thing we notice is that in service B, we get a lot of this error messages in the istio-proxy container:
[src/istio/mixerclient/report_batch.cc:109] Mixer Report failed with: UNAVAILABLE:upstream connect error or disconnect/reset before headers. reset reason: connection failure
And according to documentation when a request comes in, envoy asks Mixer if everything is good (authorization and other things), and if Mixer doesn’t reply, the request is not success. So that’s why exists an option called policyCheckFailOpen.
We have that in false, I guess is a sane default, we don’t want the request to go through if Mixer cannot be reached, but why can’t?
disablePolicyChecks: true
policyCheckFailOpen: false
controlPlaneSecurityEnabled: false
NOTE: istio-policy is running with the istio-proxy sidecar. Is that correct?
We don’t see that error in some other service which can also fail.
Another log that I can see a lot, and this one happens in all the services not running as root with fsGroup defined in the YAML files is:
watchFileEvents: "/etc/certs": MODIFY|ATTRIB
watchFileEvents: "/etc/certs/..2020_02_10_09_41_46.891624651": MODIFY|ATTRIB
watchFileEvents: notifying
One of the leads I'm chasing is about default circuitBreakers values. Could that be related with this?
Thanks
The error you are seeing is because of a failure to establish a connection to istio-policy
Based on this github issue
Community members add two answers here which could help you with your issue
If mTLS is enabled globally make sure you set controlPlaneSecurityEnabled: true
I was facing the same issue, then I read about protocol selection. I realised the name of the port in the service definition should start with for example http-. This fixed the issue for me. And . if you face the issue still you might need to look at the tls-check for the pods and resolve it using destinationrules and policies.
istio-policy is running with the istio-proxy sidecar. Is that correct?
Yes, I just checked it and it's with sidecar.
Let me know if that help.

Authentication with Cognito - where to find logs

We have 2 React Native app are using AWS Cognito for authentication. We use library react-native-aws-cognito-js in our code. The apps are working fine until these 2 days. Apps are experiencing intermittent "Internal Server Error".
How can I find more information about this error? Any tool can help us pinpoint the cause?
Update
From CloudTrail, each API call has an event "CreateNetworkInterface". Many of such API calls have error code "Client.NetworkInterfaceLimitExceeded". What is the cause and solution to this?
According to this AWS Doc (in Chinese), CloudWatch will not write to log when error is due to insufficient IP/ENI. That explains the increase in error number but no logs in CloudWatch.
Upate 2
We have found a scheduled Lambda job which may exhausted IP addresses. We stopped the batch job. But still can't have too many user login to server due to "Client.NetworkInterfaceLimitExceeded" error. I realized that there are many "CreateNetworkInterface" event and few "DeleteNetworkInterface" event. How can I "clean up / reset" all network interface in VPC?
Short answer: Cloud Trail.
Long answer with a suggestion
Assuming your application code is fine, most likely the cause of your 500 error is based on Cognito's initial limitations (e.g., number of calls per user): https://docs.aws.amazon.com/cognito/latest/developerguide/limits.html.
AWS suggests to use Cloud Trail, for logging Api calls.
However I would suggest, to prove the limitations first, add some logs around the api call yourself, and in development you could call your app/api with a high number of calls; and most likely you will see the 500 error due to the limitations.
You could do the following in the terminal:
for i in `seq 1 1000`; do curl --cookie SecureCookie=TokenValueFromAWS http://localhost:desirablePort/SecuredPath; done

Google Places API error 502 - The server encountered a temporary error

we run a website that obtains location data through the Google Place API. We have 150k daily searches available, which we haven´t met yet as the website has been live for few weeks only. We have suddenly received a 502 error. A notification in the Console says: “The server encountered a temporary error and could not complete your request.”. Is this a temporary error? Is there any suggestions on what we can do? The website hasn’t been available for 40 minutes.
When you receive 5xx status or UNKNOWN_ERROR in the response, you should implement a retrying logic. Google has a following recommendation in their web services documentation:
In rare cases something may go wrong serving your request; you may receive a 4XX or 5XX HTTP response code, or the TCP connection may simply fail somewhere between your client and Google's server. Often it is worthwhile re-trying the request as the followup request may succeed when the original failed. However, it is important not to simply loop repeatedly making requests to Google's servers. This looping behavior can overload the network between your client and Google causing problems for many parties.
A better approach is to retry with increasing delays between attempts. Usually the delay is increased by a multiplicative factor with each attempt, an approach known as Exponential Backoff.
https://developers.google.com/maps/documentation/directions/web-service-best-practices#exponential-backoff
However, if retrying logic with Exponential Backoff doesn't help and the error persists for a long time you should file a bug in Google issue tracker
I hope this addresses your doubt!
UPDATE
There was an issue on Google side yesterday (November 6, 2017), you can refer to the following bug that explains the issue:
https://issuetracker.google.com/issues/68938173