Istio Circuit Breaker who trips it? - istio

I am currently doing research on the service mesh Istio in version 1.6. The data plane (Envoy proxies) are configured by the controle plane.
When I configure a Circuit Breaker by creating a Destination rule and the circuit breaker opens, does the client side sidecar proxy already return the 503 or the server side sidecar proxy?
Does the client side sidecar proxy route the request to another available instance of the service automatically or does it simply return the 503 to the application container?
Thanks in advance!

In the log entries, you can inspect them to figure out both end of the connection that was stopped by the circuit breaker. IP addresses of both sides of the connection are present in the log message from the istio-proxy container.
{
insertId: "..."
labels: {
k8s-pod/app: "circuitbreaker-jdwa8424"
k8s-pod/pod-template-hash: "..."
}
logName: ".../logs/stdout"
receiveTimestamp: "2020-06-09T05:59:30.209882320Z"
resource: {
labels: {
cluster_name: "..."
container_name: "istio-proxy"
location: "..."
namespace_name: "circuit"
pod_name: "circuit-service-a31cb334d-66qeq"
project_id: "..."
}
type: "k8s_container"
}
severity: "INFO"
textPayload: "[2020-06-09T05:59:27.854Z] UO 0 0 0 "-" - - 172.207.3.243:443 10.1.13.216:36774 "
timestamp: "2020-06-09TT05:59:28.071001549Z"
}
The message is coming from istio-proxy container which runs Envoy that was affected by CircuitBreaker policy that request was sent to. Also there is the IP address of both the source and destination of the connection that was interrupted.
It will return 503. There is option to configure retries, however I did not test its synergy with CircuitBreaker and if the retry actually will go to different pod if previous returned an error.
Also take a look at the most detailed explanation of CircuitBreaker I managed to find.
Hope it helps.

Related

Unable to create working subscription for Event Hub topic in Dapr

I have a Dapr application running locally, self-hosted with the Dapr cli. I've configured a Dapr Component and Subscription for subscribing to an Azure Event Hub, detailed below:
apiVersion: dapr.io/v1alpha1
kind: Component
metadata:
name: eventhubs-pubsub
spec:
type: pubsub.azure.eventhubs
version: v1
metadata:
- name: connectionString
value: "Endpoint=sb://[removed].servicebus.windows.net/;SharedAccessKeyName=[removed];SharedAccessKey=[removed];EntityPath=myhub"
- name: enableEntityManagement
value: "false"
- name: storageAccountName
value: "[removed]"
- name: storageAccountKey
value: "[removed]"
- name: storageContainerName
value: "myapp"
scopes:
- myapp
apiVersion: dapr.io/v1alpha1
kind: Subscription
metadata:
name: myhub-subscription
spec:
topic: myhub
route: /EventHubsInput
pubsubname: eventhubs-pubsub
scopes:
- myapp
I've manually created a consumer group with the name of the Dapr app id - "myapp".
I've called the HTTP endpoint directly - a POST verb returning 200 - and it works fine. It also responds to OPTIONS verb.
The application starts succsesfully with no errors or warnings. I can see a logged message saying:
INFO[0000] connectionString provided is specific to event hub "myhub". Publishing or subscribing to a topic that does not match this event hub will fail when attempted. app_id=myapp instance=OldManWaterfall scope=dapr.contrib type=log ver=1.6.0
INFO[0000] component loaded. name: eventhubs-pubsub, type: pubsub.azure.eventhubs/v1 app_id=myapp instance=OldManWaterfall scope=dapr.runtime type=log ver=1.6.0
No other message is logged regarding the pubsub and no message indicating a failure or success of the subscription itself. Nothing is created in the storgae container. If I remove the storage related config from the Component no failure is reported, despite those properties being mandatory. When I put a message on the Hub, unsurprisingly nothing happens.
What am I doing wrong? Everything I've read seems to indicate this set up should work.
I was able to fix this by exposing my app over http instead of https. Unfortunately there was no logging to indicate https was the issue, even with debug level switched on.

GCP Cloud Scheduler throws ERROR for a HTTP targettype

I have created a GCP Cloud Scheduler job to run every 15 minutes. It is supposed to call an API from my Node js application.
In the console the job definition looks like this:
Description: A job
Frequency: */15 * * * *
Timezone: Central Standard Time
Target: HTTP
URL: https://<company url>/api/email-reminder/
HTTP method: GET
Auth header: Add OIDC token
Service account: xxxxxxxxxxx-compute#developr.gserviceaccount.com
When it runs it returns the following in the logs:
httpRequest: {
}
insertId: "15wxxxxxxge1lv"
jsonPayload: {
#type: "type.googleapis.com/google.cloud.scheduler.logging.AttemptFinished"
jobName: "projects/<project name>/locations/us-central1/jobs/xxxxxxxxx-scheduler-emailreminders-1"
status: "UNKNOWN"
targetType: "HTTP"
url: "https://<company url>/api/email-reminder/"
}
logName: "projects/<project name>/logs/cloudscheduler.googleapis.com%2Fexecutions"
receiveTimestamp: "2019-11-14T04:45:50.280446452Z"
resource: {
labels: {…}
type: "cloud_scheduler_job"
}
severity: "ERROR"
timestamp: "2019-11-14T04:45:50.280446452Z"
}
How do I find out more information about the error?
The default value for timeout scheduler jobs process is 180s, you can change vía gcloud command
gcloud scheduler jobs update http my-super-job --attempt-deadline 540s
As well you can see the complete info of your jobs with this commands...
gcloud scheduler jobs list
gcloud scheduler jobs describe my-super-job
I've seen similar problems recently using Cloud Scheduler for HTTPS targets. Every so often, the scheduler will fail, and all I get is a log message like yours.
Viewing in the Logs Viewer, the key parts of the log are, in the log header:
"status":"RESOURCE_EXHAUSTED",
"#type":"type.googleapis.com/google.cloud.scheduler.logging.AttemptFinished"
and in the log data:
httpRequest: {
status: 429
}
jsonPayload: {
#type: "type.googleapis.com/google.cloud.scheduler.logging.AttemptFinished"
jobName: "projects/joburlhere"
status: "RESOURCE_EXHAUSTED"
targetType: "HTTP"
url: "https://urlgoeshere"
}
severity: "ERROR"
"Resource exhausted" is the description of the 429 error code.
There's a description of this code here:
https://cloud.google.com/apis/design/errors
429 RESOURCE_EXHAUSTED Either out of resource quota or reaching rate limiting. The client should look for google.rpc.QuotaFailure error detail for more information.
Given that I'm running this Job once an hour, and the receiver is a very modest Cloud Function, I don't I'm doing anything to cause resource exhaustion. So I think it's a recurring transient problem with Google's cloud infrastructure. I'm guessing that the cloud functions were unavailable for that particular request, and because I set up the Cloud Function with default settings, the Scheduler did not retry.
Additionally, it is possible to configure a Scheduler Job to retry in case of failure. This functionality is not shown in the web console, but you can control it using the gcloud command.
The default setting is not to retry.
Look at the --max-retry-attempts flag.
https://cloud.google.com/sdk/gcloud/reference/scheduler/jobs/update/http
There are similar controls for pubsub Jobs
https://cloud.google.com/sdk/gcloud/reference/scheduler/jobs/update/pubsub
It's because the http endpoint you specified doesn't return the response within the default attempt deadline.
Refer the link

How can I confirm whether Circuit Breaking (via DestinationRule) is at work or not for external service (ServiceEntry & VirtualService)

Summary of Problem
I'm trying to impose Circuit Breaker parameters for an external endpoint outside of my mesh, hosted somewhere else. However, the parameters I have set doesn't seem to be imposed because I am still getting successful HTTP 200 responses, when I expect it to start failing with HTTP 503.
Tools versions are:
Istio-1.2.4
Kubernetes: v1.10.11
Docker Dekstop Version 2.0.0.3
Notable config:
global.outboundTrafficPolicy.mode is REGISTRY_ONLY.
Within Mesh is mTLS. External traffic policy, TLS is DISABLED
Related Resources
ServiceEntry
apiVersion: networking.istio.io/v1alpha3
kind: ServiceEntry
metadata:
name: external-service
spec:
hosts:
- external-service.sample.com
location: MESH_EXTERNAL
exportTo:
- "*"
ports:
- number: 80
name: http
protocol: HTTP
resolution: DNS
VirtualService
apiVersion: networking.istio.io/v1alpha3
kind: VirtualService
metadata:
name: external-service-vs
spec:
hosts:
- external-service.sample.com
http:
- timeout: 200ms
retries:
attempts: 1
perTryTimeout: 200ms
route:
- destination:
host: external-service.sample.com
port:
number: 80
DestinationRule
apiVersion: networking.istio.io/v1alpha3
kind: DestinationRule
metadata:
name: external-service-dr
spec:
host: external-service.sample.com
trafficPolicy:
tls:
mode: DISABLE
connectionPool:
tcp:
maxConnections: 1
connectTimeout: 200ms
http:
http2MaxRequests: 1
http1MaxPendingRequests: 1
maxRequestsPerConnection: 1
maxRetries: 1
idleTimeout: 200ms
outlierDetection:
consecutiveErrors: 1
interval: 1s
baseEjectionTime: 10s
maxEjectionPercent: 100
Testing
I have an application inside the mesh injected with an Envoy Proxy. The app basically just run load concurrent for HTTP1.1 GET external-service.sample.com/endpoint.
I adjust the number of concurrent users in the App (1 to 10) and requests per second per user (1 to 20).
I was expecting for the response to start failing with the ramp up. But that's not the case. I get success throughout.
Key Asks
If you see something very glaring, please point it out.
I already checked logs and /stats from my Envoy Proxy (outgoing request and response). What other istio logs do I need to check to understand more whether the request was subjected and evaluated by istio to the destinationrule or not?
Besides the statistic data gathered by Istio Mixer from nested Envoy instances, you might consider fetching Circuit Breaker events from Envoy Access logs.
Since Access logging enabled across Istio mesh plane, you can extract the relevant Circuit Breaker log entries, distinguished by specific response flags:
UO: Upstream overflow (circuit breaking) in addition to 503 response
code.
And fetched up record from container's envoy proxy access logs, i.e:
[2019-09-18T09:49:56.716Z] "GET /headers HTTP/1.1" 503 UO "-" "-" 0 81 0 - "-"
I have not really addressed the issue directly on this.
But, I have done the whole setup from the start with a clean slate all over again from setting up istio. And after that it was already throwing the expected HTTP 503.
It was rather challenging than necessary to know the state of the circuit breakers. There was supposed to be a ticket logged, but it seems development for such feature is not yet on the horizon.
Nevertheless, when verifying, I did take a look on some telemetry metrics to understand the circuit breaker state. I think this way could be better because I only want to know whether the circuit is close or open at a moment and not analyze from multiple input data.
Thanks.

GCP - IOT Core Monitoring messages

I'm kind of new on IOT world and trying to learn a bit about GCP products.
Ive made a simples python app that uses PAHO to send a message to an IOT topic (IOT Core in GCP).
Everything, apparently, works just fine. But I was wondering if I could see, on stackdriver, the content of a message that the device had sent.
I already have enable debuging log for it, but the message didnt show up.
Publish Log in stackdriver:
{
insertId: "78178yfwnl"
jsonPayload: {
eventType: "PUBLISH"
protocol: "MQTT"
publishFromDeviceTopicType: "EVENTS"
resourceName: "projects/demoiot/locations/us-central1/registries/iotchicago/devices/2753540639583"
serviceName: "cloudiot.googleapis.com"
status: {
code: 0
}
}
labels: {
device_id: "us_chi"
}
logName: "projects/demoiot/logs/cloudiot.googleapis.com%2Fdevice_activity"
receiveTimestamp: "2018-11-20T11:10:01.123928203Z"
resource: {
labels: {
device_num_id: "2753540639583"
device_registry_id: "iotchicago"
location: "us-central1"
project_id: "demoiot-223010"
}
type: "cloudiot_device"
}
severity: "DEBUG"
timestamp: "2018-11-20T11:10:01.104415969Z"
}
No telemetry data will be logged by our system. The potential for privacy concerns because of permissions to logs vs. permission on the telemetry itself is such that we didn't want to touch that.
You CAN write to Stackdriver explicitly though, so a way to do that would be to have a Cloud Function tied to the Pub/Sub topic where the telemetry is being written, and then have that function write messages out to Stackdriver with the payload data. Could also do it with DataFlow if Java is more your thing.
A teammate also pointed out to me, using the /state/ MQTT topic to write out the device's state and checking it in the GCP console is also a good way to quickly test/check. In the device details, there's a tab for "Configuration and State History" that will show you.

502 Server Error sometime on Google Compute Engine

I set up a server on Google Compute Engine with Apache server on Ubuntu 16.04.4 LTS. It's protected with IAP.
It was fine all along for about 6 months but now some of the users encounter 502 Server Error.
I already checked the following links
Some 502 errors in GCP HTTP Load Balancing [Changed the Apache KeepAliveTimeout to 620]
502 response coming from errors in Google Cloud LoadBalancer [Removed ajax requests]
But the problem is still there.
Here is the error message from one of the log.
{
httpRequest: {…}
insertId: "170sg34g5fmld90"
jsonPayload: {
#type: "type.googleapis.com/google.cloud.loadbalancing.type.LoadBalancerLogEntry"
statusDetails: "failed_to_pick_backend"
}
logName: "projects/sggc-web01/logs/requests"
receiveTimestamp: "2018-03-14T07:21:55.807802906Z"
resource: {…}
severity: "WARNING"
spanId: "44a49bf1b3893412"
timestamp: "2018-03-14T07:21:53.048717425Z"
trace: "projects/sggc-web01/traces/f35119d8571f20df670b0d53ab6b3210"
}
Please help me to trace and fix the issue. Thank you!
The error is not being caused by the server but the load balancer.
For the error we can see in the statusDetails "failed_to_pick_backend" it is being caused because all the instances were unhealthy (or still are) when it tries to establish the connection.
This can be because:
1 - The CPU usage of the instances were too high and they weren't able to answer the health check request from the load balancer showing as unhealthy to it.
2 - The health checks aren't being allowed in the firewall (I doubt this can be the reason if it worked before)