How timeout and retries works together in Istio? - istio

Here is example of VirtualService, using both timeout and retries.
apiVersion: networking.istio.io/v1alpha3
kind: VirtualService
metadata:
name: sa-logic
spec:
hosts:
- sa-logic
http:
- route:
- destination:
host: sa-logic
subset: v1
weight: 50
- destination:
host: sa-logic
subset: v2
weight: 50
timeout: 8s
retries:
attempts: 3
perTryTimeout: 3s # perTryTimeout (3s) is different from timeout above (8s)
How it works? The documentation does not provide clear answers to this question.
I have three guesses:
Timeout always 8s (timeout overrides perTryTimeout).
Timeout always 3s (perTryTimeout overrides timeout).
The initial call's timeout is 8s, retry's timeout is 3s (contradicts the documentation. The docs says perTryTimeout includes the initial call and any retries).
Timeout per try always 3s (including initial call), but the total timeout for all attempts is 8s.

This is correct:
Timeout per try always 3s (including initial call), but the total timeout for all attempts is 8s.
It basically means that:
An attempt is marked as failed if it takes longer than 3 seconds.
There are going to be a maximum of 3 attempts.
The overall waiting time for a successful attempt will not be longer than 8 seconds.
perTryTimeout * retries should not exceed the global timeout. If it does, the retry attempts that fall outside the global timeout will be ignored.

Related

how to use retries for local rate limiter envoy requests

I want to set up a retry policy for requests that have been restricted by the local rate limiter. The documentation states that you must add envoy-ratelimited to the retry_on field.
But somehow it doesn't work.I do not see that the statistics on retry in the admin panel increases, and the response time is instantaneous despite the 4 maximum attempts like:
My configuration is
routes:
- match:
prefix: "/app"
route:
host_rewrite_literal: app
prefix_rewrite: "/"
timeout: 15s
cluster: app
retry_policy:
retry_on: envoy-ratelimited
num_retries: 4
typed_per_filter_config:
envoy.filters.http.local_ratelimit:
"#type": type.googleapis.com/envoy.extensions.filters.http.local_ratelimit.v3.LocalRateLimit
stat_prefix: app_ratelimit
token_bucket:
max_tokens: 5
tokens_per_fill: 5
fill_interval: 5s
filter_enabled:
runtime_key: local_rate_limit_enabled
default_value:
numerator: 100
denominator: HUNDRED
filter_enforced:
runtime_key: local_rate_limit_enforced
default_value:
numerator: 100
denominator: HUNDRED
At the moment, no any possible way.
You can only use the two listeners system proposed below:
Retry listener -> rate limit listener -> upstream
You need to configure a local rate limiter on the listener that sends the request to the upstream. Then if request from this listener will be limited by local rate limiter, then retry listener will retry this one (You need retry listener to be configured for 4xx retry_on)

istio setting request size limits - lookup failed: 'request.size'

I am attempting to limit traffic by request size using istio. Given that the virtual service does not provide this I am trying to due it via a mixer policy.
I setup the following
---
apiVersion: "config.istio.io/v1alpha2"
kind: handler
metadata:
name: denylargerequest
spec:
compiledAdapter: denier
params:
status:
code: 9
message: Request Too Large
---
apiVersion: "config.istio.io/v1alpha2"
kind: instance
metadata:
name: denylargerequest
spec:
compiledTemplate: checknothing
---
apiVersion: "config.istio.io/v1alpha2"
kind: rule
metadata:
name: denylargerequest
spec:
match: destination.labels["app"] == "httpbin" && request.size > 100
actions:
- handler: denylargerequest
instances: [ denylargerequest ]
Requests are not denied and I see the following error from istio-mixer
2020-01-07T15:42:40.564240Z warn input set condition evaluation error: id='2', error='lookup failed: 'request.size''
If I remove the request.size portion of the match I get the expected behavior which is a 400 http status with a message about request size. Of course, I get it on every request which is not desired. But that, along with the above error makes it clear that the request.size attribute is the problem.
I do not see anywhere in istio's docs what attributes are available to the mixer rules.
I am running istio 1.3.0.
Any suggestions on the mixer rule? Or an alternative way to enforce request size limits via istio?
The rule match mentioned in the question:
match: destination.labels["app"] == "httpbin" && request.size > 100
will not work because of mismatched attribute types.
According to istio documentation:
Match is an attribute based predicate. When Mixer receives a request
it evaluates the match expression and executes all the associated
actions if the match evaluates to true.
A few example match:
an empty match evaluates to true
true, a boolean literal; a rule with this match will always be executed
match(destination.service.host, "ratings.*") selects any request targeting a service whose name starts with “ratings”
attr1 == "20" && attr2 == "30" logical AND, OR, and NOT are also available
This means that the request.size > 100 has integer values that are not supported.
However, it is possible to do with help of Common Expression Language (CEL).
You can enable CEL in istio by using the policy.istio.io/lang annotation (set it to CEL).
Then by using Type Values from the List of Standard Definitions we can use functions to parse values into different types.
Just as a suggesion for solution.
Alternative way would be to use envoyfilter filter like in this github issue.
According to another related github issue about Envoy's per connection buffer limit:
The current resolution is to use envoyfilter, reopen if you feel this is a must feature
Hope this helps.

AWS CloudWatch alarm keeps sending mails for as long as evaluation interval lasts

I've set up an AWS CloudWatch alarm with the following parameters:
ActionsEnabled: true
AlarmActions: "some SNS topic"
AlarmDescription: "Too many HTTP 5xx errors"
ComparisonOperator: GreaterThanOrEqualToThreshold
DatapointsToAlarm: 1
Dimensions:
- Name: ApiName
Value: "some API"
EvaluationPeriods: 20
MetricName: 5XXError
Namespace: AWS/ApiGateway
Period: 300
Statistic: Average
Threshold: 0.1
TreatMissingData: ignore
The idea is to receive a mail when there are too many HTTP 500 errors. I believe the above gives me an alarm that evaluates time periods of 5 minutes (300s). If 1 out of 20 data points exceeds the limit (10% of the requests) I should receive an email.
This works. I receive the email. But even if the amount of errors drops below the threshold again, I seem to keep receiving emails. It seems to be more or less for the entire duration of the evaluation interval (1h40min = 20 x 5 minutes). Also, I receive these mails every 5 minutes, leading me to think there must be a connection with my configuration.
This question implies that this shouldn't happen, which seems logical to me. In fact, I'd expect not to receive an email for at least 1 hour and 40 minutes (20 x 5 minutes), even if the threshold is breached again.
This is the graph of my metric/alarm:
Correction: I actually received 22 mails.
Have I made an error in my configuration?
Update
I can see that the state is set from Alarm to OK 3 minutes after it was set from OK to Alarm:
This is what we've found and how we fixed it.
So we're evaluating blocks of 5 minutes and taking the average of the amount of errors. But AWS is evaluating at faster intervals than 5 minutes. The distribution of your errors can be such that at a given point in time, a 5 minute block has an average of 12%. But a bit later, this block could be split in two giving you two blocks with different averages, possibly lower than the threshold.
That's what we believe is going on.
We've fixed it by changing our Period to 60s, and change our DatapointsToAlarm and EvaluationPeriods settings.

How can I confirm whether Circuit Breaking (via DestinationRule) is at work or not for external service (ServiceEntry & VirtualService)

Summary of Problem
I'm trying to impose Circuit Breaker parameters for an external endpoint outside of my mesh, hosted somewhere else. However, the parameters I have set doesn't seem to be imposed because I am still getting successful HTTP 200 responses, when I expect it to start failing with HTTP 503.
Tools versions are:
Istio-1.2.4
Kubernetes: v1.10.11
Docker Dekstop Version 2.0.0.3
Notable config:
global.outboundTrafficPolicy.mode is REGISTRY_ONLY.
Within Mesh is mTLS. External traffic policy, TLS is DISABLED
Related Resources
ServiceEntry
apiVersion: networking.istio.io/v1alpha3
kind: ServiceEntry
metadata:
name: external-service
spec:
hosts:
- external-service.sample.com
location: MESH_EXTERNAL
exportTo:
- "*"
ports:
- number: 80
name: http
protocol: HTTP
resolution: DNS
VirtualService
apiVersion: networking.istio.io/v1alpha3
kind: VirtualService
metadata:
name: external-service-vs
spec:
hosts:
- external-service.sample.com
http:
- timeout: 200ms
retries:
attempts: 1
perTryTimeout: 200ms
route:
- destination:
host: external-service.sample.com
port:
number: 80
DestinationRule
apiVersion: networking.istio.io/v1alpha3
kind: DestinationRule
metadata:
name: external-service-dr
spec:
host: external-service.sample.com
trafficPolicy:
tls:
mode: DISABLE
connectionPool:
tcp:
maxConnections: 1
connectTimeout: 200ms
http:
http2MaxRequests: 1
http1MaxPendingRequests: 1
maxRequestsPerConnection: 1
maxRetries: 1
idleTimeout: 200ms
outlierDetection:
consecutiveErrors: 1
interval: 1s
baseEjectionTime: 10s
maxEjectionPercent: 100
Testing
I have an application inside the mesh injected with an Envoy Proxy. The app basically just run load concurrent for HTTP1.1 GET external-service.sample.com/endpoint.
I adjust the number of concurrent users in the App (1 to 10) and requests per second per user (1 to 20).
I was expecting for the response to start failing with the ramp up. But that's not the case. I get success throughout.
Key Asks
If you see something very glaring, please point it out.
I already checked logs and /stats from my Envoy Proxy (outgoing request and response). What other istio logs do I need to check to understand more whether the request was subjected and evaluated by istio to the destinationrule or not?
Besides the statistic data gathered by Istio Mixer from nested Envoy instances, you might consider fetching Circuit Breaker events from Envoy Access logs.
Since Access logging enabled across Istio mesh plane, you can extract the relevant Circuit Breaker log entries, distinguished by specific response flags:
UO: Upstream overflow (circuit breaking) in addition to 503 response
code.
And fetched up record from container's envoy proxy access logs, i.e:
[2019-09-18T09:49:56.716Z] "GET /headers HTTP/1.1" 503 UO "-" "-" 0 81 0 - "-"
I have not really addressed the issue directly on this.
But, I have done the whole setup from the start with a clean slate all over again from setting up istio. And after that it was already throwing the expected HTTP 503.
It was rather challenging than necessary to know the state of the circuit breakers. There was supposed to be a ticket logged, but it seems development for such feature is not yet on the horizon.
Nevertheless, when verifying, I did take a look on some telemetry metrics to understand the circuit breaker state. I think this way could be better because I only want to know whether the circuit is close or open at a moment and not analyze from multiple input data.
Thanks.

How to enable user-level rate limiting in istio

I saw Istio site mention Rate Limiting support but I can only find global rate-limit example.
Is it possible to do so at user-level? For example, if my user logged in but sends more than 50 requests within a second then I'd like to block said user, etc. In a similar situation, if user doesn't logged in then that device cannot send more than 30 requests per seconds.
Yes it is possible to conditionally apply rate limits based on arbitrary
attributes using a match condition in the quota rule.
apiVersion: config.istio.io/v1alpha2
kind: rule
metadata:
name: quota
namespace: istio-system
spec:
match: source.namespace != destination.namespace
actions:
- handler: handler.memquota
instances:
- requestcount.quota
The quota will only apply when the source namespace is not equal to the destination namespace. In your case you probablty want to set a match like this:
match:
request:
headers:
cookie:
regex: "^(.*?;)?(user=jason)(;.*)?$"
I made a PR to improve the rate-limiting docs you can find it here: https://github.com/istio/istio.github.io/pull/1109