Is there a way in which we can provide Istio to blacklist or whitelist an error code. Since I have tried with 500(Internal Server Error) but circuit breaker is not getting open in 500 as well?
The Circuit Breaker doesn't have that kind of functionality.
Furthermore there is an issue with Error 500 not being used in Circuit Breaker. There is an issue about this on github.
We try not to expose the plethora of sometimes confusing Envoy options
to end users, in the routing api.
Within a mesh, gateway errors will be more common (502/503/504) while
most sensible external services will return a 503 to shed load.
Secondly we just made the outlier detection generic to both tcp and
http. The consecutive gateway error applies only to http and will make
no sense in tcp context.
I also feel that 500 error code is not something indicative of
overload. The whole idea behind outliers is to remove overloaded
servers from the lb pool.
We don’t have very many users relying on this behavior I think. We
kept it intentionally generic so that we can switch to a more specific
error code in future (which happens to be now).
Hope this helps.
Related
I want to know how I can have different implementation of the same service and switch traffic from one to the other when failure starts to occur (active / passive) or have traffic go from a 50%/50% split to a 0%/100% split when service implementation A is not responding. I would expect the 50/50 split to be restored once implementation A starts working again.
For example, I want to have a payment service and I have an implementation with Cybersource and the other with Stripe (or whatever other provider makes sense). My implementation will start returning 504 when they detect that response times on one of the providers is above a certain threshold or good old 500 because a bug occured. At that point, I want the clients to only connect to the fastest (properly working) implementation for a while and gradually retry the failed implementation once the health probe give it a green light.
Similarly for an active/passive scenario perhaps I have a search API and I want all traffic to go to implementation A. However, when that implementation starts returning 5XX, I want traffic to be routed to implementation B which is perhaps offering a degraded experience, but can be used as a backup implementation.
When I read the istio documentation / blogs, etc. I don't see the scenarios above. Perhaps Istio is not the right choice for that ?
Regularly I receive from my server these kind of errors:
Invalid HTTP_HOST header: '139.162.113.11'. You may need to add '139.162.113.11' to ALLOWED_HOSTS.
The problem is that my server works fine and I don't know where do these IP addresses are coming from.
If I try to localize the one in example, it appears to be in Tokyo, which is weird to me, having a server based in France for mainly european customers.
Can't it be a suspicious attempt to the server security? I'm not keen to allow this IP. What is the correct attitude toward this kind of error?
You can "trust" that Django is helping prevent your app from running on disallowed hosts!
However- you can't blindly trust that these IPs should be allowed to host your application. They're typically some kind of bot scanning services poking around for vulnerabilities in servers to do nasty things.
Heck- I have a few of these DISALLOWED_HOST warnings in my inbox this morning as I wake up!
There is a logging option django.security.DisallowedHost where you can quiet this issue- however I keep it on as a barometer for bot activity.
I understand you can use istio to open a circuit breaker when service isn't responding. Instead of return back a 503, is it possible to redirect to a different URL? Same question but when the original service returns back a 500, can we redirect to another URL?
Or is it possible to have offline mode response provided by istio? I assume the easiest way to do this through URL redirection to a offline mode service URL, but open to ideas...
can we redirect to another URL?
If I understand correctly you're asking if it's possible to do that just with istio.
According to documentation
While Istio failure recovery features improve the reliability and availability of services in the mesh, applications must handle the failure or errors and take appropriate fallback actions. For example, when all instances in a load balancing pool have failed, Envoy returns an HTTP 503 code. The application must implement any fallback logic needed to handle the HTTP 503 error code.
And dzone.com, Christian Posta blog post:
Istio improves the reliability and availability of services in the mesh. However, applications need to handle the errors and take appropriate fallback actions. For example, when all instances in a load balancing pool have failed, Envoy will return HTTP 503. It is the responsibility of the application to implement any fallback logic that is needed to handle the HTTP 503 error code from an upstream service.
With a service mesh, at the moment without specialized libraries for failure context propagation, the failure reasons are more opaque. This doesn’t mean our application cannot take fallbacks (for both transport and client-specific errors). I’d argue it’s very important for the protocol of any application, whether using library-specific frameworks OR NOT) to always adhere to the promises it’s trying to keep for its clients. If it finds that it cannot complete its intended action, it should figure a way to gracefully degrade. Luckily, you don’t need application-specific frameworks for this. Most languages have built-in error and exception trapping and handling. Fallbacks should be implemented in these exception paths.
Sadly the answer is no, you can't. You would've to implement that in your application.
Additional resources:
demystifying istio circuit breaking
istio circuit breaker when failure is an option
I've recently moved my memcache server behind an Elastic Load Balancer in AWS. I'm also using Flask-Cache with this memcache. If I'm not mistaken (and it's totally possible I am), Flask-Cache opens a connection to memcache and holds it open. It also appears that the ELB terminates these long-standing connections after some period of time (I think it's about 60 minutes). This will result in errors like:
SomeErrors: error 19 from flush_all: (0x4ff96f0) CONNECTION FAILURE, ::rec() returned zero, server has disconnected
If there was some way I could catch these errors and reconnect (or some magic setting to "try to reconnect on connection failure"), that would solve this problem.
FWIW, I'm using pylibmc, but don't see anything obvious (to me) that I could pass.
Any help would be greatly appreciated!
Being disconnected from ELB is very common and also very difficult to debug. Here are a few things that might help:
Debugging Ideas
Attempt to debug the problem in a staging environment with only one
instance connected to ELB.
Make sure you have application logging with time stamps and that if you catch all exceptions in Python (which is generally not a great idea), that you log the exception. It is possible you have a subtle and hidden bug that appears to be something else if you are catching all exceptions.
Simulate the failure (i.e. manually remove "one" instance from ELB), now look at your logs and make sure you see this manifested in your logs. If you can reproduce the same behavior than you can figure out how to fix it.
Look into a web service automated testing tool like https://loader.io/. This can be very helpful to simulate the conditions when the disconnects appear to happen.
Try the same application with a different load balancer, i.e. HAProxy (I would potentially try this last).
I'm looking for best practices for reporting internal service errors (status code 500) when something unexpected goes wrong with my RESTful web service.
I'm not referring to conditions covered by other status codes, but for truly exceptional, unexpected errors internal to my application.
Displaying detailed error information such as exception details could benefit debugging, but this would expose internal details of my server. This seems like a Bad Thing (tm).
Perhaps it's best to just report a high, level error message with a time stamp? Error details should of course be in the server log.
Any good examples out there for inspiration?
Don't show detailed debug info externally. A good approach is to create a unique hash/id of the error event and surface that. Ideally that id can be used on your end to look up additional details. Here is an example of how YouTube does it. They go a bit crazy on the length however.