What is the difference between Request Duration (istio_request_duration_milliseconds) reported by source proxy and destination proxy? - istio

I was looking at metrics exported by istio. I am especially interested in the Request Duration (istio_request_duration_milliseconds) metric.
I want to know what is the value of Request Duration (istio_request_duration_milliseconds) reported by client istio proxy (reporter="source") and server istio proxy (reporter = "destination").
Also, what should I set the reporter value to (source or destination) when I want to know the latency of http requests?
For example, this is one the queries I can see in grafana dashboard provided by istio.
label_join((histogram_quantile(0.99, sum(rate(istio_request_duration_milliseconds_bucket{reporter="source"}[1m])) by (le, destination_workload, destination_workload_namespace)) / 1000) or histogram_quantile(0.99, sum(rate(istio_request_duration_seconds_bucket{reporter="source"}[1m])) by (le, destination_workload, destination_workload_namespace)), "destination_workload_var", ".", "destination_workload", "destination_workload_namespace")
But this query doesn't work for me. When I change the reporter to destination in the above query, it works.
So, I'm really confused as to which one is correct. If I have to calculate latency, which one should I consider? reporter=source or reporter=destination? or both will have the same value for latency?

Related

Making GCP Load-balancer HTTP logs integrate with Cloud Trace (in GCP Logs Explorer)?

Cloud Trace and Cloud Logging integrate quite nicely in most cases, described in https://cloud.google.com/trace/docs/trace-log-integration
Unfortunately, this doesn't seem to include the HTTP request logs generated by a Load Balancer when request logging is enabled.
The LB logs show the traces icon, and are correctly associated with an overall trace in the Cloud Trace system, but the context menu 'show trace details' is greyed out for those log items.
A similar problem arose with my application level logging/tracing, and was solved by setting the traceSampled attribute on the LogEntry, but this can't work for LB logs, since I'm not in control of their generation.
In this instance I'm tracing 100% of requests since the service is M2M and fairly low volume, but in the general case it makes sense that the LB can't know if something is actually generating traces without being told.
I can't find any good references in the docs, but in theory a response header indicating it was sampled could be observed by the LB and cause it to issue the appropriate log.
Any ideas if such a feature exists, in this form or any other?
(Last-ditch workaround might be to use Logs Router to feed LB logs into a pubsub queue (and exclude them from normal logging sinks), and resubmit them to the normal sink(s) with fields appropriately set by some Cloud Function or other pubsub consumer, but that seems like a lot of work and complexity for this purpose)
There is currently a Feature Request created for this, you can follow the status in the following link 1.
As a workaround, you could implement target proxies along with your Load Balancer, according to the documentation for a Global external HTTP(S) load balancer:
The proxies set HTTP request/response headers as follows:
Via: 1.1 google (requests and responses)
X-Forwarded-Proto: [http | https] (requests only)
X-Cloud-Trace-Context: <trace-id>/<span-id>;<trace-options> (requests only) Contains parameters for Cloud Trace.
X-Forwarded-For: [<supplied-value>,]<client-ip>,<load-balancer-ip>
(see X-Forwarded-For header) (requests only)
You can find the complete documentation about external HTTP(S) load balancers and target proxies here 2.
And finally, take a look at the documentation on how to use and configure target proxies here 3.

Should django health-check endpoint /ht/ be accessible from everybody?

From the documentation reported here I read
This project checks for various conditions and provides reports when
anomalous behavior is detected.The following health checks are bundled
with this project: cache, database, storage, disk and memory
utilization (viapsutil), AWS S3 storage, Celery task queue, Celery
ping, RabbitMQ, Migrations
and from use case section
The primary intended use case is to monitor conditions via HTTP(S),
with responses available in HTML and JSONformats. When you get back a
response that includes one or more problems, you can then decide the
appropriate courseof action, which could include generating
notifications and/or automating the replacement of a failing node with
a newone
And then
The /ht/ endpoint will respond aHTTP 200 if all checks passed and a HTTP
500 if any of the tests failed.
From a security point of view: should this url (https://example.com/ht) be reachable from everybody? It seems to give away different information.

Kamon Prometheus takes long to refresh

I started using Kamon instrumentation recently and facing issues with the rate of the kamon/prometheus http endpoint refresh
Preface:
using "io.kamon" %% "kamon-bundle" % "2.1.4" && "io.kamon" %% "kamon-prometheus" % "2.1.4"
exposing metrics as http endpoint so that prometheus scrapes them and evaluates every 1 sec
created custom Counter, Gauge and Histogram metrics and they are updated 2-3K times per sec inside the Akka actor processing incoming messages
The reason to use Kamon instead of standard prometheus client is to get thread safety
There is configuration kamon.metric.tick-interval 1 second & kamon.prometheus.refresh-interval 1 second related to the rate of refresh
Problem:
Custom metrics that are exposed at the endpoint (localhost:9095) are not refreshed every second. Approximately, they are refreshed every 60 seconds.
It's not prometheus configuration problem, I'm checking the values on the http endpoint exposed by kamon, manually refreshing the page
This was misconfiguration issue. If you are getting same problem, please make sure that the kamon configuration is on the top level of the application.conf, not inside akka {..} as I had it

API Management - Response Time

We are working on setting up an API Management portal for one of our Web API. We are using eventhubs for logging the events and we are transferring the event messages to Azure Blob storage using Azure functions.
We would like to know how can we find the Time taken by API Management portal for providing the response for a message (we are capturing the time taken at the back end api layer but not from the API Management layer).
Regards,
John
The simpler solution is to enable Azure Monitor Diagnostic Logs for the Apimanagement service. You will get raw logs for each request including
durationMs - interval between receiving request line and headers from a client and writing last chunk of response body to a client. All writes and reads include network latency.
BackendTime - time spent waiting on backend response
ClientTime - time spent with client for request and response
CacheTime - time spent on fetching from cache
You can also refer this video.
Not the correct way of doing this, but still get an idea of how much time each request is taking. We can actually use the context variable to set the start time in the inbound policy node and then calculate the end time in the outbound node.

WSO2 API Manager 2.1 : Gateway not enforcing Throttling Limits

We have deployed API-M 2.1 in a distributed way (each component, GW, TM, KM are running in their own Docker image) on top on DC/OS 1.9 ( Mesos ).
We have issues to get the gateway to enforce throttling policies (should it be subscription tiers or app-level policies). Here is what we have managed to define so far:
The Traffic Manager itself does it job : it receives the event streams, analyzes them on the fly and pushes an event onto the JMS topic throttledata
The Gateway reads the message properly.
So basically we have discarded a communication issue.
However we found two potential issues:
In the event which is pushed to the TM component, the value of the appTenant is null (instead of carbon.super)- We have a single tenant defined.
When the gateway receives the throttling message, it decides to let the message go thinking the "stopOnQuotaReach" is set to false, when it is set to true (we checked the value in the database).
Digging into the source code, we related those two issues to a single source: the value for both values above are read from the authContext and apparently incorrectly set. We are stuck and running out of ideas of things to try and would need some pointers to what could be a potential source of the problem and things to check.
Can somebody help please ?
Thanks- Isabelle.
Is there two TM with HA enabled available in the system?
If the TM is HA enabled, how gateways publish data to TM. Is it load balanced data publishing or failover data publishing to the TMs?
Did you follow below articles to configure the environment with respect to your deployment?
http://wso2.com/library/articles/2016/10/article-scalable-traffic-manager-deployment-patterns-for-wso2-api-manager-part-1/
http://wso2.com/library/articles/2016/10/article-scalable-traffic-manager-deployment-patterns-for-wso2-api-manager-part-2/
Is throttling completely not working in your environment?
Have you noticed any JMS connection related logs in gateways nodes?
In these tests, we have disabled HA to avoid possible complications. Neither subscription nor app throttling policies are working, both because parameters that should have values have not the adequate value (appTenant, stopOnQuotaReach).
Our scenario is far more basic. If we go with one instance of each component, it fails as Isabelle described. And the only thing we know is that both parameters come from the Authentication Context.
Thank you!