I had installed istio on kubernets(hosted on aws ) and exposed a service as a via istio ingress. I could achieve very less through put < 2000 requests per minute. I exposed the service as a standard ELB, I was able to achieve >600,000 request per sec. Is there any guide/steps to tune istio for high performance ?
Thanks
Joji
One thing we recommend users to do is to switch to the non-debug image for the envoy side car, e.g. replace docker.io/istio/proxy_debug with docker.io/istio/proxy in the istio and istio-initializer yaml file and redeploy istio. We are also working on reduce mixer traces. Performance is an area we are very actively working on in the next release of Istio and we welcome contribution to this!
Related
I've currently deployed a REST api onto a EKS cluster with ExternalDNS, HPA, VPA, Cluster Autoscaler etc.. and I am facing a big issue regarding load.
I did a few load tests by swarming the api with requests, the problem is that:
the api seems extremely slow to respond (i have the same api deployed on another platform and performs significantly better
After a few requests, they all start to return 502 timeout.
I know for a fact it is not a problem of CPU or memory usage since the pods have 1vCPU and 1 Gb of memory and they don't use not even 20% of it.
In grafana i see the kube-proxy receiving those requests and some get the needed response, but the others fail.
What could the problem be? Any help/advice is much appreciated
I use Istio 1.8 for service mesh and Prometheus to collect metrics from sidecards. Currently these metrics are been provided by sidecards:
istio_request_bytes_bucket
istio_request_duration_milliseconds_bucket
istio_requests_total
envoy_cluster_upstream_cx_connect_ms_bucket
istio_request_messages_total
istio_response_messages_total
envoy_cluster_upstream_cx_length_ms_bucket
istio_response_bytes_bucket
istio_request_bytes_sum
istio_request_bytes_count
This amount of metrics use lots of network bandwidth. (We have around 5k pods)
All we need for now are istio_requests_total and istio_request_duration_milliseconds_bucket only from Inbound.
I know how to remove labels by EnvoyFilter but I was unable to find documentation for removing a metric.
For better visibility I'm posting my comment as a Community Wiki answer as it is only the extension of what Peter Claes already mentioned in his answer.
According to the Istio docs:
The metrics section provides values for the metric dimensions as
expressions, and allows you to remove or override the existing metric
dimensions. You can modify the standard metric definitions using
tags_to_remove or by re-defining a dimension. These configuration
settings are also exposed as istioctl installation options, which
allow you to customize different metrics for gateways and sidecars as
well as for the inbound or outbound direction.
Here you can find the info regarding customizing Istio (1.8) metrics :
https://istio.io/v1.8/docs/tasks/observability/metrics/customize-metrics/
We're setting up a Moodle for our LMS and we're designing it to autoscale.
Here are the current stack specifications:
-Moodle Application (App + Data) baked into an image and launched into a Managed Instance Group
-Cloud SQL for database (MySQL 5.7 connected through Cloud SQL Proxy)
-Cloud Load Balancer - HTTPS load balancing with the managed instance group as backend + session affinity turned on
Questions:
Do I still need Redis/Memcached for my session? Or is the load balancer session affinity enough?
I'm thinking of using Cloud Filestore for the Data folder. Is this recommendable vs another Compute Engine?
I'm more concerned of the session cache and content cache for future user increase. What would you recommend adding into the mix? Any advise on the CI/CD would also be helpful.
So, I can't properly answer these questions without more information about your use case. Anyway, here's my best :)
How bad do you consider to be forcing the some users to re-login when a machine is taken down from the managed instance group? Related to this, how spiky you foresee your traffic will be? How many users will can a machine serve before forcing the autoscaler to kick in and more machines will be added or removed to/from the pool (ie, how dynamic do you think your app will need to be)? By answering these questions you should get an idea. Also, why not using Datastore/Firestore for user sessions? The few 10s of millisecond of latency shouldn't compromise the snappy feeling of your app.
Cloud Filestore uses NFS and you might hit some of the NFS idiosyncrasies. Will you be ok hitting and dealing with that? Also, what is an acceptable latency? How big is the blobs of data you will be saving? If they are small enough, you are very latency sensitive, and you want atomicity in the read/write operations you can go for Cloud BigTable. If latency is not that critical Google Cloud Storage can do it for you, but you also lose atomicity.
Google Cloud CDN seems what you want, granted that you can set up headers correctly. It is a managed service so it has all the goodies without you lifting a finger and it's cheap compared to serving stuff from your application/Google Cloud Storage/...
Cloud Builder for seems the easy option, unless you want to support more advanced stuff that are not yet supported.
Please provide more details so I can edit and focus my answer.
there is study for the autoscaling, using redis memory store show large network bandwidth from cache server, compare than compute engine with redis installed.
moodle autoscaling on google cloud platform
regarding moodle data, it show compute engine with NFS should have enough performance compare than filestore, much more expensive, as the speed also depend on the disk size.
I use this topology for the implementation
Autoscale Topology Moodle on GCP
I create a Kubernetes (v1.6.1) cluster on AWS with one master and two slave nodes, then I spin up mysql instance using helm and deploy a simple Django web-app that queries latest five rows from the database and displays it. For my web service I specify 'type: LoadBalancer' which creates an ELB on AWS.
If I use 'weave' networking and scale my web-app to at least two replicas, then I begin experiencing inconsistent response time - most of the time it is reasonable (like 0.1-0.2 s), but 20-40% requests take significantly longer (3-5 s, sometimes even more than 15 s). However, if I switch to 'flannel' networking, everything works fast, even with 20-30 replicas of the web-app. All machines have enough resources, so that's not the problem.
I tried debugging to find out what's causing the delay, and the best explanation I have is that AWS ELB doesn't work well with 'weave'. Has anyone experienced similar issues? What could be the problem? Please let me know if I should provide some relevant information.
P.S. I'm new to using Kubernetes.
I'm planning a High-availability set-up with autoscaling for RestComm and some general doubts about the best way to plan it.
This is what I have now:
Restcomm instance using Amazon ECS (docker), so we can launch more instances very easily.
All of them share the Amazon RDS database.
Workspace is shared and persisted between instances.
To move to the next step, I have some questions:
Amazon Load Balancer isn't an option because it doesn't support UDP so I'm considering Telestax LB, is it correct?. Is it possible to deploy it using docker?
Move Restcomm MS outside of the docker Restcomm image so it can scale independently. Restcomm provides env variables to specify the MS, so I would have a LB and several MS behind it. Correct?.
How much RAM needs a Restcomm instance and how many concurrent sessions supports?. How can we know how many concurrent sessions are in real time and in a programtically way?.
There is a "automatic scaling" mechanism implemented in RestComm? More info would be appreciated. Ubuntu Juju isn't an option for me.
We are considering Graylog2 or logstasch for logs management. Any insight here?. How do you install the agent in the docker images?.
The only documentation I found it was this very good document: https://docs.google.com/document/d/13xlaioF065pDnQUoZgfIpi6Noh0qHfAZ7U6afcPd2Y0/edit
Is there any other doc?.
Thanks in advance!
Very good questions :
Yes. See, https://hub.docker.com/r/restcomm/load-balancer/
You would have one LB (better to have 2 with active passive to avoid single point of failure) with X Restcomm behind it speaking to Z Media Servers behind them.
It depends on the complexity of the application on top. But here is some numbers https://github.com/RestComm/Restcomm-Connect/wiki/Load-Testing-on-Docker
Not yet. you can use Mesos or Kubernetes potentially if juju is not an option. We have a set of open issues for kubernetes right now but Mesos should be working.
You can check https://hub.docker.com/r/restcomm/graylog-restcomm/ it contains a docker image pre loaded with everything needed to poll a restcomm server for gathering metrics.