Communication Sidecar Controle Plane at Istio - istio

I am currently doing research on the service mesh Istio in version 1.6. The data plane (Envoy proxies) are configured by the controle plane. Especially Pilot (part of istiod) is responsible to propagate routing rules and configs to the envoys. I am wondering how the communication is working?
Is it a single gRPC stream that is opened when the sidecar container starts for the first time and that stays opened during the sidecars whole lifecycle. If the mesh changes, Pilot uses this stream to inform envoy via the xDS api about the changes? So updates are based on a push strategy? OR does the sidecar pull for new configs in a defined interval?
What is the role of the istio agent (fromer pilot and citadel agent) in the sidecar container (especially the former pilot agent, I know that the Citadel agent is of the CSR process)? Does it pull for new configs, does it only bootstrap the envoy, but why is it then always running, ...?
Thanks in advance!

The best explanation how istio envoy works is from envoy documentation. It is actually lot more complicated than it seems:
Initialization
How Envoy initializes itself when it starts up is complex. This section explains at a high level how the process works. All of the following happens before any listeners start listening and accepting new connections.
During startup, the cluster manager goes through a multi-phase initialization where it first initializes static/DNS clusters, then predefined EDS clusters. Then it initializes CDS if applicable, waits for one response (or failure) for a bounded period of time, and does the same primary/secondary initialization of CDS provided clusters.
If clusters use active health checking, Envoy also does a single active health check round.
Once cluster manager initialization is done, RDS and LDS initialize (if applicable). The server waits for a bounded period of time for at least one response (or failure) for LDS/RDS requests. After which, it starts accepting connections.
If LDS itself returns a listener that needs an RDS response, Envoy further waits for a bounded period of time until an RDS response (or failure) is received. Note that this process takes place on every future listener addition via LDS and is known as listener warming.
After all of the previous steps have taken place, the listeners start accepting new connections. This flow ensures that during hot restart the new process is fully capable of accepting and processing new connections before the draining of the old process begins.
A key design principle of initialization is that an Envoy is always guaranteed to initialize within initial_fetch_timeout, with a best effort made to obtain the complete set of xDS configuration within that subject to the management server availability.
As for updating envoy config:
Runtime configuration
Envoy supports “runtime” configuration (also known as “feature flags” and “decider”). Configuration settings can be altered that will affect operation without needing to restart Envoy or change the primary configuration. The currently supported implementation uses a tree of file system files. Envoy watches for a symbolic link swap in a configured directory and reloads the tree when that happens. This type of system is very commonly deployed in large distributed systems. Other implementations would not be difficult to implement. Supported runtime configuration settings are documented in the relevant sections of the operations guide. Envoy will operate correctly with default runtime values and a “null” provider so it is not required that such a system exists to run Envoy.
Runtime configuration.
More information about how envoy proxy work can be found here.
According to istio documentation:
The benefit of consolidation: introducing istiod
Having established that many of the common benefits of microservices didn’t apply to the Istio control plane, we decided to unify them into a single binary: istiod (the ’d’ is for daemon).
Let’s look at the benefits of the new packaging:
Installation becomes easier. Fewer Kubernetes deployments and associated configurations are required, so the set of configuration options and flags for Istio is reduced significantly. In the simplest case, you can start the Istio control plane, with all features enabled, by starting a single Pod.
Configuration becomes easier. Many of the configuration options that Istio has today are ways to orchestrate the control plane components, and so are no longer needed. You also no longer need to change cluster-wide PodSecurityPolicy to deploy Istio.
Using VMs becomes easier. To add a workload to a mesh, you now just need to install one agent and the generated certificates. That agent connects back to only a single service.
Maintenance becomes easier. Installing, upgrading, and removing Istio no longer require a complicated dance of version dependencies and startup orders. For example: To upgrade, you only need to start a new istiod version alongside your existing control plane, canary it, and then move all traffic over to it.
Scalability becomes easier. There is now only one component to scale.
Debugging becomes easier. Fewer components means less cross-component environmental debugging.
Startup time goes down. Components no longer need to wait for each other to start in a defined order.
Resource usage goes down and responsiveness goes up. Communication between components becomes guaranteed, and not subject to gRPC size limits. Caches can be shared safely, which decreases the resource footprint as a result.
istiod unifies functionality that Pilot, Galley, Citadel and the sidecar injector previously performed, into a single binary.
A separate component, the istio-agent, helps each sidecar connect to the mesh by securely passing configuration and secrets to the Envoy proxies. While the agent, strictly speaking, is still part of the control plane, it runs on a per-pod basis. We’ve further simplified by rolling per-node functionality that used to run as a DaemonSet, into that per-pod agent.
Hope it helps.

Related

Using Istio circuit breakers in production

I'd like to start using the http2MaxRequests circuit breaker in my Kubernetes based Istio service mesh, but the semantics are befuddling to me:
on the destination side, it does what I would expect: each Envoy will accept at most the specified number of concurrent requests, and will then reject new requests. Importantly, the configuration is based on the concurrency supported by each instance of my application server; when I scale the number of instances up or down, I don't need to change this configuration.
on the source side, things are a lot more messy. The http2MaxRequests setting specifies the number of concurrent requests from each source Envoy to the entire set of destination Envoys. I simply don't understand how this can possible work in production!
First off, different source applications have wildly different concurrency models; we have an Nginx proxy that supports a massive amount of concurrent requests per instance, and we have internal services that support a handful. Those source applications will themselves be scaled very differently, with e.g. Nginx having very few instances that make lots of requests. I would need a DestinationRule for each source application pertaining to each destination service.
Even then, the http2MaxRequests setting limits the concurrent requests to the entire destination service, not to each individual instance (endpoint). This fundamentally breaks the notion of autoscaling, the entire point of which is to vary the capacity as needed.
What I would love to see is a way to enforce at the source side a destination-instance concurrency limit; that is, to tie the limit not to the entire Envoy cluster, but to each Envoy endpoint instead. That would make it actually useful for me, as I could then base the configuration on the concurrency supported by each instance of my service, which is a constant factor.

Istio shifting service implementation on failure or active/active and active/passive services

I want to know how I can have different implementation of the same service and switch traffic from one to the other when failure starts to occur (active / passive) or have traffic go from a 50%/50% split to a 0%/100% split when service implementation A is not responding. I would expect the 50/50 split to be restored once implementation A starts working again.
For example, I want to have a payment service and I have an implementation with Cybersource and the other with Stripe (or whatever other provider makes sense). My implementation will start returning 504 when they detect that response times on one of the providers is above a certain threshold or good old 500 because a bug occured. At that point, I want the clients to only connect to the fastest (properly working) implementation for a while and gradually retry the failed implementation once the health probe give it a green light.
Similarly for an active/passive scenario perhaps I have a search API and I want all traffic to go to implementation A. However, when that implementation starts returning 5XX, I want traffic to be routed to implementation B which is perhaps offering a degraded experience, but can be used as a backup implementation.
When I read the istio documentation / blogs, etc. I don't see the scenarios above. Perhaps Istio is not the right choice for that ?

Kubernetes: How to connect one pod to another on an arbitrary port - with or without services?

We are currently transitioning our apps to Kubernetes and I have two apps, appP and appH, that I need to communicate with each other over a port unknown at start up time.
Unlike most of our apps, we don't have a set port for them will to communicate over. Before Kubernetes, third party app (out of my control) would tell appP to start processing an item, itemA, identified with a unique id and it would also tell appH to handle the processed data produced by appP.
To coordinate communications between appP and appH, appH would generate a port based on the unique id and publish the host and port info to connect on to an intermediate app (IA). appP, once done with it's processing queries IA for the connection information based on the unique id and sends it over.
Now we have to adapt this to kubernetes. Each app runs in its own deployment, as does the IA. So how can I setup appH to accept the connection over a port without being able to specify it in the service definition?
Note: I've seen some posts say that pods should be able to communicate to any other pods in the cluster regardless of specifying the ports in the service definition but I can't seem to find a ton of confirming information on this and I don't have a ton of time on our cluster where it is free to bang my head against.
Would it would just fine as is regardless? My biggest worry is the ip resolution. Currently appH grabs its ip based on the host it's running on (using boost). Not sure how this resolves within a container.
If not, my next thought would be if I could setup a headless service with selector for appH in order to allow for ip resolution. What I am unsure of then is if I could have appP connect to <appH_Service>:<arbitrary_port>?
Would the service even have to be headless in this scenario? I mostly say headless w/ selector because I saw in one specific post that it is the only one you don't need a port in the spec for it. Also because I am unsure if the connection would go through unless it was the actual pod's ip it was connecting with, rather than the services.
Any info or clarification is appreciated. For the most part, I can't really change the architecture of these apps right now, I just have to get them talking to each other as is and haven't found a ton of clear information on this type of case.
Note: We use helm and coredns if anyone is curious.
The Kubernetes networking model is as follows: a Pod is a group of containers that share a single network identity (a cluster IP). Any port exposed by a container is thus automatically exposed on the Pod. The model demands that each Pods can communicate with other Pods.
This means that your current design can work without modifications.
What Services bring to the table is that you can bring a stable network identity to a group of Pods that is otherwise very volatile. It does not apply to your appP/appH coupling, I think.

Coordinating master and worker machines

If this question seems basic to more IT-oriented folks, then I apologize in advance. I'm not sure it falls under the ServerFault domain, but correct me if I'm wrong...
This question concerns some backend operations of a web application, hosted in a cloud environment (Google). I'm trying to assess options for coordinating our various virtual machines. I'll describe what we currently have, and those "in the know" can maybe suggest a better way (I hope!).
In our application there are a number of different analyses that can be run, each of which has different hardware requirements. They are typically very large, and we do NOT want these to be run on the application server (referred to as app_server below).
To that end, when we start one of these analyses, app_server will start a new VM (call this VM1). For some of these analyses, we only need VM1; it performs the analysis and sends a HTTP POST request back to app_server to let it know the work is complete.
For other analyses, VM1 will in turn will launch a number of worker machines (worker-1,...,worker-N), which run very similar tasks in parallel. Once the task on a single worker (e.g. worker-K) is complete, it should communicate back to VM1: "hey, this is worker-K and I am done!". Once all the workers (worker-1,...,worker-N) are complete, VM1 does some merging operations, and finally communicates back to app_server.
My question is:
Aside from starting a web server on VM1 which listens for POST requests from the workers (worker-1,..), what are the potential mechanisms for having those workers communicate back to VM1? Are there non-webserver ways to listen for HTTP POST requests and do something with the request?
I should note that all of my VMs are operating within the same region/zone on GCE, so they are able to communicate via internal IPs without any special firewall rules, etc. (e.g. running $ ping <other VM's IP addr> works). I obviously do not want any of these VMs (VM1, worker-1, ..., worker-N) to be exposed to the internet.
Thanks!
Sounds like the right use-case for Cloud Pub/Sub. https://cloud.google.com/pubsub
In your case workers would be publishing events to the queue and VM1 would be subscribing to them.
Hard to tell from your high - level overview if it can be a match, but take a look at Cloud Composer too https://cloud.google.com/composer/

Redundancy without central control point?

If it possible to provide a service to multiple clients whereby if the server providing this service goes down, another one takes it's place- without some sort of centralised "control" which detects whether the main server has gone down and to redirect the clients to the new server?
Is it possible to do without having a centralised interface/gateway?
In other words, its a bit like asking can you design a node balancer without having a centralised control to direct clients?
Well, you are not giving much information about the "service" you are asking about, so I'll answer in a generic way.
For the first part of my answer, I'll assume you are talking about a "centralized interface/gateway" involving ip addresses. For this, there's CARP (Common Address Redundancy Protocol), quoted from the wiki:
The Common Address Redundancy Protocol or CARP is a protocol which
allows multiple hosts on the same local network to share a set of IP
addresses. Its primary purpose is to provide failover redundancy,
especially when used with firewalls and routers. In some
configurations CARP can also provide load balancing functionality. It
is a free, non patent-encumbered alternative to Cisco's HSRP. CARP is
mostly implemented in BSD operating systems.
Quoting the netbsd's "Introduction to CARP":
CARP works by allowing a group of hosts on the same network segment to
share an IP address. This group of hosts is referred to as a
"redundancy group". The redundancy group is assigned an IP address
that is shared amongst the group members. Within the group, one host
is designated the "master" and the rest as "backups". The master host
is the one that currently "holds" the shared IP; it responds to any
traffic or ARP requests directed towards it. Each host may belong to
more than one redundancy group at a time.
This might solve your question at the network level, by having the slaves takeover the ip address in order, without a single point of failure.
Now, for the second part of the answer (the application level), with distributed erlang, you can have several nodes (a cluster) that will give you fault tolerance and redundancy (so you would not use ip addresses here, but "distributed erlang" -a cluster of erlang nodes- instead).
You would have lots of nodes lying around with your Distributed Applciation started, and your application resource file would contain a list of nodes (ordered) where the application can be run.
Distributed erlang will control which of the nodes is "the master" and will automagically start and stop your application in the different nodes, as they go up and down.
Quoting (as less as possible) from http://www.erlang.org/doc/design_principles/distributed_applications.html:
In a distributed system with several Erlang nodes, there may be a need
to control applications in a distributed manner. If the node, where a
certain application is running, goes down, the application should be
restarted at another node.
The application will be started at the first node, specified by the
distributed configuration parameter, which is up and running. The
application is started as usual.
For distribution of application control to work properly, the nodes
where a distributed application may run must contact each other and
negotiate where to start the application.
When started, the node will wait for all nodes specified by
sync_nodes_mandatory and sync_nodes_optional to come up. When all
nodes have come up, or when all mandatory nodes have come up and the
time specified by sync_nodes_timeout has elapsed, all applications
will be started. If not all mandatory nodes have come up, the node
will terminate.
If the node where the application is running goes down, the
application is restarted (after the specified timeout) at the first
node, specified by the distributed configuration parameter, which is
up and running. This is called a failover
distributed = [{Application, [Timeout,] NodeDesc}]
If a node is started, which has higher priority according to
distributed, than the node where a distributed application is
currently running, the application will be restarted at the new node
and stopped at the old node. This is called a takeover.
Ok, that was meant as a general overview, since it can be a long topic :)
For the specific details, it is highly recommended to read the Distributed OTP Applications chapter for learnyousomeerlang (and of course the previous link: http://www.erlang.org/doc/design_principles/distributed_applications.html)
Also, your "service" might depend on other external systems like databases, so you should consider fault tolerance and redundancy there, too. The whole architecture needs to be fault tolerance and distributed for "the service" to work in this way.
Hope it helps!
This answer is a general overview to high availability for networked applications, not specific to Erlang. I don't know too much about what is available in the OTP framework yet because I am new to the language.
There are a few different problems here:
Client connection must be moved to the backup machine
The session may contain state data
How to detect a crash
Problem 1 - Moving client connection
This may be solved in many different ways and on different layers of the network architecture. The easiest thing is to code it right into the client, so that when a connection is lost it reconnects to another machine.
If you need network transparency you may use some technology to sync TCP states between different machines and then reroute all traffic to the new machine, which may be entirely invisible for the client. This is much harder to do than the first suggestion.
I'm sure there are lots of things to do in-between these two.
Problem 2 - State data
You obviously need to transfer the session state from the crashed machine unto the backup machine. This is really hard to do in a reliable way and you may lose the last few transactions because the crashed machine may not be able to send the last state before the crash. You can use a synchronized call in this way to be really sure about not losing state:
Transaction/message comes from the client into the main machine.
Main machine updates some state.
New state is sent to backup machine.
Backup machine confirms arrival of the new state.
Main machine confirms success to the client.
This may potentially be expensive (or at least not responsive enough) in some scenarios since you depend on the backup machine and the connection to it, including latency, before even confirming anything to the client. To make it perform better you can let the client check with the backup machine upon connection what transactions it received and then resend the lost ones, making it the client's responsibility to queue the work.
Problem 3 - Detecting a crash
This is an interesting problem because a crash is not always well-defined. Did something really crash? Consider a network program that closes the connection between the client and server, but both are still up and connected to the network. Or worse, makes the client disconnect from the server without the server noticing. Here are some questions to think about:
Should the client connect to the backup machine?
What if the main server updates some state and send it to the backup machine while the backup have the real client connected - will there be a data race?
Can both the main and backup machine be up at the same time or do you need to shut down work on one of them and move all sessions?
Do you need some sort of authority on this matter, some protocol to decide which one is master and which one is slave? Who is that authority? How do you decentralise it?
What if your nodes loses their connection between them but both continue to work as expected (called network partitioning)?
See Google's paper "Chubby lock server" (PDF) and "Paxos made live" (PDF) to get an idea.
Briefly,this solution involves using a consensus protocol to elect a master among a group of servers that handles all the requests. If the master fails, the protocol is used again to elect the next master.
Also, see gen_leader for an example in leader election which works with detecting failures and transferring service ownership.