CF Bosh Director Multi-DC High Availability

CF Bosh Director Multi-DC High Availability - cloud-foundry

I'm looking at a new architecture deployment of Cloud Foundry using multi-cpi with a single BOSH director deployment. If the BOSH director is deployed in DC-A and manages 2 vcenters, one in DC-A and the other in DC-B, if DC-A goes offline, what are the options for BOSH to run active/standby in DC-B so that it can immediately take over deployments without having to perform a backup and restore?

Yes, multi-DC BOSH deployments with multi-CPI BOSH is working great! And your question is very often raised when people think of such multi-DC design.
There is no high availability (HA) for a BOSH Director, and there is no active/passive setup I'm aware of right now. The reason for this, is that loosing a BOSH Director is not a big deal. The nodes that this Director manages will still run on top of the infrastructure. They just won't be “manageable” until you bring your Director back.
But if we think about the requirements for such active/passive setup, here is what I would come up with:
They would have to share the exact same CPI installation and setup. Not a big deal.
They would have to share the same SQL database and same blobstore (object storage). This is not a big deal, but this leads to using both external SQL storage and external blobstore. Then the “passive” BOSH Director would at least have to disable its resurrector plugin, in order not to compete with the resurrector from the “active” BOSH Director. (In fact, the passive BOSH Director would have to be completely stopped, see below.)
They would have to share the same NATS message bus, which usually is co-located on the BOSH Director and thus dedicated to it. It's easy to extract this NATS from the Director and run it separately with a High Availability setup. But then the problem would be: which Director consumes NATS messages? Two Directors cannot compete in consuming those messages. That's why the “passive” BOSH Director would require its processes to be monit stop-ed, or the whole instance bosh stop-ed.
This bosh stop requirement cannot be achieved using the bosh CLI (v2, including this bosh-init component which can act just like a local, stripped-down BOSH Director). So these two BOSH Directors would have to be deployed by a “bootstrap” BOSH Director (which is quite common, I've even seen up to four stages of such bootstrap Director pattern, on some customer production environments!)
Now imagine you have it all. A bootstrap BOSH Director that deploys a separate HA NATS and two Directors, with same CPI setup, same external SQL database and same external blobstore. Then it would work! Whenever you loose the active one, you bosh start the passive one and it takes over. But the you should be careful that the previously active one doesn't pop up suddenly, or it would compete in consuming NATS messages, and possibly mess up the database and blobstore. That's where BOSH is missing some “lock” feature, to allow only one active Director at a time. Here, something very simple could be implemented, based on some database record that would designate which one is active and which one is passive. Switching this record manually would trigger the passive Director to become active.
This is a very good idea for the next Cloud Foundry hackathon!

Related

Optimizing latency between application server (EC2) and RDS

here's how the story goes.
We started transforming a monolith, single-machine, e-commerce application (Apache/PHP) to cloud infrastructure. Obviously, the application and the database (MySQL) were on the same machine.
We decided to move to AWS. And as the first step of transformation, we decided to split the database and application. Hosting application on a c4.xlarge machine. And hosting database to RDS Aurora MySQL on a db.r5.large machine, with default options.
This setup performed well. Especially the database performance went up high.
Unfortunately, when the traffic spiked up, we started experiencing long response times. Looked like RDS, although being really fast for executing queries, wasn't returning results fast enough over the network to the EC2 machine.
So that was our conclusion after an in-depth analysis of the setup including Apache/MySQL/PHP tuning parameters. The delayed response time was definitely due to the network latency between EC2 and RDS/Aurora machine, both machines being in the same region.
Before adding additional resources (ex: ElastiCache etc) we'd first like to look into any default configuration we can play around with to solve this problem.
What do you think we missed there?

One of the bigest strength with the cloud is the scalability and you should always design your application to utilise it and it sounds like your RDS instance is getting chocked due to nr of request more than the process time for the queries. So rather go more small instances with load balancing than one big doing all the job. And with Load Balancers you will get away from a singel point of failure due to you can have replicas of your database and they can even be placed in different AZ.
Here is a blogpost you can read on the topic:
https://aws.amazon.com/blogs/database/scaling-your-amazon-rds-instance-vertically-and-horizontally/
Good luck in your aws journey.

The Best answer to your question is using read replicas, but remember only your read requests could be sent to your read replicas so you would need to design your application that way
Also for some cost savings, you should try aurora serverless
One more option is passing traffic between ec2 and rds through a private network rather than using the public internet to connect your ec2 to rds that can be one of the mistakes that might be happening

The 4 levels of High Availability in PCF, does BOSH handles failed instances or monit?

According to me, as it is mentioned in PCF's 4 levels of High Availability, when an instance(process) fails, the Monit should recognize it and shourd restart it. And then it'll just send the report to BOSH. But if the whole VM goes down, it's BOSH's responsibility to recognize and restart it.
With this belief I answered one question in : https://djitz.com/guides/pivotal-cloud-foundry-pcf-certification-exam-review-questions-and-answers-set-4-logging-scaling-and-high-availability/
Question and answer
According to me, the answer for this question should be option 3, but it says I'm wrong and answer should be option 2. Now I'm confused. So please help me if my belief is wrong.

BOSH is responsible for creating new instance for failed VM.
I know that there is not much information available on internet for this but if you get chance, there is tutorial on pluralsight you can enroll. There instructor has explained high availability very well.
But you can get high level idea from PCF documents as well.
Process Monitoring PCF uses a BOSH agent, monit, to monitor the
processes on the component VMs that work together to keep your
applications running, such as nsync, BBS, and Cell Rep. If monit
detects a failure, it restarts the process and notifies the BOSH agent
on the VM. The BOSH agent notifies the BOSH Health Monitor, which
triggers responders through plugins such as email notifications or
paging.
Resurrection for VMs BOSH detects if a VM is present by listening for
heartbeat messages that are sent from the BOSH agent every 60 seconds.
The BOSH Health Monitor listens for those heartbeats. When the Health
Monitor finds that a VM is not responding, it passes an alert to the
Resurrector component. If the Resurrector is enabled, it sends the
IaaS a request to create a new VM instance to replace the one that
failed.

How to detect temporary network partition in Kubernetes?

We have a Kubernetes cluster set up on AWS VPC with 10+ nodes. We encountered an incident where one node was not accessible to others and vice-versa for ~10 minutes. Finding this out took quite a lot of time.
Is there a tool for Kubernetes or AWS to detect these kind of network problems? Maybe something like a Daemon Set where each pod pings the others in the network and logs it when the ping fails.

If you are mostly interested in being alerted when such problem happens, I would set up monitoring system and hook it up with something like alertmanager. For collecting metrics, you can look at open source project such as Prometheus. Once you set this up, it is really easy to integrate it with Grafana (for dashboard) and alertmanager (for alerting based on rules you specify in Prometheus). And they are all open source projects.
https://prometheus.io/

High-availability for Restcomm

I'm planning a High-availability set-up with autoscaling for RestComm and some general doubts about the best way to plan it.
This is what I have now:
Restcomm instance using Amazon ECS (docker), so we can launch more instances very easily.
All of them share the Amazon RDS database.
Workspace is shared and persisted between instances.
To move to the next step, I have some questions:
Amazon Load Balancer isn't an option because it doesn't support UDP so I'm considering Telestax LB, is it correct?. Is it possible to deploy it using docker?
Move Restcomm MS outside of the docker Restcomm image so it can scale independently. Restcomm provides env variables to specify the MS, so I would have a LB and several MS behind it. Correct?.
How much RAM needs a Restcomm instance and how many concurrent sessions supports?. How can we know how many concurrent sessions are in real time and in a programtically way?.
There is a "automatic scaling" mechanism implemented in RestComm? More info would be appreciated. Ubuntu Juju isn't an option for me.
We are considering Graylog2 or logstasch for logs management. Any insight here?. How do you install the agent in the docker images?.
The only documentation I found it was this very good document: https://docs.google.com/document/d/13xlaioF065pDnQUoZgfIpi6Noh0qHfAZ7U6afcPd2Y0/edit
Is there any other doc?.
Thanks in advance!

Very good questions :
Yes. See, https://hub.docker.com/r/restcomm/load-balancer/
You would have one LB (better to have 2 with active passive to avoid single point of failure) with X Restcomm behind it speaking to Z Media Servers behind them.
It depends on the complexity of the application on top. But here is some numbers https://github.com/RestComm/Restcomm-Connect/wiki/Load-Testing-on-Docker
Not yet. you can use Mesos or Kubernetes potentially if juju is not an option. We have a set of open issues for kubernetes right now but Mesos should be working.
You can check https://hub.docker.com/r/restcomm/graylog-restcomm/ it contains a docker image pre loaded with everything needed to poll a restcomm server for gathering metrics.

Is it possible to auto scale with amazon web services, with ever changing AMI's?

Curious if this is possible:
We have a web application that at MOST times, works just fine with our single small instance. However, when we get multiple customers running simultaneously intense queries (we are a cloud scheduling service); our instance bogs way down to near 80% cpu load and becomes pretty unresponsive.
Is there a way to have AWS fire up another small instance (or a few), quickly, only for the times that its operating under this intense load? BUT, the real question is how does this work when we have very frequent programming updates to our application? Do we have to manually create a new image everytime we upload a code change???
Thanks

You should never be running anything important on a single EC2 instance. Instances can--and do--go offline randomly. Always use an autoscaling (AS) group that spans multiple availability zones. An AS group will automatically bring new instances online when you hit a certain trigger (in your case, CPU utilization). And then it will scale down the instances when traffic subsides. Autoscaling is the heart and soul of AWS and if you're not using it, you might as well be using a cheaper (and more durable) VPS host.
No, you don't want to be creating a new AMI for each code release. Ideally you should use a base AMI (like one of Amazon's official ones) and then have it auto-provision at boot. You can use the "user data" field when you launch an AMI to bootstrap this process. It can be as simple as a bash script that pulls from your Git repo to as something as sophisticated as Puppet or Chef.
The only time I create custom AMI's is if the provisioning process just takes too long. However that can almost always be solved by storing the needed files in S3.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js