Disaster Recovery options on AWS - amazon-web-services

Disaster Recovery options on AWS - amazon-web-services

We are running on aws where we run everything in 1 region and use AZ's for our services. So if a AZ failed we would still be "up" and servicing our customers.
From reading the Reliability Pillar of the AWS Well-Architected documentation, this would suggest that this is enough to do in the case of a failure:
Unless you require a multi-region strategy, we advise you to meet your
recovery objectives in AWS using multiple Availability Zones within an
AWS Region.
A see tools out there like Cloud Endure and Druva CloudRange, but they seem like more for on premise or other cloud providers migrating or recovering on aws.
My question is, it is hard to definitively find, but it appears regions never go down, maybe services within a AZ or the AZ goes down, so if you are using AZ's for your applications and databases and doing backups to s3(Cross-Region replication) is this enough for DR?

Regions may not go down but they can become functionally unusable. There was an outage of eu-west-2a about 3 months ago that rendered large parts of eu-west-2 more-or-less unusable.
If you want redundancy, you should be mirroring infra to at least one other region.

Related

AWS Choosing a region or Choose budget

So, as the question title says,
How should we architect the solution using AWS ?
Do we need to consider the region first assuming we might use all the features in future or stick with a region which is near and migrate to other regions for additional service,when needed.
How generally it is decided ?

The cost is fairly negligible when looking at various services pricing between regions, but obviously worth noting if you're on a very tight budget.
Regarding availability most commonly will services be available day 1 in the following regions:
us-east-1
us-west-1
eu-west-1
You generally find that within a few weeks or months that services will be rolled out to other regions, with the exception of the China and Govcloud regions which can see a more significant delay.
New regions are generally deployed with a core set of services such as EC2, S3, RDS etc but after launch will start to add the remaining services there.
If your application is client facing (a client directly interacts with the application, over either a web browser or service API) then I believe geographical location can be more important to a degree than the pricing. Delivering as best an experience to the client in my opinion is more beneficial for example us-east-1 might be cheaper but your clients based in europe.
If you want the cutting edge the regions listed above will almost always be current. Obviously you need to weigh all of these factors and decide based on what is most important for your usecase.

Is their enough capacity in all AWS regions for disaster recovery

In case of a disaster, when an entire AWS region fails and all its customers want to move their workloads to the next closest region in a disaster recovery scenario, is AWS ready for this?
I imagine millions of servers running in each region. Is AWS ready to provision them in another region the next day? Do they have that capacity at the ready?

AWS global infrastructure is using the concept of Availability Zones inside each region, to partition the resources, isolate the risks and ultimately reduce the blast radius of an eventual failure. AZs are groups of datacenter within a region that are designed to be independent of each others in terms of risks (i.e. different connection to the power grid, redundant and isolated network infrastructure, isolated in terms of geographical risks such as earthquake, fooding etc)
Some services are designed to automatically take advantage of this redundant infrastructure (Amazon S3, Amazon DynamoDB, ELB etc), customer do not need to configure anything, redundancy and failover at the regional level is handled by the service. Some other services are operating at AZ level (Amazon EC2, EBS, RDS etc) Fo these services, the best practice is to design for multiple AZ architecture and data replication.
In the very unlikely case a service would not be available in an AZ, a well architected architecture will transparently fail over to another AZ, without any noticeable customer impact.
Back to your question, the architecture is designed to avoid a region-wide failure of all services. This never happened since we launched AWS in 2006. And, yes, we have a lot of capacity. I propose you to watch this keynote from James Hamilton to learn more about it https://www.youtube.com/watch?v=AyOAjFNPAbA

How to improve the performance of a Django application across different geographic regions?

I have a Django application that is hosted on an AWS box located in the us-east-1 geographic region using Nginx and django-channels. Recently, I have had some users in the ap-southeast-1 region complain that my app is not very responsive. The app runs fine for me as I am using us-east-1.
How can I detect poor performance in a region is happening before a user complains?
What can I do to improve the app performance and user experience in the ap-southeast-1 timezone?
Is there any way to test the performance in another geographic region as part of unit-testing or something similar?
I have a feeling the answer for #2 will have something to do with: (A) Adding another web server in ap-southeast-1 and (B) caching, but I'm keen to hear if there are additional things I should be doing.
However, I have no clue how to detect slow performance for other regions is happening in the first place or to test to ensure it does not happen again in the future.

Yes, optimally you should have a server wherever you have users. However, if multiple servers in different regions have to talk to the same database, you will still have latency issues when the server communicates with the database in another region.
The best solution would be to have your full stack, servers and databases, in all supported regions and use cross-region replication to ensure that all regions share the same data. This is supported for some AWS databases such as DynamoDB and RDS.
As your architecture gets more complex, it may be a good idea to use Cloudformation to manage your stack in each region so that everything is kept up to date.
As for detecting performance, Cloudwatch is a good tool for monitoring your AWS resources. Depending on what AWS resource you are using for your server, it should have some metrics to measure the response times.
As for testing performance, you could look into creating a dev/test version of your server in another region, and use a proxy to access it. Then just use Cloudwatch to see how long those requests take.

How are Amazon RDS Database Instances Provisioned?

I've been considering moving some databases from self-hosted database instances (e.g. MySQL or PostgreSQL on Linux, either bare-metal or within AWS itself) into Amazon RDS, but it's unclear to me how everything will behave once I've created the database and it's time for maintenance to begin.
For example, I have to choose the type of instance(s) that will be used for the database, which I guess means how responsive everything will be, and there is an option for multi-AZ deployments, but it's not clear how many of those types of instances I'm actually configuring. (Presumably, multi-AZ deployment requires at least two instances).
There are options for Failover, which leads me to believe that I can rely on the service to stay up if there are problems with an instance, but then there is also a section for selecting maintenance windows for automated upgrades, which I find confusing. If I were administering e.g. a two-instance MySQL setup, I'd upgrade one instance and then the other to avoid any downtime. Is that not how RDS behaves?
RDS advertises support for automatic "minor version upgrades" (yes, please), but doesn't say anything about OS upgrades. Presumably, the db engine will be running on Amazon Linux or something similar, and will periodically require updates to those packages. Does that all happen automatically, or do I need to manually perform those upgrades, etc.?
The whole point of using something like RDS is that the service should become something I no longer have to worry about: I don't have to deal with package maintenance, upgrades, failover, or unexpected downtime (as long as I pay enough, of course). But all of the options for the RDS instance are making me skeptical of the advantages provided by RDS over just running everything myself.
Can anyone with experience with AWS RDS comment on their experiences with maintenance, upgrades, and failover?

These were the same concerns which we had when we were planning to use RDS. Now that we are effectively using AWS RDS for multiple production workloads, let me try to clarify your queries. Hope this helps.
Your Question 1 : I have to choose the type of instance(s) that will be used for the database, which I guess means how responsive everything will be
Answer : Yes. This is to define what capacity (CPU,RAM etc) you will need for your database workload
Your Question 2 : There is an option for multi-AZ deployments, but it's not clear how many of those types of instances I'm actually configuring.
Answer : Multi-AZ deployments are to ensure high availability. AZ (Availability Zones) are isolated locations within an AWS Region to provide better protection against disaster scenarios. So when we choose a Multi AZ deployment, RDS will place 2 instances of your database server in 2 Availability Zones in the region where you are provisioning.
This is done automatically by RDS and we dont have to setup/maintain 2 servers separately/manually. ( Note : Your VPC should have atleast 1 subnet in each of the 2 different AZ to provision Multi AZ Setup)
Your Question 3: If I were administering e.g. a two-instance MySQL setup, I'd upgrade one instance and then the other to avoid any downtime. Is that not how RDS behaves?
Yes. RDS does it by itself without manual intervention if you enable Automatic Upgrades while setting up RDS (Only if you choose to have Multi AZ option)
Your Question 4 : RDS advertises support for automatic "minor version upgrades" (yes, please), but doesn't say anything about OS upgrades.
Answer : RDS dont expose/provide any OS access to us. The underlying OS and its upgrades/other activities are all done without affecting the RDS services hosted on top of it. We dont have to do anything about the OS of RDS. So we can forget about that part.
Your question 5 : Regarding Failover of AWS RDS Multi AZ database
I would classify into 2 cases.
Case 1 : Fail-overs required during maintenance/other automatic activities done by Multi AZ RDS instance.
Here, RDS will automatically do the failover one instance at a time. It will first move all the ongoing traffic to second instance and then upgrade/reboot the first instance and then do the same with second instance.
Case 2 : Fail-overs required during manual reboot/manually triggered actions done on Multi AZ RDS instance.
In this case, during the reboot, AWS RDS provides an option for you to select whether the reboot should be with failover or without one.

AWS - HA NFS - Best practices

Anyone have a sound strategy for implementing NFS on AWS in such a way that it's not a SPoF (single point of failure), or at the very least, be able to recover quickly if an instance crashes?
I've read this SO post, relating to the ability to share files with multiple EC2 instances, but it doesn't answer the question of how to ensure HA with NFS on AWS, just that NFS can be used.
A lot of online assets are saying that AWS EFS is available, but it is still in preview mode and only available in the Oregon region, our primary VPC is located in N. Cali., so can't use this option.
Other online assets are saying that GlusterFS is a way to go, but after some research I just don't feel comfortable implementing this solution due to race conditions and performance concerns.
Another options is SoftNAS but I want to avoid bringing in an unknown AMI into a tightly controlled, homogeneous environment.
Which leaves NFS. NFS is what we use in our dev environment and works fine, but it's dev, so if it crashes we go get a couple beers while systems fixes the problem, but on production, this is obviously a no go.
The best solution I can come up with at this point is to create an EBS and two EC2 instances. Both instances will be updated as normal (via puppet) to maintain stack alignment (kernel, nfs libs etc), but only one instance will mount the EBS. We set up a monitor on the active NFS instance, and if it goes down, we are notified and we manually detach and attach to the backup EC2 instance. I'm thinking we also create a network interface that can also be de/re-attached so we only need to maintain a single IP in DNS.
Although I suppose we could do this automatically with keepalived, and a IAM policy that will allow the automatic detachment/re-attachment.
--UPDATE--
It looks like EBS volumes are tied to specific availability zones, so re-attaching to an instance in another AZ is impossible. The only other option I can think of is:
Create EC2 in each AZ, in public subnet (each have EIP)
Create route 53 healthcheck for TCP:2049
Create route 53 failover policies for nfs-1 (AZ1) and nfs-2 (AZ2)
The only question here is, what's the best way to keep the two NFS servers in-sync? Just cron an rsync script between them?
Or is there a best practice that I am completely missing?

There are a few options to build a highly available NFS server. Though I prefer using EFS or GlusterFS because all these solutions have their downsides.
a) DRBD
It is possible to synchronize volumes with the help of DRBD. This allows you to mirror your data. Use two EC2 instances in different availability zones for high availability. Downside: configuration and operation is complex.
b) EBS Snapshots
If a RPO of more than 30 minutes is reasonable you can use periodic EBS snapshots to be able to recover from an outage in another availability zone. This can be achieved with an Auto Scaling Group running a single EC2 instance, a user-data script and a cronjob for periodic EBS snapshots. Downside: RPO > 30 min.
c) S3 Synchronisation
It is possible to synchronize the state of an EC2 instance acting as NFS server to S3. The standby server uses S3 to stay up to date. Downside: S3 sync of lots of small files will take too long.
I recommend watching this talk from AWS re:Invent: https://youtu.be/xbuiIwEOCAs

AWS has reviewed and approved a number of SoftNAS AMIs, which are available on AWS Marketplace. The jointly published SoftNAS Architecture on AWS White Paper provides more details:
Security (pages 4-11)
HA across AZs (pages 13-14)
You can also try a 30 day free trial to see if it meets your needs.
http://softnas.com/tryaws
Full disclosure: I work for SoftNAS.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js