AWS Multi region high availability architecture for serverless stack - amazon-web-services

I am in process of coming up with a multi-region high availability (active-active) architecture for my product. A simplified version of our stack is that we use Lambda to implement our micro services, which are exposed as APIs using API Gateway. These micro services integrate with downstream services or databases like DynamoDB, Aurora RDS. So, '
Route 53 >> Api gateway >> Lambda >> Downstream service/Database
'
I am trying to figure out what is the best mechanism to configure Route 53 such that it understands any of the services in the stack fails so that it routes the incoming requests to another region. Eg if Lambda service in region-1 fails, then it is easy because I would create Health Check records pointing to these Lambdas and once they are not reachable Route 53 will itself route to next requests to region-2.
However, if the downstream resource eg RDS that Lambda is dependent on fails, how will Route 53 know this so as to route the next requests to region-2?
Appreciate any pointers on this.

It depends a bit on your envision failover setup.
Let us assume you have two regions: region1,region2
Now you could have two failure scenarios:
Lambda fails in region1 => you failover to Lambda in region2
RDS fails in region1 => you failover to RDS in region2
In both cases you need to ask yourself: What I want to do. If for example, in case 1 you connect from Lambda in region2 to RDS in region1 then high region transfer costs may occur, so you may want to trigger in any case a fail-over of RDS to region2.
Note: Generally it is very advisable to not connect directly with Lambda to RDS, but use instead RDS proxy (to avoid hammering the database with requests, slowing it down etc.): https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/rds-proxy.html
Generally, with RDS these region failovers are much more complicated (can answer on that bit if needed). It is also not simply changing IP to another region, because usually you need to promote in the other region the database (cluster) as a designated node to allow write operations.
For the databases (DynamoDB, Aurora) you mentioned there is though a solution: Use Global Tables.
A simpler solution could be - depending on your application - to use DynamoDB Global Tables (see https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/GlobalTables.html). However, clearly DynamoDB is not a relational database so it may not fit all cases. Nevertheless, DynamoDB works generally very good with Lambda and is also easier for cross-region replication. Note: if you encrypt your data using AWS KMS CMK (recommended) then you need to have this key also available in all regions where you plan to use Global tables (see https://docs.aws.amazon.com/kms/latest/developerguide/multi-region-keys-overview.html).
Another solution could be AWS RDS Aurora Global Tables (https://docs.aws.amazon.com/AmazonRDS/latest/AuroraUserGuide/aurora-global-database.html) - those are available in multiple region and failover is thus easier (cf. https://docs.aws.amazon.com/AmazonRDS/latest/AuroraUserGuide/aurora-global-database-disaster-recovery.html).
In the Aurora case you have to detect a region failure yourself (e.g. you could have a lambda in both regions that regularly tries to connect to the current active cluster for writing) and automatically promote the cluster in the new region as primary if it is not available in the original region.
Do not forget: You need to regularly test the failover otherwise it is almost ensured that it will not work when you need it.
Generally having databases cross-regions implies transfer costs and additional resource costs compared to a single region - not only during failover, but all the time data is written.

With this configuration, I recommend failing over the entire stack (to another Region), rather than failing over individual tiers (components) of the architecture. (This is what you seem to be saying in your question, but just making sure we are on the same page).
Your question comes down to how to configure the health check, and specifically how to implement shallow versus deep (checking dependencies like RDS) health checks.
There is an AWS Well-Architected lab that covers these concepts Implementing Health Checks and Managing Dependencies to improve Reliability.

Related

Cross-region Read Replicas vs One Read Replica with AWS Global Accelerator

I would like to know what is more recommended when one DB instance should be shared across different AWS regions? Is it better to use cross-Region Read Replicas or to use Read Replica in region of origin + AWS Global Accelerator?
Is there some "best praxis solution" for global applications?
I am not experienced with AWS and the most of the things are pretty new for me. So I know that my question may look amateur.
From what I read, I think that one centralized Read replica is better solution, due to latency between regions, but if that would be a case, why anyone would use cross-region replicas at all?
If your application is hosted in a region e.g. eu-west-1 the best read performance will always come when it is reading data from eu-west-1.
If you happen to have customers in us-east-1 you have to choose between one of 3 options:
Edge Location
You reduce the latency using edge locations, i.e. CloudFront or Global Accelerator. This will improve the latency by using the AWS Backbone to route to your origins. This is faster than previously but the application remains in the original region (in this case eu-west-1). You also maintain one copy of the application only.
Latency based routing
This option brings the application closer to the user, by using either Route 53 with latency based records or Global Accelerator you can have your domains resolve to the location that has the lowest latency for them. You would have your central region (where the readwrite lives) and then create cross region replicas. This will provide the best read performance as the reads are being done locally (rather than being across region).
In the example eu-west-1 is the primary region with cross region replicas in us-east-1. Latency between regions is only observed with the time it takes to write to the readwrite (in the original region unless you use Aurora Read Replica Write Forwarding). This is by far the most complex and costly, but will provide the best performance overall.
Do nothing
If you do nothing this option will use the public internet to route to a host, those who are further away to your application will have a longer latency, but this is the cheapest option.
Summary
You need to essentially decide on the importance of cross region, if it is simply because your user base is in a further away region then ensuring you're as close to them as possible is key. You would not need to think about replicas if you're in a specific geographical region.
Remember you can always enhance your infrastructure when demand increases from other geographical regions.

multi-master over multi-region Aurora - possible?

I am relatively experienced with many AWS services - but I do have a large gap around Aurora/RDS
I'm trying to create a multi-region multi-master (write replicas) setup
The purpose is to give low latency to users (if each read and write replica is in the user's region) and to give resilience (if there is a region outage, the users can have their requests routed to another region (the latency will be higher, but reduced service is better than no service))
I'm trying to learn about AWS Aurora and I've created a toy cluster to learn. It seems I can create a cluster that is served out of multiple regions (and Aurora replicates data between regions automatically). I've also read that it is possible to have a multi-master setup (in my toy cluster, it only had one write partition, I couldn't work out how to create another write partition in another region, which made me question if it's possible?)
Here is a diagram of what I'm thinking:
https://imgur.com/DzoSpHL
Thank you in advance!
The purpose is to give low latency to users (if each read and write replica is in the user's region)
I couldn't work out how to create another write partition in another region, which made me question if it's possible?
That is not possible (at least not currently) because of multi-master Aurora limitations.
all DB instances in a multi-master cluster must be in the same AWS Region.
and others such as
you can have a maximum of two DB instances in a multi-master cluster
You can't enable cross-Region replicas from multi-master clusters.
You can read more here
The best thing you can do in your scenario is to create single master and place read replicas into those additional regions (possibly with some caching in necessary).
As mentioned earlier it is not possible with Aurora.
However DynamoDB supports multi-active multi-region:
https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/GlobalTables.html
As others have said, with Amazon Aurora, you cannot deploy multi-Region and multi-master. However you can deploy multi-Region using Aurora Global Database. Then one writer endpoint would be in one Region, while reader endpoints would be available in all the other Regions. Then you can also use write forwarding (assuming you are using the MySQL flavor of Aurora) in the read-only Regions. I know latency is a concern for you, so note the write actually goes back to the primary Region, so writes will incur that extra latency.

Is their enough capacity in all AWS regions for disaster recovery

In case of a disaster, when an entire AWS region fails and all its customers want to move their workloads to the next closest region in a disaster recovery scenario, is AWS ready for this?
I imagine millions of servers running in each region. Is AWS ready to provision them in another region the next day? Do they have that capacity at the ready?
AWS global infrastructure is using the concept of Availability Zones inside each region, to partition the resources, isolate the risks and ultimately reduce the blast radius of an eventual failure. AZs are groups of datacenter within a region that are designed to be independent of each others in terms of risks (i.e. different connection to the power grid, redundant and isolated network infrastructure, isolated in terms of geographical risks such as earthquake, fooding etc)
Some services are designed to automatically take advantage of this redundant infrastructure (Amazon S3, Amazon DynamoDB, ELB etc), customer do not need to configure anything, redundancy and failover at the regional level is handled by the service. Some other services are operating at AZ level (Amazon EC2, EBS, RDS etc) Fo these services, the best practice is to design for multiple AZ architecture and data replication.
In the very unlikely case a service would not be available in an AZ, a well architected architecture will transparently fail over to another AZ, without any noticeable customer impact.
Back to your question, the architecture is designed to avoid a region-wide failure of all services. This never happened since we launched AWS in 2006. And, yes, we have a lot of capacity. I propose you to watch this keynote from James Hamilton to learn more about it https://www.youtube.com/watch?v=AyOAjFNPAbA

Connecting Cassandra from AWS Lambda

We are checking the feasibility of migrating one of our application to Amazon Web Services (AWS) . We decide to use AWS API Gateway to expose the services and AWS Lambda (java) for back end data processing. The lambda function has to fetch a large amount of data from our database.
Currently using Cassandra for data storage, which has been set up with in an EC2 instance and it has no public ip.
Can anyone suggest a way to access Cassandra(EC2) from AWS Lambda using the private Ip ( 10.0.x.x)?
Is it a right choice to use AWS Lambda for large scale applications?
Since your Cassandra instance is using private IP, you will need to configure your AWS lambda Network to use a VPC. It could be the VPC you are running Cassandra in, or a VPC you create for the purpose of your lambdas, and that you VPC-peer to your cassandra VPC. A few things to note from the documentation :
when your lambda runs in a VPC, it doesn't have internet access by default, you will need to configure a NAT for that.
There is an additional latency due to the configuration of the ENI (you only pay that penalty on cold start)
You need to make sure your lambda has the right permission to manage the ENI, you should use this role: AWSLambdaVPCAccessExecutionRole
Your plan to use API / AWS lambda has at least 3 potential issues which you need to consider carefully:
Cost. API gateway per request cost is higher than AWS lambda per request cost. Make sure you are familiar with the cost.
cold start. When AWS start an underlying container to execute your lambda, you pay a cold start latency (which get worse when using VPC due to the management of the ENI). If you execute your lambda concurrently, there will be multiple underlying containers. Each of them will have this cold start the first time. AWS tends to keep the underlying containers ready for a warm start, for a few minutes (users report 5 to 40 minutes). You might try to keep your container warm by pinging your aws lambda, obviously if you have multiple container in parallel, it is getting tricky.
Cassandra session. You will probably want to avoid creating and destroying your Cassandra session each time you invoke your lambda (costly). I haven't tried yet, but there are reports of keeping the session alive in a warm container, you might want to check this SO answer.
Having say all that, currently the biggest limitation in using AWS lambda is concurrent execution and cold start latency. For data processing, that's usually fine. For user-facing usage, the percentage of slow cold start might affect your user experience.

What is the best way to automatically auto scale AWS RDS?

I want to autoscale AWS RDS automatically with scripts based on the metric monitoring.
RDS doesn't really do this for Read-Write
Multi AZ Write-Read database copies are intended for failover from primary to secondary if there is an availability problem. They don't address the problem of performance
Read replicas can be used to increase performance but they are read only
It might be possible to look at a load metric and use a Cloudwatch alarm to start an extra read replica. Read replicas can be used via an ELB or NLB
But probably this isn't a good idea. While an existing RDS is making a read replica, performance is degraded. RDS read replicas are quite slow to come up and become available so it's unlikely to respond in a good way to transient demand
You can make an API call to Modify an RDS Instance, including changing the instance class.
Amazon RDS will provision a new instance of the desired class and will then re-point the Endpoint to the new instance. Existing connections will be terminated, but applications can reconnect and all the data will be there.
Rather than scaling the RDS instance, you could always consider a caching layer, such as Amazon ElastiCache that supports Redis and Memcached. Most applications are read-heavy, which is ideal for using a cache. This can significantly improve application performance without having to scale the database.
In simple, it can be possible with Aurora 5.7 DB RDS instances only, they provide an option to auto-scale based on cloud watch metric conditions i.e CPU utilization etc.