AWS read replicas architecture - amazon-web-services

We have a service that runs in 6 AWS regions and we have some requisites that should be met:
The latency of querying the database must be very low
It support a high throughput of queries
It's been observed that the database update process is IO intensive, so it increases the queries latency due to db locks.
Delays in the order of seconds is acceptable between update and read
The architecture that we discussed was having one service that updates the master db and one slave in each region (6 slaves total).
We found some problems and some possible solutions with that:
There is a limitation of 5 read replicas using AWS infrastructure.
To solve this issue we though of creating read replicas of read replicas. That should give us 25 instances.
There is a limitation in AWS that you cannot create a read replica of a read replica from another region.
To solve this issue we though of inside the application updating 2 master databases.
This approach will create a problem that, for a period of time, the databases can be inconsistent.
In the service implementation we can always recreate the data. So there is a job re-updating the data from times to times (that is one of the reasons that the update is IO intensive).
Anyone has a similar problem? How do you handle it? Can we avoid creating and maintaining databases by ourselves?
We are using MySQL but we are pretty open to use other compatible DBs.

unfortunately, there is no magical solution when it comes to inter-region: you lose latency.
I think you explored pretty much all the solutions from an RDS point of view with what you propose, e.g read replica of read replica (I confirm you cannot do this from another region, but this is to save you from a too high replica-lag).
Another solution would be to create databases on EC2 instances, but you would lose all the benefits from RDS (You could protect this traffic with an inter-region vpn between vpcs). Bare in mind however that too many read replicas will impact your performances.
My advises in your case would be:
to massively use cache at every possible levels: elasticache between DB and servers, varnish for http pages, cloudfront for content delivery. If you want so many read replicas, it means that you are heavely dependent on reads. This way, you would save a lot of reads from hitting your database and gain latency significantly, and maybe 5 read replicas would be enough then.
to consider sharding or using several databases. This
is not always a good solution however, depending on your use case...

You can request an increase in the number of RDS for MySQL Read Replicas using the form at https://aws.amazon.com/contact-us/request-to-increase-the-amazon-rds-db-instance-limit/
Once the limit has been increased you'll want to test to make sure that the performance of having a large number of Read Replicas is acceptable to your application.
Hal

Related

Can I create a hook or react in some other way to a CloudSQL read replica cathing up?

I have a live production system with a google CloudSQL Postgres instance. The application will soon be undergoing a long running database schema modification to accommodate a change to the way the business operates. We've got a deployment plan that will allow the business to continue to operate during the schema change which essentially pauses replication to our read replica, and queues up API requests that would mutate the database for replay after the schema change is complete. Once the deployment is complete, the last step is to un-pause replication. But while the read replica is catching up, the schema changes will lock tables causing a lot of failing read requests. So before we un-pause the the read replication, we're going to divert all API db queries to the main instance which will have just finished the schema changes. So far so good, but I can't find a way to programmatically tell when the read replica is done catching up, so we can split our DB queries with writes going to the main instance and reads going to the replica.
Is there a PubSub topic or metric stream our application could subscribe to which would fire when replication catches up? I would also be happy with something that reports replication lag transaction count (or time) which the application could receive and when the trailing average comes below threshold, it switches over to reading from the replica again. The least desirable but still okay option would be continuous polling of an API or metric stream.
I know I can do this directly by querying the replica database itself for replication status, but that means we have to implement custom traffic directing in our application. Currently the framework we use allows us to route DB traffic in config. I know there should be metrics that are available from CloudSQL, but I cannot find them.
I know it's not fully answer your question, but maybe you will be able to use it. Seems that you might be interested in Cloud Monitoring and metric:
database/mysql/replication/seconds_behind_master
According to the reference it reflects the lag of the replica behind the master.
Either that or database/replication/replica_lag should work. I don't think you can board this through pub/sub. Anyway you should take a look at the reference as it contains all metrics.

AWS DynamoDB vs RDS for Lambda serverless architecture

I am part of a team currently developing a Proof of Concept architecture/application for a communication service between governmental offices and the public (narrowed down to the health-sector for now). The customer has specifically requested a mainly serverless approach through AWS services, and I am in need of advice for how to set up this architecture, namely the Lambda to Database relationship.
Roughly, the architecture would make use of API Gateway to handle requests, which would invoke different Lambdas, as micro-services, that access the DB.
The following image depicts a quick relationship schema. Basically, a Patient inputs a description of his Condition which forms the basis for a Case. That Case is handled during one or many Sessions by one or many Nurses that take Notes related to the Case. DB Schema (not enough reputation)
From my research, I've gathered that in the case of RDS, there is a trade-off between security (keeping the Lambdas outside of a public VPC containing an RDS instance, foregoing security best-practices, a no-no for public sector) and performance (putting the Lambda in a private VPC with an RDS instance, and incurring heavy cold-start times due to the provisioning of ENI). The cold-start times can however be negated by pinging them with CloudWatch, which may or may not be optimal.
In the case of DynamoDB, I am personally very inexperienced (more so than in MySQL) and unsure of whether the data is applicable to a NoSQL model. If it is, DynamoDB seems like the better approach. From my understanding though, NoSQL has less support for complex queries that involve JOINs etc. which might eliminate it as an option.
It feels as if SQL/RDS is more appropriate in terms of the data/relations, but DynamoDB gives less problems for Lambda/AWS services if a decent data model is found. So my question is, would it be preferable to go for a private RDS instance and try to negate the cold-starts by warming up the most critical Lambdas, or is there a NoSQL model that wouldn't cause headaches for complex queries, among other things? Am I missing any key aspects that could tip the scale?
Let's start by clearing up some rather drastic misconceptions on your part:
From my research, I've gathered that in the case of RDS, there is a trade-off between security (keeping the Lambdas outside of a public RDS instance, foregoing security best-practices, a no-no for public sector) and performance (putting the Lambda in a private RDS instance, and incurring heavy cold-start times). The cold-start times can however be negated by pinging them with CloudWatch, which may or may not be optimal
RDS is a database server. You don't run anything inside or outside of it.
You may be thinking of a VPC, or Virtual Private Cloud. This is an isolated network in which you can run your RDS instances and Lambdas.
Running inside or outside of a VPC has no impact on cold start times. You pay the cold start penalty when AWS has to start a new container to run your Lambda. This can happen either because it hasn't been running recently, or because it needs to scale to meet concurrent requests. The actual cold start time will depend on your language: Java is significantly slower than Python, for example, because it needs to start the JVM and load classes before doing anything.
Now for your actual question
Basically, a Patient inputs a description of his Condition which forms the basis for a Case. That Case is handled during one or many Sessions by one or many Nurses that take Notes related to the Case.
This could be implemented in a NoSQL database such as DynamoDB. Without more information, I would probably make the Session the base document, using case ID as partition key and session ID as the sort key. If you don't understand what those terms mean, and how you would structure a document based around that key, then you probably shouldn't use DynamoDB.
A bigger reason to not use DynamoDB has to do with access patterns. Will you ever want to find all cases worked by a given nurse? Or related to a given patient? Those types of queries are what a relational database is designed for.
the case of DynamoDB, I am personally very inexperienced (more so than in MySQL)
Do you have anyone on your team who is familiar with NoSQL databases? If not, then I think you should stick with MySQL. You will have enough challenges learning how to use Lambda.

DynamoDB Global Table Replication System

I am working on Benchmarking Dynamodb's performance as part of a project at the university and have been looking for more details on the replication system when setting up Global tables as i want to understand its impact on latency / Throughput.
I end up by finding 2 confusing Concept, Regions and Availability zones. From what i understood here:
https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/Streams.CrossRegionRepl.html
By Creating 2 Tables, one in Frankfurt and one in Ireland let's say, This means that i now have
2 multi-master read/write Replicas.
But then i found those links:
https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/HowItWorks.Partitions.html
https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/Introduction.html
https://aws.amazon.com/blogs/aws/new-for-amazon-dynamodb-global-tables-and-on-demand-backup/
explaining that the data is stored and automatically replicated across multiple Availability Zones in an AWS region but not mentioning the number of replicas and whether they can be used for read / write requests and are also multi-master or slaves or just for recovery purposes.
From what i understood here if going back to the example i am using (Frankfurt / Ireland)
I will be having:
3 multi-master read/write Replicas in Frankfurt
3 multi-master read/write Replicas in Ireland
Please let me know which one is correct. Thanks in Advance
Dyanmodb by default puts your data to tables in multiple availability zone irrespective of if it is a global table or not. This is to make sure higher availability in case of one zone going down. However these partition are transparent to the user, and user don't get to choose which one to connect to.
Here is a nice video explaining how it works under the hood.
Global table means that data will be replicated across the regions transparently to the user. I did a benchmarking with table in two regions oregon and ohio, it typically took ~1.5 secs. to get replicated. Replication resolution is auto managed by AWS and the last write one wins.
A personal suggestion here is to use only one table to write so that data collision can be minimized. And in the case of disaster failover writes to other region.

Tracking Usage per API key in a multi region application

I have an app deployed in 5 regions.
The latency between the regions varies from 150ms to 300ms
Currently, we use the method outlined in this article (usage tracking part):
http://highscalability.com/blog/2018/4/2/how-ipdata-serves-25m-api-calls-from-10-infinitely-scalable.html
But we export logs from Stackdriver to Cloud Pub/Sub. Then we use Cloud Dataflow to count the number of requests consumed per API key and update it in Mongo Atlas database which is geo-replicated in 5 regions.
In our app, we only read usage info from the nearest Mongo replica for low latency. App never updates any usage data directly in Mongo as it might incur latency cost since the data has to be updated in Master which may be in another region.
Updating API key usage counter directly from the app in Mongo doesn't seem feasible because we've traffic coming in at 10,000 RPS and due to the latency between region, I think it will run into some other issue. This is just a hunch, so far I've not tested it. I came to this conclusion based on my reading of https://www.mongodb.com/blog/post/active-active-application-architectures-with-mongodb
One problem is that we end up paying for cloud pub/sub and Dataflow. Are there strategies to avoid this?
I researched on Google but didn't find how other multi-region apps keep track of usage per API key in real-time. I am not surprised, from my understanding most apps operate in a single region for simplicity and until now it was not feasible to deploy an app in multiple regions without significant overhead.
If you want real-time then the best option is to go with Dataflow. You could change the way data arrives to Dataflow, for example usging Stackdriver → Cloud Storage → Dataflow, but instead of going though pub/sub you would go through Storage, so it’s more of a choice of convenience and comparing prices of each product cost on your use case. Here’s an example of how it could be with Cloud Storage.

Cheapest way to insert one billion rows into AWS RDS

I need to import one billion rows of data from local machine to AWS RDS.
Local machine has a high speed internet connection and it goes up to 100MB/s. So, network is not the problem.
I'm using AWS RDS r3.xlarge with 2000 PIOPS and 300GB of storage.
However, since my PIOPS is stuck at 2000, in order to import one billion rows, it's going to take like 7 days.
How can I speed up the process without paying more?
Thanks so much.
Your PIOPS are the underlying IO provisioning for your database instance - that is, how much data per second the OS is guaranteed to be able to send to persistent storage. You might be able to optimize that slightly by using larger write batches (Depending on what your DB supports), but fundamentally it limits the amount of bytes/second available for your import.
You can provision more IO for the import, and then scale it down, though.