Google Cloud Spanner Single-Region Availability Analysis - google-cloud-platform

Single-Region Spanner is advertised with a 99.99% availability SLA. In the US-based configuration, there will be exactly three replicas per node, all in Council Bluffs, Iowa. Can you share information that breaks down why the 99.99% (~one hour of downtime per year) is believable, especially in the case of geographically-local disasters? I assume that Google has done a thorough analysis, or else it would not advertise the SLA, but I cannot find a detailed paper.
In the event of a regional failure, what recovery procedures will Google carry out and with what recovery time / expected data loss?
(I understand that multi-region may be available, and have seen some pricing data, but will not discuss this here).

Spanner automatically replicates data for high availability. As you stated, regional instances have three full copies of data. The key is that they are replicated across three zones within the region which have independent power, cooling, networking, etc. Zones generally fail independently for each other, so your other replicas can continue serving reads and writes even if one zone goes down. Multi-region provides even greater availability by replicating across regions.
Zonal failures are very rare and would be transparent to your application; Cloud Spanner automatically reroutes requests to replicas that are able to serve the request. It would be even rarer for a region to go down with data loss. Google takes many measures against disasters.
Further out we will expose managed backups, but these would still be stored within Google data centers. We're also working on a Dataflow connector to help you import/export data should you want to manage your own backups.

Related

What is the best back up for databases in AWS?

I have a number of databases in Azure that I want to back up in AWS, what is the best type of storage for databases in AWS ?
Can this be automated in Azure ?
In the 'old days' before Cloud Computing, back-up typically involved sending data to a secondary disaster recovery location where there was (typically inadequate) backup equipment that could takeover the activities of the primary data center.
These days, Cloud Computing provides such as AWS and Azure run multiple data centers in the one region. A 'Region' contains multiple 'Availability Zones', each of which is a separate data center.
Also, many services (eg Amazon S3, Azure Blob storage) are 'regional' services that automatically run across multiple Availability Zones. This means that a failure in one AZ does not impact operation or availability of the service. However, individual virtual machines (eg Amazon EC2, Azure VMs) run on single hosts, so each one operates in only a single AZ.
Thus, rather than attempting to copy data to a "different location" or a different cloud service, it is better to take advantage of the backup capabilities offered by the cloud provider.
From Automatic, geo-redundant backups - Azure SQL Database | Microsoft Learn:
By default, Azure SQL Database stores backups in geo-redundant storage blobs that are replicated to a paired region. Geo-redundancy helps protect against outages that affect backup storage in the primary region. It also allows you to restore your databases in a different region in the event of a regional outage.
The storage redundancy mechanism stores multiple copies of your data so that it's protected from planned and unplanned events. These events might include transient hardware failure, network or power outages, or massive natural disasters.
This would not only meet your requirement for backing up data to another location, but it also makes it quick and easy to restore data when necessary. Compare that to sending data to a different cloud provider, where you would be responsible for converting file formats, launching replacement services and loading data from backup. That type of thing really isn't necessary if you are using a managed database service.
Backing-up data is easy. Restoring is hard!
Bottom line: Use a managed database (eg Azure SQL Database) and use the managed backup options they provide. They will give you the redundancy you seek, while making the process MUCH easier to manage.

Request Limit Per Second on GCP Load Balance in front of Storage Bucket website

I want to know the limit of requests per second for Load Balancer on Google Cloud Platform. I didn't found this information on documentation.
My project is a static website hosted on Storage Bucket behind the Load Balancer and CDN active,
This website will receive a campaign in a Television channel and the estimative is that 100k requests per second for 5 minutes.
Could anyone help me with this information? Its necessary to ask Support for pre-warmup the load balancer before the campaign starts?
From the front page of GCP Load Balancing:
https://cloud.google.com/load-balancing/
Cloud Load Balancing is built on the same frontend-serving
infrastructure that powers Google. It supports 1 million+ queries per
second with consistent high performance and low latency. Traffic
enters Cloud Load Balancing through 80+ distinct global load balancing
locations, maximizing the distance traveled on Google's fast private
network backbone.
This seems to say that 1 million+ request per second is fully supported.
However, with all that said ... I wouldn't wait for "the day" before testing. See if you can't practice a suitable load. Given that this sounds like a finite event with high visibility (television), I'm sure you don't want to wait for the event only to find out something was wrong in the setup or theory. From the perspective of "is 100K request per second through a load balancer" ... the answer appears to be yes.
If you (or you asking on behalf of) a GCP consumer, Google has Technical Account Managers associated with accounts that can be brought into the planning loop ... especially if there are questions on "can we do this". One should always be cautious of sudden high volume needs of GCP resources. Again, through a Technical Account Manager, it does no harm to pre-warn Google of large resource requests. For example, if you said that you needed an extra 5000 Compute Engines, you may be constrained on what regions are available to you given a finite existing capacity. Google, just like other public cloud providers, has to schedule and balance resources in its regions. Timing is also very important. If you need a sudden burst of resources and the time that you need them happens to coincide with some event such as Black Friday (US) or Singles Day (China) special preparation may be needed.

Is their enough capacity in all AWS regions for disaster recovery

In case of a disaster, when an entire AWS region fails and all its customers want to move their workloads to the next closest region in a disaster recovery scenario, is AWS ready for this?
I imagine millions of servers running in each region. Is AWS ready to provision them in another region the next day? Do they have that capacity at the ready?
AWS global infrastructure is using the concept of Availability Zones inside each region, to partition the resources, isolate the risks and ultimately reduce the blast radius of an eventual failure. AZs are groups of datacenter within a region that are designed to be independent of each others in terms of risks (i.e. different connection to the power grid, redundant and isolated network infrastructure, isolated in terms of geographical risks such as earthquake, fooding etc)
Some services are designed to automatically take advantage of this redundant infrastructure (Amazon S3, Amazon DynamoDB, ELB etc), customer do not need to configure anything, redundancy and failover at the regional level is handled by the service. Some other services are operating at AZ level (Amazon EC2, EBS, RDS etc) Fo these services, the best practice is to design for multiple AZ architecture and data replication.
In the very unlikely case a service would not be available in an AZ, a well architected architecture will transparently fail over to another AZ, without any noticeable customer impact.
Back to your question, the architecture is designed to avoid a region-wide failure of all services. This never happened since we launched AWS in 2006. And, yes, we have a lot of capacity. I propose you to watch this keynote from James Hamilton to learn more about it https://www.youtube.com/watch?v=AyOAjFNPAbA

Do we have anything similar to Azure "Availability Set" in GCP and AWS

Context :
We are prototyping a multi cloud deployment of our application (based on micro services).
For balancing between high availability and co location we used "Availability Sets" feature in Azure. Which kind off ensures that Azure platform/service upgrades doesn't happen in two distinct sets simultaneously.
Availability sets Azure
Scenario :
I couldn't find anything similar in Google Cloud Platform and AWS. So in this case we have to go with separate "Zones" for high availability.
One argument in favor of Availability sets ( theoretically) are they are kind of more closer that Zones as the former is inside an data center.
Do we have anything close to "availability sets" in GCP and AWS. Please share your thoughts.
Regarding GCP, there are several solutions for high-availability. In general it is recommended to Design Robust Systems prone to failures and Building scalable and resilient applications.
By designing robust systems you are insuring that your VMs are available in case of single instance failure, reboot of the instance or if there is an issue with the zone.
What looks most similar to Availability Sets is Managed Instance Groups.
The managed instance group auto-updater allows you to deploy new versions of software to instances in your MIG, supporting different rollout scenarios (rolling updates, canary updates). You can control the speed and scope of deployment as well as the level of disruption to your service.
Also you can use Regional Persistent Disk that replicates data across zones (datacenters).
It sounds like Placement Groups may be an equivalent feature in AWS. There are a few different configurations where you can ask AWS to cluster your instances very close to maximize network I/O performance or spread your instances across hardware to reduce correlated failures.
Cluster – packs instances close together inside an Availability Zone. This strategy enables workloads to achieve the low-latency network performance necessary for tightly-coupled node-to-node communication that is typical of HPC applications.
Partition – spreads your instances across logical partitions such that groups of instances in one partition do not share the underlying hardware with groups of instances in different partitions. This strategy is typically used by large distributed and replicated workloads, such as Hadoop, Cassandra, and Kafka.
Spread – strictly places a small group of instances across distinct underlying hardware to reduce correlated failures.
I can't speak for Google Cloud as I am not aware of a similar feature but I am also not nearly as familiar with their offerings.
Hope that helps.

AWS vs GCP Cost Model

I need to make a cost model for AWS vs GCP. Currently, our organization is using AWS. Our biggest services used are:
EC2
RDS
Labda
AWS Gateway
S3
Elasticache
Cloudfront
Kinesis
I have very limited knowledge of cloud platforms. However, I have access to:
AWS Simple Monthly Calculator
Google Cloud Platform Pricing Calculator
MAP AWS services to GCP products
I also have access to CloudHealth so that I can get a breakdown of costs per services within our organization.
Of the 8 major services listed above are main usage and costs go to EC2, S3, and RDS.
Our director of engineering mentioned that I should be most concerned with vCPU and memory.
I would appreciate any insight (big or small) that people have into how I can go about creating this model, any other factors I should consider, which functionalities of the two providers for the services are considered historically "better" or cheaper, etc.
Thanks in advance, and any questions people may have, I am more than happy to answer.
-M
You should certainly cost-optimize your resources. It's so easy to create cloud resources that people don't always think about turning things off or right-sizing them.
Looking at your Top 5...
Amazon EC2
The simplest way to save money with Amazon EC2 is to turn off unused resources. You can even stop instances overnight and on the weekend. If they are only used 8 hours per workday, then that is only 40 out of 168 hours, so you can save 75% by turning them off when unused! For example, Dev and Test instances. People have written various types of automated utilities to turn instances on and off based on tags. Try search the Internet for AWS Stopinator.
Another way to save money on Amazon EC2 is to use spot instances. They are a fraction of the price, but have a risk that they might be turned off when demand increases. They are great where it is okay for systems to be terminated sometimes, such as automated testing systems. They are also a great way to supplement existing capacity at a fraction of the price.
If you definitely need the Amazon EC2 instances to keep running all the time, purchase Amazon EC2 Reserved Instances, which also offer a price saving.
Chat with your AWS Account Manager for help with the above options.
Amazon Relational Database Service (RDS)
Again, Amazon RDS instances can be stopped overnight/on weekends and turned on again when needed. You only pay while the instance is running (plus storage costs).
Examine the CloudWatch metrics for your RDS instances and determine whether they can be downsized without impacting applications. You can even resize them when they are used less (eg over weekends). Everything can be scripted, so you could trigger such downsizing and upsizing on a schedule.
Also look at the Engine used with RDS. Commercial offerings such as Oracle and Microsoft SQL Server are more expensive than open-source offerings like MySQL and PostgreSQL. Yes, your applications might need some changes, but the cost savings can be significant.
AWS Lambda
It is most unusual that Lambda is #3 in your list. In fact, some customers never get a charge for Lambda because it falls in the monthly free usage tier. Having high charges means you're making good use of Lambda (which is saving you EC2 costs), but take a look at which applications are using it the most and see whether they are using it wisely.
When correctly used, a Lambda function should only ever run for a few seconds, so check whether any application seem to be using it outside this pattern.
AWS API Gateway
Once again, these costs tend to be low ($3.50/million calls) so again I'd recommend trying to figure out how this is being used. If you really need that many calls, it would also explain the high Lambda costs. It would probably be more expensive if you were providing such functionality via Amazon EC2.
Amazon S3
Consider using different Storage Classes to reduce your costs. Costs can be reduced by:
Moving infrequently-accessed data to a different storage class
Moving data to One-Zone (if you have a copy of the data elsewhere, so don't need the redundancy)
Archiving infrequently-accessed data to Amazon Glacier, which offers much cheaper storage but does not have instant access
With GCP, you can benefit by receiving discounts such as the Committed Use Discount and the Sustained Use Discount.
With a Committed Use Discount, you can receive a discount of up to 70% if your usage is predictable.
With the Sustained Use Discount, there is an incremental discount if you reach certain usage thresholds.
On your concern with vCPU and memory, you may use predefined machine types. They are cheaper than custom machine types.
Lastly, you can also test the charges by trying out the Google Cloud Platform Free Tier.