As you can see from the [docs][1] dataflow has a 99.5% SLA as opposed to a similar service like AWS EMR which has 99.9% for EACH single region.
That should mean that if a were to create a system using EMR, that replicates across regions I could calculate the "compound SLA" by doing 1 - (0.001)^n_regions and amp up the availability of my service, just like it is done with distributed system in HA mode.
Could I achieve the same thing by deploying multiple Dataflow jobs in several GCP regions?
EDIT: all of these considerations assume(maybe wrongly) that one region operational status should not affect other regions, which basically means that region are as independet as possibile.
Related
What is the difference between AWS Batch and Sagemaker Training Job when using it for running docker image for Machine Learning training?
Both services are implementation of CaaS aka Container-as-a-Service. It means that you don't have to manage clusters and can only define launch configuration. And both services can be used for running training jobs in this regard once you have your docker image ready. Notable differences are:
[Operational complexity] AWS Batch has higher operational complexity then SageMaker training jobs. With the latter you don't need to provision any infrastructure - at most the role that is generated automatically. With the former you would need to deploy infrastructure, although you would definitely have a more refined control over it.
[Architecture] AWS Batch is less pure CaaS and closer to a managed cluster. It has a job queue and scales cluster based on job queue size while also places jobs on the machines. SageMaker training jobs starts VM per job and VM itself is abstracted from the user. So for example you could SSH into AWS Batch instance, but not SageMaker one.
[Docker image] SageMaker would require heavier customization of the docker container to make it work, but it does it so that you don't have to implement it yourself for thing like - passing hyperparameters, gathering metrics, and saving the model. AWS Batch just runs the container - so any associated business logic has to be implemented by the developer.
[Cost] Both AWS Batch and SageMaker training jobs are free aka you only pay for underlying infrastructure which was used. SageMaker training jobs uses ml.* instances which are ~10-20% more expensive then their on-demand counterparts (e.g. p2.xlarge costs $0.9 per hour and ml.p2.xlarge costs $1.125 per hour). Both services have a way of running the spot instances which would have lower cost.
So to summarize - AWS Batch is a more generalized and customizable tool, while SageMaker Training Jobs is a more focused one with more prebuilt features.
datastore docs say:
the replication between Datastore servers. Replication is managed by
Cloud Bigtable and Megastore, the underlying technologies for
Datastore
bigtable docs say:
Replication for Cloud Bigtable enables you to increase the
availability and durability of your data by copying it across multiple
regions or multiple zones within the same region
How can I see in the datastore UI if I'm getting any replication? If I am getting replication how can I see if I'm getting cross region or cross zone replication for my datastore entities?
(The entities I'm looking at have been populated since 2017 if that's useful.)
The short answer to your question, is that if you are in a multi-region then you can already access your data from multiple regions without worrying about asynchronous replication lag.
If you are really curious about Megastore replication, you can read the Megastore paper. However, what's more likely that you want is to read the trade-offs between strong consistency & eventual consistency in Datastore.
The locations for Cloud Datastore currently match those of Cloud Firestore in either mode.
Cloud Datastore is only a regional service. You can't deploy it in multiple region in the same project.
Its brother (or sister, I don't know), Firestore, can be deployed in multi region.
So, Datastore is only mono region, but multi zonal in this unique region. And the BigTable replication mechanism is used to achieve this replication. You can't see this, it's serverless, transparent.
We are running on aws where we run everything in 1 region and use AZ's for our services. So if a AZ failed we would still be "up" and servicing our customers.
From reading the Reliability Pillar of the AWS Well-Architected documentation, this would suggest that this is enough to do in the case of a failure:
Unless you require a multi-region strategy, we advise you to meet your
recovery objectives in AWS using multiple Availability Zones within an
AWS Region.
A see tools out there like Cloud Endure and Druva CloudRange, but they seem like more for on premise or other cloud providers migrating or recovering on aws.
My question is, it is hard to definitively find, but it appears regions never go down, maybe services within a AZ or the AZ goes down, so if you are using AZ's for your applications and databases and doing backups to s3(Cross-Region replication) is this enough for DR?
Regions may not go down but they can become functionally unusable. There was an outage of eu-west-2a about 3 months ago that rendered large parts of eu-west-2 more-or-less unusable.
If you want redundancy, you should be mirroring infra to at least one other region.
It's evident that preemptible instance are cheaper than non-preemptible instance. On daily basis 400-500 dataflow jobs are running in my organisational project. Out of which some jobs are time-sensitive and others are not. So is there any way I could use preemptible instance for non-time-constraint job, which will cost me less for overall pipeline execution. Currently I'm running dataflow jobs with below specified configuration.
options.setTempLocation("gs://temp/");
options.setRunner(DataflowRunner.class);
options.setTemplateLocation("gs://temp-location/");
options.setWorkerMachineType("n1-standard-4");
options.setMaxNumWorkers(20);
options.setWorkerCacheMb(2000);
I'm not able to find out any pipeline options with preemptible instance setting.
Yes, it is possible to do so with Flexible Resource Scheduling in Cloud Dataflow (docs). Note that there are some things to consider:
Delayed execution: jobs are scheduled and not executed right away (you can see a new QUEUED status for your Dataflow jobs). They are run opportunistically when resources are available within a six-hour window. This makes FlexRS suitable to reduce cost for non-time-critical workloads. Also, be sure to validate your code before sending the job.
Batch jobs: as of now it only accepts batch jobs and requires to enable autoscaling:
You cannot set autoscalingAlgorithm=NONE
Dataflow Shuffle: it needs to be enabled. When so, no data is stored on persistent disks attached to the VMs. This way, when a preemption happens and resources are claimed back there is no need to redistribute the data.
Regions: according to the previous item, only regions where Dataflow Shuffle is supported can be selected. List here, turn-up for new regions will be announced in the release notes. As of now, zone is automatically chosen within the region.
Machine types: FlexRS currently supports n1-standard-2 (default) and n1-highmem-16.
SDK: requires 2.12.0 or newer for Java or Python.
Quota: quota is reserved upfront (i.e. queued jobs also consume quota).
In order to run it, use --flexRSGoal=COST_OPTIMIZED and make sure to take into account that the rest of parameters conform to the FlexRS needs.
A uniform discount rate is applied to FlexRS jobs, you can compare pricing details in the following link.
Note that you might see a Beta disclaimer in the non-English documentation but, as clarified in the release notes, it's Generally Available.
Context :
We are prototyping a multi cloud deployment of our application (based on micro services).
For balancing between high availability and co location we used "Availability Sets" feature in Azure. Which kind off ensures that Azure platform/service upgrades doesn't happen in two distinct sets simultaneously.
Availability sets Azure
Scenario :
I couldn't find anything similar in Google Cloud Platform and AWS. So in this case we have to go with separate "Zones" for high availability.
One argument in favor of Availability sets ( theoretically) are they are kind of more closer that Zones as the former is inside an data center.
Do we have anything close to "availability sets" in GCP and AWS. Please share your thoughts.
Regarding GCP, there are several solutions for high-availability. In general it is recommended to Design Robust Systems prone to failures and Building scalable and resilient applications.
By designing robust systems you are insuring that your VMs are available in case of single instance failure, reboot of the instance or if there is an issue with the zone.
What looks most similar to Availability Sets is Managed Instance Groups.
The managed instance group auto-updater allows you to deploy new versions of software to instances in your MIG, supporting different rollout scenarios (rolling updates, canary updates). You can control the speed and scope of deployment as well as the level of disruption to your service.
Also you can use Regional Persistent Disk that replicates data across zones (datacenters).
It sounds like Placement Groups may be an equivalent feature in AWS. There are a few different configurations where you can ask AWS to cluster your instances very close to maximize network I/O performance or spread your instances across hardware to reduce correlated failures.
Cluster – packs instances close together inside an Availability Zone. This strategy enables workloads to achieve the low-latency network performance necessary for tightly-coupled node-to-node communication that is typical of HPC applications.
Partition – spreads your instances across logical partitions such that groups of instances in one partition do not share the underlying hardware with groups of instances in different partitions. This strategy is typically used by large distributed and replicated workloads, such as Hadoop, Cassandra, and Kafka.
Spread – strictly places a small group of instances across distinct underlying hardware to reduce correlated failures.
I can't speak for Google Cloud as I am not aware of a similar feature but I am also not nearly as familiar with their offerings.
Hope that helps.