Simple question for GCE users: are persistent boot disks safe to be used or data loss could occur?
I've seen that I can attach additional persistent disks, but what about the standard boot disks (that should be persistent as well) ?
What happens during maintenance, equipment failures and so on ? Are these boot disks stored on hardware with built-in redundancy (raid and so on) ?
In other words, are a compute instance with persistent boot-disk similiar to a non-cloud VM stored on local RAID (from data-loss point of view) ?
Usually cloud instances are volatile, a crash, shutdown, maintenance and so on, will destroy all data stored.
Obvisouly, i'll have backups.
GCE Persistent Disks are designed to be durable and highly-available:
Persistent disks are durable network storage devices that your instances can access like physical disks in a desktop or a server. The data on each persistent disk is distributed across several physical disks. Compute Engine manages the physical disks and the data distribution to ensure redundancy and optimize performance for you.
(emphasis my own, source: Google documentation)
You have a choice of zonal or regional (currently in public beta) persistent disks, on an HDD or SSD-based platform. For boot disks, only zonal disks are supported as of the time of this writing.
As the name suggests, zonal disks are only guaranteed to persist their data within a single zone; outage or failure of that zone may render the data unavailable. Writes to regional disks are replicated to two zones in a region to safeguard against the outage of any one zone. The Google Compute Engine console, "Disks" section will show you that boot disks for your instances are zonal persistent disks.
Irrespective of the durability, it is obviously wise to keep your own backups of your persistent disks in another form of storage to safeguard other mechanisms for data loss, such as corruption in your application or user error by an operator. Snapshots of persistent disks are replicated to other regions; however, be aware of their lifecycle in the event the parent disk is deleted.
In addition to reviewing the comprehensive page linked above, I recommend reviewing the relevant SLA documentation to ascertain the precise guarantees and service levels offered to you.
Usually cloud instances are volatile, a crash, shutdown, maintenance and so on, will destroy all data stored.
The cloud model does indeed prefer instances which are stateless and can be replaced at will. This offers many scalability and robustness advantages, which can be achieved using managed instance groups, for example. However, you can use VMs for persistent storage if desired.
normally the data boot disk should be ok with restart and other maintenance operation. But it will be deleted with the compute by default.
If you use managed-instance-group, preemptible compute... and you want persistent data, you should use another storage system. If you juste use compute as is, it should be safe enough with backup.
I still think an additional persistent disk or another storage system is a better way to do things. But it's only my opinion.
Related
I am asking about the GCP Preemptive Instances. I have read that when an instance is terminated on a preemptive instance, an ACPI Soft Power Off occurs.
I am wondering if the hypervisor pauses the instance so that I can continue on my tasks. Or, the VM is shut off and not paused.
I have used preemptive instances in the past, but I cannot seem to remember if the VM was shut down or paused.
When your instance is turned off (or terminated) as stated in the documentation you can still access your data which are stored on a persistent disk.
Preemption process is described in the documentation and it states:
Preempted instances still appear in your project, but you are not charged for the instance hours while it remains in a TERMINATED state. You can access and recover data from any persistent disks that are attached to the instance, but those disks still incur storage charges until you delete them. As with normal instances, persistent disks that are marked for auto-delete are deleted when you delete the preemptible instance.
You can start your instance later and access your data - and if a proper resources are not available then you can attach the disk to other VM's and still access your data.
In short you will retain the disk data and when new vm comes up then it will automatically attach to it
From what I could find, Google Cloud will only allow me to create a snapshot of a machine disk.
Is it possible in some way to also capture its runtime? i.e RAM and process states.
Unfortunately, snapshots are limited to the persistent disk and not runtime processes and RAM. I would also like to mention that it is not possible to have a snapshot of RAM as this is volatile memory.
I need to deploy Cassandra on AWS but am confused as to what type of AWS storage is most suitable for Cassandra.
The Datastax documentation here:
http://docs.datastax.com/en/cassandra/3.0/cassandra/planning/planPlanningEC2.html
says that EBS volumes are recommended. At the same time the Datastax AMI documentation:
http://docs.datastax.com/en/cassandra/2.1/cassandra/install/installAMI.html
says that:
Uses RAID0 ephemeral disks for data storage and commit logs.
Launches EBS-backed instances for faster start-up, not database
storage.
So which one is the recommended storage type for Cassandra? The EBS storage or the Instance storage?
Many of the new eC2 instances are EBS only (http://www.ec2instances.info/) I am not sure when the cassandra document was written but EBS disk have improved a lot recently and amazon launches new type frequently, so you will be able to find what you're looking for with one of the type
You can check https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/EBSVolumeTypes.html?icmpid=docs_ec2_console and its recommended Provisioned IOPS SSD (io1)
To add a reason why AWS is moving to EBS and why it would be good for cassandra data is because of ephemeral type of data, you might not want your data to disappear if your instance is terminated (because of a crash or a stop you made) at least when your instance is gone, you still have access to your data and can attach the EBS volume to a new instance (really useful also when up/down-grading instances)
I came upon this presentation, which clearly answers the question with a very interesting use case:
https://www.youtube.com/watch?v=1R-mgOcOSd4
To summarize:
EBS has changed a lot since 2011 when major companies like Netflix
had problems with it.
EBS and GP2 are now the recommended storage
for Cassandra and you should not expect any bottlenecks there.
Datastax have recently updated their documentation to also recommend
EBS:
http://docs.datastax.com/en/cassandra/3.0/cassandra/planning/planPlanningEC2.html
No doubt EBS,
Memory optimized boxes are best suited for cassandra
T2
T2 are Burstable Performance Instances that offer a baseline level of CPU performance with the capability to burst above the baseline
M4
M4 instances are the most recent general-purpose instances. The M4 family of instances offers a balance of memory, network, and compute resources, and it is a better option for several applications
C4
These instances are recent additions to the compute-optimized instances that feature maximum performance processors with the lowest compute/price performance in EC2 Instance types.
X1
These instances are best suited for enterprise-class, large-scale, in-memory applications and offer the lowest price for each GiB of RAM among AWS EC2 instance types. The X1 instances are the latest addition to the EC2 memory-optimized instance group and are intended for executing high-scale, in-memory databases and in-memory applications over the AWS cloud.
for pricing and other information
https://aws.amazon.com/ec2/instance-types/
I'm new to AWS and also to Cassandra. I just read about EBS and S3 storage available in AWS. I was trying to figure out if we have Cassandra installed in EC2, which storage would it use? EBS or S3? Or is there other storage? I'm little confused with this. Please help me understand this.
Thanks
Aravind
You shouldn't run Cassandra on EBS, as recommended per Datastax itself :
"EBS volumes are not recommended for Cassandra data volumes for the following reasons:
EBS volumes contend directly for network throughput with standard packets. This means that EBS throughput is likely to fail if you saturate a network link.
EBS volumes have unreliable performance. I/O performance can be exceptionally slow, causing the system to back load reads and writes until the entire cluster becomes unresponsive.
Adding capacity by increasing the number of EBS volumes per host does not scale. You can easily surpass the ability of the system to keep effective buffer caches and concurrently serve requests for all of the data it is responsible for managing."
http://docs.datastax.com/en/cassandra/1.2/cassandra/architecture/architecturePlanningEC2_c.html
The answer above comes from Cassandra 1.2, a relatively old version. Documentation for newer versions of Cassandra indicate that EBS Optimized instances using GP2 SSD can be used for production workloads.
http://docs.datastax.com/en/cassandra/3.x/cassandra/planning/planPlanningEC2.html
Things that changed since then were the creation of EBS Optimized instances, which reduces and/or eliminates noisy neighbor throughput problems, and using GP2 SSD for EBS storage.
If you are just getting started, I would recommend EBS Optimized. The performance should be pretty good, but you gain a critical ability -> creating snapshots. This reduces the risk of your instance becoming unstable because you would have S3-backed volume snapshots for AWS to rebuild data from if a drive died.
This reduces the need to setup your Cassandra cluster across regions. One of the concerns that you have to build around when using Ephemeral is a whole region potentially going down, which could wipe out your entire cluster if you didn't build a multi-region cluster. With EBS, this isn't really a concern.
For Cassandra you need to use EBS. S3 is an object store with and API to store and retrieve objects, but not easy querying mechanisms. The use cases include backup and archiving, Disaster Recovery, Static Website Hosting, etc
However, you can use S3 for Cassandra backup.
You can also consider ephemeral disks (as Jeff mentions) and storage which comes with AWS instance.
The High I/O instance in EC2 uses SSD. How does one run a database on such an instance while guaranteeing persistance of data?
From my limited understanding, I'm suppose to use Elastic Block Store (EBS) so that even if the machine goes down the data on the disk doesn't disappear. On the other hand the instance storage SSD of a High I/O instance is ephemeral and can't be used for database storage because if, for example, the machine loses power the data image isn't preserved. Is my understanding correct?
Point 1) If your workloads need High IO SSD for DB, then you should have Master Slave setup. Ideally 1 master and 2 slaves spread across 3 AZ's is suggested. Even if there is an outage on single AZ the alternate AZ's can handle the load and serve your High availability needs. Between master - slave you can employ synchronous, semi or async replication depending upon your DB. This solution is costlier.
Point 2) Generally if your DB is OLTP in nature, then Amazon EBS PIOPS + EBS optimized gives you consistent IOPS. A Single EBS Volume can provide 4000 IOPS and you can RAID 0 multiple volumes and gain 10k+ IOPS for performance. Lots of customers are taking this route in AWS. Even though you may use EBS for persistence, it is still recommended to go with Master-Slave architecture for High Availability. I have written detailed articles on this topic in blog, refer them for more information.
It is the same as other ephemeral storage, it does not guarantee persistence. Persistance is handled by replication between instances with at least one instance writing to an EBS volume.
If you want your data to persist, you're going to need to use EBS. Building a database on an ephemeral drive, regardless of performance, seems a dubious design choice.
EBS now offers 4K IOPS volumes, which is, depending on your database requirements, quite possibly more than sufficient.
My next question would really be: Do you want to host/run your own database?
Turnkey products such as RDS and DynamoDB may be sufficient for your needs. Using them is much easier than setting up and managing your own database. RDS is now advertising "You can now provision up to 3TB and 30,000 IOPS per DB Instance". That's enough database horsepower for many, many problem sets.