Google Cloud - Local SSD hadware failure? - google-cloud-platform

We are planning to use Google Cloud Local SSDs, because we need better IOPS than the persistent SSD disk have. We want to build a RAID5 array with 4 disks with mdamd (Linux). My question: how can we manage hardware failure with these disks? We can't unplug these disks, because we don't have phisycal access to the server. If we remove a disk with mdamd and add a new one, will it solve this problem?

Local SSD is an ephemeral storage space and is not a reliable storage method. For example, should the machine hosting your VM suffer from a hardware failure, your data will be lost and unrecoverable. The same is true if you stop the machine on purpose or accidentally.
RAID does not help, as your instance (and Google for that matter) will lose access to the data you stored on Local SSD once the instance stops running on that machine.

Related

Is it possible to attach usb to an instance?

Is it possible to attach usb to an instance? Just curious to know if it is possible to attach memory stick or external hard disk on GCP instance
It's not possible to attach USB (or any other storage device that you own) to a GCP VM instance.
Only supported storage solutions are described in the official documentation.
It's possible to attach a local SSD disk but it's just a temporary storage - in case of VM restart (or shutdown) data are lost (in most cases).
You could mount your own USB stick as a samba share but you'd have to run another machine somewhere where you can physically connect it and share it from.

How can one keep the data on a local SSD between stopping and restarting an instance

In my case I need only CPU compute for a while, and then at at the end I need GPUs. So I run the instance only with CPUs, then stop and restart with GPUs added (and CPUs reduced). However, it seems this will lead to the data on the local SSD being erased. Is there any way around that? Could one maybe back it up first with a snapshot for example and then restore the data to the local SSD after restarting the instance?
I have not tried out using local SSDs. I want to know what would happen.
You data may or may not survive machine restart - depending on how lucky on unlucky you are. Moreover, if your VM crashes (e.g. if underlying hardware fails) you may also lose contents of Local SSD at random time.
I don't think Local SSD implements snapshots or any sort of data redundancy functionality. You can however implement your own - e.g. you can partition your SSD using lvm, take lvm snapshots once in a while and upload them to e.g. GCS or store somewhere else.
In my experience, rebooting is typically fine, while shutting down will always result in data purge.
The easiest way I've found to backup and restore is to copy to/from a persistent drive or Google Cloud Storage. gsutil rsync works well for this. I don't believe snapshots work with local SSDs.
From google docs:
https://cloud.google.com/compute/docs/disks/local-ssd
Data on local SSDs persist only through the following events:
If you reboot the guest operating system.
If you configure your instance for live migration and the instance goes through a host maintenance event.
If the host system experiences a host error, Compute Engine makes a best effort to reconnect to the VM and preserve the local SSD data, but might not succeed. If the attempt is successful, the VM restarts automatically. However, if the attempt to reconnect fails, the VM restarts without the data. While Compute Engine is recovering your VM and local SSD, which can take up to 60 minutes, the host system and the underlying drive are unresponsive. To configure how your VM instances behave in the event of a host error, see Setting instance availability policies.
Data on Local SSDs does not persist through the following events:
If you shut down the guest operating system and force the instance to stop.
If you configure the instance to be preemptible and the instance goes through the preemption process.
If you configure the instance to stop on host maintenance events and the instance goes through a host maintenance event.
If the host system experiences a host error, and the underlying drive does not recover within 60 minutes, Compute Engine does not attempt to preserve the data on your local SSD. While Compute Engine is recovering your VM and local SSD, which can take up to 60 minutes, the host system and the underlying drive are unresponsive.
If you misconfigure the local SSD so that it becomes unreachable.
If you disable project billing. The instance will stop and your data will be lost.

GCP: Creating a snapshot of a VM including runtime processes

From what I could find, Google Cloud will only allow me to create a snapshot of a machine disk.
Is it possible in some way to also capture its runtime? i.e RAM and process states.
Unfortunately, snapshots are limited to the persistent disk and not runtime processes and RAM. I would also like to mention that it is not possible to have a snapshot of RAM as this is volatile memory.

Stop a VM instance in GCP with local SSD attached to it

According to the documentation - https://cloud.google.com/compute/docs/disks/#localssds - it is not possible to stop and instance that has a local SSD. It merely acts as a cache memory.
Is it possible to get a local SSD for an instance which is persistent and allows us to stop the instance?
Also, are local SSDs and persistent SSDs detected as different hardware types by an instance?
At this moment there's no way to setup a GCE instance with a local SSD and be able to stop it, as is mentioned in the documentation this kind of storage is used to store caches and as processing space.
Now, about the hardware differences between a local SSD and a persistent SSD. Since the point of view of the GCE instance they are the same, I mean the instance detects the two of them just as another mount; however, the technology that behind each of them it's completely different.
A local SSD, just as the documentation states, is physically attached to the server where is hosted the instance while a persistent SSD it's just a Log-Structured Volume, in other words it's not a physical hard drive.
There's a complete explanation about how persistent disks works on Google Cloud at [1].
[1] https://www.youtube.com/watch?v=jA_A-OXsIss
WARNING stopping your VM will delete all data from the local disk
You can stop the VM from ssh with the commands
sudo shutdown -h now or sudo poweroff
This is the correct way to stop the VM if it has a local ssd attached.
https://cloud.google.com/compute/docs/instances/stop-start-instance

GCE: persistent boot disk

Simple question for GCE users: are persistent boot disks safe to be used or data loss could occur?
I've seen that I can attach additional persistent disks, but what about the standard boot disks (that should be persistent as well) ?
What happens during maintenance, equipment failures and so on ? Are these boot disks stored on hardware with built-in redundancy (raid and so on) ?
In other words, are a compute instance with persistent boot-disk similiar to a non-cloud VM stored on local RAID (from data-loss point of view) ?
Usually cloud instances are volatile, a crash, shutdown, maintenance and so on, will destroy all data stored.
Obvisouly, i'll have backups.
GCE Persistent Disks are designed to be durable and highly-available:
Persistent disks are durable network storage devices that your instances can access like physical disks in a desktop or a server. The data on each persistent disk is distributed across several physical disks. Compute Engine manages the physical disks and the data distribution to ensure redundancy and optimize performance for you.
(emphasis my own, source: Google documentation)
You have a choice of zonal or regional (currently in public beta) persistent disks, on an HDD or SSD-based platform. For boot disks, only zonal disks are supported as of the time of this writing.
As the name suggests, zonal disks are only guaranteed to persist their data within a single zone; outage or failure of that zone may render the data unavailable. Writes to regional disks are replicated to two zones in a region to safeguard against the outage of any one zone. The Google Compute Engine console, "Disks" section will show you that boot disks for your instances are zonal persistent disks.
Irrespective of the durability, it is obviously wise to keep your own backups of your persistent disks in another form of storage to safeguard other mechanisms for data loss, such as corruption in your application or user error by an operator. Snapshots of persistent disks are replicated to other regions; however, be aware of their lifecycle in the event the parent disk is deleted.
In addition to reviewing the comprehensive page linked above, I recommend reviewing the relevant SLA documentation to ascertain the precise guarantees and service levels offered to you.
Usually cloud instances are volatile, a crash, shutdown, maintenance and so on, will destroy all data stored.
The cloud model does indeed prefer instances which are stateless and can be replaced at will. This offers many scalability and robustness advantages, which can be achieved using managed instance groups, for example. However, you can use VMs for persistent storage if desired.
normally the data boot disk should be ok with restart and other maintenance operation. But it will be deleted with the compute by default.
If you use managed-instance-group, preemptible compute... and you want persistent data, you should use another storage system. If you juste use compute as is, it should be safe enough with backup.
I still think an additional persistent disk or another storage system is a better way to do things. But it's only my opinion.