How to view the contents of google cloud platform snapshot? - google-cloud-platform

I have a gcp snapshot and I want to read the data(not restoring) of that snapshot. How can I do that? What are the APIs available in gcp for that?

It's reasonable to want to browse (Persistent Disk) Snapshots but you cannot.
You must restore a (series of) Snapshots to a PD in order to browse the content.
Although more involved, you could create a PD, restore the Snapshot, do what you need and then delete the PD without incurring considerable cost.
I assume your use-case is that you'd like to confirm a specific (set of) file(s) has been Snapshotted?
If so, you may wish to consider a formal backup solution. This would provide you with a rich, queryable source of backed up files, would permit you to move backed-up replicas to different locations|media, and could possibly provide online backups whereby you needn't stop applications atop your PDs while backups are taken.

Related

Backup of Datastore/Firestore without gcloud import/export

Hello Google Cloud Platform users!
I am interested in a solution for a regular (let's say daily) backup of Datastore/Firestore databases. Typical use: for some reason (bad "manual" operation, bug, whatever), a series of entities have been wrongly modified or destroyed, or the database is corrupted; in that case, the database version from the previous day will be restored.
I know this has been discussed in previous posts, but mostly through gcloud datastore|firestore import|export through files hosted on Google Cloud Storage. The problem is that for large databases (typically for professional applications with thousands and thousands of entities), this approach can take huge time and resources, even if launched in batch during the night (and it can only get worse when the database increases).
A solution that I have thought about would be to copy to another Datastore/Firestore dataset at each upsert, but that seems like overkill, since Datastore/Firestore services already guarantees replica anyway. But most of all: it does not address the issue of unwanted writing or deletion of entities if this second database is 100% synced with the original one...
Are there best practices to backup Datastore/Firestore entities for this use case?
Any (brilliant) idea is welcome!
Thanks.
You can have a look on this project: https://github.com/Zenika/alpine-firestore-backup
I'm a contributor on it, don't hesitate if you have question or if you want new features.
At the moment that function is not available for the datastore/firestore, there is a Feature Request to implement the functionality
https://issuetracker.google.com/133662510

Create a copy of Redshift production with limited # records in each table

I have a production Redshift cluster with a significant amount of data on it. I would like to create a 'dummy' copy of the cluster that I can use for ad-hoc development and testing of various data pipelines. The copy would have all the schemas/tables of production, but only a small subset of the records in each table (say, limited to 10,000 rows per table).
What would be a good way to create such a copy, and refresh it on a regular basis (in case production schemas change)? Is there a way to create a snapshot of a cluster with limits on each table?
So far my thinking is to create a new cluster and use some of the admin views as defined here to automatically get the DDL of schemas/tables etc. and write scripts that generate UNLOAD statements (with limits on number of records) for each table. I can then use these to populate my dev cluster. However I feel there must be a cleaner solution.
I presume your basic goal is cost-saving. This needs to be balanced against administrative effort (how expensive is your time?).
It might be cheaper to produce a full-copy (restore from backup) of the cluster but turn it off at night/weekends to save money. If you automate the restoration process you could even schedule it to start before you come into work.
That way, you'll have a complete replica of the production system with effectively zero administration overhead (once you write a couple of scripts to create/delete the cluster) and you can save 75% of the costs (40 out of 168 hours per week). Plus, each time you create a new cluster it contains the latest data from the snapshot, so there is no need to keep them "in sync".
The simplest solutions are often the best.

Cassandra Backup with EBS

Currently I am looking how the backup/restore be done in Cassandra. We've setup a three node cluster in AWS. I understand that using nodetool snapshot tool we can take a snapshot but it's bit cumbersome process.
My idea is :
Make use of EBS snapshot because they're more durable and easy to setup but one problem which I see with EBS is inconsistency backup. Hence, my plan is run a script prior to taking EBS snapshot which would just run flush command to flush out all the memtable data and copies it on to the disk(SSTable) and then prepares the hard link with flushed sstables.
Once that's done, initiate the EBS snapshot, this was we can address the inconsistency issue which we might face if we only use EBS snapshost.
Please let me know if you see any issue with this approach or share your suggestions.
Being immutable, SSTables do help a lot when it comes to backups, indeed.
Your ideia sounds ok for situations where everything is healthy on your cluster. Actually, Cassandra is consistency-configurable (if I say eventually consistent, some people may be offended here, hehe), and as the system itself may no be fully consistent at a given time, you cannot say your backup will be as well. But, by the other hand, one of the beauties of Cassandra (and NoSQL models) is that it tends to recover pretty well, which is true for Cassandra in most situations (quite opposite to a relational databases, which are very sensitive to data losses). It's very unlikely you end up with a bunch of useless data if you have at least fully preserved SSTables files.
Be aware that EBS Snapshots are block-level. So, when you have a filesystem on top of it, it may be a concern as well. Fortunately, any modern filesystem have journaling nowadays and are pretty reliable, so that shouldn't be a problem, but having your data in a separate partition is a good practice, so the chances of someone else writing in it right after a full flush are smaller.
You may have some lost replicas when you eventually need to restore you cluster, demanding you to run nodetool repair, what, if you have done before, is a bit painful and takes very long for large amounts of data. (But, repair is recommended to be run regularly anyway, specially if you delete a lot.)
Another thing to consider are hinted handoffs (writes whose row owners are missing, but which are kept by other nodes until the owners come back). I don't know what happens with them when you flush, but I guess they're kept in memory and on commit logs only.
And, off course, do a full restore before you assume this will work in the future.
I don't have a large experience with Cassandra, but what I have heard about backup solutions for it are whole cluster replicas in another region, or datacenter, instead of cold backups like snapshots. It's probably more expensive but more reliable too than raw disks snapshots like you trying to do.
I am not sure how backup of a node will help, because in C* data is already backed up in the replica nodes.
If a node is dead and has to be replaced, the new node will learn about the data from other nodes that it needs to own and get it from other nodes, so you might not need to restore from a disk backup.
Would a replication scenario like the following help ?
Use two data centers (DC:A with 3 nodes) (DC:B with one node) with RF of (A:2 & B:1). Allow clients to interact with nodes in DC:A, with a Read/write consistency of Local_QUORUM. Here since quorum in 2 all reads and write will be successful and you will get data replicated on DC:B. Now you could back up DC:B

Cloud hosting - shared storage with direct access

We have an application deployed across AWS with using EC2, EBS services.
The infrastructure dropped by layers (independent instances):
application (with load balancer)
database (master-slave standard schema)
media server (streaming)
background processing (redis, delayed_job)
Application and Database instance use number of EBS block storage devices (root, data), which help us to attach/detach them and do EBS snapshots to S3. It's pretty default way how AWS works.
But EBS should be located in a specific zone and can be attached to one instance only in the same time.
Media server is one of bottlenecks, so we'd like to scale them with master/slave schema. So for the media server storage we'd like to try distributed file systems can be attached to multiple servers. What do you advice?
If you're not Facebook or Amazon, then you have no real reason to use something as elaborate as Hadoop or Cassandra. When you reach that level of growth, you'll be able to afford engineers who can choose/design the perfect solution to your problems.
In the meantime, I would strongly recommend GlusterFS for distributed storage. It's extremely easy to install, configure and get up and running. Also, if you're currently streaming files from local storage, you'll appreciate that GlusterFS also acts as local storage while remaining accessible by multiple servers. In other words, no changes to your application are required.
I can't tell you the exact configuration options for your specific application, but there are many available such as distributed, replicated, striped data. You can also play with cache settings to avoid hitting disks on every request, etc.
One thing to note, since GlusterFS is a layer above the other storage layers (particularly with Amazon), you might not get impressive disk performance. Actually it might be much worst than what you have now, for the sake of scalability... basically you could be better-off designing your application to serve streaming media from a CDN who already has the correct infrastructure for your type of application. It's something to think about.
HBase/Hadoop
Cassandra
MogileFS
Good same question (if I understand correctly):
Lustre, Gluster or MogileFS?? for video storage, encoding and streaming
There are many distributed file systems, just find the one you need.
The above are just part which I personally know (haven't tested them).

copying/marking state in vmware vm: alternte ways to snapshot

Snapshot is a way usig which VM states can be saved and it can be reverted back to point in time when the snapshot was taken.
Are there any other ways of doing this? For example, create incremental copies of VM files and restore those copies as needed. Copies can contain only incremental data. Are there any such different alternatives to snaphots? One of the other considerations for me is to use only VMware tools/technologies.
Thanks,
Vivek.
Snapshot is one of the best thing you have for maintaining Virtual Machine state.
It locks the current disk and creates a new disk which will have the incremental data stored.
So when you revert to snapshot same state is restored.
VCB is another way to take backups, it internally uses snapshots for taking backup.
So AFAIK taking snapshots is the only available way to maintain state of a VM.