I have a VM with 60GB and 45GB is used. It has multiple users.
I have deleted old users and now I have about 40GB used. I want to download all the data to my machine. How do I do that? The only way I know is with scp the command is something like this one: gcloud compute scp --recurse <$USER>#<INSTANCE>:/home/$USER ~/$LOCAL_DIR but it's very slow and might even take a couple of days. Is there a better and easier way to download the data?
The bottleneck here seems to be that you're trying to download a considerable amount of data over SSH which is quite slow.
One thing you can try to speed up the process is break down the download into two parts:
Upload the content of your VM to Cloud Storage Bucket
Download the content from Cloud Storage to your machine
So in that case, from the VM you'll execute:
gsutil cp -R /home/$USER gs://BUCKET_NAME
And then from your local machine:
gsutil cp -R gs://BUCKET_NAME ~/$LOCAL_DIR
Gsutil uses parallel composite uploads to speed up uploading large files, in case you have any. And in the first step you'll be doing GCP <-> GCP communication which will be faster than downloading from SSH directly.
Related
I have a dumb question.
So I have terrabytes of data to rsync between two GCP buckets.
I'm not too sure with how gsutil rsync works behind the scenes.
Does it have to download the files locally before it uploads it to destination or does it just magically move things over from source bucket to destination?
The answer to your question is in the gsutil rsync documentation:
Note 2: If you are synchronizing a large amount of data between clouds you might consider setting up a Google Compute Engine account and running gsutil there. Since cross-provider gsutil data transfers flow through the machine where gsutil is running, doing this can make your transfer run significantly faster than running gsutil on your local workstation.
So yes, it downloads the content locally first, then uploads it to the destination.
I performed a test with RSYNC and the debug flags and I noticed this behaviour
When you move an object (using cp or rsync) between buckets this is not downloaded to your local machine, I used a file of ~4GB and glances to measure the network usage during rsync operation, the objects were directly moved to the target bucket
If you run the following command you going to notice that the SDK perform a post request indicating the movement between buckets
gsutil -d rsync gs://sourcebucket gs://targetbucket
https://storage.googleapis.com/storage/v1/b/sourcebucket/o/bigfile.iso/rewriteTo/b/targetbucket/o/bigfile.iso
Rewriteto behaviour is documented here
We have terabytes of Data in Google Cloud Object Storage, we want to migrate it to AWS S3. What are the best ways to do it? Is there any 3rd party tool that can be better instead of going for direct transfer?
There could be multiple options available even without using any device (cloud to cloud migration) in less time.
** gsutil to copy data from a Google Cloud Storage bucket to an Amazon bucket, using a command such as:**
gsutil -m rsync -r gs://your-gcp-bucket s3://your-aws-s3-bucket
More details is available # https://cloud.google.com/storage/docs/gsutil/commands/rsync
Note: if you face speed challenge with default cloud shell, then you can create a big machine and execute above command from there.
According to the docs:
Cloud Shell provisions 5 GB of free persistent disk storage mounted as your $HOME directory on the virtual machine instance.
I would need more (paid) storage though that I can access from the Cloud Shell environment and that is persistent across my sessions. It's mostly used to store local clones of git repositories and images. I would be the only one to access these files.
It seems that the 5 GB storage is a hard limit, so it won't expand dynamically and bill me for the exceeding amount. It is possible to use Boost Mode, but that does not affect the storage size. And I also can't provision more storage with a custom Cloud Shell environment. I couldn't figure out if I can mount another GCE persistent disk to my $HOME. I was considering gcs-fuse as suggested in this answer but I'm not sure if it is suitable for git repos.
Is there any way to have more storage available in Cloud Shell?
Google Cloud Shell is a container that runs on a hidden Compute Engine instance managed by Google. You can download, modify and redeploy this container to Cloud Shell or to your own container running in the cloud or on your desktop.
The base image of the container is available at gcr.io/cloudshell-images/cloudshell:latest, per this page.
For your use case, I would use Compute Engine with Container OS and run the Cloud Shell container within COS. You can scale the CPUs, memory, and storage to fit your requirements.
You can also set up a Compute Engine instance, install the CLIs, SDKs, and tools and have a more powerful system.
Notes for future readers based upon the first answer:
Filestore is a great product, but pay attention to costs. The minimum deployment is 1 TB at $200+ per month. You will need to mount the NFS share each time Cloud Shell restarts - this can be put into login scripts. Note: I am not sure if you can actually mount an NFS share from Filestore in Cloud Shell. I have never tested this.
You will have the same remount problem with FUSE, plus you will have bandwidth costs to access Cloud Storage.
Cloud Shell is a great product that is well implemented, but when you need to exceed its capabilities it is better to deploy a small/medium size GCE instance. this enables persistent, snapshots, etc.
There is another way to have more disk space in the cloud shell. It's to create a cloud storage bucket and map the cloud storage bucket as a folder. This way you can store larger files in the cloud storage bucket and it doesn't require any compute instance.
Go to cloud storage and create a new storage bucket
Copy the storage bucket's name, eg. my_storage_bucket
Go to cloud shell and create a folder in your home folder
mkdir ~/my_bucket_folder
Mount the storage bucket to this folder
gcsfuse my_storage_bucket ~/my_bucket_folder
Change directory to your my_bucket_folder
cd ~/my_bucket_folder
Voila! you have unlimited space!
To unmount please run the following
fusermount -u ~/my_bucket_folder
I'm using gcsfuse and works fine. You don't have to remount every time if you put the mount command in .customize_environment (run on boot up).
#!/bin/sh
#.customize_environmnet run in background as root, wait for your user to initialize
sleep 20
sudo -u [USER] gcsfuse -o nonempty -file-mode=777 -dir-mode=777 --uid=1000 --debug_gcs [BUCKET_NAME] /home/[USER]/[FOLDER_NAME]
You can read more at Unlimited persistent disk in google cloud shell
There is no way of adding more storage to the Cloud Shell. You can create a VM and install the Cloud SDK and have as much storage as you'd like but it is not currently possible to add storage space to the Cloud Shell.
Depending on how you plan on using the saved repos, Cloud Storage may be ideal as it has a storage category just perfect archiving.
Filestore will be your best option as it is great for file systems and it is scalable. It fits your needs as you have described.
You can use Cloud Storage with FUSE. Keep in mind that this method, although great, depends on how it will be used as costs are based on storage category.
You can see a brief comparison of the Storage solutions the Cloud Platform has to offer here.
I want to download a file over 20GB from the internet into a google cloud bucket directly. Just like doing in a local command line the following:
wget http://some.url.com/some/file.tar
I refuse to download the file to my own computer and then copying the file to the bucket using:
gsutil cp file.tar gs://the-bucket/
For the moment I am trying (just at this very moment) to use datalab to download the file and then copying the file from there to the bucket.
A capability of the Google Cloud Platform as it relates to Google Cloud Storage is the functional area known as "Storage Transfer Service". The documentation for this is available here.
At the highest level, this capability allows you to define a source of data that is external to Google such as data available as a URL or on AWS S3 storage and then schedule that to be copied to Google Cloud Storage in the background. This function seems to perform the task you want ... the data is copied from an Internet source to GCS directly.
A completely different story would be the realization that GCP itself provides compute capabilities. What this means is that you can run your own logic on GCP through simple mechanisms such as a VM, Cloud Functions or Cloud Run. This helps us in this story by realizing that we could execute our code to download the Internet based data from within GCP itself to a local temp file. This file could then be uploaded into GCS from within GCP. At no time did the data that will end up in GCP ever go anywhere than from the source to Google. Once retrieved from the source, the transfer rate of the data from the GCP compute to GCS storage should be optimal as it is passing exclusively over Googles internal ultra high speed networks.
You can do the curl http://some.url.com/some/file.tar | gsutil cp - gs://YOUR_BUCKET_NAME/file command from inside cloud shell on GCP. That way it never uses your own network and stays in GCP entirely.
For large files, one-liners will very often fail, as will the Google Storage Transfer Service. Part two of Kolban's answer is then needed, and I thought I'd add a little more detail as it can take time to figure out the easiest way of actually downloading to a google compute instance and uploading to a bucket.
For many users, I believe the easiest way will be to open a notebook from the Google AI Platform and do the following:
%pip install wget
import wget
from google.cloud import storage # No install required
wget.download('source_url', 'temp_file_name')
client = storage.Client()
bucket = client.get_bucket('target_bucket')
blob = bucket.blob('upload_name')
blob.upload_from_filename('temp_file_name')
No need to set up an environment, benefits from the convenience of notebooks, and the client will have automatic access to your bucket if the notebook is hosted using same GCP account.
I found a similar post, where is explained that you can download a file from a Web and copy it to your bucket in just one command line:
curl http://some.url.com/some/file.tar | gsutil cp - gs://YOUR_BUCKET_NAME/file.tar
I tried in my own bucket and it works correctly, so I hope this is what you are expecting.
Can someone let me know how to transfer multiple files from Linux server to AWS?
If you are wanting to copy the data to Amazon S3, the easiest method is to use the AWS Command-Line Interface (CLI), either:
aws s3 cp --recursive or
aws s3 sync
The sync command automatically recurses sub-directories and is generally a better option because it can be re-run and only copies files modified or added since the previous execution. Thus, it can be used to continue the copy after a failure, or the next day when new files have been adeed.
Did you try using scp or sftp to transfer files. If your local machine is a linux one, you can use the console, otherwise putty in a windows machine.