Slow file uploading to Google Cloud Storage - google-cloud-platform

I'm noticing quite low upload speeds to Google Cloud Storage, almost 2.5x slower compared with uploading a file to Google Drive. Here is a screencast comparing the two for the upload of a 1GB file:
https://gyazo.com/c3488bd56b8118043b7df5aab813db01
This is just an example, but I've also tried using the gsutil command-line tool, using all the suggestions they have for uploading large files the fastest (such as using parallel_composite_upload_threshold). It is still slower than I'm accustomed to. Much slower.
Is there any way to improve this upload speed? Why is upload to Drive so much faster than doing the same to GCS?

It took me a day to upload around 30.000 images (100kb/image) using console.google.cloud.com or using browser, the same as you did. Then I tried to use gsutil to upload the files using terminal in Ubuntu. Following the instruction here : https://cloud.google.com/storage/docs/uploading-objects .
For single file :
gsutil cp [LOCAL_OBJECT_LOCATION] gs://[DESTINATION_BUCKET_NAME]/
For directory :
gsutil -m cp -R [DIR_NAME] gs://[DESTINATION_BUCKET_NAME]
Using gsutil it was incredibly faster, I uploaded 100.000 images (100~400kb/image) and it only took less than 30 minutes.
Honestly I haven't done lots of research why using gsutil is totally faster than using console. Probably because of gsutil provides the -m option which performs a parallel (multi-threaded/multi-processing) copy which can significantly increase the upload performance. https://cloud.google.com/storage/docs/composite-objects

Well, first of all both these products serve different purposes. While Drive can be seen more of a small-scale file storage using cloud, Cloud Storage is focused in the integration with Google Cloud Platform products, data reliability, accessibility, availability in a small-to-high scale.
You need to take into account that when you are uploading a file to Cloud Storage, it is treated as a blob object, which means that it has to go through some extra steps, for example the object data needs to be encrypted when uploaded to Cloud Storage, and uploaded objects are checked for consistency.
As well, depending on the configuration of your bucket, objects uploaded might have enabled version control, and the bucket might be storing the data in various regions at the same time, which can slow the file uploads.
I believe some of this points, specially encryption, is what make file uploads slower in Cloud Storage compared to Drive.
I would recommend however, to have the bucket you are uploading to at the region closest to your area, which could make a difference.

Related

Update Hard Drive backup on AWS S3

I would like to run an aws s3 sync command daily to update my hard drive backup on S3. Most of the time there will be no changes. The problem is that the s3 sync command takes days to check for changes (for a 4tb HDD). What is the quickest way to update a hard drive backup on S3?
If you are wanting to backup your own computer to Amazon S3, I would recommend using a Backup Utility that knows how to use S3. These utilities can do smart things like compress data, track files that have changed and set an appropriate Storage Class.
For example, I use Cloudberry Backup on a Windows computer. It does regular checking for new/changed files and uploads them to S3. If I delete a file locally, it waits 90 days before deleting it from S3. It can also handle multiple versions of files, rather than always overwriting files.
I would recommend only backing-up data folders (eg My Documents). There is no benefit to backing-up your Operating System or temporary files because you would not restore the OS from a remote backup.
While some backup utilities can compress files individually or in groups, experience has taught me to never do so since it can make restoration difficult if you do not have the original backup software (and remember -- backups last years!). The great things about S3 is that it is easy to access from many devices -- I have often grabbed documents from my S3 backup via my phone when I'm away from home.
Bottom line: Use a backup utility that knows how to do backups well. Make sure it knows how to use S3.
I would recommend using a backup tool that can synchronize with Amazon S3. For example, for Windows you can use Uranium Backup. It syncs with several clouds, including Amazon S3.
It can be scheduled to perform daily backups and also incremental backups (in case there are changes.)
I think this is the best way, considering the tediousness of daily manual syncing. Plus, it runs in the background and notifies you of any error or success logs.
This is the solution I use, I hope it can help you.

How to copy large files between Google Drive to Google Cloud Storage Using Colab

I would like to copy a huge file from Google Drive to Google Cloud Storage Bucket (file size ~80GB).
There have a very nice Google Colab from Philip Lies (found here) that does the job for small files, but it is problem to huge files, once it seems to create a local cache before copy the file itself to the bucket, and the Colab storage is limited.
The copy it self seems to be quick (once everything is in the cloud and between Google's solutions) but once the Colab has a limited storage, it reaches storage limit before complete the copy. I would not like to use a local Colab, because it would need to download the file to my computer, then upload to Google Storage Bucket, which would takes too long.
As far as I understand, we have two approaches to do the copy of huge files from Google Drive to Google Cloud Storage:
A. Copy file into chunks
Copy files in chunks (let's say of 150MB) from Google Drive to Google Colab, then upload the chunk to Google Bucket.
But I didn't find how to do this with these storages. shutil seems to do the job, but I could not make it works with Google Storage bucket.
B. Stream the file from the G Drive directly to G Bucket (ideally)
If there have an way to "stream" the file from one store to another it would be ideal, but I'm not sure if this approach is possible.

What's the quickest way to upload a large CSV file (8GB) from local computer to Google Cloud Storage/BigQuery table?

I have a 8GB-size CSV file of 104 million rows sat on the local hard drive. I need to upload this either directly to BigQuery as a table or via Google Cloud Storage + then point link in BigQuery. What's the quickest way to accomplish this? After trying the web console upload and Google Cloud SDK, both are quite slow (moving at 1% progress every few minutes).
Thanks in advance!
All the 3 existing answer are right, but if you have a low bandwidth, no one will help you, you will be physically limited.
My recommendation is to gzip your file before sending it. Text file has an high compression rate (up to 100 times) and you can ingest gzip files directly into BigQuery without unzipped them
Using the gsutil tool is going to be much faster, and fault tolerant than the web console (which will probably time out before finishing anyway). You can find detailed instructions here (https://cloud.google.com/storage/docs/uploading-objects#gsutil) but essentially, once you have the gcloud tools installed on your computer, you'll run:
gsutil cp [OBJECT_LOCATION] gs://[DESTINATION_BUCKET_NAME]/
From there, you can upload the file into BigQuery (https://cloud.google.com/bigquery/docs/loading-data-cloud-storage-csv) which will all happen on Google's network.
The bottleneck you're going to face is your internet upload speed during the initial upload. What we've done in the past to bypass this is spin up a compute box, run whatever process generated the file, and have it output onto the compute box. Then, we use the built in gsutil tool to upload the file to cloud storage. This has the benefit of running entirely on Google's Network and will be pretty quick.
I would recomment you to give a look to this article where there are several points to take into consideration.
Basically the best option is to upload your object making use of the parallel upload feature of gsutil, into the article you can find this command:
gsutil -o GSUtil:parallel_composite_upload_threshold=150M cp ./localbigfile gs://your-bucket
And also there you will find several tips to improve your upload, like moving the chunk size of the objects to upload.
Once uploaded I'd go to the option that dweling has provided for the Bigquery part by looking further at this document.
Have you considered using the BigQuery Command Line Tool, as per example provided below?
bq load --autodetect --source-format=CSV PROJECT_ID:DATASET.TABLE ./path/to/local/file/data.csv
The above command will directly load the contents of the local CSV file data.csv into the specified table with schema automatically detected. Alternatively, details on how you could customise the load job as per your requirements through parsing additional flags can be found here https://cloud.google.com/bigquery/docs/loading-data-local#bq

Download million files from S3 bucket

I have million of files in different folders on S3 bucket.
The files are very small. I wish to download all the files that are
under folder named VER1. The folder VER1 contains many subfolders,
I wish to download all the million files under all the subfolders of VER1.
(e.g VER1-> sub1-> file1.txt ,VER1-> sub1 -> subsub1 -> file2.text, etc.)
What is the fastest way to download all the files?
Using s3 cp? s3 sync?
Is there a way to download all the files located under the folder in parallel?
Use the AWS Command-Line Interface (CLI):
aws s3 sync s3://bucket/VER1 [name-of-local-directory]
From my experience, it will download in parallel but it won't necessarily use the full bandwidth because there is a lot of overhead for each object. (It is more efficient for large objects, since there is less overhead.)
It is possible that aws s3 sync might have problems with a large number of files. You'd have to try it to see whether it works.
If you really wanted full performance, you could write your own code that downloads in massive parallel, but the time saving would probably be lost in the time it takes you to write and test such a program.
Another option is to use aws s3 sync to download to an Amazon EC2 instance, then zip the files and simply download the zip file. That would reduce bandwidth requirements.

Speeding up a transfer of a file to Google Cloud VM

I am uploading a file to my Google Cloud Platform VM using scp on Linux. However, after initially uploading it at a speed of 900 kb/s it quickly falls to 20kb/s. My internet upload speed should be around 20mbps. I wanted to upload an SQLite database clocking in at 20gb, but this is unfeasible at this point.
Right now I used 54 minutes to upload a 94 MB file. It cannot be that slow?
I had the same issue multiple times with GCP, the solution I use is to compress all my files, upload it to dropbox and then wget the file from there. The speeds should go back to normal.
This answer should help you ae well, though I don't know if your paticular issue is related to gcp , scp or both.
https://askubuntu.com/questions/760509/how-to-make-scp-go-faster