Google colab upload time to google drive - google-cloud-platform

I am using google colab to extract data from a zip file I have stored on the colab drive. I mounted my google drive to the colab workspace and used the following code to unzip the data to the google drive:
!unzip '/content/g2net-gravitational-wave-detection.zip' -d '/content/drive/MyDrive/'
This took about 6/7 hours to run as there is about 80gb of data to extract.
I have read online that when using colab, although it extracts items to the mounted google drive, it first extracts them to colab and then after completion it uploads them to the actual google drive.
My question is regarding the upload; I have checked and all my data has successfully extracted to the colab drive and is clearly currently moving accross to google drive now; but only about 10gb of it so far. I was just wondering if there is any way to see how fast this is transferring/ how much is transferred. Just that I only have a pro subscription so I think if I idle too long or close the session I will lose all of the extracted data.
Any help is appreciated! Thanks!

Related

How to download last version of all google drive files at scale?

All my google drive files got encrypted by ransomware. Google did not help me with the backup of all drive files available before that encryption date.
The only option I found working is to manually select the file in Google Drive and revert to the previous version by deleting the encrypted current version. Google keeps the previous version of a file in drive only for 30 days.
I am looking for a script that can help me with reverting to the immediately previous version of the file by deleting currently encrypted at scale. I have 60 GB of data in Google Drive.
If you have any script to do that. I see in Google Developer documentation, they have opened Google Drive API for people where using API all versions can be set to forever saved or one can download a particular version of file using API.
I have left coding some 7 years back and struggling to create script. If anyone has such script created, it will help. Google drive is just my personal account.
I had the same problem last week and I have created an Apps Script which deletes the new file versions and keep the old version before the ransomware affected the Drive.
Contact me for the Script.. for some reason I can't paste it here?!
You can Skype me (nickname: gozeril) and I'll give it to you.
Notes:
You need to run it on each root folder one by one, changing in code the folder name only.
** Some folders are very big and therefore you must run the script several times
The Script will run 30 minute at most.
Be patient, it works!
I hope you'll find it useful :-)

How to copy large files between Google Drive to Google Cloud Storage Using Colab

I would like to copy a huge file from Google Drive to Google Cloud Storage Bucket (file size ~80GB).
There have a very nice Google Colab from Philip Lies (found here) that does the job for small files, but it is problem to huge files, once it seems to create a local cache before copy the file itself to the bucket, and the Colab storage is limited.
The copy it self seems to be quick (once everything is in the cloud and between Google's solutions) but once the Colab has a limited storage, it reaches storage limit before complete the copy. I would not like to use a local Colab, because it would need to download the file to my computer, then upload to Google Storage Bucket, which would takes too long.
As far as I understand, we have two approaches to do the copy of huge files from Google Drive to Google Cloud Storage:
A. Copy file into chunks
Copy files in chunks (let's say of 150MB) from Google Drive to Google Colab, then upload the chunk to Google Bucket.
But I didn't find how to do this with these storages. shutil seems to do the job, but I could not make it works with Google Storage bucket.
B. Stream the file from the G Drive directly to G Bucket (ideally)
If there have an way to "stream" the file from one store to another it would be ideal, but I'm not sure if this approach is possible.

What's the quickest way to upload a large CSV file (8GB) from local computer to Google Cloud Storage/BigQuery table?

I have a 8GB-size CSV file of 104 million rows sat on the local hard drive. I need to upload this either directly to BigQuery as a table or via Google Cloud Storage + then point link in BigQuery. What's the quickest way to accomplish this? After trying the web console upload and Google Cloud SDK, both are quite slow (moving at 1% progress every few minutes).
Thanks in advance!
All the 3 existing answer are right, but if you have a low bandwidth, no one will help you, you will be physically limited.
My recommendation is to gzip your file before sending it. Text file has an high compression rate (up to 100 times) and you can ingest gzip files directly into BigQuery without unzipped them
Using the gsutil tool is going to be much faster, and fault tolerant than the web console (which will probably time out before finishing anyway). You can find detailed instructions here (https://cloud.google.com/storage/docs/uploading-objects#gsutil) but essentially, once you have the gcloud tools installed on your computer, you'll run:
gsutil cp [OBJECT_LOCATION] gs://[DESTINATION_BUCKET_NAME]/
From there, you can upload the file into BigQuery (https://cloud.google.com/bigquery/docs/loading-data-cloud-storage-csv) which will all happen on Google's network.
The bottleneck you're going to face is your internet upload speed during the initial upload. What we've done in the past to bypass this is spin up a compute box, run whatever process generated the file, and have it output onto the compute box. Then, we use the built in gsutil tool to upload the file to cloud storage. This has the benefit of running entirely on Google's Network and will be pretty quick.
I would recomment you to give a look to this article where there are several points to take into consideration.
Basically the best option is to upload your object making use of the parallel upload feature of gsutil, into the article you can find this command:
gsutil -o GSUtil:parallel_composite_upload_threshold=150M cp ./localbigfile gs://your-bucket
And also there you will find several tips to improve your upload, like moving the chunk size of the objects to upload.
Once uploaded I'd go to the option that dweling has provided for the Bigquery part by looking further at this document.
Have you considered using the BigQuery Command Line Tool, as per example provided below?
bq load --autodetect --source-format=CSV PROJECT_ID:DATASET.TABLE ./path/to/local/file/data.csv
The above command will directly load the contents of the local CSV file data.csv into the specified table with schema automatically detected. Alternatively, details on how you could customise the load job as per your requirements through parsing additional flags can be found here https://cloud.google.com/bigquery/docs/loading-data-local#bq

Slow file uploading to Google Cloud Storage

I'm noticing quite low upload speeds to Google Cloud Storage, almost 2.5x slower compared with uploading a file to Google Drive. Here is a screencast comparing the two for the upload of a 1GB file:
https://gyazo.com/c3488bd56b8118043b7df5aab813db01
This is just an example, but I've also tried using the gsutil command-line tool, using all the suggestions they have for uploading large files the fastest (such as using parallel_composite_upload_threshold). It is still slower than I'm accustomed to. Much slower.
Is there any way to improve this upload speed? Why is upload to Drive so much faster than doing the same to GCS?
It took me a day to upload around 30.000 images (100kb/image) using console.google.cloud.com or using browser, the same as you did. Then I tried to use gsutil to upload the files using terminal in Ubuntu. Following the instruction here : https://cloud.google.com/storage/docs/uploading-objects .
For single file :
gsutil cp [LOCAL_OBJECT_LOCATION] gs://[DESTINATION_BUCKET_NAME]/
For directory :
gsutil -m cp -R [DIR_NAME] gs://[DESTINATION_BUCKET_NAME]
Using gsutil it was incredibly faster, I uploaded 100.000 images (100~400kb/image) and it only took less than 30 minutes.
Honestly I haven't done lots of research why using gsutil is totally faster than using console. Probably because of gsutil provides the -m option which performs a parallel (multi-threaded/multi-processing) copy which can significantly increase the upload performance. https://cloud.google.com/storage/docs/composite-objects
Well, first of all both these products serve different purposes. While Drive can be seen more of a small-scale file storage using cloud, Cloud Storage is focused in the integration with Google Cloud Platform products, data reliability, accessibility, availability in a small-to-high scale.
You need to take into account that when you are uploading a file to Cloud Storage, it is treated as a blob object, which means that it has to go through some extra steps, for example the object data needs to be encrypted when uploaded to Cloud Storage, and uploaded objects are checked for consistency.
As well, depending on the configuration of your bucket, objects uploaded might have enabled version control, and the bucket might be storing the data in various regions at the same time, which can slow the file uploads.
I believe some of this points, specially encryption, is what make file uploads slower in Cloud Storage compared to Drive.
I would recommend however, to have the bucket you are uploading to at the region closest to your area, which could make a difference.

Speeding up a transfer of a file to Google Cloud VM

I am uploading a file to my Google Cloud Platform VM using scp on Linux. However, after initially uploading it at a speed of 900 kb/s it quickly falls to 20kb/s. My internet upload speed should be around 20mbps. I wanted to upload an SQLite database clocking in at 20gb, but this is unfeasible at this point.
Right now I used 54 minutes to upload a 94 MB file. It cannot be that slow?
I had the same issue multiple times with GCP, the solution I use is to compress all my files, upload it to dropbox and then wget the file from there. The speeds should go back to normal.
This answer should help you ae well, though I don't know if your paticular issue is related to gcp , scp or both.
https://askubuntu.com/questions/760509/how-to-make-scp-go-faster