uploading zip to GCP jupyterlab is superslow - google-cloud-platform

I was using the jupyterlab notebook instance at AI platform at GCP. You can access this by 1) entering GCP console, 2) search notebook instance and choose the entry with the subtitle of AI platform. 3) create one.
When I upload a zip to the jupyterlab, the speed is very very slow.
Don't know what to do. It is very frustrating when cost a day just to upload the data.

The Davic at GCP 24/7 chat support is helpful. After checking a bunch of things such as network speed (http://speedtest.net)
I found the speed of uploading a single file is pretty fast. And the network is pretty good too. Since my dataset is available at Kaggle, I just thought why not download directly from kaggle.
So I used the following commands:
pip install kaggle
mv kaggle.json /home/jupyter/.kaggle. # download your kaggle.json from profile page, upload it to jupyterlab, and move this place
chmod 600 /home/jupyter/.kaggle
kaggle download datasets {username/dataset name}
It is done!! Just 5 seconds, I guess, the dataset is deployed!!

Related

Copy files to Container-Optimised OS from a GCP Storage bucket

How can one download files from a GCP Storage bucket to a Container-Optimised OS (COS) on instance startup?
I know of the following solutions:
gcloud compute copy-files
SSH through console
SCP
Yet all of these have to be done manually and externally after an instance is started.
There is also cloud init, yet I can't find any info on how to copy files from a Storage bucket. Examples seem to be suggesting that it's better to include content of files in the cloud init file directly, which is not something I want to do because security. Is it possible to download files from Storge bucket using cloud init?
I considered using a startup script, yet COS lacks CLI tools such as gcloud or gsutil to be able to run any such commands in a startup script.
I know I could copy the files manually and then save the image as a boot disk, but I'm hoping there are solutions that avoid having to do so.
Most of all, I'm assuming I'm not asking for something impossible, given that COS instance setup allows me to specify Docker volumes that I could mount onto the starting container. This seems to suggest I should be able to have some private files on the instance the moment COS will attempt to run my image on startup. But how?
Trying to execute a startup-script with a cloud-sdk image and copying files there as suggested by Guillaume didn't work for me for a while, showing this log. Eventually I realised that the cloud-sdk image is 2.41GB when uncompressed and takes over 2 minutes to complete pulling. I tried again with an empty COS instance and the startup script completed successfully, downloading the data from a Storage bucket.
However, a 2.41GB image and over 2 minutes of boot time sound like a bit of an overkill to download a 2KB file. Don't they?
I'm glad to see a working solution to my question (thanks Guillaume!) although I'm still wondering: isn't there a nicer way to do this? I feel that this method is even less tidy than manually putting the files on the COS instance and then creating a machine image to use in the future.
Based on Guillaume's answer I created and published a gsutil wrapper image, available as voyz/gsutil_wrap. This way I am able to run a startup-script with the following command:
docker run -v /host/path:/container/path \
--entrypoint gsutil voyz/gsutil_wrap \
cp gs://bucket/path /container/path
It's essentially a copy of what Guillaume suggested, except it is using an image containing only a minimum setup required to run gsutil. As a result it weighs 0.22GB and pulls within 10-20 seconds on average - as opposed to 2.41GB and over 2 minutes respectively for the google/cloud-sdk image suggested by Guillaume.
Also, credit to this incredibly useful StackOverflow answer that allows gsutil to use the default service account for authentication.
The startup-script is the correct location to do this. And YES, COS lacks some useful library.
BUT you can run container! And, for example, the Google Cloud SDK container!
So, add this startup-script in the VM metadata:
key -> startup-script
value ->
docker run -v /local/path/to/copy/files:/dummy/container/path \
--entrypoint gsutil google/cloud-sdk \
cp gs://your_bucket/path/to/file /dummy/container/path
Note: the startup script is ran in root mode. Perform a chmod/chown in your startup script if you need to change the file access mode.
Let me know if you need more explanation on this command line
Of course, with a fresh COS image, the startup time is quite long (pull the container image and extract it).
To reduce the startup time, you can "bake" your image. I mean, start with a COS, download/install what you want on it (or only perform a docker pull of the googkle/cloud-sdk container) and create a custom image from this.
Like this, all the required dependencies will be present on the image and the boot start will be quicker.

Downloading a public file from Google Cloud to Google colaboratory

I have a dataset which is publicly hosted on google cloud at this link. I would like to use this data in a Google colaboratory notebook by downloading it there. However all the tutorials I have seen which involve transferring a file from the Cloud to Colab require a Project ID, which I don't have since this is not my project. Wget also doesn't work with this file. Is there a way to download the files at that link directly to a colab notebook?
Be careful, your files are very big. It can easily fill up all Colab space.
First you need to login (authenticate yourself).
from google.colab import auth
auth.authenticate_user()
Then, you can use gsutil to list the files.
!gsutil ls gs://ravens-matrices/analogies/
And to copy 1 or more files, to the current directory.
!gsutil cp gs://ravens-matrices/analogies/extrapolation.tar.gz .
Here's a working notebook

How to set up kaggle api in colab

I am uploading the kaggle.json file for every new session in Google Colab. Is there any way to permanently configure it using Google Drive.
You can save kaggle.json in gdrive, mount it, then download from there.
You can also embed kaggle.json directly in Colab.
!mkdir ~/.kaggle
!echo '{"username":"korakot","key":"xxxxxxxxxxxxxxxx"}' > ~/.kaggle/kaggle.json
!chmod 600 ~/.kaggle/kaggle.json
If your notebook is private(not shared), this is the most convenient way.

Wget - Downloading lots of files recursively is taking a long time

Currently I am trying to download a large dataset (200k+ of large images) Its all stored on google cloud. The authors provide a wget script to download it:
wget -r -N -c -np --user username --ask-password https://alpha.physionet.org/files/mimic-cxr/2.0.0/
Now it downloads etc, but its been 2 days and its still going and I don't know how long its going to take. AFAIK its downloading each file individually. is there a way for me to download it in parallel?
EDIT: I don't have sudo access to the machine doing the downloading. I just have user access.
wget is a great tool but it is not designed to be efficient for downloading 200K files.
You can either wait for it to finish or find another tool that does parallel downloads provided that you have a fast Internet connection to support parallel downloads which might decrease the time by half over wget.
Since the source is an HTTPS web server, there really is not much you can do to speed this up besides downloading two to four files in parallel. Depending on your Internet speed, distance to the source server, you might not achieve any improvement with parallel downloads.
Note: You do not specify what you are downloading onto. If the destination is a Compute Engine VM, and you picked a tiny one (f1-micro) you may be resource limited. For any hi-speed data transfer pick at least an n1 instance size.
If you don't know the urls then use the good old httrack website copier to download files in parallel:
httrack -v -w https://user:password#example.com/
Default is 8 parallel connections but you can use cN option to increase it.
If the files are large you can use aria2c this will download single file with multiple threads:
aria2c -x 16 url
You could find out if the files are store in GCS, if so then you can just use
gsutil -m <src> <destination>
This will download files in multithreaded mode
Take a look at the updated official MIMIC-CXR https://mimic-cxr.mit.edu/about/download/downloads page.
There you'll find the info how to download via wget (locally) and gsutil (Google Cloud Storage)

How to install Kaggle on Jupyter Notebook services in Google Cloud

I've using Google Colab for computing my Kaggle competition, nowadays I decided to take a look if it'll work faster using services on Google Cloud. I have a *.ipybn file from Google Cloud, downloaded it and try to upload it to Google Cloud instance.
I created all connection on Google Colab using this link: https://towardsdatascience.com/setting-up-kaggle-in-google-colab-ebb281b61463 and it worked fine.
Using this tutorial: https://towardsdatascience.com/how-to-use-jupyter-on-a-google-cloud-vm-5ba1b473f4c2 I started a new instance for Jupyter notebook. Uploaded a *ipybn file, I tried to install Kaggle and run my notebook, but I have usually following errors:
kaggle: command not found error
ensure that your python binaries are on your path
How can I set everything to work on Google Cloud service?
Using this first tutorial mind about changing root directory path from content to /home/jupyter/, for example:
import zipfile
zip_ref = zipfile.ZipFile("/home/jupyter/Airbus_competition/input/test_v2.zip", 'r')
zip_ref.extractall("/home/jupyter/Airbus_competition/input/test_v2")
zip_ref.close()
For problems with installing kaggle, you don't have access to root folder from Jupyter notebooks, but you can install and use Kaggle API, when you change the command from !kaggle to !~/.local/bin/kaggle, for example (commands from tutorial changed to be working on GCS):
!mkdir ~/.kaggle
import json
token = {"your_TOKEN"}
with open('/home/jupyter/.kaggle/kaggle.json', 'w') as file:
json.dump(token, file)!cp /home/jupyter/.kaggle/kaggle.json
~/.kaggle/kaggle.json
!~/.local/bin/kaggle config set -n path -v{home/jupyter/Airbus_competition}
!chmod 600 /home/jupyter/.kaggle/kaggle.json
!~/.local/bin/kaggle competitions download -c airbus-ship-detection -p /home/jupyter/Airbus_competition/input --force