Downloading a public file from Google Cloud to Google colaboratory - google-cloud-platform

I have a dataset which is publicly hosted on google cloud at this link. I would like to use this data in a Google colaboratory notebook by downloading it there. However all the tutorials I have seen which involve transferring a file from the Cloud to Colab require a Project ID, which I don't have since this is not my project. Wget also doesn't work with this file. Is there a way to download the files at that link directly to a colab notebook?

Be careful, your files are very big. It can easily fill up all Colab space.
First you need to login (authenticate yourself).
from google.colab import auth
auth.authenticate_user()
Then, you can use gsutil to list the files.
!gsutil ls gs://ravens-matrices/analogies/
And to copy 1 or more files, to the current directory.
!gsutil cp gs://ravens-matrices/analogies/extrapolation.tar.gz .
Here's a working notebook

Related

How to copy data from Cloud Storage to my Local Windows Laptop

Google makes it difficult to get your data if you are not experienced in programming. I did a data export from Google to export all company data - Google Data Export
It shows the root folder and to download, I run this command (it automatically enters this command):
gsutil -m cp -r \ "gs://takeout-export-myUniqueID" \.
But I have no idea where it would save it being I am not a GCP customer, only Google Workspace. Workspace won't help because they say it's a GCP product but I am exporting from Workspace. Craziness
Can someone let me know the proper command to run on my local machine with Google's SDK to download this folder? I was able to start the download yesterday but it said there was an invalid character in the file names so it killed the export.
Appreciate any help!

uploading zip to GCP jupyterlab is superslow

I was using the jupyterlab notebook instance at AI platform at GCP. You can access this by 1) entering GCP console, 2) search notebook instance and choose the entry with the subtitle of AI platform. 3) create one.
When I upload a zip to the jupyterlab, the speed is very very slow.
Don't know what to do. It is very frustrating when cost a day just to upload the data.
The Davic at GCP 24/7 chat support is helpful. After checking a bunch of things such as network speed (http://speedtest.net)
I found the speed of uploading a single file is pretty fast. And the network is pretty good too. Since my dataset is available at Kaggle, I just thought why not download directly from kaggle.
So I used the following commands:
pip install kaggle
mv kaggle.json /home/jupyter/.kaggle. # download your kaggle.json from profile page, upload it to jupyterlab, and move this place
chmod 600 /home/jupyter/.kaggle
kaggle download datasets {username/dataset name}
It is done!! Just 5 seconds, I guess, the dataset is deployed!!

How to set up kaggle api in colab

I am uploading the kaggle.json file for every new session in Google Colab. Is there any way to permanently configure it using Google Drive.
You can save kaggle.json in gdrive, mount it, then download from there.
You can also embed kaggle.json directly in Colab.
!mkdir ~/.kaggle
!echo '{"username":"korakot","key":"xxxxxxxxxxxxxxxx"}' > ~/.kaggle/kaggle.json
!chmod 600 ~/.kaggle/kaggle.json
If your notebook is private(not shared), this is the most convenient way.

How to install Kaggle on Jupyter Notebook services in Google Cloud

I've using Google Colab for computing my Kaggle competition, nowadays I decided to take a look if it'll work faster using services on Google Cloud. I have a *.ipybn file from Google Cloud, downloaded it and try to upload it to Google Cloud instance.
I created all connection on Google Colab using this link: https://towardsdatascience.com/setting-up-kaggle-in-google-colab-ebb281b61463 and it worked fine.
Using this tutorial: https://towardsdatascience.com/how-to-use-jupyter-on-a-google-cloud-vm-5ba1b473f4c2 I started a new instance for Jupyter notebook. Uploaded a *ipybn file, I tried to install Kaggle and run my notebook, but I have usually following errors:
kaggle: command not found error
ensure that your python binaries are on your path
How can I set everything to work on Google Cloud service?
Using this first tutorial mind about changing root directory path from content to /home/jupyter/, for example:
import zipfile
zip_ref = zipfile.ZipFile("/home/jupyter/Airbus_competition/input/test_v2.zip", 'r')
zip_ref.extractall("/home/jupyter/Airbus_competition/input/test_v2")
zip_ref.close()
For problems with installing kaggle, you don't have access to root folder from Jupyter notebooks, but you can install and use Kaggle API, when you change the command from !kaggle to !~/.local/bin/kaggle, for example (commands from tutorial changed to be working on GCS):
!mkdir ~/.kaggle
import json
token = {"your_TOKEN"}
with open('/home/jupyter/.kaggle/kaggle.json', 'w') as file:
json.dump(token, file)!cp /home/jupyter/.kaggle/kaggle.json
~/.kaggle/kaggle.json
!~/.local/bin/kaggle config set -n path -v{home/jupyter/Airbus_competition}
!chmod 600 /home/jupyter/.kaggle/kaggle.json
!~/.local/bin/kaggle competitions download -c airbus-ship-detection -p /home/jupyter/Airbus_competition/input --force

Accessing data in Google Cloud bucket for a python Tensorflow learning program

I’m working through the Google quick start examples for Cloud Learning / Tensorflow as shown here: https://cloud.google.com/ml/docs/quickstarts/training
I want my python program to access data that I have stored in a Google Cloud bucket such as gs://mybucket. How do I do this inside of my python program instead of calling it from the command line?
Specifically, the quickstart example for cloud learning utilizes data they provided but what if I want to provide my own data that I have stored in a bucket such as gs://mybucket?
I noticed a similar post here: How can I get the Cloud ML service account programmatically in Python? ... but I can’t seem to install the googleapiclient module.
Some posts seem to mention Apache Beam though I can’t tell if that’s relevant to me, but besides I can’t figure out how to download or install that whatever it is.
If I understand your question correctly, you want to programmatically talk to GCS in Python.
The official docs are a good place to start.
First, grab the module using pip:
pip install --upgrade google-cloud-storage
Then:
from google.cloud import storage
client = storage.Client()
bucket = client.get_bucket('bucket-id-here')
# Then do other things...
blob = bucket.get_blob('remote/path/to/file.txt')
print(blob.download_as_string())
blob.upload_from_string('New contents!')
blob2 = bucket.blob('remote/path/storage.txt')
blob2.upload_from_filename(filename='/local/path.txt')
Assuming you are using Ubuntu/Linux as an OS and already having data in GCS bucket
Execute following commands from a terminal or can be executed on Jupyter Notebook(just use ! before commands):
--------------------- Installation -----------------
1st install storage module:
on Terminal type:
pip install google-cloud-storage
2nd to verify storage installed or not type the command:
gsutil
(o/p will show available options)
---------------------- Copy data from GCS bucket --------
type this command: to check whether you are able to get information about bucket
gsutil acl get gs://BucketName
Now copy the file from GCS Bucket to your machine:
gsutil cp gs://BucketName/FileName /PathToDestinationDir/
In this way, you will be able to copy data from this bucket to your machine for further processing purpose.
NOTE: all the above commands can be run from a Jupyter Notebook just use ! before commands, it will run e.g.
!gsutil cp gs://BucketName/FileName /PathToDestinationDir/