I've using Google Colab for computing my Kaggle competition, nowadays I decided to take a look if it'll work faster using services on Google Cloud. I have a *.ipybn file from Google Cloud, downloaded it and try to upload it to Google Cloud instance.
I created all connection on Google Colab using this link: https://towardsdatascience.com/setting-up-kaggle-in-google-colab-ebb281b61463 and it worked fine.
Using this tutorial: https://towardsdatascience.com/how-to-use-jupyter-on-a-google-cloud-vm-5ba1b473f4c2 I started a new instance for Jupyter notebook. Uploaded a *ipybn file, I tried to install Kaggle and run my notebook, but I have usually following errors:
kaggle: command not found error
ensure that your python binaries are on your path
How can I set everything to work on Google Cloud service?
Using this first tutorial mind about changing root directory path from content to /home/jupyter/, for example:
import zipfile
zip_ref = zipfile.ZipFile("/home/jupyter/Airbus_competition/input/test_v2.zip", 'r')
zip_ref.extractall("/home/jupyter/Airbus_competition/input/test_v2")
zip_ref.close()
For problems with installing kaggle, you don't have access to root folder from Jupyter notebooks, but you can install and use Kaggle API, when you change the command from !kaggle to !~/.local/bin/kaggle, for example (commands from tutorial changed to be working on GCS):
!mkdir ~/.kaggle
import json
token = {"your_TOKEN"}
with open('/home/jupyter/.kaggle/kaggle.json', 'w') as file:
json.dump(token, file)!cp /home/jupyter/.kaggle/kaggle.json
~/.kaggle/kaggle.json
!~/.local/bin/kaggle config set -n path -v{home/jupyter/Airbus_competition}
!chmod 600 /home/jupyter/.kaggle/kaggle.json
!~/.local/bin/kaggle competitions download -c airbus-ship-detection -p /home/jupyter/Airbus_competition/input --force
Related
I have a dataset which is publicly hosted on google cloud at this link. I would like to use this data in a Google colaboratory notebook by downloading it there. However all the tutorials I have seen which involve transferring a file from the Cloud to Colab require a Project ID, which I don't have since this is not my project. Wget also doesn't work with this file. Is there a way to download the files at that link directly to a colab notebook?
Be careful, your files are very big. It can easily fill up all Colab space.
First you need to login (authenticate yourself).
from google.colab import auth
auth.authenticate_user()
Then, you can use gsutil to list the files.
!gsutil ls gs://ravens-matrices/analogies/
And to copy 1 or more files, to the current directory.
!gsutil cp gs://ravens-matrices/analogies/extrapolation.tar.gz .
Here's a working notebook
I was using the jupyterlab notebook instance at AI platform at GCP. You can access this by 1) entering GCP console, 2) search notebook instance and choose the entry with the subtitle of AI platform. 3) create one.
When I upload a zip to the jupyterlab, the speed is very very slow.
Don't know what to do. It is very frustrating when cost a day just to upload the data.
The Davic at GCP 24/7 chat support is helpful. After checking a bunch of things such as network speed (http://speedtest.net)
I found the speed of uploading a single file is pretty fast. And the network is pretty good too. Since my dataset is available at Kaggle, I just thought why not download directly from kaggle.
So I used the following commands:
pip install kaggle
mv kaggle.json /home/jupyter/.kaggle. # download your kaggle.json from profile page, upload it to jupyterlab, and move this place
chmod 600 /home/jupyter/.kaggle
kaggle download datasets {username/dataset name}
It is done!! Just 5 seconds, I guess, the dataset is deployed!!
I am uploading the kaggle.json file for every new session in Google Colab. Is there any way to permanently configure it using Google Drive.
You can save kaggle.json in gdrive, mount it, then download from there.
You can also embed kaggle.json directly in Colab.
!mkdir ~/.kaggle
!echo '{"username":"korakot","key":"xxxxxxxxxxxxxxxx"}' > ~/.kaggle/kaggle.json
!chmod 600 ~/.kaggle/kaggle.json
If your notebook is private(not shared), this is the most convenient way.
I’m working through the Google quick start examples for Cloud Learning / Tensorflow as shown here: https://cloud.google.com/ml/docs/quickstarts/training
I want my python program to access data that I have stored in a Google Cloud bucket such as gs://mybucket. How do I do this inside of my python program instead of calling it from the command line?
Specifically, the quickstart example for cloud learning utilizes data they provided but what if I want to provide my own data that I have stored in a bucket such as gs://mybucket?
I noticed a similar post here: How can I get the Cloud ML service account programmatically in Python? ... but I can’t seem to install the googleapiclient module.
Some posts seem to mention Apache Beam though I can’t tell if that’s relevant to me, but besides I can’t figure out how to download or install that whatever it is.
If I understand your question correctly, you want to programmatically talk to GCS in Python.
The official docs are a good place to start.
First, grab the module using pip:
pip install --upgrade google-cloud-storage
Then:
from google.cloud import storage
client = storage.Client()
bucket = client.get_bucket('bucket-id-here')
# Then do other things...
blob = bucket.get_blob('remote/path/to/file.txt')
print(blob.download_as_string())
blob.upload_from_string('New contents!')
blob2 = bucket.blob('remote/path/storage.txt')
blob2.upload_from_filename(filename='/local/path.txt')
Assuming you are using Ubuntu/Linux as an OS and already having data in GCS bucket
Execute following commands from a terminal or can be executed on Jupyter Notebook(just use ! before commands):
--------------------- Installation -----------------
1st install storage module:
on Terminal type:
pip install google-cloud-storage
2nd to verify storage installed or not type the command:
gsutil
(o/p will show available options)
---------------------- Copy data from GCS bucket --------
type this command: to check whether you are able to get information about bucket
gsutil acl get gs://BucketName
Now copy the file from GCS Bucket to your machine:
gsutil cp gs://BucketName/FileName /PathToDestinationDir/
In this way, you will be able to copy data from this bucket to your machine for further processing purpose.
NOTE: all the above commands can be run from a Jupyter Notebook just use ! before commands, it will run e.g.
!gsutil cp gs://BucketName/FileName /PathToDestinationDir/
im using EMR and wanted to use jupyter(ipython) so i added to the cluster the bootstrap action:
s3://elasticmapreduce.bootstrapactions/ipython-notebook/install-ipython-notebook
I performed the port tunelling to access jupyter from my local host and works fine, but it is asking for a login password, tried empty, tried hadoop, but no luck, does any body knows what is the jypyter password?
I ran into this problem as well when I used the same bootstrap action. I tried adding in Args=[--password, jupyter] which I also could not get working. That was from this aws forum:
Name='Install Jupyter notebook',Path="s3://aws-bigdata-blog/artifacts/aws-blog-emr-jupyter/install-jupyter-emr5.sh",Args=[--r,--julia,--toree,--torch,--ruby,--ds-packages,--ml-packages,--python-packages,'ggplot nilearn',--port,8880,--password,jupyter,--jupyterhub,--jupyterhub-port,8001,--cached-install,--notebook-dir,s3://<your-s3-bucket>/notebooks/,--copy-samples]
What I did instead was to follow these instructions for installing anaconda directly in the EMR instance using the CLI. If you follow the first part you should be able to get it up and running. To summarize here:
ssh into your master emr instance using the .pem file you saved
once there's you'll want to install anaconda using super user priveledges: sudo wget http://repo.continuum.io/archive/Anaconda3-4.1.1-Linux-x86_64.sh. Then bash Anaconda3–4.1.1-Linux-x86_64.sh
Make sure you're using the anaconda version of python: which python
If you're not, specify your source: source .bashrc
Now make a jupyter config file: jupyter notebook --generate-config
cd into the jupyter folder: cd ~/.jupyter/
update the config file: vi jupyter_notebook_config.py
In the config file add the following lines:
c = get_config()
c.NotebookApp.ip = '*'
c.NotebookApp.open_browser = False
c.NotebookApp.port = 6789 <---pick whichever port you want
exit out of the config editor and run jupyter via: jupyter notebook
this should run a notebook with no active kernels (for now). But it will give you the token you're looking for: http://localhost:6789/?token=xxxxxx
Leave this running, and open a new terminal window. Now you'll want to tunnel to the EMR instance per this aws blog post (make the port the same as the one you specified in the config file). ssh -o ServerAliveInterval=10 -i <<credentials.pem>> -N -L 8192:<<master-public-dns-name>>:8192 hadoop#<<master-public-dns-name>>
Opening localhost:6789 in the browser should prompt you with the jupyter page to enter your password or token. Enter the token that was generated in the above step and you should be good to go.
Hope this helps! There might be a less convoluted way, but this is what ended up working for me.