How to get kaggle competition data via command line on virtual machine? - cookies

I am looking for the easiest way to download the kaggle competition data (train and test) on the virtual machine using bash to be able to train it there without uploading it on git.

Fast-forward three years later and you can use Kaggle's API using the CLI, for example:
kaggle competitions download favorita-grocery-sales-forecasting

First you need to copy your cookie information for kaggle site in a text file. There is a chrome extension which will help you to do this.
Copy the cookie information and save it as cookies.txt.
Now transfer the file to the EC2 instance using the command
scp -i /path/my-key-pair.pem /path/cookies.txt user-name#ec2-xxx-xx-xxx-x.compute-1.amazonaws.com:~
Accept the competitions rules and copy the URLs of the datasets you want to download from kaggle.com. For example the URL to download the sample_submission.csv file of Intel & MobileODT Cervical Cancer Screening competition is: https://kaggle.com/c/intel-mobileodt-cervical-cancer-screening/download/sample_submission.csv.zip
Now, from the terminal use the following command to download the dataset into the instance.
wget -x --load-cookies cookies.txt https://kaggle.com/c/intel-mobileodt-cervical-cancer-screening/download/sample_submission.csv.zip

Install CurlWget chrome extension.
start downloading your kaggle data-set. CurlWget will give you full wget command. paste this command to terminal with sudo.
Job is done.

Install cookies.txt extension on chrome and enable it.
Login to kaggle
Go to the challenge page that you want the data from
Click on cookie.txt extension on top right and it download the current page's cookie. It will download the cookies in cookies.txt file
Transfer the file to the remote service using scp or other methods
Copy the data link shown on kaggle page (right click and copy link address)
run wget -x --load-cookies cookies.txt <datalink>

Related

How to copy data from Cloud Storage to my Local Windows Laptop

Google makes it difficult to get your data if you are not experienced in programming. I did a data export from Google to export all company data - Google Data Export
It shows the root folder and to download, I run this command (it automatically enters this command):
gsutil -m cp -r \ "gs://takeout-export-myUniqueID" \.
But I have no idea where it would save it being I am not a GCP customer, only Google Workspace. Workspace won't help because they say it's a GCP product but I am exporting from Workspace. Craziness
Can someone let me know the proper command to run on my local machine with Google's SDK to download this folder? I was able to start the download yesterday but it said there was an invalid character in the file names so it killed the export.
Appreciate any help!

Downloading a public file from Google Cloud to Google colaboratory

I have a dataset which is publicly hosted on google cloud at this link. I would like to use this data in a Google colaboratory notebook by downloading it there. However all the tutorials I have seen which involve transferring a file from the Cloud to Colab require a Project ID, which I don't have since this is not my project. Wget also doesn't work with this file. Is there a way to download the files at that link directly to a colab notebook?
Be careful, your files are very big. It can easily fill up all Colab space.
First you need to login (authenticate yourself).
from google.colab import auth
auth.authenticate_user()
Then, you can use gsutil to list the files.
!gsutil ls gs://ravens-matrices/analogies/
And to copy 1 or more files, to the current directory.
!gsutil cp gs://ravens-matrices/analogies/extrapolation.tar.gz .
Here's a working notebook

How to force download files from a google storage bucket instead of opening it it browser?

I have some audio files in a Google Bucket, and I am serving links to those file in a WordPress website.
How do I force download those files instead of playing in the browser.
Adding &response-content-disposition=attachment; to the end of the url doesn't work.
Tried in gsutil gsutil setmeta -h 'Content-Disposition:attachment' gs://samplebucket/*/*.mp3
I get the error
CommandException: Invalid or disallowed header (u'content-disposition).
Only these fields (plus x-goog-meta-* fields) can be set or unset:
[u'cache-control', u'content-disposition', u'content-encoding', u'content-language', u'content-type']`
as pointed by robsiemb, I had to invoke these commands under google cloud shell . In my case Windows shell turned out to be the culprit.

wget does not download files on Amazon AWS S3

I was trying to download all the slides from the following webpage
https://web.stanford.edu/~jurafsky/NLPCourseraSlides.html
The command I was using was
wget --no-check-certificate --no-proxy -r -l 3 'https://web.stanford.edu/~jurafsky/NLPCourseraSlides.html'
I could only download html and some PNG files. Those slides are hosted on Amazon S3 but I could not crawl them using the command above. The message showing on the terminal is
I could, however, download those slides directly using a command below
wget http://spark-public.s3.amazonaws.com/nlp/slides/intro.pdf
Anybody knows why? How do I download all the slides on that page using a single command?
What you need to do is called "HTML Scraping". This means that you take an HTML page and then parse the HTML links inside the page. After parsing you can download, catalog, etc. the links found in the document (web page).
This StackOverflow article is very popular for this topic:
Options for HTML scraping?

How to copy file from bucket GCS to my local machine

I need copy files from Google Cloud Storage to my local machine:
I try this command o terminal of compute engine:
$sudo gsutil cp -r gs://mirror-bf /var/www/html/mydir
That is my directory on local machine /var/www/html/mydir.
i have that error:
CommandException: Destination URL must name a directory, bucket, or bucket
subdirectory for the multiple source form of the cp command.
Where the mistake?
You must first create the directory /var/www/html/mydir.
Then, you must run the gsutil command on your local machine and not in the Google Cloud Shell. The Cloud Shell runs on a remote machine and can't deal directly with your local directories.
I have had a similar problem and went through the painful process of having to figuring it out too, so I thought I would provide my step by step solution (under Windows, hopefully similar for unix users) with snapshots and hope it helps others:
The first thing (as many others have pointed out on various stackoverflow threads), you have to run a local Console (in admin mode) for this to work (ie. do not use the cloud shell terminal).
Here are the steps:
Assuming you already have Python installed on your machine, you will then need to install the gsutil python package using pip from your console:
pip install gsutil
The Console looks like this:
You will then be able to run the gsutil config from that same console:
gsutil config
As you can see from the snapshot bellow, a .boto file needs to be created. It is needed to make sure you have permissions to access your drive.
Also note that you are now provided an URL, which is needed in order to get the authorization code (prompted in the console).
Open a browser and paste this URL in, then:
Log in to your Google account (ie. account linked to your Google Cloud)
Google ask you to confirm you want to give access to GSUTIL. Click Allow:
You will then be given an authorization code, which you can copy and paste to your console:
Finally you are asked for a project-id:
Get the project ID of interest from your Google Cloud.
In order to find these IDs, click on "My First Project" as circled here below:
Then you will be provided a list of all your projects and their ID.
Paste that ID in you console, hit enter and here you are! You now have created your .boto file. This should be all you need to be able to play with your Cloud storage.
Console output:
Boto config file "C:\Users\xxxx\.boto" created. If you need to use a proxy to access the Internet please see the instructions in that file.
You will then be able to copy your files and folders from the cloud to your PC using the following gsutil Command:
gsutil -m cp -r gs://myCloudFolderOfInterest/ "D:\MyDestinationFolder"
Files from within "myCloudFolderOfInterest" should then get copied to the destination "MyDestinationFolder" (on your local computer).
gsutil -m cp -r gs://bucketname/ "C:\Users\test"
I put a "r" before file path, i.e., r"C:\Users\test" and got the same error. So I removed the "r" and it worked for me.
Check with '.' as ./var
$sudo gsutil cp -r gs://mirror-bf ./var/www/html/mydir
or maybe below problem
gsutil cp does not support copying special file types such as sockets, device files, named pipes, or any other non-standard files intended to represent an operating system resource. You should not run gsutil cp with sources that include such files (for example, recursively copying the root directory on Linux that includes /dev ). If you do, gsutil cp may fail or hang.
Source: https://cloud.google.com/storage/docs/gsutil/commands/cp
the syntax that worked for me downloading to a Mac was
gsutil cp -r gs://bucketname dir Dropbox/directoryname