wget does not download files on Amazon AWS S3 - amazon-web-services

I was trying to download all the slides from the following webpage
https://web.stanford.edu/~jurafsky/NLPCourseraSlides.html
The command I was using was
wget --no-check-certificate --no-proxy -r -l 3 'https://web.stanford.edu/~jurafsky/NLPCourseraSlides.html'
I could only download html and some PNG files. Those slides are hosted on Amazon S3 but I could not crawl them using the command above. The message showing on the terminal is
I could, however, download those slides directly using a command below
wget http://spark-public.s3.amazonaws.com/nlp/slides/intro.pdf
Anybody knows why? How do I download all the slides on that page using a single command?

What you need to do is called "HTML Scraping". This means that you take an HTML page and then parse the HTML links inside the page. After parsing you can download, catalog, etc. the links found in the document (web page).
This StackOverflow article is very popular for this topic:
Options for HTML scraping?

Related

Text Compression when serving from a Github Trigger

I'm trying to figure out how to serve my js, css and html as compressed gzip from my Google Cloud Storage bucket. I've set up my static site properly, and also built a Cloud Build Trigger to sync the contents from the repository on push. My problem is that I don't want to have gzips of these files on my repository, but rather just serve them from the bucket.
I might be asking too much for such a simple setup, but perhaps there is a command I can add to my cloudbuild.yaml to make this work.
At the moment it is just this:
steps:
- name: gcr.io/cloud-builders/gsutil
args: ["-m", "rsync", "-r", "-c", "-d", ".", "gs://my-site.com"]
As far as I'm aware this just syncs the bucket to the repo. Is there another command that could ensure that the aforementioned files are transferred as gzip? I've seen use of the gsutil cp
but not within this specific Cloud Build pipeline setup from Github.
Any help would be greatly appreciated!
The gsutil command setmeta lets you add metadata information to the files that overwrites the default http server. Which is handy for Content-Type, and Cache-* options.
gsutil setmeta -h "Content-Encoding: gzip" gs://bucket_name/folder/*
For more info about Transcoding with gzip-uploaded files: https://cloud.google.com/storage/docs/transcoding

How to force download files from a google storage bucket instead of opening it it browser?

I have some audio files in a Google Bucket, and I am serving links to those file in a WordPress website.
How do I force download those files instead of playing in the browser.
Adding &response-content-disposition=attachment; to the end of the url doesn't work.
Tried in gsutil gsutil setmeta -h 'Content-Disposition:attachment' gs://samplebucket/*/*.mp3
I get the error
CommandException: Invalid or disallowed header (u'content-disposition).
Only these fields (plus x-goog-meta-* fields) can be set or unset:
[u'cache-control', u'content-disposition', u'content-encoding', u'content-language', u'content-type']`
as pointed by robsiemb, I had to invoke these commands under google cloud shell . In my case Windows shell turned out to be the culprit.

Extract Tar.gz files from Cloud Storage

I am newbie to Google cloud,need to extract the files with extension "xxxx.tar.gz" in cloud storage and load into BiQuery(multiple files to multiple tables)
I tried the cloud function with nodejs using npm modules like "tar.gz" and "jaguar",both didn't work.
can someone share some inputs to decompress the files using other languages like python or Go also.
my work: so far I decompressed files manually copied to that target bucket and loaded to bigquery using background functions using nodejs
Appreciate your help.
tar is a Linux tool for archiving a group of files together - e.g., see this manual page. You can unpack a compressed tar file using a command like:
tar xfz file.tar.gz
Mike is right wrt. tar archives. Regarding the second half of the question in the title, Cloud Storage does not natively support unpacking a tar archive. You'd have to do this yourself (on your local machine or from a Compute Engine VM, for instance)

How to install hexo blog in a remote repo to local machine?

I'm using hexo in github page. Mistakingly I deleted my local file in my local machine. I tried to make a new local file again by using git clonehttps://github.com/aaayumi/aaayumi.github.io.git. Then I installed npm install hexo-cli -g.
I could install all necessary files but when I typed hexo deploy,
it shows,
hexo deploy
Usage: hexo <command>
Commands:
help Get help on a command.
init Create a new Hexo folder.
version Display version information.
Global Options:
--config Specify config file instead of using _config.yml
--cwd Specify the CWD
--debug Display all verbose messages in the terminal
--draft Display draft posts
--safe Disable all plugins and scripts
--silent Hide output on console
For more help, you can use 'hexo help [command]' for the detailed information
or you can check the docs: http://hexo.io/docs/
Is there an way to be able to use hexo blog locally?
The code in https://github.com/aaayumi/aaayumi.github.io is not the source code of your blog, it is just the generated content. What you need are the original markdown files that were inside your source folder.
You will have to recreate the blog with hexo init and rewrite your blog posts .. Sorry for that.
Of course you can look at your website directly (http://ayumi-saito.com/) and rewrite the posts, copy pasting from there which should not take that long.
Also to make sure this does not happen again, you can publish your blog source files in a different repository. So that there is always a copy somewhere.
PS: Thanks for using my theme ;)

How to get kaggle competition data via command line on virtual machine?

I am looking for the easiest way to download the kaggle competition data (train and test) on the virtual machine using bash to be able to train it there without uploading it on git.
Fast-forward three years later and you can use Kaggle's API using the CLI, for example:
kaggle competitions download favorita-grocery-sales-forecasting
First you need to copy your cookie information for kaggle site in a text file. There is a chrome extension which will help you to do this.
Copy the cookie information and save it as cookies.txt.
Now transfer the file to the EC2 instance using the command
scp -i /path/my-key-pair.pem /path/cookies.txt user-name#ec2-xxx-xx-xxx-x.compute-1.amazonaws.com:~
Accept the competitions rules and copy the URLs of the datasets you want to download from kaggle.com. For example the URL to download the sample_submission.csv file of Intel & MobileODT Cervical Cancer Screening competition is: https://kaggle.com/c/intel-mobileodt-cervical-cancer-screening/download/sample_submission.csv.zip
Now, from the terminal use the following command to download the dataset into the instance.
wget -x --load-cookies cookies.txt https://kaggle.com/c/intel-mobileodt-cervical-cancer-screening/download/sample_submission.csv.zip
Install CurlWget chrome extension.
start downloading your kaggle data-set. CurlWget will give you full wget command. paste this command to terminal with sudo.
Job is done.
Install cookies.txt extension on chrome and enable it.
Login to kaggle
Go to the challenge page that you want the data from
Click on cookie.txt extension on top right and it download the current page's cookie. It will download the cookies in cookies.txt file
Transfer the file to the remote service using scp or other methods
Copy the data link shown on kaggle page (right click and copy link address)
run wget -x --load-cookies cookies.txt <datalink>