Extract Tar.gz files from Cloud Storage - google-cloud-platform

I am newbie to Google cloud,need to extract the files with extension "xxxx.tar.gz" in cloud storage and load into BiQuery(multiple files to multiple tables)
I tried the cloud function with nodejs using npm modules like "tar.gz" and "jaguar",both didn't work.
can someone share some inputs to decompress the files using other languages like python or Go also.
my work: so far I decompressed files manually copied to that target bucket and loaded to bigquery using background functions using nodejs
Appreciate your help.

tar is a Linux tool for archiving a group of files together - e.g., see this manual page. You can unpack a compressed tar file using a command like:
tar xfz file.tar.gz

Mike is right wrt. tar archives. Regarding the second half of the question in the title, Cloud Storage does not natively support unpacking a tar archive. You'd have to do this yourself (on your local machine or from a Compute Engine VM, for instance)

Related

Trouble opening audio files stored on S3 in SageMaker

I stored like 300 GB of audio data (mp3/wav mostly) on Amazon S3 and am trying to access it in a SageMaker notebook instance to do some data transformations. I'm trying to use either torchaudio or librosa to load a file as a waveform. torchaudio expects the file path as the input, librosa can either use a file path or file-like object. I tried using s3fs to get the url to the file but torchaudio doesn't recognize it as a file. And apparently SageMaker has problems installing librosa so I can't use that. What should I do?
For anyone who has this issue and has to use Sagemaker, I found installing librosa using the following:
!conda install -y -c conda-forge librosa
rather than via pip allowed me to use it in Sagemaker.
I ended up not using SageMaker for this, but for anybody else having similar problems, I solved this by opening the file using s3fs and writing it to a tempfile.NamedTemporaryFile. This gave me a file path that I could pass into either torchaudio.load or librosa.core.load. This was also important because I wanted the extra resampling functionality of librosa.core.load, but it doesn't accept file-like objects for loading mp3s.

Downloading a public file from Google Cloud to Google colaboratory

I have a dataset which is publicly hosted on google cloud at this link. I would like to use this data in a Google colaboratory notebook by downloading it there. However all the tutorials I have seen which involve transferring a file from the Cloud to Colab require a Project ID, which I don't have since this is not my project. Wget also doesn't work with this file. Is there a way to download the files at that link directly to a colab notebook?
Be careful, your files are very big. It can easily fill up all Colab space.
First you need to login (authenticate yourself).
from google.colab import auth
auth.authenticate_user()
Then, you can use gsutil to list the files.
!gsutil ls gs://ravens-matrices/analogies/
And to copy 1 or more files, to the current directory.
!gsutil cp gs://ravens-matrices/analogies/extrapolation.tar.gz .
Here's a working notebook

Wget - Downloading lots of files recursively is taking a long time

Currently I am trying to download a large dataset (200k+ of large images) Its all stored on google cloud. The authors provide a wget script to download it:
wget -r -N -c -np --user username --ask-password https://alpha.physionet.org/files/mimic-cxr/2.0.0/
Now it downloads etc, but its been 2 days and its still going and I don't know how long its going to take. AFAIK its downloading each file individually. is there a way for me to download it in parallel?
EDIT: I don't have sudo access to the machine doing the downloading. I just have user access.
wget is a great tool but it is not designed to be efficient for downloading 200K files.
You can either wait for it to finish or find another tool that does parallel downloads provided that you have a fast Internet connection to support parallel downloads which might decrease the time by half over wget.
Since the source is an HTTPS web server, there really is not much you can do to speed this up besides downloading two to four files in parallel. Depending on your Internet speed, distance to the source server, you might not achieve any improvement with parallel downloads.
Note: You do not specify what you are downloading onto. If the destination is a Compute Engine VM, and you picked a tiny one (f1-micro) you may be resource limited. For any hi-speed data transfer pick at least an n1 instance size.
If you don't know the urls then use the good old httrack website copier to download files in parallel:
httrack -v -w https://user:password#example.com/
Default is 8 parallel connections but you can use cN option to increase it.
If the files are large you can use aria2c this will download single file with multiple threads:
aria2c -x 16 url
You could find out if the files are store in GCS, if so then you can just use
gsutil -m <src> <destination>
This will download files in multithreaded mode
Take a look at the updated official MIMIC-CXR https://mimic-cxr.mit.edu/about/download/downloads page.
There you'll find the info how to download via wget (locally) and gsutil (Google Cloud Storage)

How do I download files within a Sagemaker notebook instance programatically?

We have a notebook instance within Sagemaker which contains many Jupyter Python scripts. I'd like to write a program which downloads these various scripts each day (i.e. so that I could back them up). Unfortunately I don't see any reference to this in the AWS CLI API.
Is this achievable?
It's not exactly that you want, but looks like VCS can fit your needs. You can use Github(if you already use it) or CodeCommit(free privat repos) Details and additional ways like sync target dir with S3 bucket - https://aws.amazon.com/blogs/machine-learning/how-to-use-common-workflows-on-amazon-sagemaker-notebook-instances/
Semi automatic way:
conda install -y -c conda-forge zip
!zip -r -X folder.zip folder-to-zip
Then download that zipfile.

Copying/using Python files from S3 to Amazon Elastic MapReduce at bootstrap time

I've figured out how to install python packages (numpy and such) at the bootstrapping step using boto, as well as copying files from S3 to my EC2 instances, still with boto.
What I haven't figured out is how to distribute python scripts (or any file) from S3 buckets to each EMR instance using boto. Any pointers?
If you are using boto, I recommend packaging all your Python files in an archive (.tar.gz format) and then using the cacheArchive directive in Hadoop/EMR to access it.
This is what I do:
Put all necessary Python files in a sub-directory, say, "required/" and test it locally.
Create an archive of this: cd required && tar czvf required.tgz *
Upload this archive to S3: s3cmd put required.tgz s3://yourBucket/required.tgz
Add this command-line option to your steps: -cacheArchive s3://yourBucket/required.tgz#required
The last step will ensure that your archive file containing Python code will be in the same directory format as in your local dev machine.
To actually do step #4 in boto, here is the code:
step = StreamingStep(name=jobName,
mapper='...',
reducer='...',
...
cache_archives=["s3://yourBucket/required.tgz#required"],
)
conn.add_jobflow_steps(jobID, [step])
And to allow for the imported code in Python to work properly in your mapper, make sure to reference it as you would a sub-directory:
sys.path.append('./required')
import myCustomPythonClass
# Mapper: do something!