Trouble opening audio files stored on S3 in SageMaker - amazon-web-services

I stored like 300 GB of audio data (mp3/wav mostly) on Amazon S3 and am trying to access it in a SageMaker notebook instance to do some data transformations. I'm trying to use either torchaudio or librosa to load a file as a waveform. torchaudio expects the file path as the input, librosa can either use a file path or file-like object. I tried using s3fs to get the url to the file but torchaudio doesn't recognize it as a file. And apparently SageMaker has problems installing librosa so I can't use that. What should I do?

For anyone who has this issue and has to use Sagemaker, I found installing librosa using the following:
!conda install -y -c conda-forge librosa
rather than via pip allowed me to use it in Sagemaker.

I ended up not using SageMaker for this, but for anybody else having similar problems, I solved this by opening the file using s3fs and writing it to a tempfile.NamedTemporaryFile. This gave me a file path that I could pass into either torchaudio.load or librosa.core.load. This was also important because I wanted the extra resampling functionality of librosa.core.load, but it doesn't accept file-like objects for loading mp3s.

Related

How to read a csv file from s3 bucket using pyspark

I'm using Apache Spark 3.1.0 with Python 3.9.6. I'm trying to read csv file from AWS S3 bucket something like this:
spark = SparkSession.builder.getOrCreate()
file = "s3://bucket/file.csv"
c = spark.read\
.csv(file)\
.count()
print(c)
But I'm getting the following error:
py4j.protocol.Py4JJavaError: An error occurred while calling o26.csv.
: org.apache.hadoop.fs.UnsupportedFileSystemException: No FileSystem for scheme "s3"
I understand that I need add special libraries, but I didn't find any certain information which exactly and which versions. I've tried to add something like this to my code, but I'm still getting same error:
import os
os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages com.amazonaws:aws-java-sdk:1.7.4,org.apache.hadoop:hadoop-aws:2.7.3 pyspark-shell'
How can I fix this?
You need to use hadoop-aws version 3.2.0 for spark 3. In --packages specifying hadoop-aws library is enough to read files from S3.
--packages org.apache.hadoop:hadoop-aws:3.2.0
You need to set below configurations.
spark._jsc.hadoopConfiguration().set("fs.s3a.access.key", "<access_key>")
spark._jsc.hadoopConfiguration().set("fs.s3a.secret.key", "<secret_key>")
After that you can read CSV file.
spark.read.csv("s3a://bucket/file.csv")
Thanks Mohana for the pointer! After breaking my head for more than a day, I was able to finally figure out. Summarizing my learnings:
Make sure what version of Hadoop your spark comes with:
print(f'pyspark hadoop version:
{spark.sparkContext._jvm.org.apache.hadoop.util.VersionInfo.getVersion()}')
or look for
ls jars/hadoop*.jar
The issue I was having was I had older version of Spark that I had installed a while back that Hadoop 2.7 and was messing up everything.
This should give a brief idea of what binaries you need to download.
For me it was Spark 3.2.1 and Hadoop 3.3.1.
Hence I downloaded :
https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-aws/3.3.1
https://mvnrepository.com/artifact/com.amazonaws/aws-java-sdk-bundle/1.11.901 # added this just in case;
Placed these jar files in the spark installation dir:
spark/jars/
spark-submit runner.py --packages org.apache.hadoop:hadoop-aws:3.3.1
You have your code snippet that reads from AWS S3

Why can't my GCP script/notebook find my file?

I have a working script that finds the data file when it is in the same directory as the script. This works both on my local machine and Google Colab.
When I try it on GCP though it can not find the file. I tried 3 approaches:
PySpark Notebook:
Upload the .ipynb file which includes a wget command. This downloads the file without error but I am unsure where it saves it to and the script can not find the file either (I assume because I am telling it that the file is in the same directory and pressumably using wget on GCP saves it somewhere else by default.)
PySpark with bucket:
I did the same as the PySpark notebook above but first I uploaded the dataset to the bucket and then used the two links provided in the file details when you click the file name inside the bucket on the console (neither worked). I would like to avoid this though as wget is much faster then downloading on my slow wifi then reuploading to the bucket through the console.
GCP SSH:
Create cluster
Access VM through SSH.
Upload .py file using the cog icon
wget the dataset and move both into the same folder
Run script using python gcp.py
Just gives me an error saying file not found.
Thanks.
As per your first and third approach, if you are running a PySpark code on Dataproc, irrespective of whether you use .ipynb file or .py file, please note the below points:
If you use the ‘wget’ command to download the file, then it will be downloaded in the current working directory where your code is executed.
When you try to access the file through the PySpark code, it will check defaultly in HDFS. If you want to access the downloaded file from the current working directory, use the “ file:///” URI with absolute file path.
If you want to access the file from HDFS, then you have to move the downloaded file to HDFS and then access from there using an absolute HDFS file path. Please refer the below example:
hadoop fs -put <local file_name> </HDFS/path/to/directory>

pyarrow as lambda layer

I need help in order to have pyarrow as a lambda layer for my lambda function.
I am trying to read/write parquet file and I am getting below error,
"errorMessage": "Unable to find a usable engine; tried using: 'pyarrow', 'fastparquet'.\npyarrow or fastparquet is required for parquet support".
I tried myself creating layer by installing pyarrow in my ec2 with below command,
pip3 install pandas pyarrow -t build/python/lib/python3.7/site-packages/ --system
but the zip file is getting created with > 300 mb, and, hence i can not have that as lambda layer.
any suggestion or solutions.
Thanks,
Firstly, all the packages are need to be in a directory called python, nothing more, nothing less, and you can zip the whole python directory and upload to lambda.
Secondly, pandas and pyarrow are pretty big. I did use them both in one lambda function without any issue, but I'm afraid you may need to separate those two packages as two layers to make it work. Do NOT use fastparquet, it is too big and exceed the 250MB limitation of lambda.

Extract Tar.gz files from Cloud Storage

I am newbie to Google cloud,need to extract the files with extension "xxxx.tar.gz" in cloud storage and load into BiQuery(multiple files to multiple tables)
I tried the cloud function with nodejs using npm modules like "tar.gz" and "jaguar",both didn't work.
can someone share some inputs to decompress the files using other languages like python or Go also.
my work: so far I decompressed files manually copied to that target bucket and loaded to bigquery using background functions using nodejs
Appreciate your help.
tar is a Linux tool for archiving a group of files together - e.g., see this manual page. You can unpack a compressed tar file using a command like:
tar xfz file.tar.gz
Mike is right wrt. tar archives. Regarding the second half of the question in the title, Cloud Storage does not natively support unpacking a tar archive. You'd have to do this yourself (on your local machine or from a Compute Engine VM, for instance)

Accessing data in Google Cloud bucket for a python Tensorflow learning program

I’m working through the Google quick start examples for Cloud Learning / Tensorflow as shown here: https://cloud.google.com/ml/docs/quickstarts/training
I want my python program to access data that I have stored in a Google Cloud bucket such as gs://mybucket. How do I do this inside of my python program instead of calling it from the command line?
Specifically, the quickstart example for cloud learning utilizes data they provided but what if I want to provide my own data that I have stored in a bucket such as gs://mybucket?
I noticed a similar post here: How can I get the Cloud ML service account programmatically in Python? ... but I can’t seem to install the googleapiclient module.
Some posts seem to mention Apache Beam though I can’t tell if that’s relevant to me, but besides I can’t figure out how to download or install that whatever it is.
If I understand your question correctly, you want to programmatically talk to GCS in Python.
The official docs are a good place to start.
First, grab the module using pip:
pip install --upgrade google-cloud-storage
Then:
from google.cloud import storage
client = storage.Client()
bucket = client.get_bucket('bucket-id-here')
# Then do other things...
blob = bucket.get_blob('remote/path/to/file.txt')
print(blob.download_as_string())
blob.upload_from_string('New contents!')
blob2 = bucket.blob('remote/path/storage.txt')
blob2.upload_from_filename(filename='/local/path.txt')
Assuming you are using Ubuntu/Linux as an OS and already having data in GCS bucket
Execute following commands from a terminal or can be executed on Jupyter Notebook(just use ! before commands):
--------------------- Installation -----------------
1st install storage module:
on Terminal type:
pip install google-cloud-storage
2nd to verify storage installed or not type the command:
gsutil
(o/p will show available options)
---------------------- Copy data from GCS bucket --------
type this command: to check whether you are able to get information about bucket
gsutil acl get gs://BucketName
Now copy the file from GCS Bucket to your machine:
gsutil cp gs://BucketName/FileName /PathToDestinationDir/
In this way, you will be able to copy data from this bucket to your machine for further processing purpose.
NOTE: all the above commands can be run from a Jupyter Notebook just use ! before commands, it will run e.g.
!gsutil cp gs://BucketName/FileName /PathToDestinationDir/