reading files in google cloud machine learning - google-cloud-ml

I tried to run tensorflow-wavenet on the google cloud ml-engine with gcloud ml-engine jobs submit training but the cloud job crashed when it was trying to read the json configuration file:
with open(args.wavenet_params, 'r') as f:
wavenet_params = json.load(f)
arg.wavenet_params is simply a file path to a json file which I uploaded to the google cloud storage bucket. The file path looks like this: gs://BUCKET_NAME/FILE_PATH.json.
I double-checked that the file path is correct and I'm sure that this part is responsible for the crash since I commented out everything else.
The crash log file doesn't give much information about what has happened:
Module raised an exception for failing to call a subprocess Command '['python', '-m', u'gcwavenet.train', u'--data_dir', u'gs://wavenet-test-data/VCTK-Corpus-Small/', u'--logdir_root', u'gs://wavenet-test-data//gcwavenet10/logs']' returned non-zero exit status 1.
I replaced wavenet_params = json.load(f) by f.close() and I still get the same result.
Everything works when I run it locally with gcloud ml-engine local train.
I think the problem is with reading files with gcloud ml-engine in general or that I can't access the google cloud bucket from within a python file with gs://BUCKET_NAME/FILE_PATH.

Python's open function cannot read files from GCS. You will need to use a library capable of doing so. TensorFlow includes one such library:
import tensorflow as tf
from tensorflow.python.lib.io import file_io
with file_io.FileIO(args.wavenet_params, 'r') as f:
wavenet_params = json.load(f)

Related

GCP Serverless pyspark : Illegal character in path at index

I'm trying to run a simple hello world python code on Serverless pyspark on GCP using gcloud (from local windows machine).
if __name__ == '__main__':
print("Hello")
This always results in the error
=========== Cloud Dataproc Agent Error ===========
java.lang.IllegalArgumentException: Illegal character in path at index 38: gs://my-bucket/dependencies\hello.py
at java.base/java.net.URI.create(URI.java:883)
at com.google.cloud.hadoop.services.agent.job.handler.AbstractJobHandler.registerResourceForDownload(AbstractJobHandler.java:592)
The gcloud command:
gcloud dataproc batches submit pyspark hello.py --batch=hello-batch-5 --deps-bucket=my-bucket --region=us-central1
On further analysis, I found that gcloud puts hello.py file in dependencies\hello.py under folder {deps-bucket} and Java considers backward slash '\' as illegal.
Has anyone encountered a similar situation?
As #Ronak mentioned, Can you double check the bucket name ? I have replicated your task, and simply copied your code to my Google Cloud shell. and it ran just fine. for your next run can you delete the dependencies folder and run the batch job again ?
See my replication here:
Dependencies path created after running the job:

Why can't my GCP script/notebook find my file?

I have a working script that finds the data file when it is in the same directory as the script. This works both on my local machine and Google Colab.
When I try it on GCP though it can not find the file. I tried 3 approaches:
PySpark Notebook:
Upload the .ipynb file which includes a wget command. This downloads the file without error but I am unsure where it saves it to and the script can not find the file either (I assume because I am telling it that the file is in the same directory and pressumably using wget on GCP saves it somewhere else by default.)
PySpark with bucket:
I did the same as the PySpark notebook above but first I uploaded the dataset to the bucket and then used the two links provided in the file details when you click the file name inside the bucket on the console (neither worked). I would like to avoid this though as wget is much faster then downloading on my slow wifi then reuploading to the bucket through the console.
GCP SSH:
Create cluster
Access VM through SSH.
Upload .py file using the cog icon
wget the dataset and move both into the same folder
Run script using python gcp.py
Just gives me an error saying file not found.
Thanks.
As per your first and third approach, if you are running a PySpark code on Dataproc, irrespective of whether you use .ipynb file or .py file, please note the below points:
If you use the ‘wget’ command to download the file, then it will be downloaded in the current working directory where your code is executed.
When you try to access the file through the PySpark code, it will check defaultly in HDFS. If you want to access the downloaded file from the current working directory, use the “ file:///” URI with absolute file path.
If you want to access the file from HDFS, then you have to move the downloaded file to HDFS and then access from there using an absolute HDFS file path. Please refer the below example:
hadoop fs -put <local file_name> </HDFS/path/to/directory>

How can I run a Dataflow pipeline with a setup file using Cloud Composer/Apache Airflow?

I have a working Dataflow pipeline the first runs setup.py to install some local helper modules. I now want to use Cloud Composer/Apache Airflow to schedule the pipeline. I've created my DAG file and placed it in the designated Google Storage DAG folder along with my pipeline project. The folder structure looks like this:
{Composer-Bucket}/
dags/
--DAG.py
Pipeline-Project/
--Pipeline.py
--setup.py
Module1/
--__init__.py
Module2/
--__init__.py
Module3/
--__init__.py
The part of my DAG that specifies the setup.py file looks like this:
resumeparserop = dataflow_operator.DataFlowPythonOperator(
task_id="resumeparsertask",
py_file="gs://{COMPOSER-BUCKET}/dags/Pipeline-Project/Pipeline.py",
dataflow_default_options={
"project": {PROJECT-NAME},
"setup_file": "gs://{COMPOSER-BUCKET}/dags/Pipeline-Project/setup.py"})
However, when I look at the logs in the Airflow Web UI, I get the error:
RuntimeError: The file gs://{COMPOSER-BUCKET}/dags/Pipeline-Project/setup.py cannot be found. It was specified in the --setup_file command line option.
I am not sure why it is unable to find the setup file. How can I run my Dataflow pipeline with the setup file/modules?
If you look at the code for DataflowPythonOperator it looks like the main py_file can be a file inside of a GCS bucket and is localized by the operator prior to executing the pipeline. However, I do not see anything like that for the dataflow_default_options. It appears that the options are simply copied and formatted.
Since the GCS dag folder is mounted on the Airflow instances using Cloud Storage Fuse you should be able to access the file locally using the "dags_folder" env var.
i.e. you could do something like this:
from airflow import configuration
....
LOCAL_SETUP_FILE = os.path.join(
configuration.get('core', 'dags_folder'), 'Pipeline-Project', 'setup.py')
You can then use the LOCAL_SETUP_FILE variable for the setup_file property in the dataflow_default_options.
Do you run Composer and Dataflow with the same service account, or are they separate? In the latter case, have you checked whether Dataflow's service account has read access to the bucket and object?

gcloud job can't access my files, either they are in GCS or in my cloud shell

I'm trying to run my code of machine learning from images using tensorflow in Google CloudML. However, it seems the submitted job can't access to my files in my cloud shell or in GCS. Even though it is working fine in my local machine, I get the following error once I submit my job using the command gcloud from the cloud shell:
ERROR 2017-12-19 13:52:28 +0100 service IOError: [Errno 2] No such file or directory: '/home/user/pores-project-googleML/trainer/train.txt'
This folder can be found for sure in cloud shell, and I can check it when I type:
ls /home/user/pores-project-googleML/trainer/train.txt
I tried putting my file train.txt in GCS and access to it from my code (by specifying the path gs://my_bucket/my_path), but once the job submitted, I got a 'No such file or directory' error with the corresponding path.
To check where the job I submitted using gcloud is running, I added print(os.getcwd()) in the beginning of my python code trainer/task.py, which printed as a result in the logs: /user_dir. I couldn't find this path using the cloud shell, not even in GCS. So my question is how can I know in which machine my job is running? If it's in a certain container somewhere, how can I access from it to my files using the cloud shell and in GCS?
Before I do all of this, I succesfully completed the 'Image Classification using Flowers Dataset' tutorial.
The command I used to submit my job is:
gcloud ml-engine jobs submit training $JOB_NAME --job-dir $JOB_DIR --packages trainer-0.1.tar.gz --module-name $MAIN_TRAINER_MODULE --region us-central1
where:
TRAINER_PACKAGE_PATH=/home/use/pores-project-googleML/trainer
MAIN_TRAINER_MODULE="trainer.task"
JOB_DIR="gs://pores/AlexNet_CloudML/job_dir/"
JOB_NAME="census$(date +"%Y%m%d_%H%M%S")"
Regular Python IO library is not able to access files on GCS. Instead, you need to use GCS python client or gstuil cli to access GCS files.
Note that TensorFlow itself has native support of GCS (i.e., it can read GCS files directly).

Accessing data in Google Cloud bucket for a python Tensorflow learning program

I’m working through the Google quick start examples for Cloud Learning / Tensorflow as shown here: https://cloud.google.com/ml/docs/quickstarts/training
I want my python program to access data that I have stored in a Google Cloud bucket such as gs://mybucket. How do I do this inside of my python program instead of calling it from the command line?
Specifically, the quickstart example for cloud learning utilizes data they provided but what if I want to provide my own data that I have stored in a bucket such as gs://mybucket?
I noticed a similar post here: How can I get the Cloud ML service account programmatically in Python? ... but I can’t seem to install the googleapiclient module.
Some posts seem to mention Apache Beam though I can’t tell if that’s relevant to me, but besides I can’t figure out how to download or install that whatever it is.
If I understand your question correctly, you want to programmatically talk to GCS in Python.
The official docs are a good place to start.
First, grab the module using pip:
pip install --upgrade google-cloud-storage
Then:
from google.cloud import storage
client = storage.Client()
bucket = client.get_bucket('bucket-id-here')
# Then do other things...
blob = bucket.get_blob('remote/path/to/file.txt')
print(blob.download_as_string())
blob.upload_from_string('New contents!')
blob2 = bucket.blob('remote/path/storage.txt')
blob2.upload_from_filename(filename='/local/path.txt')
Assuming you are using Ubuntu/Linux as an OS and already having data in GCS bucket
Execute following commands from a terminal or can be executed on Jupyter Notebook(just use ! before commands):
--------------------- Installation -----------------
1st install storage module:
on Terminal type:
pip install google-cloud-storage
2nd to verify storage installed or not type the command:
gsutil
(o/p will show available options)
---------------------- Copy data from GCS bucket --------
type this command: to check whether you are able to get information about bucket
gsutil acl get gs://BucketName
Now copy the file from GCS Bucket to your machine:
gsutil cp gs://BucketName/FileName /PathToDestinationDir/
In this way, you will be able to copy data from this bucket to your machine for further processing purpose.
NOTE: all the above commands can be run from a Jupyter Notebook just use ! before commands, it will run e.g.
!gsutil cp gs://BucketName/FileName /PathToDestinationDir/