Cloud AI Platform Training Fails to Read from Bucket - google-cloud-platform

I'm trying to use Cloud AI Platform for training (gcloud ai-platform jobs submit training).
I created my bucket and am sure the training file is there (gsutil ls gs://sat3_0_bucket/data/train_input.csv).
However, my job is failing with log messsage:
File "/root/.local/lib/python3.7/site-packages/ktrain/text/data.py", line 175, in texts_from_csv
with open(train_filepath, 'rb') as f:
FileNotFoundError: [Errno 2] No such file or directory: 'gs://sat3_0_bucket/data/train_input.csv'
Am I missing something?

The error is probably happening because ktrain tries to auto-detect the character encoding using open(train_filepath, 'rb') which may be problematic with Google Cloud Storage. One solution is to explicitly provide the encoding to texts_from_csv as an argument so this step is skipped (default is None, which means auto-detect).
Alternatively, you can read the data in yourself as a pandas DataFrame using one of these methods. For instance, pandas evidently supports GCS, so you can simply do this: df = pd.read_csv('gs://bucket/your_path.csv')
Then, using ktrain, you can use ktrain.text.texts_from_df (or ktrain.text.texts_from_array) to load and preprocess your data.

Related

Cannot read data with Cloud Storage FUSE

In a Vertex AI workbench notebook, I'm trying to read data from Cloud Storage with Cloud Storage FUSE.
The file path to the dataset inside Cloud Storage is:
gs://my_bucket_name/cola_public/raw/in_domain_train.tsv so I can read it into pandas dataframe as follows:
import pandas as pd
# Load the dataset into a pandas dataframe.
df = pd.read_csv("gs://my_bucket_name/cola_public/raw/in_domain_train.tsv", delimiter='\t', header=None, names=['sentence_source', 'label', 'label_notes', 'sentence'])
# Report the number of sentences.
print('Number of training sentences: {:,}\n'.format(df.shape[0]))
# Display 10 random rows from the data.
df.sample(10)
The previous code works seamlessly. However, I want to update my code to read data with Cloud Storage FUSE (for Vertex AI Training later). Based on Read and write Cloud Storage files with Cloud Storage FUSE and this Codelab, I should be able to load my data using the following code:
df = pd.read_csv("/gcs/my_bucket_name/cola_public/raw/in_domain_train.tsv", delimiter='\t', header=None, names=['sentence_source', 'label', 'label_notes', 'sentence'])
Unfortunately, It did not work for me. The error message is:
FileNotFoundError: [Errno 2] No such file or directory: '/gcs/my_bucket_name/cola_public/raw/in_domain_train.tsv'
How I could solve this problem?
Thank you in advance!
Thanks to Ayush Sethi for the answer:
"Did you try performing step 5 of the mentioned codelab ? The GCS buckets are mounted on performing step 5. So, the training application code that is containerised in step 4, should be able to access the data present in GCS buckets when run as training job on VertexAI which is described in step 5."

Upload to BigQuery from Cloud Storage

Have ~50k compressed (gzip) json files daily that need to be uploaded to BQ with some transformation, no API calls. The size of the files may be up to 1Gb.
What is the most cost-efficient way to do it?
Will appreciate any help.
Most efficient way to use Cloud Data Fusion.
I would suggest below approach
Create cloud function and trigger on every new file upload to uncompress file.
Create datafusion job with GCS file as source and bigquery as sink with desired transformation.
Refer below my youtube video.
https://youtu.be/89of33RcaRw
Here is (for example) one way - https://cloud.google.com/bigquery/docs/loading-data-cloud-storage-json...
... but quickly looking over it however one can see that there are some specific limitations. So perhaps simplicity, customization and maintainability of solution can also be added to your “cost” function.
Not knowing some details (for example read "Limitations" section under my link above, what stack you have/willing/able to use, files names or if your files have nested fields etc etc etc ) my first thought is cloud function service ( https://cloud.google.com/functions/pricing ) that is "listening" (event type = Finalize/Create) to your cloud (storage) bucket where your files land (if you go this route put your storage and function in the same zone [if possible], which will make it cheaper).
If you can code Python here is some started code:
main.py
import pandas as pd
from pandas.io import gbq
from io import BytesIO, StringIO
import numpy as np
from google.cloud import storage, bigquery
import io
def process(event, context):
file = event
# check if its your file can also check for patterns in name
if file['name'] == 'YOUR_FILENAME':
try:
working_file = file['name']
storage_client = storage.Client()
bucket = storage_client.get_bucket('your_bucket_here')
blob = bucket.blob(working_file)
#https://stackoverflow.com/questions/49541026/how-do-i-unzip-a-zip-file-in-google-cloud-storage
zipbytes = io.BytesIO(blob.download_as_string())
#print for logging
print(f"file downloaded, {working_file}")
#read_file_as_df --- check out docs here = https://pandas.pydata.org/docs/reference/api/pandas.read_json.html
# if nested might need to go text --> to dictionary and then do some preprocessing
df = pd.read_json(zipbytes, compression='gzip', low_memory= False)
#write processed to big query
df.to_gbq(destination_table ='your_dataset.your_table',
project_id ='your_project_id',
if_exists = 'append')
print(f"table bq created, {working_file}")
# if you want to delete processed file from your storage to save on storage costs uncomment 2 lines below
# blob.delete()
#print(f"blob delete, {working_file}")
except Exception as e:
print(f"exception occured {e}, {working_file}")
requirements.txt
# Function dependencies, for example:
# package>=version
google-cloud-storage
google-cloud-bigquery
pandas
pandas.io
pandas-gbq
PS
Some alternatives include
Starting up a VM and run your script on a schedule and shutting VM down once process is done ( for example cloud scheduler –-> pub/sub –-> cloud function –-> which starts up your vm --> which then runs your script)
Using app engine to run your script (similar)
Using cloud run to run your script (similar)
Using composer/airflow (not similar to 1,2&3) [ could use all types of approaches including data transfers etc, just not sure what stack you are trying to use or what you already have running ]
Scheduling vertex ai workbook (not similar to 1,2&3, basically write up a jupyter notebook and schedule it to run in vertex ai)
Try to query files directly (https://cloud.google.com/bigquery/external-data-cloud-storage#bq_1) and schedule that query (https://cloud.google.com/bigquery/docs/scheduling-queries) to run (but again not sure about your overall pipeline)
Setup for all (except #5 & #6) just in technical debt to me is not worth it if you can get away with functions
Best of luck,

Solve Incorrect encoding to upload CSV to Google Cloud Platform AI Dataset

I just started with GCP IA and tried to import a dataset.
Following the official documentation Preparing your training data
The CSV file can have any filename, must be UTF-8 encoded, and must end with a .csv extension.
Following these instructions above and be certified that the file is correct
However, after importing dataset to GCP the document lost their encoding.
What should i do to upload with correct text encoding.
I Already tried to upload iso-8859-1 encoded file but i got following error:
Due to an error, AI Platform was unable to import data into dataset "...".
Additional Details:
Operation State: Failed with errors
Resource Name:
projects/ ...
Error Messages: INTERNAL

GCP AI Notebook can't access storage bucket

New to GCP. Trying to load a saved model file into an AI Platform notebook. Tried several approaches without success.
Most obvious approach seemed to be to set the value of a variable to the path copied from storage:
model_path = "gs://<my-bucket>/models/3B/export/1600635833/saved_model.pb"
Results: OSError: SavedModel file does not exist at: (the above path)
I know I can connect to the bucket and retrieve contents because I downloaded a csv file from the bucket and printed out the contents.
OSError to me sounds like you are trying to access GCS bucket with a regular file system which do not support looking at GCS. (Example: Python open() function)
To access files in GCS I recommend you use the Client Libraries. https://cloud.google.com/storage/docs/reference/libraries
Another option for testing is to try to connect to SSH and use gsutil command.
Note: I assume <my-bucket> was edited to replace your real GCS bucket name.
According to the GCP documentation enter here , you are able to access Cloud Storage. This page will guide to using Cloud Storage with AI Platform Training.

Using AWS S3 and Apache Spark with hdf5/netcdf-4 data

I've got a bunch of atmospheric data stored in AWS S3 that I want to analyze with Apache Spark, but am having a lot of trouble getting it loaded and into an RDD. I've been able to find examples online to help with discrete aspects of the problem:
-using h5py to read locally stored scientific data files via h5py.File(filename) (https://hdfgroup.org/wp/2015/03/from-hdf5-datasets-to-apache-spark-rdds/)
-boto/boto3 to get data that is textfile format from S3 into Spark via get_contents_as_string()
-map a set of text files to an RDD via keys.flatMap(mapFunc)
But I can't seem to get these parts to work together. Specifically-- how do you load in a netcdf file from s3 (using boto or directly, not attached to using boto) in order to then use h5py? Or can you treat the netcdf file as a binary file and load it in as a binary file and map to an rdd using sc.BinaryFile(binaryFile)?
Here's a couple of similar questions that weren't answered fully that relate:
How to read binary file on S3 using boto?
using pyspark, read/write 2D images on hadoop file system
Using the netCDF4 and s3fs modules, you can do:
from netCDF4 import Dataset
import s3fs
s3 = s3fs.S3FileSystem()
filename = 's3://bucket/a_file.nc'
with s3.open(filename, 'rb') as f:
nc_bytes = f.read()
root = Dataset(f'inmemory.nc', memory=nc_bytes)
Make sure you are setup to read from S3. For details, here is the documentation.