Error Loading Fasttext model in GCP Dataflow from GCS Bucket - google-cloud-platform

I am not able to load the fasttext model in Dataflow. I have the model stored in a bucket and the path is
gs://fasttext_models/model1.bin
Below is the way I call:
model_1= fasttext.load_model('gs://fasttext_models/model1.bin')
I get the below error:
ValueError: gs://fasttext_models/model1.bin cannot be opened for loading!
PS:
I used to get the same error when I was loading fasttext locally. But adding Absolute Path fixed this issue.. I am not understanding how to fix this in GCP

Likely fasttext.load_model(str) can only load files from the local filesystem. It doesn't look like it can take an arbitrary open file handle, so your best bet is to copy the data to the local filesystem and then load it from there, e.g.
from google.cloud import storage
with tempfile.NamedTemporaryFile() as tmp_file:
local_model_file = tmp_file.name
remote_model_file = storage.Client().bucket('fasttext_models'). blob('model1.bin')
blob.download_to_filename(local_model_file)
model_1 = fasttext.load_model(local_model_file)

Related

Upload to BigQuery from Cloud Storage

Have ~50k compressed (gzip) json files daily that need to be uploaded to BQ with some transformation, no API calls. The size of the files may be up to 1Gb.
What is the most cost-efficient way to do it?
Will appreciate any help.
Most efficient way to use Cloud Data Fusion.
I would suggest below approach
Create cloud function and trigger on every new file upload to uncompress file.
Create datafusion job with GCS file as source and bigquery as sink with desired transformation.
Refer below my youtube video.
https://youtu.be/89of33RcaRw
Here is (for example) one way - https://cloud.google.com/bigquery/docs/loading-data-cloud-storage-json...
... but quickly looking over it however one can see that there are some specific limitations. So perhaps simplicity, customization and maintainability of solution can also be added to your “cost” function.
Not knowing some details (for example read "Limitations" section under my link above, what stack you have/willing/able to use, files names or if your files have nested fields etc etc etc ) my first thought is cloud function service ( https://cloud.google.com/functions/pricing ) that is "listening" (event type = Finalize/Create) to your cloud (storage) bucket where your files land (if you go this route put your storage and function in the same zone [if possible], which will make it cheaper).
If you can code Python here is some started code:
main.py
import pandas as pd
from pandas.io import gbq
from io import BytesIO, StringIO
import numpy as np
from google.cloud import storage, bigquery
import io
def process(event, context):
file = event
# check if its your file can also check for patterns in name
if file['name'] == 'YOUR_FILENAME':
try:
working_file = file['name']
storage_client = storage.Client()
bucket = storage_client.get_bucket('your_bucket_here')
blob = bucket.blob(working_file)
#https://stackoverflow.com/questions/49541026/how-do-i-unzip-a-zip-file-in-google-cloud-storage
zipbytes = io.BytesIO(blob.download_as_string())
#print for logging
print(f"file downloaded, {working_file}")
#read_file_as_df --- check out docs here = https://pandas.pydata.org/docs/reference/api/pandas.read_json.html
# if nested might need to go text --> to dictionary and then do some preprocessing
df = pd.read_json(zipbytes, compression='gzip', low_memory= False)
#write processed to big query
df.to_gbq(destination_table ='your_dataset.your_table',
project_id ='your_project_id',
if_exists = 'append')
print(f"table bq created, {working_file}")
# if you want to delete processed file from your storage to save on storage costs uncomment 2 lines below
# blob.delete()
#print(f"blob delete, {working_file}")
except Exception as e:
print(f"exception occured {e}, {working_file}")
requirements.txt
# Function dependencies, for example:
# package>=version
google-cloud-storage
google-cloud-bigquery
pandas
pandas.io
pandas-gbq
PS
Some alternatives include
Starting up a VM and run your script on a schedule and shutting VM down once process is done ( for example cloud scheduler –-> pub/sub –-> cloud function –-> which starts up your vm --> which then runs your script)
Using app engine to run your script (similar)
Using cloud run to run your script (similar)
Using composer/airflow (not similar to 1,2&3) [ could use all types of approaches including data transfers etc, just not sure what stack you are trying to use or what you already have running ]
Scheduling vertex ai workbook (not similar to 1,2&3, basically write up a jupyter notebook and schedule it to run in vertex ai)
Try to query files directly (https://cloud.google.com/bigquery/external-data-cloud-storage#bq_1) and schedule that query (https://cloud.google.com/bigquery/docs/scheduling-queries) to run (but again not sure about your overall pipeline)
Setup for all (except #5 & #6) just in technical debt to me is not worth it if you can get away with functions
Best of luck,

GCP AI Notebook can't access storage bucket

New to GCP. Trying to load a saved model file into an AI Platform notebook. Tried several approaches without success.
Most obvious approach seemed to be to set the value of a variable to the path copied from storage:
model_path = "gs://<my-bucket>/models/3B/export/1600635833/saved_model.pb"
Results: OSError: SavedModel file does not exist at: (the above path)
I know I can connect to the bucket and retrieve contents because I downloaded a csv file from the bucket and printed out the contents.
OSError to me sounds like you are trying to access GCS bucket with a regular file system which do not support looking at GCS. (Example: Python open() function)
To access files in GCS I recommend you use the Client Libraries. https://cloud.google.com/storage/docs/reference/libraries
Another option for testing is to try to connect to SSH and use gsutil command.
Note: I assume <my-bucket> was edited to replace your real GCS bucket name.
According to the GCP documentation enter here , you are able to access Cloud Storage. This page will guide to using Cloud Storage with AI Platform Training.

Django File object and S3

So I have added s3 support to one of my Django projects. (storages and boto3)
I have a model that has a file field with zip-archive with images in it.
At some point I need to access this zip-archive and parse it to create instances of another model with those images from archive. It looks something like this:
I access archive data with zipfile
Get image from it
Put this image to django File object
Add this file object to model field
Save model
I works perfectly fine without s3, however with it I get UnsupportedOperation: seek error.
My guess is that boto3/storages does not support uploading files to s3 from memory files. Is it the case? If so, how to fix id/ avoid this in this kind of situation?

Load dataset from amazon S3 to jupyter notebook on EC2

I want to try image segmentation with deep learning using AWS. I have my data stored on Amazon S3 and I'd like to access it from a Jupyter Notebook which is running on an Amazon EC2 instance.
I'm planning on using Tensorflow for segmentation, therefore it seemed appropriate to me to use options provided by Tensorflow themselves (https://www.tensorflow.org/deploy/s3) as it feels that in the end I want my data to be represented in the format of tf.Dataset. However, it didn't quite work out for me. I've tried the following:
filenames = ["s3://path_to_first_image.png", "s3://path_to_second_image.png"]
dataset = tf.data.TFRecordDataset(filenames)
iterator = dataset.make_one_shot_iterator()
next_element = iterator.get_next()
with tf.Session() as sess:
for i in range(2):
print(sess.run(next_element))
I get the following error:
OutOfRangeError: End of sequence
[[Node: IteratorGetNext_6 = IteratorGetNext[output_shapes=[[]], output_types=[DT_STRING], _device="/job:localhost/replica:0/task:0/device:CPU:0"](OneShotIterator_6)]]
I'm quite new to tensorflow and have just recently started trying out some stuff with AWS, so I hope that my mistake is gonna be obvious to someone with more experience. I would greatly appreciate any help or advice! Maybe it's even the wrong way and I'm better off with something like boto3 (also stumbled upon it, but thought that tf would be more appropriate in my case) or something else?
P.S. Tensorflow also recommends to test a setup with the following piece:
from tensorflow.python.lib.io import file_io
print (file_io.stat('s3://path_to_image.png'))
For me this leads to Object doesn't exist error, though the object certainly exists and it's being listed among others if I use
for obj in s3.Bucket(name=MY_BUCKET_NAME).objects.all():
print(os.path.join(obj.bucket_name, obj.key))
I also have my credentials filled in /.aws/credentials. What might be the problem here?
Not a direct answer to your question but still something I noticed as to why you can't load data using Tensorflow.
The files in your filenames are .png and not in the .tfrecord file format which is a binary storage format. So, tf.data.TFRecordDataset(filenames) shouldn't work.
I think the following will work. Note: this is for TF2, not sure if it is the same for TF1. A similar example can be found here at TensorFlow's web site tensorflow example
Step 1
Load your files into a TensorFlow dataset with tf.data.Dataset.list_files.
import tensorflow as tf
list_ds = tf.data.Dataset.list_files(filenames)
Step 2
Make a function that will be applied to each element in the dataset by using map; this will use the function on every element in the TF dataset.
def process_path(file_path):
'''reads the path and returns an image.'''
# load the raw data from the file as a string
byteString = tf.io.read_file(file_path)
# convert the compressed string to a 3D uint8 tensor
img = tf.image.decode_png(byteString, channels=3)
return img
dataset = list_ds.map(preprocess_path)
Step 3
Check out the image.
import matplotlib.pyplot as plt
for image in dataset.take(1): plt.imshow(image)
Directly access S3 data from the Ubuntu Deep Learning instance by
cd ~/.aws
aws configure
Then update aws key and secret key for the instance, just to make sure. Check awscli version using the command:
aws --version
Read more on configuration
https://docs.aws.amazon.com/cli/latest/userguide/cli-chap-getting-started.html
You can type in jupyter
import pandas as pd
from smart_open import smart_open
import os
aws_key = 'aws_key'
aws_secret = 'aws_secret'
bucket_name = 'my_bucket'
object_key = 'data.csv'
path = 's3://{}:{}#{}/{}'.format(aws_key, aws_secret, bucket_name, object_key)
df = pd.read_csv(smart_open(path))
Also, objects stored in the buckets have a unique key value and are retrieved using a HTTP URL address. For example, if an object with a key value
/photos/mygarden.jpg
is stored in the
myawsbucket
bucket, then it is addressable using the URL
http://myawsbucket.s3.amazonaws.com/photos/mygarden.jpg.
If your data is not sensitive, you can use the http option. More details:
https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/AmazonS3.html
You can change the setting of the bucket to public. Hope this helps.

Saving a file in AWS filesystem

Hi I am trying out opencv in AWS lambda. I want to save a SVM model in txt file so that I can load it again. Is it possible to save it in tmp directory and load it from there whenever I need it or will I have to use s3?
I am using python and trying to do something like this:
# saving the model
svm.save("/tmp/svm.dat")
# Loading the model
svm = cv2.ml.SVM_load("/tmp/svm.dat")
Its not possible as Lambda execution environment is distributed and therefore the same function might run on several different instances.
The alternative is to save your svm.dat to S3 and then download it every time you start your lambda function.