Load dataset from amazon S3 to jupyter notebook on EC2 - amazon-web-services

I want to try image segmentation with deep learning using AWS. I have my data stored on Amazon S3 and I'd like to access it from a Jupyter Notebook which is running on an Amazon EC2 instance.
I'm planning on using Tensorflow for segmentation, therefore it seemed appropriate to me to use options provided by Tensorflow themselves (https://www.tensorflow.org/deploy/s3) as it feels that in the end I want my data to be represented in the format of tf.Dataset. However, it didn't quite work out for me. I've tried the following:
filenames = ["s3://path_to_first_image.png", "s3://path_to_second_image.png"]
dataset = tf.data.TFRecordDataset(filenames)
iterator = dataset.make_one_shot_iterator()
next_element = iterator.get_next()
with tf.Session() as sess:
for i in range(2):
print(sess.run(next_element))
I get the following error:
OutOfRangeError: End of sequence
[[Node: IteratorGetNext_6 = IteratorGetNext[output_shapes=[[]], output_types=[DT_STRING], _device="/job:localhost/replica:0/task:0/device:CPU:0"](OneShotIterator_6)]]
I'm quite new to tensorflow and have just recently started trying out some stuff with AWS, so I hope that my mistake is gonna be obvious to someone with more experience. I would greatly appreciate any help or advice! Maybe it's even the wrong way and I'm better off with something like boto3 (also stumbled upon it, but thought that tf would be more appropriate in my case) or something else?
P.S. Tensorflow also recommends to test a setup with the following piece:
from tensorflow.python.lib.io import file_io
print (file_io.stat('s3://path_to_image.png'))
For me this leads to Object doesn't exist error, though the object certainly exists and it's being listed among others if I use
for obj in s3.Bucket(name=MY_BUCKET_NAME).objects.all():
print(os.path.join(obj.bucket_name, obj.key))
I also have my credentials filled in /.aws/credentials. What might be the problem here?

Not a direct answer to your question but still something I noticed as to why you can't load data using Tensorflow.
The files in your filenames are .png and not in the .tfrecord file format which is a binary storage format. So, tf.data.TFRecordDataset(filenames) shouldn't work.
I think the following will work. Note: this is for TF2, not sure if it is the same for TF1. A similar example can be found here at TensorFlow's web site tensorflow example
Step 1
Load your files into a TensorFlow dataset with tf.data.Dataset.list_files.
import tensorflow as tf
list_ds = tf.data.Dataset.list_files(filenames)
Step 2
Make a function that will be applied to each element in the dataset by using map; this will use the function on every element in the TF dataset.
def process_path(file_path):
'''reads the path and returns an image.'''
# load the raw data from the file as a string
byteString = tf.io.read_file(file_path)
# convert the compressed string to a 3D uint8 tensor
img = tf.image.decode_png(byteString, channels=3)
return img
dataset = list_ds.map(preprocess_path)
Step 3
Check out the image.
import matplotlib.pyplot as plt
for image in dataset.take(1): plt.imshow(image)

Directly access S3 data from the Ubuntu Deep Learning instance by
cd ~/.aws
aws configure
Then update aws key and secret key for the instance, just to make sure. Check awscli version using the command:
aws --version
Read more on configuration
https://docs.aws.amazon.com/cli/latest/userguide/cli-chap-getting-started.html
You can type in jupyter
import pandas as pd
from smart_open import smart_open
import os
aws_key = 'aws_key'
aws_secret = 'aws_secret'
bucket_name = 'my_bucket'
object_key = 'data.csv'
path = 's3://{}:{}#{}/{}'.format(aws_key, aws_secret, bucket_name, object_key)
df = pd.read_csv(smart_open(path))
Also, objects stored in the buckets have a unique key value and are retrieved using a HTTP URL address. For example, if an object with a key value
/photos/mygarden.jpg
is stored in the
myawsbucket
bucket, then it is addressable using the URL
http://myawsbucket.s3.amazonaws.com/photos/mygarden.jpg.
If your data is not sensitive, you can use the http option. More details:
https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/AmazonS3.html
You can change the setting of the bucket to public. Hope this helps.

Related

Upload to BigQuery from Cloud Storage

Have ~50k compressed (gzip) json files daily that need to be uploaded to BQ with some transformation, no API calls. The size of the files may be up to 1Gb.
What is the most cost-efficient way to do it?
Will appreciate any help.
Most efficient way to use Cloud Data Fusion.
I would suggest below approach
Create cloud function and trigger on every new file upload to uncompress file.
Create datafusion job with GCS file as source and bigquery as sink with desired transformation.
Refer below my youtube video.
https://youtu.be/89of33RcaRw
Here is (for example) one way - https://cloud.google.com/bigquery/docs/loading-data-cloud-storage-json...
... but quickly looking over it however one can see that there are some specific limitations. So perhaps simplicity, customization and maintainability of solution can also be added to your “cost” function.
Not knowing some details (for example read "Limitations" section under my link above, what stack you have/willing/able to use, files names or if your files have nested fields etc etc etc ) my first thought is cloud function service ( https://cloud.google.com/functions/pricing ) that is "listening" (event type = Finalize/Create) to your cloud (storage) bucket where your files land (if you go this route put your storage and function in the same zone [if possible], which will make it cheaper).
If you can code Python here is some started code:
main.py
import pandas as pd
from pandas.io import gbq
from io import BytesIO, StringIO
import numpy as np
from google.cloud import storage, bigquery
import io
def process(event, context):
file = event
# check if its your file can also check for patterns in name
if file['name'] == 'YOUR_FILENAME':
try:
working_file = file['name']
storage_client = storage.Client()
bucket = storage_client.get_bucket('your_bucket_here')
blob = bucket.blob(working_file)
#https://stackoverflow.com/questions/49541026/how-do-i-unzip-a-zip-file-in-google-cloud-storage
zipbytes = io.BytesIO(blob.download_as_string())
#print for logging
print(f"file downloaded, {working_file}")
#read_file_as_df --- check out docs here = https://pandas.pydata.org/docs/reference/api/pandas.read_json.html
# if nested might need to go text --> to dictionary and then do some preprocessing
df = pd.read_json(zipbytes, compression='gzip', low_memory= False)
#write processed to big query
df.to_gbq(destination_table ='your_dataset.your_table',
project_id ='your_project_id',
if_exists = 'append')
print(f"table bq created, {working_file}")
# if you want to delete processed file from your storage to save on storage costs uncomment 2 lines below
# blob.delete()
#print(f"blob delete, {working_file}")
except Exception as e:
print(f"exception occured {e}, {working_file}")
requirements.txt
# Function dependencies, for example:
# package>=version
google-cloud-storage
google-cloud-bigquery
pandas
pandas.io
pandas-gbq
PS
Some alternatives include
Starting up a VM and run your script on a schedule and shutting VM down once process is done ( for example cloud scheduler –-> pub/sub –-> cloud function –-> which starts up your vm --> which then runs your script)
Using app engine to run your script (similar)
Using cloud run to run your script (similar)
Using composer/airflow (not similar to 1,2&3) [ could use all types of approaches including data transfers etc, just not sure what stack you are trying to use or what you already have running ]
Scheduling vertex ai workbook (not similar to 1,2&3, basically write up a jupyter notebook and schedule it to run in vertex ai)
Try to query files directly (https://cloud.google.com/bigquery/external-data-cloud-storage#bq_1) and schedule that query (https://cloud.google.com/bigquery/docs/scheduling-queries) to run (but again not sure about your overall pipeline)
Setup for all (except #5 & #6) just in technical debt to me is not worth it if you can get away with functions
Best of luck,

Error Loading Fasttext model in GCP Dataflow from GCS Bucket

I am not able to load the fasttext model in Dataflow. I have the model stored in a bucket and the path is
gs://fasttext_models/model1.bin
Below is the way I call:
model_1= fasttext.load_model('gs://fasttext_models/model1.bin')
I get the below error:
ValueError: gs://fasttext_models/model1.bin cannot be opened for loading!
PS:
I used to get the same error when I was loading fasttext locally. But adding Absolute Path fixed this issue.. I am not understanding how to fix this in GCP
Likely fasttext.load_model(str) can only load files from the local filesystem. It doesn't look like it can take an arbitrary open file handle, so your best bet is to copy the data to the local filesystem and then load it from there, e.g.
from google.cloud import storage
with tempfile.NamedTemporaryFile() as tmp_file:
local_model_file = tmp_file.name
remote_model_file = storage.Client().bucket('fasttext_models'). blob('model1.bin')
blob.download_to_filename(local_model_file)
model_1 = fasttext.load_model(local_model_file)

Sagemaker, get spark dataframe from data image url on S3

I am trying to obtain a sparkdataframe which contains the paths and image for all images in my data. The data is store as follow :
folder/image_category/image_n.jpg
I worked on a local jupyter notebook and got no problem with using following code:
dataframe = spark.read.format("image").load(path)
I need to do the same exercise using AWS sagemaker and S3. I created a bucket following the same pattern :
s3://my_bucket/folder/image_category/image_n.jpg
I've tried a lot of possible solutions i found online, based on boto3, s3fs and other stuff, but unfortunately i am still unable to make it work (and i am starting to lose faith ...).
Would anyone have something reliable i could base my work on ?

Dataflow: storage.Client() import error or how to list all blobs of a GCP bucket

I am have an apache-beam==2.3.0 pipeline written using the python SDK that is working with my DirectRunner locally. When I change the runner to DataflowRunner I get an error about 'storage' not being global.
Checking my code I think it's because I am using the credentials stored in my environment. In my python code I just do:
class ExtractBlobs(beam.DoFn):
def process(self, element):
from google.cloud import storage
client = storage.Client()
yield list(client.get_bucket(element).list_blobs(max_results=100))
The real issue is that I need the client so I can then get the bucket so I can then list the blobs. Everything I'm doing here is so I can list the blobs.
So if anyone can either point me the right direction towards using 'storage.Client()' in Dataflow or how to list the blobs of a GCP bucket without needing the client.
Thanks in advance!
[+] What I've read: https://cloud.google.com/docs/authentication/production#auth-cloud-implicit-python
Fixed:
Okay so upon further reading and investigating it turns out I have the required libraries to run my pipeline locally but Dataflow needs to know these in order to download them into the resources it spins up. https://beam.apache.org/documentation/sdks/python-pipeline-dependencies/
so all I've done is create a requirements.txt file with my google-cloud-* requirements.
I then spin up my pipeline like this:
python myPipeline.py --requirements_file requirements.txt --save_main_session True
that last flag is to tell it to keep the imports you do in main.

Copy an image in Amazon S3 from Image URL using Django

I have an image URL (for example: http://www.myexample.come/testImage.jpg) and I would to upload this image on Amazon S3 using Django.
I'm not found a way to copy directly the resource from URL in Amazon S3 passing directly the file URL.
So, I think that i have to implement these steps in my project:
Download the file locally from URL http://www.myexample.come/testImage.jpg. I will have a local file testImage.jpg
I have to upload the local file into Amazon S3. I will have a S3 Url.
I have to delete the local file testImage.jpg
Is this a good way to build this feature?
Is possible to improve these steps?
I have to use this features when I receive a REST request and I have to respond passing in the response the uploaded S3 File Url... Are these steps a good way about performance?
The easiest way off the top of my head would be to use requests with io from the python std lib -- this is a bit of code I used a while back, I just tested it with python 2.7.9 and it works
>>> requests_image('http://docs.python-requests.org/en/latest/_static/requests-sidebar.png')
and it works with the latest version of requests (2.6.0) - but I should point out that it's just a snippet, and I was in full control of the image urls being handed to the function, so there's nothing in the way of error checking (you could use Pillow to open the image and confirm it's really a jpeg, etc.)
import requests
from io import open as iopen
from urlparse import urlsplit
def requests_image(file_url):
suffix_list = ['jpg', 'gif', 'png', 'tif', 'svg',]
file_name = urlsplit(file_url)[2].split('/')[-1]
file_suffix = file_name.split('.')[1]
i = requests.get(file_url)
if file_suffix in suffix_list and i.status_code == requests.codes.ok:
with iopen(file_name, 'wb') as file:
file.write(i.content)
else:
return False