Sagemaker, get spark dataframe from data image url on S3 - amazon-web-services

I am trying to obtain a sparkdataframe which contains the paths and image for all images in my data. The data is store as follow :
folder/image_category/image_n.jpg
I worked on a local jupyter notebook and got no problem with using following code:
dataframe = spark.read.format("image").load(path)
I need to do the same exercise using AWS sagemaker and S3. I created a bucket following the same pattern :
s3://my_bucket/folder/image_category/image_n.jpg
I've tried a lot of possible solutions i found online, based on boto3, s3fs and other stuff, but unfortunately i am still unable to make it work (and i am starting to lose faith ...).
Would anyone have something reliable i could base my work on ?

Related

Unable to PUT big file (2gb) to aws s3 bucket (nodejs) | RangeError: data is too long

I scouted trough all of the internet and everybody gives out different advice but none of them helped me.
Im currently trying to simply send file.buffer that gets send to my endpoint directly to aws bucket.
im using PutObjectCommand have correctly entered all the details in but there's apparently problem with me using simple await s3.send(command) because my 2.2gbs video is way too big.
i get this error when attempting to upload said file to cloud.
RangeError: data is too long at Hash.update (node:internal/crypto/hash:113:22) at Hash.update (C:\Users\misop\Desktop\sebi\sebi-auth\node_modules\#aws-sdk\hash-node\dist-cjs\index.js:12:19) at getPayloadHash (C:\Users\misop\Desktop\sebi\sebi-auth\node_modules\#aws-sdk\signature-v4\dist-cjs\getPayloadHash.js:18:18) at SignatureV4.signRequest (C:\Users\misop\Desktop\sebi\sebi-auth\node_modules\#aws-sdk\signature-v4\dist-cjs\SignatureV4.js:96:71) at process.processTicksAndRejections (node:internal/process/task_queues:95:5) { code: 'ERR_OUT_OF_RANGE', '$metadata': { attempts: 1, totalRetryDelay: 0 } }
I browsed quite a lot,there's lots of people saying that i should be using presigned url,i did try however if i do await getSignedUrl(s3,putCommand,{expires:3600}); then i do get generated url but there's not PUT send to cloud. when i read little more into it getSignedUrl is just for generating signed url therefore there's no way for me to use Put command there so im not sure how to approach this situation.
Im currently working with :
"#aws-sdk/client-s3": "^3.238.0",
"#aws-sdk/s3-request-presigner": "^3.238.0",
Honestly i've been testing lots of different ways i saw online but i wasnt successful following even amazon's official documentation where they mention these thing and i trully dont want to implement multipart upload for smaller than 4 ~ 5gbs of videos.
I'll be honored to hear any advice on this topic, thank you.
Get advice on how to implement simple video upload to aws s3 because of my many failed attempts on doing so since there's lots of information and vast majority doesnt work.
The solution to my problem was essentially using multer's s3 "addon" that had s3 property and had pre-done solution.
"multer-s3": "^3.0.1" version worked even with file that have 5gbs and such. solutions such as using PutObject command inside presigned url method or presigned-post methods were unable to work with multer's file.buffer that node server receives after its being submitted.
If you experienced same problem and want quick and easy solution. use this Multer-s3 npm

extract all aws transcribe results using boto3

I have a couple hundred transcribed results in aws transcribe and I would like to get all the transcribed text and store it in one file.
Is there any way to do this without clicking on each transcribed result and copy and pasting the text?
You can do this via the AWS APIs.
For example, if you were using Python, you can use the Python boto3 SDK:
list_transcription_jobs() will return a list of Transcription Job Names
For each job, you could then call get_transcription_job(), which will provide the TranscriptFileUri that is the location where the transcription is stored.
You can then use get_object() to download the file from Amazon S3
Your program would then need to combine the content from each file into one file.
See how you go with that. If you run into any specific difficulties, post a new Question with the code and an explanation of the problem.
I put an example on GitHub that shows how to:
run an AWS Transcribe job,
use the Requests package to get the output,
write output to the console.
You ought to be able to refit if pretty easily for your purposes. Here's some of the code, but it'll make more sense if you check out the full example:
job_name_simple = f'Jabber-{time.time_ns()}'
print(f"Starting transcription job {job_name_simple}.")
start_job(
job_name_simple, f's3://{bucket_name}/{media_object_key}', 'mp3', 'en-US',
transcribe_client)
transcribe_waiter = TranscribeCompleteWaiter(transcribe_client)
transcribe_waiter.wait(job_name_simple)
job_simple = get_job(job_name_simple, transcribe_client)
transcript_simple = requests.get(
job_simple['Transcript']['TranscriptFileUri']).json()
print(f"Transcript for job {transcript_simple['jobName']}:")
print(transcript_simple['results']['transcripts'][0]['transcript'])

Load dataset from amazon S3 to jupyter notebook on EC2

I want to try image segmentation with deep learning using AWS. I have my data stored on Amazon S3 and I'd like to access it from a Jupyter Notebook which is running on an Amazon EC2 instance.
I'm planning on using Tensorflow for segmentation, therefore it seemed appropriate to me to use options provided by Tensorflow themselves (https://www.tensorflow.org/deploy/s3) as it feels that in the end I want my data to be represented in the format of tf.Dataset. However, it didn't quite work out for me. I've tried the following:
filenames = ["s3://path_to_first_image.png", "s3://path_to_second_image.png"]
dataset = tf.data.TFRecordDataset(filenames)
iterator = dataset.make_one_shot_iterator()
next_element = iterator.get_next()
with tf.Session() as sess:
for i in range(2):
print(sess.run(next_element))
I get the following error:
OutOfRangeError: End of sequence
[[Node: IteratorGetNext_6 = IteratorGetNext[output_shapes=[[]], output_types=[DT_STRING], _device="/job:localhost/replica:0/task:0/device:CPU:0"](OneShotIterator_6)]]
I'm quite new to tensorflow and have just recently started trying out some stuff with AWS, so I hope that my mistake is gonna be obvious to someone with more experience. I would greatly appreciate any help or advice! Maybe it's even the wrong way and I'm better off with something like boto3 (also stumbled upon it, but thought that tf would be more appropriate in my case) or something else?
P.S. Tensorflow also recommends to test a setup with the following piece:
from tensorflow.python.lib.io import file_io
print (file_io.stat('s3://path_to_image.png'))
For me this leads to Object doesn't exist error, though the object certainly exists and it's being listed among others if I use
for obj in s3.Bucket(name=MY_BUCKET_NAME).objects.all():
print(os.path.join(obj.bucket_name, obj.key))
I also have my credentials filled in /.aws/credentials. What might be the problem here?
Not a direct answer to your question but still something I noticed as to why you can't load data using Tensorflow.
The files in your filenames are .png and not in the .tfrecord file format which is a binary storage format. So, tf.data.TFRecordDataset(filenames) shouldn't work.
I think the following will work. Note: this is for TF2, not sure if it is the same for TF1. A similar example can be found here at TensorFlow's web site tensorflow example
Step 1
Load your files into a TensorFlow dataset with tf.data.Dataset.list_files.
import tensorflow as tf
list_ds = tf.data.Dataset.list_files(filenames)
Step 2
Make a function that will be applied to each element in the dataset by using map; this will use the function on every element in the TF dataset.
def process_path(file_path):
'''reads the path and returns an image.'''
# load the raw data from the file as a string
byteString = tf.io.read_file(file_path)
# convert the compressed string to a 3D uint8 tensor
img = tf.image.decode_png(byteString, channels=3)
return img
dataset = list_ds.map(preprocess_path)
Step 3
Check out the image.
import matplotlib.pyplot as plt
for image in dataset.take(1): plt.imshow(image)
Directly access S3 data from the Ubuntu Deep Learning instance by
cd ~/.aws
aws configure
Then update aws key and secret key for the instance, just to make sure. Check awscli version using the command:
aws --version
Read more on configuration
https://docs.aws.amazon.com/cli/latest/userguide/cli-chap-getting-started.html
You can type in jupyter
import pandas as pd
from smart_open import smart_open
import os
aws_key = 'aws_key'
aws_secret = 'aws_secret'
bucket_name = 'my_bucket'
object_key = 'data.csv'
path = 's3://{}:{}#{}/{}'.format(aws_key, aws_secret, bucket_name, object_key)
df = pd.read_csv(smart_open(path))
Also, objects stored in the buckets have a unique key value and are retrieved using a HTTP URL address. For example, if an object with a key value
/photos/mygarden.jpg
is stored in the
myawsbucket
bucket, then it is addressable using the URL
http://myawsbucket.s3.amazonaws.com/photos/mygarden.jpg.
If your data is not sensitive, you can use the http option. More details:
https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/AmazonS3.html
You can change the setting of the bucket to public. Hope this helps.

.csv upload not working in Amazon Web Services Machine Learning - AWS

I have uploaded a simple 10 row csv file (S3) into AWS ML website. It keeps giving me the error,
"We cannot find any valid records for this datasource."
There are records there and Y variable is continuous (not binary). I am pretty much stuck at this point because there is only 1 button to move forward to build Machine Learning. Does any one know what should I do to fix it? Thanks!
The only way I have been able to upload .csv files to S3 created on my own is by downloading an existing .csv file from my S3 server, modifying the data, uploading it then changing the name in the S3 console.
Could you post the first few lines of contents of the .csv file? I am able to upload my own .csv file along with a schema that I have created and it is working. However, I did have issues in that Amazon ML was unable to create the schema for me.
Also, did you try to save the data in something like Sublime, Notepad++, etc. in order to get a different format? On my mac with Microsoft Excel, the CSV did not work, but when I tried LibreOffice on my Windows, the same file worked perfectly.

AWS S3 error with PFFiles after importing the exported Parse data

Looks like Parse.com stores the PFFile objects on AWS S3 and only stores a reference to the actual files on S3 in Parse for the PFFile object types.
So my problem here is I only get a link to AWS S3 link for my PFFile if I export the data using the out of the box Parse.com export functionality. After I import the same data to my Parse application, for some reason the security setting on those PFFiles on S3 is changed in a way that all PFFiles won't be accessible to me after an import due to security error.
My question is, does anyone know how the security is being set on the PFFiles? Here's a link to PFFile https://parse.com/docs/osx/api/Classes/PFFile.html but I guess this is rather an advanced topic and wasn't revealed on this page.
Also looking a solution for this, all I found is this from their forum:
In this case, the PFFiles are stored in a different app. You might
need to download these files and upload them again to the new app and
update the pointers. I know this is not a great answer but we're
working on making this process more straightforward.
https://www.parse.com/questions/import-pffile-object-not-working-in-iphone-application