In my project users upload images into a S3 bucket. I have created a tensor flow resnet model to interpret the contents of the image. Based on the tensor flow interpretation, the data is to be stored in an elasticsearch instance.
For this, I have created a S3 Bucket, a lambda function that gets triggered when an image is loaded, and AWS elasticsearch instance. Since my tf models are large, I have zipped them and put it in a S3 bucket and uploaded the s3 url to lambda.
Issue: Since my unzipped files were larger than 266 mb, I could not complete the lambda function.
Alternative approach: Instead of S3 Bucket - I am thinking of creating a ec2 instance - with larger volume size to store images and receive the images directly into ec2 instance instead of s3. However, since I will be receiving images in millions within a year, I am not sure if this will be scalable.
I can think of two approaches here:
You side load the app. The lambda can be a small bootstrap script that downloads your app from s3 and unzips it. This is a popular pattern in server less frameworks. You pay for this during a cold start of the lambda so you will need to keep it warm in a production env.
You can store images in s3 itself and create event on image upload with destination SQS. Then you can use ec2 to pull the sqs messages for new messages periodically and process them using your tf models.
Related
For an AI project I want to train a model over a dataset which is about 300 GB. I want to use the AWS SageMaker framework.
In SageMaker documentation, they write that SageMaker can import data from AWS S3 bucket. Since the dataset is huge, I zipped it (to several zip files) and uploaded it to a S3 bucket. It took several hours. However, in order to use it I need to unzip the dataset. There are several options:
Unzip directly in S3. This might be impossible to do. See refs below.
Upload the uncompressed data directly, I tried it but it takes too much time and stopped in the middle, uploading only 9% of the data.
Uploading the data to a AWS EC2 machine and unzip it there. But can I import the data to SageMaker from EC2?
Many solutions offer a Python script that downloading the data from S3, unzipping it locally (on the desktop) and then streaming it back to the S3 bucket (see references below). Since I have the original files I can simply upload them to S3, but this takes too long (see 2).
Added in Edit:
I am now trying to upload the uncompressed data using AWS CLI V2.
References:
How to extract files in S3 on the fly with boto3?
https://community.talend.com/s/question/0D53p00007vCjNSCA0/unzip-aws-s3?language=en_US
https://www.linkedin.com/pulse/extract-files-from-zip-archives-in-situ-aws-s3-using-python-tom-reid
https://repost.aws/questions/QUI8fTOgURT-ipoJmN7qI_mw/unzipping-files-from-s-3-bucket
https://dev.to/felipeleao18/how-to-unzip-zip-files-from-s3-bucket-back-to-s3-29o9
The main strategy most commonly used, and also least expensive (since space has its own cost * GB), is not to use the space of the EC2 instance used for the training job but rather to take advantage of the high transfer rate from bucket to instance memory.
This is on the basis that the bucket resides in the same region as the EC2 instance. Otherwise you have to increase the transmission performance, for a fee of course.
You can implement all the strategies for reading files in parallel in your script or reads by chunks, but my advice is to use automated frameworks such as dask/pyspark/pyarrow (in case you need to read dataframes) or review the nature of the storage of these zippers if it can be transformed into a more facilitative form (e.g., a csv transformed into parquet.gzip).
If the nature of the data is different (e.g., images or other), an appropriate lazy data-loading strategy must be identified.
For example, for your zipper problem, you can easily get the list of your files from an S3 folder and read them sequentially.
You already have the data in S3 zipped. What's left is:
Provision a SageMaker notebook instance, or an EC2 instance with enough EBS storage (say 800GB)
Login to the notebook instance, open a shell, copy the data from S3 to local disk.
Unzip the data.
Copy unzip data back to S3.
terminate the instance and the EBS to avoid extra cost.
This should be fast (no less than 250MB/sec) as both the instance has high bandwidth to S3 within the same AWS Region.
Assuming you refer to Training, when talking about using the dataset in SageMaker, read this guide on different storage options for large datasets.
I am new to AWS and trying to find a way to load the data from S3 to RDS . In my current approach I am using EC2 instance to do that (where my app is running). I was thinking of doing through lambda but my data will have around (22 million records) and my current approach is taking 4hr. And lambda timeout is 15mins (So lambda approach does not work in this case).
The problem with my current approach is This data files comes may be like ones in a month and I don't want to have a EC2 running just of this task. Any alternatives in server-less world would be helpful.Thank You
Note: The data is loaded from S3 to RDS based on SQS, i,e my application is pulling the messages from SQS which will then load the data into RDS
Please try DMS for this. You need to create DMS agent with S3 bucket info as source and target details of your RDS.
I have recently joined a company that uses S3 Buckets for various different projects within AWS. I want to identify and potentially delete S3 Objects that are not being accessed (read and write), in an effort to reduce the cost of S3 in my AWS account.
I read this, which helped me to some extent.
Is there a way to find out which objects are being accessed and which are not?
There is no native way of doing this at the moment, so all the options are workarounds depending on your usecase.
You have a few options:
Tag each S3 Object (e.g. 2018-10-24). First turn on Object Level Logging for your S3 bucket. Set up CloudWatch Events for CloudTrail. The Tag could then be updated by a Lambda Function which runs on a CloudWatch Event, which is fired on a Get event. Then create a function that runs on a Scheduled CloudWatch Event to delete all objects with a date tag prior to today.
Query CloudTrail logs on, write a custom function to query the last access times from Object Level CloudTrail Logs. This could be done with Athena, or a direct query to S3.
Create a Separate Index, in something like DynamoDB, which you update in your application on read activities.
Use a Lifecycle Policy on the S3 Bucket / key prefix to archive or delete the objects after x days. This is based on upload time rather than last access time, so you could copy the object to itself to reset the timestamp and start the clock again.
No objects in Amazon S3 are required by other AWS services, but you might have configured services to use the files.
For example, you might be serving content through Amazon CloudFront, providing templates for AWS CloudFormation or transcoding videos that are stored in Amazon S3.
If you didn't create the files and you aren't knowingly using the files, can you probably delete them. But you would be the only person who would know whether they are necessary.
There is recent AWS blog post which I found very interesting and cost optimized approach to solve this problem.
Here is the description from AWS blog:
The S3 server access logs capture S3 object requests. These are generated and stored in the target S3 bucket.
An S3 inventory report is generated for the source bucket daily. It is written to the S3 inventory target bucket.
An Amazon EventBridge rule is configured that will initiate an AWS Lambda function once a day, or as desired.
The Lambda function initiates an S3 Batch Operation job to tag objects in the source bucket. These must be expired using the following logic:
Capture the number of days (x) configuration from the S3 Lifecycle configuration.
Run an Amazon Athena query that will get the list of objects from the S3 inventory report and server access logs. Create a delta list with objects that were created earlier than 'x' days, but not accessed during that time.
Write a manifest file with the list of these objects to an S3 bucket.
Create an S3 Batch operation job that will tag all objects in the manifest file with a tag of "delete=True".
The Lifecycle rule on the source S3 bucket will expire all objects that were created prior to 'x' days. They will have the tag given via the S3 batch operation of "delete=True".
Expiring Amazon S3 Objects Based on Last Accessed Date to Decrease Costs
So I have an S3 bucket full of over 200GB of different videos. It would be very time consuming to manually set up jobs to transcode all of these.
How can I use either the web UI or aws cli to transcode all videos in this bucket at 1080p, replicating the same output path in a different bucket?
I also want any new videos added to the original bucket to be transcoded automatically immediately after upload.
I've seen some posts about Lambda functions, but I don't know anything about this.
A lambda function is just a temporary machine that runs some code.
The sample code in your link is what you are looking for as a solution. You can call your lambda function once for each item in the S3 bucket and kick off concurrent processing of the entire bucket.
I'm creating an image upload service using EC2 and S3.
User uploads image to EC2 using PHP. EC2 uploads to S3 and then EC2 responds to user with the image link.
I was wondering how fast the upload between EC2 and S3 in the same region is.
Would it be better to store image temporarily on EC2, responds to user first and upload to S3 later or wait for the upload to finish before responding to user?
I was wondering how fast the upload between EC2 and S3 in the same region is
It's fast. Test it.
You should find that you can upload the image to S3 very quickly, then return the S3 URL to the client, where they'll immediately be able to fetch the image.
Caveat: if you are overwriting an S3 object at the same path, rather than creating a new object, there can be a delay after the time you upload the object before the new object is consistently returned for every request. This delay is unlikely, but possible, due to the eventual consistency model of S3. Deletes are the same way -- a deleted object may be still fetchable, briefly, before requests to S3 return 404 or 403.
See What is maximum Amazon S3 replication time on file upload? and note the change you should make to the endpoint if you're working in the US Standard (us-east-1) region to ensure immediate consistency.
It will be plenty fast; the latency between the user and your ec2 instance will be much bigger than the latency between ec2 and s3.
On the otherhand, if ec2 is not doing anything to the image before uploading it to s3, why not upload it directly to s3?