I am designing a service that receives large amounts of binary data (proprietary images). I want to accumulate these multiple images into a single S3 object and finally upload to S3. It doesnt make sense to store the accumulation in memory as I want to be able to horizontally scale the service by running multiple instances. As S3 is "replace-only" I cant keep uploading the object to S3. In such a case, what is the best option to store this intermediate data before uploading to S3? One option I am thinking is to use redis but it has limits on the size of value not to exceed to 512 MB.
Related
For an AI project I want to train a model over a dataset which is about 300 GB. I want to use the AWS SageMaker framework.
In SageMaker documentation, they write that SageMaker can import data from AWS S3 bucket. Since the dataset is huge, I zipped it (to several zip files) and uploaded it to a S3 bucket. It took several hours. However, in order to use it I need to unzip the dataset. There are several options:
Unzip directly in S3. This might be impossible to do. See refs below.
Upload the uncompressed data directly, I tried it but it takes too much time and stopped in the middle, uploading only 9% of the data.
Uploading the data to a AWS EC2 machine and unzip it there. But can I import the data to SageMaker from EC2?
Many solutions offer a Python script that downloading the data from S3, unzipping it locally (on the desktop) and then streaming it back to the S3 bucket (see references below). Since I have the original files I can simply upload them to S3, but this takes too long (see 2).
Added in Edit:
I am now trying to upload the uncompressed data using AWS CLI V2.
References:
How to extract files in S3 on the fly with boto3?
https://community.talend.com/s/question/0D53p00007vCjNSCA0/unzip-aws-s3?language=en_US
https://www.linkedin.com/pulse/extract-files-from-zip-archives-in-situ-aws-s3-using-python-tom-reid
https://repost.aws/questions/QUI8fTOgURT-ipoJmN7qI_mw/unzipping-files-from-s-3-bucket
https://dev.to/felipeleao18/how-to-unzip-zip-files-from-s3-bucket-back-to-s3-29o9
The main strategy most commonly used, and also least expensive (since space has its own cost * GB), is not to use the space of the EC2 instance used for the training job but rather to take advantage of the high transfer rate from bucket to instance memory.
This is on the basis that the bucket resides in the same region as the EC2 instance. Otherwise you have to increase the transmission performance, for a fee of course.
You can implement all the strategies for reading files in parallel in your script or reads by chunks, but my advice is to use automated frameworks such as dask/pyspark/pyarrow (in case you need to read dataframes) or review the nature of the storage of these zippers if it can be transformed into a more facilitative form (e.g., a csv transformed into parquet.gzip).
If the nature of the data is different (e.g., images or other), an appropriate lazy data-loading strategy must be identified.
For example, for your zipper problem, you can easily get the list of your files from an S3 folder and read them sequentially.
You already have the data in S3 zipped. What's left is:
Provision a SageMaker notebook instance, or an EC2 instance with enough EBS storage (say 800GB)
Login to the notebook instance, open a shell, copy the data from S3 to local disk.
Unzip the data.
Copy unzip data back to S3.
terminate the instance and the EBS to avoid extra cost.
This should be fast (no less than 250MB/sec) as both the instance has high bandwidth to S3 within the same AWS Region.
Assuming you refer to Training, when talking about using the dataset in SageMaker, read this guide on different storage options for large datasets.
I have prepared a CNN and would like to deploy it on AWS to do some image classification. However, the images are stored on an other server than Amazon S3.
Do I have to load all the images on S3 prior to calling the endpoint to make inference ? Do AWS handle like a "cache memory" so I can get images from that server without bringing them on S3 ?
On an other note, is there any alternative way to make classification of large amount of images ? The output should be a Json file. I'm quite lost with all the AWS features.
Thank you for your help !
At the moment, Batch Transform is limited to S3 as an input source.
To better suggest an Inference method can you please state the size of the input data and latency requirements?
SageMaker also offers Asynchronous Inference that allows you to point to an S3 path(input) and select an Instances and count with an optional scaling policy to scale down to zero to save costs. The maximum size of input for Async is 1GB.
We are fetching binary blobs (PDF, JPG) from sql server and adding the object to Amazon S3 using AWSSDK.S3 (.net) v3.7.2.2.
Currently the process is adding the binary objects to Amazon S3 sequentially (one by one).
Is there any way/api to add more than one objects to Amazon S3 in a single request as this can improve the performance.
While adding the binary objects we have to pass metadata (Binary object properties like width, height, extension etc..) as well.
It is not possible to upload/download multiple objects in one request.
However, Amazon S3 is highly scalable, so you can send multiple requests in parallel. This will also take more advantage of your available bandwidth due to the overhead of file transfer protocols.
Suppose you have to implement video streaming platform from scratch. It doesn't matter where you gonna store metadata, your not-very-popular video files will be stored at file system, or object store in case you want to use Cloud. If you'll choose AWS, in order to boost AWS S3 read performance, you can make multiple read requests against the same video file, see Performance Guidelines for Amazon S3:
You can use concurrent connections to Amazon S3 to fetch different
byte ranges from within the same object. This helps you achieve higher
aggregate throughput versus a single whole-object request.
In the same time, as you know, disk I/O is sequential for all HDD/SDD drives, so to boost read performance (if neglect RAM necessary for uncompress/decrypt each video chunk) you have to read from multiple disks (YouTube use RAID).
Why S3 will have better performance on concurrent byte range requests agains the same file? Isn't it stored on single disk? I suppose S3 may have some replication factor and still store the file on multiple disks, does it?
My client has a service which stores a lot of files, like video or sound files. The service works well, however looks like the long-time file storing is quite a challenge, and we would like to use AWS for storing these files.
The problem is the following, the client wants to use AWS kinesis for transferring every file from our servers to AWS. Is this possible? Can we transfer files using that service? There's a lot of video files, and we got more and more every day. And every files is relatively big.
We would also like to save some detail of the files, possibly into dynamoDB, we could use Lambda functions for that.
The most important thing, that we need a reliable data transfer option.
KInesis would not be the right tool to upload files, unless they were all very small - and most videos would almost certainly be over the 1MB record size limit:
The maximum size of a data blob (the data payload before
Base64-encoding) within one record is 1 megabyte (MB).
https://aws.amazon.com/kinesis/streams/faqs/
Use S3 with multi-part upload using one of the SDK's. Objects you won't be accessing for 90+ days can be moved to Glacier.
Multipart upload allows you to upload a single object as a set of parts. Each part is a contiguous portion of the object's data. You can upload these object parts independently and in any order. If transmission of any part fails, you can retransmit that part without affecting other parts. After all parts of your object are uploaded, Amazon S3 assembles these parts and creates the object. In general, when your object size reaches 100 MB, you should consider using multipart uploads instead of uploading the object in a single operation.
Amazon Web Services. Amazon Simple Storage Service (S3) Developer Guide (Kindle Locations 4302-4306). Amazon Web Services, Inc.. Kindle Edition.
To further optimize file upload speed, use transfer acceleration:
Amazon S3 Transfer Acceleration enables fast, easy, and secure transfers of files over long distances between your client and an S3 bucket. Transfer Acceleration takes advantage of Amazon CloudFront’s globally distributed edge locations. As the data arrives at an edge location, data is routed to Amazon S3 over an optimized network path.
Amazon Web Services. Amazon Simple Storage Service (S3) Developer Guide (Kindle Locations 2060-2062). Amazon Web Services, Inc.. Kindle Edition.
Kinesis launched a new service "Kinesis Video Streams" - https://aws.amazon.com/kinesis/video-streams/ which may be helpful to move large amount of data.