Efficient way to upload huge number of small files in S3 - amazon-web-services

I'm encoding dash streams locally that I intend to stream through Cloudfront after, but when it comes to uploading the whole folder it get counted as +4000 PUT requests. So, I thought instead to compress it and upload the zip folder that would count as only 1 PUT request, and then Unzip it using lambda.
My question is, is lambda still going to use the PUT requests for unzipping the file ? And if so, what would be a better/cost effective way to achieve this ?

No, there is no way around having to pay for the individual PUT/POST requests per-file.
S3 is expensive. So is anything related to video streaming. The bandwidth and storage costs will eclipse your HTTP request costs. You might consider a more affordable provider. AWS is the highest price out of all that do S3-compatible hosting.

Related

Streaming media to files in AWS S3

My problem:
I want to stream media I record on the client (typescript code) to my AWS storage (services like YouTube / Twitch / Zoom / Google Meet can live record and save the record to their cloud. Some of them even have host-failure tolerance and create a file if the host has disconnected).
I want each stream to have a different file name so future triggers will be available from it.
I tried to save the stream into S3, but maybe there are more recommended storage solutions for my problems.
What services I tried:
S3: I tried to stream directly into S3 but it doesn't really support updating files.
I tried multi-part files but they are not host-failure tolerance.
I tried to upload each part and have a lambda to merge it (yes, it is very dirty and consuming) but I sometimes had ordering problems.
Kinesis-Video: I tried to use kinesis-video but couldn't enable the saving feature with the SDK.
By hand, I saw it saved a new file after a period of time or after a size was reached so maybe it is not my wanted solution.
Amazon IVS: I tried it because Twitch recommended this although it is way over my requirements.
I couldn't find an example of what I want to do in code with SDK (only by hand examples).
Questions
Do I look at the right services?
What can I do with the AWS-SDK to make it work?
Is there a good place with code examples for future problems? Or maybe a way to search for solutions?
Thank you for your help.

Best way to stream or load audio files into S3 bucket (contact centre recordings)

What is the best way to with reliability get our client to send audio files to our S3 bucket that will process the audio files (ML processes that will do speech-to-text-insights)?
The files could be in .wav / mp3 other such audio formats. Also, some files may be larger in size.
Love to get best ideas? (e.g. API Gateway / Lambda / S3 ?) Would love to hear from anyone who may have done this before.
Some questions and answers to give context:
How do users interface with your system? We are looking for API based approach vs. a browser based approach. We can get browser based approach to work but not sure if that is the right technical/architectural / scalable approach
Do you require a bulk upload method? Yes. We would need bulk upload functionality and some individual files may be larger as well
Will it be controlled by a human, or do you want it to upload automatically somehow? Certainly want it automatically
ultimately, we are building a SaaS solution that will take the audio files and meta data and perform analytics on it and deliver results of our analysis through an API back to the App. So the approach we are looking for is something that will work within this context
I have a similar scenario.
If you intend to use Api Gateway/Lambda/s3 then you should know that there is a limit on the payload size that Gateway & Lambda can accept. Specifically, Api Gateway accepts payloads till 10 MB & Lambda till 6MB.
There is a workaround for this issue though. You can upload your files directly on an s3 bucket and attach a lambda trigger on object creation.
I'll leave some articles that may point you to the right direction :
Uploading a file using presigned URLs :
https://docs.aws.amazon.com/AmazonS3/latest/userguide/PresignedUrlUploadObject.html
Lambda trigger on s3 object creation: https://medium.com/analytics-vidhya/trigger-aws-lambda-function-to-store-audio-from-api-in-s3-bucket-b2bc191f23ec
A holistic view of the same issue: https://sookocheff.com/post/api/uploading-large-payloads-through-api-gateway/
Related GitHub issue :
https://github.com/serverless/examples/issues/106
So from my pov, regarding uploading files, the best way would be to return a pre-signed URL, then have the client upload the file directly to S3. Otherwise, you'll have to implement uploading the file in chunks.

How to upload large file using AWS Gateway and S3 proxy

I have AWS Gateway API configured as proxy for S3 to upload a file to S3 bucket. I have configured binary media to support multipart/form-data
I am able to upload a file of size 10MB or less without any issue. However when the file size is more than 10MB i get 413 Request Entity Too Large issue.
I know that AAG has hard limit of 10 MB on payload.
Questions
1>Isn't adding multipart/form-data should solve the 10 MB limit issue? Do i need to configure anything else?
2>Another approach recommended is to create pre-signed url. I am assuming for this approach to work client has to make call to get pre-signed url and then use that url to upload a file. Is this the only approach to upload a large file?
Note that I have gone through several SO post regarding the same issue, but most of them are old and i am curious to see if there are any new recommendations.
The 10 MB payload limit is hard and cannot be increased [1].
It seems to be possible to split the file into chunks on the client and then put it together on the server again [2] to circumvent the 10 MB limit, but I do not think this is a reasonable approach. The pre-signed URL approach seems better to me, if you do not use a client SDK which provides functionality for chunking.
Please note that if you ever decide to move away from S3, you can still implement the very same interface on any server yourself. In my opinion it is therefore the way to go.

Transferring lots of small files between EC2 and Amazon S3

I'm building a browser game and I have a lot of small files that need to be transfered between my EC2 instancce and S3 when players perform some key actions.
Although transferring a single big file is fairly fast, transferring multiple small files is extremely slow. I'm using Amazon's PHP SDK.
Is there a way to overcome this weakness in S3? Thanks.
It looks like combining the two solutions below is the way to go.
http://improve.dk/archive/2011/11/07/pushing-the-limits-of-amazon-s3-upload-performance.aspx
http://gearman.org/
If this transfer has to be made from EC2 instance to S3 then may be you can try using s3fuse , which will basically mount your s3 drive to storage volume of EC2 instance.
The performance of S3 is not constant and can be quite slow sometimes. If you need real-time performance for a shared object I would take a look at the AWS memcached service although I have not used it.
How exactly are you uploading files? is there a multithreaded method in the SDK? I'm asking because I've had to implement my own method for downloading stuff faster than the SDK.
Do you need to read those files right away? how many events do you have per second, do you need them ordered?
My first thought would be to make a local buffer that uploads batches every once in a while.
Then, if that's too slow, I'd store them in a fast buffer first, instead of S3, and flush it every once in a while. My choices would be simple stuff like SQS or Redis. SQS has theoretically unlimited throughput for random queues and 300 batches per second (1 batch = 1..10 messages = 0..256kb) for FIFO queues - which you can further increase.
Then you have streams, Lambda and whatever.

What is the most efficient S3 GET request method?

I can download a file from S3 using either of the following methods.
s3cmd get s3://bucket_name/DB/company_data/abc.txt
wget http://bucket_name.s3.amazonaws.com/DB/company_data/abc.txt
My question is :
1) Which one is faster?
2) Which one is cheaper?
According to some past research, the s3cmd GET operation is about 5 times slower than wget. Keep in mind that s3cmd is a utility designed to retrieve files from your S3 filesystem. It doesn't use the HTTP protocol but instead uses the s3 protocol.
The only time I can see using the s3cmd utility is for cases where you're retrieving files you cannot otherwise retrieve using standard HTTP GET methods, like when the files on S3 don't have read permissions or you're doing maintenance on your S3 buckets.
Based on your question, I'm assuming you're trying to use this utility in a production system; however, it doesn't appear that was the intention or goals of the utility.
For more details, check out performance testing spreadsheet.
As far as costs goes, I'm not an expert on Amazon pricing, but I believe they bill based on actual data transferred, so a 1GB file would cost the same regardless of whether you downloaded it quickly or slowly. It's like the question where someone asks you what is heavier, ten pounds of bricks or ten pounds of feathers.