What is the most efficient S3 GET request method? - amazon-web-services

I can download a file from S3 using either of the following methods.
s3cmd get s3://bucket_name/DB/company_data/abc.txt
wget http://bucket_name.s3.amazonaws.com/DB/company_data/abc.txt
My question is :
1) Which one is faster?
2) Which one is cheaper?

According to some past research, the s3cmd GET operation is about 5 times slower than wget. Keep in mind that s3cmd is a utility designed to retrieve files from your S3 filesystem. It doesn't use the HTTP protocol but instead uses the s3 protocol.
The only time I can see using the s3cmd utility is for cases where you're retrieving files you cannot otherwise retrieve using standard HTTP GET methods, like when the files on S3 don't have read permissions or you're doing maintenance on your S3 buckets.
Based on your question, I'm assuming you're trying to use this utility in a production system; however, it doesn't appear that was the intention or goals of the utility.
For more details, check out performance testing spreadsheet.
As far as costs goes, I'm not an expert on Amazon pricing, but I believe they bill based on actual data transferred, so a 1GB file would cost the same regardless of whether you downloaded it quickly or slowly. It's like the question where someone asks you what is heavier, ten pounds of bricks or ten pounds of feathers.

Related

Efficient way to upload huge number of small files in S3

I'm encoding dash streams locally that I intend to stream through Cloudfront after, but when it comes to uploading the whole folder it get counted as +4000 PUT requests. So, I thought instead to compress it and upload the zip folder that would count as only 1 PUT request, and then Unzip it using lambda.
My question is, is lambda still going to use the PUT requests for unzipping the file ? And if so, what would be a better/cost effective way to achieve this ?
No, there is no way around having to pay for the individual PUT/POST requests per-file.
S3 is expensive. So is anything related to video streaming. The bandwidth and storage costs will eclipse your HTTP request costs. You might consider a more affordable provider. AWS is the highest price out of all that do S3-compatible hosting.

Streaming media to files in AWS S3

My problem:
I want to stream media I record on the client (typescript code) to my AWS storage (services like YouTube / Twitch / Zoom / Google Meet can live record and save the record to their cloud. Some of them even have host-failure tolerance and create a file if the host has disconnected).
I want each stream to have a different file name so future triggers will be available from it.
I tried to save the stream into S3, but maybe there are more recommended storage solutions for my problems.
What services I tried:
S3: I tried to stream directly into S3 but it doesn't really support updating files.
I tried multi-part files but they are not host-failure tolerance.
I tried to upload each part and have a lambda to merge it (yes, it is very dirty and consuming) but I sometimes had ordering problems.
Kinesis-Video: I tried to use kinesis-video but couldn't enable the saving feature with the SDK.
By hand, I saw it saved a new file after a period of time or after a size was reached so maybe it is not my wanted solution.
Amazon IVS: I tried it because Twitch recommended this although it is way over my requirements.
I couldn't find an example of what I want to do in code with SDK (only by hand examples).
Questions
Do I look at the right services?
What can I do with the AWS-SDK to make it work?
Is there a good place with code examples for future problems? Or maybe a way to search for solutions?
Thank you for your help.

Cloud File Storage with Bandwidth Limits

I want to develop an app for a friend's small business that will store/serve media files. However I'm afraid of having a piece of media goes viral, or getting DDoS'd. The bill could go up quite easily with a service like S3 and I really want to avoid surprise expenses like that. Ideally I'd like some kind of max-bandwidth limit.
Now, the solutions for S3 this has been posted here
But it does require quite a few steps. So I'm wondering if there is a cloud storage solution that makes this simpler I.e. where I don't need to create a custom microservice. I've talked to the support on Digital Ocean and they also don't support this
So in the interest of saving time, and perhaps for anyone else who finds themselves in a similar dilemma, I want to ask this question here, I hope that's okay.
Thanks!
Not an out-of-the-box solution, but you could:
Keep the content private
When rendering a web page that contains the file or links to the file, have your back-end generate an Amazon S3 pre-signed URLs to grant time-limited access to the object
The back-end could keep track of the "popularity" of the file and, if it exceeds a certain rate (eg 1000 over 15 minutes), it could instead point to a small file with a message of "please try later"

what is the efficient way of pulling data from s3 among boto3, athena and aws command line utils

Can someone please let me know what is the efficient way of pulling data from s3. Basically I want to pull out data between for a given time range and apply some filters over the data ( JSON ) and store it in a DB. I am new to AWS and after little research found that I can do it via boto3 api, athena queries and aws CLI. But I need some advise on which one to go with.
If you are looking for the simplest and most straight-forward solution, I would recommend the aws cli. It's perfect for running commands to download a file, list a bucket, etc. from the command line or a shell script.
If you are looking for a solution that is a little more robust and integrates with your application, then any of the various AWS SDKs will do fine. The SDKs are a little more feature rich IMO and much cleaner than running shell commands in your application.
If your application that is pulling the data is written in python, then I definitely recommend boto3. Make sure to read the difference between a boto3 client vs resource.
Some options:
Download and process: Launch a temporary EC2 instance, have a script download the files of interest (eg one day's files?), use a Python program to process the data. This gives you full control over what is happening.
Amazon S3 Select: This is a simple way to extract data from CSV files, but it only operates on a single file at a time.
Amazon Athena: Provides an SQL interface to query across multiple files using Presto. Serverless, fast. Charged based on the amount of data read from disk (so it is cheaper on compressed data).
Amazon EMR: Hadoop service that provides very efficient processing of large quantities of data. Highly configurable, but quite complex for new users.
Based on your description (10 files, 300MB, 200k records) I would recommend starting with Amazon Athena since it provides a friendly SQL interface across many data files. Start by running queries across one file (this makes it faster for testing) and once you have the desired results, run it across all the data files.

hadoop fs -du / gsutil du is running slow on GCP

I am trying to obtain the size of directors in Google bucket but command is running a long time.
I have tried with 8TB data having 24k subdirectory and files, it is taking around 20~25 minutes, conversely, same data on HDFS is taking less than a minute to get the size.
commands that I use to get the size
hadoop fs -du gs://mybucket
gsutil du gs://mybucket
Please suggest how can I do it faster.
1 and 2 are nearly identical in that 1 uses GCS Connector.
GCS calculates usage by making list requests, which can take a long time if you have a large number of objects.
This article suggests setting up Access Logs as alternative to gsutil du:
https://cloud.google.com/storage/docs/working-with-big-data#data
However, you will likely still incur the same 20-25 minute cost if you intend to do any analytics on the data. From GCS Best Practices guide:
Forward slashes in objects have no special meaning to Cloud Storage,
as there is no native directory support. Because of this, deeply
nested directory- like structures using slash delimiters are possible,
but won't have the performance of a native filesystem listing deeply
nested sub-directories.
Assuming that you intend to analyze this data; you may want to consider benchmarking fetch performance of different file sizes and glob expressions with time hadoop distcp.