Is an S3 Object the same as a file? - amazon-web-services

I was looking at the growth of my bucket (I run an image hosting social network), and the new number of objects per month is ten times larger than what I expect from seeing the number of files people upload thru my website.
Of course, I have a variable number of thumbnails and also other sources of files, but a 10x deviation from the expectation is quite massive.
If my bucket metrics indicate 500,000 new objects in a month, does that actually mean 500,000 new files?
Thanks!

Related

AWS Redshift and small datasets

I have S3 bucket to which many different small files (2 files 1kB per 1 min) are uploaded.
Is it good practice to injest them by trigger using lambda at once to Redshift?
Or maybe will it be better to push them to some stage area like Postgres and then at the end of the day do batch etl from stage area to Redshift?
Or maybe do the job of making manifest file that contains all of the file names per day and use COPY command for injesting them to Redshift?
As Mitch says, #3. Redshift wants to work on large data sets and if you ingest small things many times you will need to vacuum the table. Loading many files at once fixes this.
However there is another potential problem - your files are too small for efficient bulk retrieval from S3. S3 is an object store and each request needs to be translated from bucket/object-key pair to a location in S3. This takes on the order of .5 seconds to do. Not an issue for loading a few at a time. But if you need to load a million of them in series then that’s 500K seconds of lookup time. Now Redshift will do the COPY in parallel but only to the number of slices you have in your cluster - it is still going to take a long time.
So depending on your needs you may need to think about a change in your use of S3. If so then your may end up with a Lambda that combines small files into bigger ones as part of your solution. You can do this in a parallel process to RS COPY if you only need to load many, many files at once during some recovery process. But an archive of 1 billion 1KB files will be near useless if they need to be loaded quickly.

If I enable versioning on an S3 bucket, how long does it take for all objects to have a version ID?

Is the time based on the number of objects? The high end of objects I want to do this for is in the tens of millions.
Have tried this on a few buckets, but want a more accurate time expectation

Amazon S3 Store Millions of Files

I am trying to find the most cost effective way of doing this, will appreciate any help:
I have 100s of millions of files. Each file is under 1MB each (usually 100KB or so)
In total this is over 5 TB of data - as of now, and this will grow weekly
I cannot merge/concatenate the files together. The files must be stored as is
Query and download requirements are basic. Around 1 Million files to be selected and downloaded per month
I am not worried about S3 storage or Data Retrieval or Data Scan cost.
My question is when I upload 100s of million files, does this count as one PUT request per file (meaning one per object)? If so, just the cost to upload the data will be massive. If I upload a directory with a million files, is that one PUT request?
What if I zip the 100 million files on prem, then upload the zip, and use lambda to unzip. Would that count as one PUT request?
Any advise?
You say that you have "100s of millions of files", so I shall assume you have 400 million objects, making 40TB of storage. Please adjust accordingly. I have shown my calculations so that people can help identify my errors.
Initial upload
PUT requests in Amazon S3 are charged at $0.005 per 1,000 requests. Therefore, 400 million PUTs would cost $2000. (.005*400m/1000)
This cost cannot be avoided if you wish to create them all as individual objects.
Future uploads would be the same cost at $5 per million.
Storage
Standard storage costs $0.023 per GB, so storing 400 million 100KB objects would cost $920/month. (.023*400m*100/1m)
Storage costs can be reduced by using lower-cost Storage Classes.
Access
GET requests are $0.0004 per 1,000 requests, so downloading 1 million objects each month would cost 40c/month. (.0004*1m/1000)
If the data is being transferred to the Internet, Data Transfer costs of $0.09 per GB would apply. The Data Transfer cost of downloading 1 million 100KB objects would be $9/month. (.09*1m*100/1m)
Analysis
You seem to be most fearful of the initial cost of uploading 100s of millions of objects at a cost of $5 per million objects.
However, storage will also be high, and the cost of $2.30/month per million objects ($920/month for 400m objects). That ongoing cost is likely to dwarf the cost of initial uploads.
Some alternatives would be:
Store the data on-premises (disk storage is $100/4TB, so 400m files would require $1000 of disks, but you would want extra drives for redundancy), or
Store the data in a database: There are no 'PUT' costs for databases, but you would need to pay for running the database. This might work out a lower cost. or
Combine the data in the files (which you say you do not wish to do), but in a way that can be easily split-apart. For example, marking records by an identifier for easy extractions. or
Use a different storage service, such as Digital Ocean, who do not appear to have a 'PUT' cost.

Use case of Amazon S3 Select

I took a look at the link and trying to understand what s3 select is.
Most applications have to retrieve the entire object and then filter out only the required data for further analysis. S3 Select enables applications to offload the heavy lifting of filtering and accessing data inside objects to the Amazon S3 service.
Based on the statement above, I am trying to imagine what is the proper use case.
Is it helpful that if I have a single excel file with 100million rows, sitting on S3, I can use S3 Select to query partial rows, instead of downloading the entire 100mil rows?
There are many use cases. But two cases that are apparent are centralization and time efficiency.
Lets say you have this "single excel file with 100million rows" in S3. Now if you have several people/department/branches that need to access it, all of them would have to download it, store and process. Since it would be downloaded by each of them separately, in no time you would end up with all of them either having old version of the file (new version could be uploaded to S3), or just different versions - one person version from today, the other would work on a version from last week. With S3 select, all of them would query and get data from the one version of the object stored in S3.
Also if you have 100 million of records, you getting selected data can save you a lot of time. Just image one person needing only 10 records from this file, other person 1000 records. Instead of downloading 100 million records, the first person uses S3 Select to find 10 records only, while the other just gets his/hers 1000 records. All this without needing to download 100 million records.
Even more benefits come from using S3 select in Glacier, from where you can't readily download your files if needed.

Recommended way to read an entire table (Lambda, DynamoDB/S3)

I'm new to AWS and am working on a Serverless application where one function needs to read a large array of data. Never will a single item be read from the table, but all the items will routinely be updated by a schedule function.
What is your recommendation for the most efficient way of handling this scenario? My current implementation uses the scan operation on a DynamoDB table, but with my limited experience I'm unsure if this is going to be performant in production. Would it be better to store the data as a JSON file on S3 perhaps? And if so would it be so easy to update the values with a schedule function?
Thanks for your time.
PS: to give an idea of the size of the database, there will be ~1500 items, each containing an array of up to ~100 strings
It depends on the size of each item, but how?
First of all to use DynamoDB or S3 you pay for two services (in your case*):
1- Request per month
2- Storage per month
If you have small items the fist case will be up to 577 times cheaper if you read items from DynamoDB instead of S3
How: $0.01 per 1,000 requests for S3 compared to 5.2 million reads (up to 4 KB each) per month for DynamoDB. Plus you should pay $0.01 per GB for data retrieval in S3 which should be added up to that price. However, your writes into S3 will be free while you should pay for each write into your DynamoDB (which is almost 4 times more expensive than reading).
However if your items require so many RCUs per reads maybe S3 would be cheaper in this case.
And regarding the storage cost, S3 is cheaper but again you should see how big your data will be in size as you pay maximum $0.023 per GB for S3 while you pay $0.25 per GB per month which is almost 10 times more expensive.
Conclusion:
If you have so many request and your items are smaller its easier and even more straight forward to use DynamoDB as you're not giving up any of the query functionalities that you have using DynamoDB which clearly you won't have in case you use S3. Otherwise, you can consider keeping a pointer to objects' locations stored in S3 in DynamoDB.
(*) The costs you pay for tags in S3 or indexes in DynamoDB are another factors to be considered in case you need to use them.
Here is how I would do:
Schedule Updates:
Lambda (to handle schedule changes) --> DynamoDB --> DynamoDBStream --> Lambda(Read if exists, Apply Changes to All objects and save to single object in S3)
Read Schedule:
With Lambda Read the single object from S3 and serve all the schedules or single schedule depending upon the request. You can check whether the object is modified or not before reading the next time, so you don't need to read every time from S3 and serve only from memory.
Scalability:
If you want to scale, you need to split the objects to certain size so that you will not load all objects exceeding 3GB memory size (Lambda Process Memory Size)
Hope this helps.
EDIT1:
When you cold start your serving lambda, load the object from s3 first and after that, you can check s3 for an updated object (after certain time interval or a certain number of requests) with since modified date attribute.
You can also those data to Lambda memory and serve from memory until the object is updated.