Amazon S3 Store Millions of Files

Amazon S3 Store Millions of Files - amazon-web-services

I am trying to find the most cost effective way of doing this, will appreciate any help:
I have 100s of millions of files. Each file is under 1MB each (usually 100KB or so)
In total this is over 5 TB of data - as of now, and this will grow weekly
I cannot merge/concatenate the files together. The files must be stored as is
Query and download requirements are basic. Around 1 Million files to be selected and downloaded per month
I am not worried about S3 storage or Data Retrieval or Data Scan cost.
My question is when I upload 100s of million files, does this count as one PUT request per file (meaning one per object)? If so, just the cost to upload the data will be massive. If I upload a directory with a million files, is that one PUT request?
What if I zip the 100 million files on prem, then upload the zip, and use lambda to unzip. Would that count as one PUT request?
Any advise?

You say that you have "100s of millions of files", so I shall assume you have 400 million objects, making 40TB of storage. Please adjust accordingly. I have shown my calculations so that people can help identify my errors.
Initial upload
PUT requests in Amazon S3 are charged at $0.005 per 1,000 requests. Therefore, 400 million PUTs would cost $2000. (.005*400m/1000)
This cost cannot be avoided if you wish to create them all as individual objects.
Future uploads would be the same cost at $5 per million.
Storage
Standard storage costs $0.023 per GB, so storing 400 million 100KB objects would cost $920/month. (.023*400m*100/1m)
Storage costs can be reduced by using lower-cost Storage Classes.
Access
GET requests are $0.0004 per 1,000 requests, so downloading 1 million objects each month would cost 40c/month. (.0004*1m/1000)
If the data is being transferred to the Internet, Data Transfer costs of $0.09 per GB would apply. The Data Transfer cost of downloading 1 million 100KB objects would be $9/month. (.09*1m*100/1m)
Analysis
You seem to be most fearful of the initial cost of uploading 100s of millions of objects at a cost of $5 per million objects.
However, storage will also be high, and the cost of $2.30/month per million objects ($920/month for 400m objects). That ongoing cost is likely to dwarf the cost of initial uploads.
Some alternatives would be:
Store the data on-premises (disk storage is $100/4TB, so 400m files would require $1000 of disks, but you would want extra drives for redundancy), or
Store the data in a database: There are no 'PUT' costs for databases, but you would need to pay for running the database. This might work out a lower cost. or
Combine the data in the files (which you say you do not wish to do), but in a way that can be easily split-apart. For example, marking records by an identifier for easy extractions. or
Use a different storage service, such as Digital Ocean, who do not appear to have a 'PUT' cost.

Related

AWS Redshift and small datasets

I have S3 bucket to which many different small files (2 files 1kB per 1 min) are uploaded.
Is it good practice to injest them by trigger using lambda at once to Redshift?
Or maybe will it be better to push them to some stage area like Postgres and then at the end of the day do batch etl from stage area to Redshift?
Or maybe do the job of making manifest file that contains all of the file names per day and use COPY command for injesting them to Redshift?

As Mitch says, #3. Redshift wants to work on large data sets and if you ingest small things many times you will need to vacuum the table. Loading many files at once fixes this.
However there is another potential problem - your files are too small for efficient bulk retrieval from S3. S3 is an object store and each request needs to be translated from bucket/object-key pair to a location in S3. This takes on the order of .5 seconds to do. Not an issue for loading a few at a time. But if you need to load a million of them in series then that’s 500K seconds of lookup time. Now Redshift will do the COPY in parallel but only to the number of slices you have in your cluster - it is still going to take a long time.
So depending on your needs you may need to think about a change in your use of S3. If so then your may end up with a Lambda that combines small files into bigger ones as part of your solution. You can do this in a parallel process to RS COPY if you only need to load many, many files at once during some recovery process. But an archive of 1 billion 1KB files will be near useless if they need to be loaded quickly.

Is an S3 Object the same as a file?

I was looking at the growth of my bucket (I run an image hosting social network), and the new number of objects per month is ten times larger than what I expect from seeing the number of files people upload thru my website.
Of course, I have a variable number of thumbnails and also other sources of files, but a 10x deviation from the expectation is quite massive.
If my bucket metrics indicate 500,000 new objects in a month, does that actually mean 500,000 new files?
Thanks!

BigQuery Data Transfer Service benchmarks for Campaign Manager data

There's some good info here on general transfer times over the wire for data to/from various sources.
Besides the raw data transfer time, I am trying to estimate roughly how long it would take to import ~12TB/day into BigQuery using the BigQuery Data Transfer service for DoubleClick Campaign Manager.
Is this documented anywhere?

In the first link you've shared, there's an image that shows the transfer speed (estimated) depending on the network bandwidth.
So let's say you have a bandwith of 1Gbps, then the data will be available in your GCP project in ~30 hours as you are transfering 12TB which is close to 10TB. That makes it 1 day and a half to transfer.
If you really want to transfer 12TB/day because you need that data to be available each day, and increasing bandwidth is not a possibility, I would recommend you to batch data and create different transfer services for each batch. As an example:
Split 12TB into 12 batches of 1TB -> 12 transfer jobs of 1TB each
Each batch will take 3 hours to complete, therefore you will have available 8/12TB a day.
This can be applied to smaller batches of data if you want to have a more fine-grained solution.

Creating prefixes in S3 to paralellise reads and increase performance

I'm doing some research and I was reading this page
https://docs.aws.amazon.com/AmazonS3/latest/dev/request-rate-perf-considerations.html
It says
Amazon S3 automatically scales to high request rates. For example, your application can achieve at least 3,500 PUT/POST/DELETE and 5,500 GET requests per second per prefix in a bucket. There are no limits to the number of prefixes in a bucket. It is simple to increase your read or write performance exponentially. For example, if you create 10 prefixes in an Amazon S3 bucket to parallelise reads, you could scale your read performance to 55,000 read requests per second.
I'm not sure what the last bit means. My understanding is that for the filename 'Australia/NSW/Sydney', the prefix is 'Australia/NSW'. Correct?
How does creating 10 of these improve your read performance? Do you create for example Australia/NSW1/, Australia/NSW2/, Australia/NSW3/, and then map them to a load balancer somehow?

S3 is designed like a Hashtable/HashMap in Java. The prefix form the hash for the hash-bucket... and the actual files are stored in groups in these buckets...
To search a particular file you need to compare all files in a hash-bucket... whereas getting to a hash-bucket is instant (constant-time).
Thus the more descriptive the keys, the more hash-buckets hence lesser items in those buckets... which makes the lookup faster...
Eg. a bucket with tourist attraction details for all countries in the world
Bucket1: placeName.jpg (all files in the bucket no prefix)
Bucket2: countryName/state/placeName.jpg
now if you are looking for Sydney.info in Australia/NSW... the lookup will be faster in second bucket.

No, S3 doesn't connect to LB, ever. This article covers this topic but the important highlights:
(...) keys in S3 are partitioned by prefix
(...)
Partitions are split either due to sustained high request rates, or because they contain a large number of keys (which would slow down lookups within the partition). There is overhead in moving keys into newly created partitions, but with request rates low and no special tricks, we can keep performance reasonably high even during partition split operations. This split operation happens dozens of times a day all over S3 and simply goes unnoticed from a user performance perspective. However, when request rates significantly increase on a single partition, partition splits become detrimental to request performance. How, then, do these heavier workloads work over time? Smart naming of the keys themselves!
So Australia/NSW/ could be read from the same partition while Australia/NSW1/ and might Australia/NSW2/ be read from two others. It doesn't have to be that way but still prefixes allow some control of how to partition the data because you have a better understanding of what kind of reads you will be doing on it. You should aim to have reads spread evenly over the prefixes.

Recommended way to read an entire table (Lambda, DynamoDB/S3)

I'm new to AWS and am working on a Serverless application where one function needs to read a large array of data. Never will a single item be read from the table, but all the items will routinely be updated by a schedule function.
What is your recommendation for the most efficient way of handling this scenario? My current implementation uses the scan operation on a DynamoDB table, but with my limited experience I'm unsure if this is going to be performant in production. Would it be better to store the data as a JSON file on S3 perhaps? And if so would it be so easy to update the values with a schedule function?
Thanks for your time.
PS: to give an idea of the size of the database, there will be ~1500 items, each containing an array of up to ~100 strings

It depends on the size of each item, but how?
First of all to use DynamoDB or S3 you pay for two services (in your case*):
1- Request per month
2- Storage per month
If you have small items the fist case will be up to 577 times cheaper if you read items from DynamoDB instead of S3
How: $0.01 per 1,000 requests for S3 compared to 5.2 million reads (up to 4 KB each) per month for DynamoDB. Plus you should pay $0.01 per GB for data retrieval in S3 which should be added up to that price. However, your writes into S3 will be free while you should pay for each write into your DynamoDB (which is almost 4 times more expensive than reading).
However if your items require so many RCUs per reads maybe S3 would be cheaper in this case.
And regarding the storage cost, S3 is cheaper but again you should see how big your data will be in size as you pay maximum $0.023 per GB for S3 while you pay $0.25 per GB per month which is almost 10 times more expensive.
Conclusion:
If you have so many request and your items are smaller its easier and even more straight forward to use DynamoDB as you're not giving up any of the query functionalities that you have using DynamoDB which clearly you won't have in case you use S3. Otherwise, you can consider keeping a pointer to objects' locations stored in S3 in DynamoDB.
(*) The costs you pay for tags in S3 or indexes in DynamoDB are another factors to be considered in case you need to use them.

Here is how I would do:
Schedule Updates:
Lambda (to handle schedule changes) --> DynamoDB --> DynamoDBStream --> Lambda(Read if exists, Apply Changes to All objects and save to single object in S3)
Read Schedule:
With Lambda Read the single object from S3 and serve all the schedules or single schedule depending upon the request. You can check whether the object is modified or not before reading the next time, so you don't need to read every time from S3 and serve only from memory.
Scalability:
If you want to scale, you need to split the objects to certain size so that you will not load all objects exceeding 3GB memory size (Lambda Process Memory Size)
Hope this helps.
EDIT1:
When you cold start your serving lambda, load the object from s3 first and after that, you can check s3 for an updated object (after certain time interval or a certain number of requests) with since modified date attribute.
You can also those data to Lambda memory and serve from memory until the object is updated.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js