s3 bucket date path format for faster operations - amazon-web-services

I was told by one of the consultants from AWS itself that, while naming the folders(objects) in s3 with date. use MM-DD-YYYY for faster s3 operations like get Object, but i usually use YYYY-MM-DD. I don't understand what difference it makes, is there a difference, if yes, which one is better?

This used to be a limitation due to the way data had been stored in the back end, but it doesn't apply (to the original extend, see jellycsc's comment below) anymore.
The reason for this recommendation was, that in the past Amazon Simple Storage Service (S3) partitioned data using the key. With many files having the same prefix (like e.g. all starting with the same year) this could have led to reduced performance when many files needed to be loaded from the same partition.
However, since 2018, hashing and random prefixing the S3 key is no longer required to see improved performance: https://aws.amazon.com/about-aws/whats-new/2018/07/amazon-s3-announces-increased-request-rate-performance/

S3 creates so-called partitions under the hood in order to serve up your requests to the bucket. Each partition has the ability to serve 3,500 PUT/COPY/POST/DELETE or 5,500 GET/HEAD requests per second. They partition the bucket based on the common prefix among all the object keys. MM-DD-YYYY date format would be slightly faster than YYYY-MM-DD because objects with MM-DD-YYYY naming will spread across more partitions.
Key take away here: more randomness at the beginning of the object keys will likely give you more performance out of the S3 bucket

Related

Database suggestion for large unstructured datasets to integrate with elasticsearch

A scenario where we have millions of records saved in database, currently I was using dynamodb for saving metadata(and also do write, update and delete operations on objects), S3 for storing files(eg: files can be images, where its associated metadata is stored in dynamoDb) and elasticsearch for indexing and searching. But due to dynamodb limit of 400kb for a row(a single object), it was not sufficient for data to be saved. I thought about saving for an object in different versions in dynamodb itself, but it would be too complicated.
So I was thinking for replacement of dynamodb with some better storage:
AWS DocumentDb
S3 for saving metadata also, along with object files
So which one is better option among both in your opinion and why, which is also cost effective. (Also easy to sync with elasticsearch, but this ES syncing is not much issue as somehow it is possible for both)
If you have any other better suggestions than these two you can also tell me those.
I would suggest looking at DocumentDB over Amazon S3 based on your use case for the following reasons:
Pricing of storing the data would be $0.023 for standard and $0.0125 for infrequent access per GB per month (whereas Document DB is $0.10per GB-month), depending on your size this could add up greatly. If you use IA be aware that your costs for retrieval could add up greatly.
Whilst you would not directly get the data down you would use either Athena or S3 Select to filter. Depending on the data size being queried it would take from a few seconds to possibly minutes (not the milliseconds you requested).
For unstructured data storage in S3 and the querying technologies around it are more targeted at a data lake used for analysis. Whereas DocumentDB is more driven for performance within live applications (it is a MongoDB compatible data store after all).

Using Amazon S3 as a limited-database

I have looked into this post on s3 vs database. But I have a different use case and want to know whether s3 is enough. The primary reason for using s3 instead of other databases on cloud is because of cost.
I have multiple __scraper__s that download data from websites and apis everyday. Most of them return data as Json format. Currently, I will insert them into mongodb. I will then run analysis by querying data out on a specific date or some specific fields or records that match a certain criteria. After querying the data, usually I will load them into a dataframe and do what is needed.
The data will not be updated. They need to be stored and ready for retrieval according to some criteria. I am aware of S3 Select which may be able to do the retrieval task.
Any recommendations?
The use cases you have mentioned above, it seems that you are not using the MongoDB capabilities(any database capability for say) to a greater degree.
I think S3 suites well for your use cases, in fact, you should go for S3-Infrequent access with life cycle policy to archive and then finally purge to be cost efficient.
I hope it will helps!
I think your code will be more efficient if you use dynamodb with all its feature. using s3 for database or data storage will make you code more complex. since you need to retrieve file from s3 every time and have to iterate thorough the file every time. And in case of dynamodb you can easily query and filter the data which is required. At the end s3 is a file storage and dynmodb is a database.

Creating prefixes in S3 to paralellise reads and increase performance

I'm doing some research and I was reading this page
https://docs.aws.amazon.com/AmazonS3/latest/dev/request-rate-perf-considerations.html
It says
Amazon S3 automatically scales to high request rates. For example, your application can achieve at least 3,500 PUT/POST/DELETE and 5,500 GET requests per second per prefix in a bucket. There are no limits to the number of prefixes in a bucket. It is simple to increase your read or write performance exponentially. For example, if you create 10 prefixes in an Amazon S3 bucket to parallelise reads, you could scale your read performance to 55,000 read requests per second.
I'm not sure what the last bit means. My understanding is that for the filename 'Australia/NSW/Sydney', the prefix is 'Australia/NSW'. Correct?
How does creating 10 of these improve your read performance? Do you create for example Australia/NSW1/, Australia/NSW2/, Australia/NSW3/, and then map them to a load balancer somehow?
S3 is designed like a Hashtable/HashMap in Java. The prefix form the hash for the hash-bucket... and the actual files are stored in groups in these buckets...
To search a particular file you need to compare all files in a hash-bucket... whereas getting to a hash-bucket is instant (constant-time).
Thus the more descriptive the keys, the more hash-buckets hence lesser items in those buckets... which makes the lookup faster...
Eg. a bucket with tourist attraction details for all countries in the world
Bucket1: placeName.jpg (all files in the bucket no prefix)
Bucket2: countryName/state/placeName.jpg
now if you are looking for Sydney.info in Australia/NSW... the lookup will be faster in second bucket.
No, S3 doesn't connect to LB, ever. This article covers this topic but the important highlights:
(...) keys in S3 are partitioned by prefix
(...)
Partitions are split either due to sustained high request rates, or because they contain a large number of keys (which would slow down lookups within the partition). There is overhead in moving keys into newly created partitions, but with request rates low and no special tricks, we can keep performance reasonably high even during partition split operations. This split operation happens dozens of times a day all over S3 and simply goes unnoticed from a user performance perspective. However, when request rates significantly increase on a single partition, partition splits become detrimental to request performance. How, then, do these heavier workloads work over time? Smart naming of the keys themselves!
So Australia/NSW/ could be read from the same partition while Australia/NSW1/ and might Australia/NSW2/ be read from two others. It doesn't have to be that way but still prefixes allow some control of how to partition the data because you have a better understanding of what kind of reads you will be doing on it. You should aim to have reads spread evenly over the prefixes.

AWS hosted data storage for storing simple entities

I need to choose data storage for simple system. The main purpose of the system is storing events - simple entities with timestamp, user id and type. No joins. Just single table.
Stored data will be fetched rarely (compared with writes). I expect following read operations:
get latest events for a list of users
get latest events of a type for a list of users
I expect about 0.5-1 million writes a day. Data older than 2 years can be removed.
I'm looking for best fitted service provided by AWS. I wonder if using redshift is like taking a sledgehammer to crack a nut?
For your requirement you can use AWS DynamoDB and also define the TTL values to remove the older items automatically. You get the following advantages.
Fully managed data storage
Able to scale with the need for write throughput (Though it can be costly)
Use sort key with timestamp to query latest items.
I would also like to check the AWS Simple DB as it looks more fit(in a first glance) for your requirements.
Please refer this article which explains some practical user experience.
http://www.masonzhang.com/2013/06/2-reasons-why-we-select-simpledb.html

S3 storing JSON vs DynamoDB

Taking into consideration that DynamoDB is quite more expensive than S3...
Why not to store JSON files in S3 instead using DynamoDB as a store?
One disadvantage of this approach could be querying, filtering or even paging. But let's say that the system is very simple and it only queries by id. The id could be the name (or key) or the file.
Another point could be concurrency. But let's say that users only access/write their on data.
Is there any other scenario or fact which will make S3 a really bad choice?
I agree if the below features provided by DynamoDB are ruled out ie.
Concurrency
Indexing (for faster access)
Other features like secondary indexes
S3 can be used, as it eventually stores objects as key-value pairs