Taking into consideration that DynamoDB is quite more expensive than S3...
Why not to store JSON files in S3 instead using DynamoDB as a store?
One disadvantage of this approach could be querying, filtering or even paging. But let's say that the system is very simple and it only queries by id. The id could be the name (or key) or the file.
Another point could be concurrency. But let's say that users only access/write their on data.
Is there any other scenario or fact which will make S3 a really bad choice?
I agree if the below features provided by DynamoDB are ruled out ie.
Concurrency
Indexing (for faster access)
Other features like secondary indexes
S3 can be used, as it eventually stores objects as key-value pairs
Related
A scenario where we have millions of records saved in database, currently I was using dynamodb for saving metadata(and also do write, update and delete operations on objects), S3 for storing files(eg: files can be images, where its associated metadata is stored in dynamoDb) and elasticsearch for indexing and searching. But due to dynamodb limit of 400kb for a row(a single object), it was not sufficient for data to be saved. I thought about saving for an object in different versions in dynamodb itself, but it would be too complicated.
So I was thinking for replacement of dynamodb with some better storage:
AWS DocumentDb
S3 for saving metadata also, along with object files
So which one is better option among both in your opinion and why, which is also cost effective. (Also easy to sync with elasticsearch, but this ES syncing is not much issue as somehow it is possible for both)
If you have any other better suggestions than these two you can also tell me those.
I would suggest looking at DocumentDB over Amazon S3 based on your use case for the following reasons:
Pricing of storing the data would be $0.023 for standard and $0.0125 for infrequent access per GB per month (whereas Document DB is $0.10per GB-month), depending on your size this could add up greatly. If you use IA be aware that your costs for retrieval could add up greatly.
Whilst you would not directly get the data down you would use either Athena or S3 Select to filter. Depending on the data size being queried it would take from a few seconds to possibly minutes (not the milliseconds you requested).
For unstructured data storage in S3 and the querying technologies around it are more targeted at a data lake used for analysis. Whereas DocumentDB is more driven for performance within live applications (it is a MongoDB compatible data store after all).
I was told by one of the consultants from AWS itself that, while naming the folders(objects) in s3 with date. use MM-DD-YYYY for faster s3 operations like get Object, but i usually use YYYY-MM-DD. I don't understand what difference it makes, is there a difference, if yes, which one is better?
This used to be a limitation due to the way data had been stored in the back end, but it doesn't apply (to the original extend, see jellycsc's comment below) anymore.
The reason for this recommendation was, that in the past Amazon Simple Storage Service (S3) partitioned data using the key. With many files having the same prefix (like e.g. all starting with the same year) this could have led to reduced performance when many files needed to be loaded from the same partition.
However, since 2018, hashing and random prefixing the S3 key is no longer required to see improved performance: https://aws.amazon.com/about-aws/whats-new/2018/07/amazon-s3-announces-increased-request-rate-performance/
S3 creates so-called partitions under the hood in order to serve up your requests to the bucket. Each partition has the ability to serve 3,500 PUT/COPY/POST/DELETE or 5,500 GET/HEAD requests per second. They partition the bucket based on the common prefix among all the object keys. MM-DD-YYYY date format would be slightly faster than YYYY-MM-DD because objects with MM-DD-YYYY naming will spread across more partitions.
Key take away here: more randomness at the beginning of the object keys will likely give you more performance out of the S3 bucket
I have looked into this post on s3 vs database. But I have a different use case and want to know whether s3 is enough. The primary reason for using s3 instead of other databases on cloud is because of cost.
I have multiple __scraper__s that download data from websites and apis everyday. Most of them return data as Json format. Currently, I will insert them into mongodb. I will then run analysis by querying data out on a specific date or some specific fields or records that match a certain criteria. After querying the data, usually I will load them into a dataframe and do what is needed.
The data will not be updated. They need to be stored and ready for retrieval according to some criteria. I am aware of S3 Select which may be able to do the retrieval task.
Any recommendations?
The use cases you have mentioned above, it seems that you are not using the MongoDB capabilities(any database capability for say) to a greater degree.
I think S3 suites well for your use cases, in fact, you should go for S3-Infrequent access with life cycle policy to archive and then finally purge to be cost efficient.
I hope it will helps!
I think your code will be more efficient if you use dynamodb with all its feature. using s3 for database or data storage will make you code more complex. since you need to retrieve file from s3 every time and have to iterate thorough the file every time. And in case of dynamodb you can easily query and filter the data which is required. At the end s3 is a file storage and dynmodb is a database.
I need to choose data storage for simple system. The main purpose of the system is storing events - simple entities with timestamp, user id and type. No joins. Just single table.
Stored data will be fetched rarely (compared with writes). I expect following read operations:
get latest events for a list of users
get latest events of a type for a list of users
I expect about 0.5-1 million writes a day. Data older than 2 years can be removed.
I'm looking for best fitted service provided by AWS. I wonder if using redshift is like taking a sledgehammer to crack a nut?
For your requirement you can use AWS DynamoDB and also define the TTL values to remove the older items automatically. You get the following advantages.
Fully managed data storage
Able to scale with the need for write throughput (Though it can be costly)
Use sort key with timestamp to query latest items.
I would also like to check the AWS Simple DB as it looks more fit(in a first glance) for your requirements.
Please refer this article which explains some practical user experience.
http://www.masonzhang.com/2013/06/2-reasons-why-we-select-simpledb.html
We have to work on an IoT system. Basically sensors sending data to the cloud, and users being able to access the data belonging to them.
The amount of data can be pretty substantial so we need to ensure something that covers both security and heavy load.
The typology of data is pretty straightforward, basically a data and its value at a specified time.
The idea was to use DynamoDB for this, having a table with :
[id of sensor-array]
[id of sensor]
[type of measure]
[value of measure]
[date of measure]
The idea was for the IoT system to put directly (in python) data into the database.
Our questions are :
In terms of performance :
will DynamoDB be able to handle a lot of insertions on a daily basis (we may be talking about hundredth of thousands insertions per minute) ?
does querying the table by giving the id of the sensor array and a minimal date will ensure being able to retrieve the data in a efficient fashion?
In terms of security is it okay to proceed this way?
We used to use NoSQL like MongoDB, so we're finding hard to apply our notions on DynamoDB where the data seems to be arranged in a pretty simple fashion.
Thanks.
will DynamoDB be able to handle a lot of insertions on a daily basis (we may be talking about hundredth of thousands insertions per minute) ?
Yes, DynamoDB will sustain (and cost) all the write throughput you provision for the table. As your data is small, aggregating before writing in batches (BatchWriteItem) is probably more cost efficient than individual writes.
does querying the table by giving the id of the sensor array and a minimal date will ensure being able to retrieve the data in a efficient fashion?
Yes, queries by hash (id) and range keys (date) would be very efficient. You may need secondary indexes for more complex queries though.
In terms of security is it okay to proceed this way?
Although data is encrypted in transit and client-side encryption is straightforward, there is a lot uncovered here. For example, AWS IoT provides TLS mutual authentication with certificates, IAM or Cognito, among other security features. AWS IoT can store data in DynamoDB using a simple rule.