DynamoDB dynamic schema - amazon-web-services

DynamoDB dynamic schema - amazon-web-services

I'd like to use AWS DynamoDB as a datastore for a data-collection application, where the data schema may vary over time.
For example, initially an Item may represent attributes of people e.g. {name, age}. However, later the schema may be modified to contain {name, age, gender}.
Each schema modification will be tracked and versioned and older data won't need to be migrated - but it may still need to be queried alongside newer data.
Is it an acceptable pattern to store each data-schema change in its own table? Is there a straightforward mechanism to query aggregated data across tables?

Schemas for DynamoDB tables are dynamic in nature. The only thing that needs to be set up upfront is the key name and type. You can add global indexes any time too (indexes with a different partition key). Local indexes, however, those with the same partition key but different sort key, they are added at table creation table. Because of this dynamic schema, you can add new fields, or stop adding them any time.
You need to design tables knowing how would you query them. Queries are quite restricted, you can filter but that's not a fast/cheap operation. Fast queries rely on existing indexes. Queries can fetch from a single table. Joins/unions aren't available.
A table scan is done without any criteria, only filters are available. With filters, data is fetched from disk but can be removed from the returned set. It's an expensive operation in both cost and time. Queries passing a key are faster because they fetch data from a single partition. So you might want to design a key with both a partition (userId for instance) and sort key (item id). It is usual to have compound keys on DynamoDB.
Also it is important to avoid hot spots inside a table. That is, data needs to be fairly distributed inside partition keys.
Reference: http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/BestPractices.html

Related

AWS DynamoDB sorting without partition key

I have a DynamoDB table with a partition key (UUID) with a few attributes (like name, email, created date etc). Created date is one of the attribute in the item and its format is YYYY-MM-DD. But now there is a requirement change - I have to sort it based on created date and bring the entire data (that is, I cannot just bring the data on a particular partition, but the whole data from all the partitions in a sorted fashion. I know this might take time as DynamoDB to fetch data from all the partitions and sort it after. My question is:
Is the querying possible with the current design? I can see that partition key is required in the query, this is why I am confused because I cannot give a partition key here.
Is there a better way to redesign the table for such a use case?
Thanks in advance.

As your table exists you couldn't change the structure now, and even if you wanted to you would be reliant on UUID as your partition key.
There is functionality however to create a global secondary index for your DynamoDB table.
By using a GSI you can rearrange your data representation to include the creation date as the partition key of your table instead.
The reason why partition keys are important is that data in DynamoDB data is distributed across multiple nodes, with each partition sharing the same node. By performing a query it is more efficient to only communicate with one partition, as there is no need to wait on the other partitions returning results.

NoSQL encourages designing database based on access patterns. What to do when the patterns change?

NoSQL encourages designing database based on access patterns and it can perform those queries it was designed for very fast. For other queries, the performance is not so good. But for software, change is the norm. So when new requirements come in and we have to add new features, how can nosql databases adapt? Or better yet, how can I design nosql databases(preferably dynamodb) that will allow me to adapt to new feature additions.
The first approach that comes to my mind will be to design a new table and migrate all the previous data to the new table. But considering the table has millions of records, its probably not very cost effective
References:
Rick Houlihan talking about designing dynamodb table based on access patterns
Dynamodb design best practices from aws documentation

DynamoDB is schema-less, so you can add a new attribute at any time without having to do any backfill or migration. Just make sure your application knows what to do if the attribute is not present.
If you need to query that attribute, you can add a new GSI on the attribute. DynamoDB has an initial quota of 20 GSIs per table, but you can request a quota increase if you need more.
If your new use case isn’t satisfied by a GSI, you can create a new table containing your new attribute(s) to use alongside the existing table. If you need a guarantee of consistency between those tables, you can use DynamoDB transactions to keep them in sync.

One way to minimize full table migrations in order to adapt to new changes would be to use generic names for indexes. In the case of dynamodb, we would have pk as partition_key and sk for sort_key as well as all the attributes of the item. The value of pk and sk will actually be a derived value from other attributes. More importantly, we will add 5 LSIs during table creation and use them when necessary. For example, to store data about a book, a row in the table will have the following fields:
pk, sk, ISBN, data_type, author, created_at, ...other data, lsi1, lsi2, lsi3, lsi4, lsi5
The values for the fields:
pk->ISBN, sk->data_type, ISBN->ISBN, ...., lsi1->data_type#created_at , lsi(2-5)->empty
This way, unless there are drastic changes in the requirements the table structure of our table is unlikely to change. One thing to note here is that unless an item that is added, deleted or updated contains an attribute that belongs to an index, no computational or storage cost is incurred in dynamodb.

Should I use a secondary index or separate ID lookup table in DynamoDB?

I'm migrating a database from mongodb to dynamodb and trying to understand best practices, especially with using secondary local indexes and sort keys.
My application pulls in html data from the web, and loads the data into several tables/collections. At the time of extraction it gives each item an extracted_id, unique to the website it's pulled from. Before loading the items, it gives each item a UUID as its primary/partition key.
Problem: In order to avoid assigning different uuids to the same extracted_id I query the db to check if the entity has a preexisting entity_uuid.
Current Solution: Currently in mongodb, I have two sets of tables/collections. One for storing all items, and one for storing an entity's extracted_id(as key) / entity_uuid (as value) lookup table.
Better Solution?: As I move to DynamoDB would it be better to only create one database with extracted_id as a local secondary index, as to not store duplicate data? I'm unsure as the docs say to use indexes sparingly. I don't use the extracted_id for anything other than providing items with their uuid for a given site.
Hopefully this makes sense, I'm new to AWS / DynamoDB and would appreciate any tips / better solutions to the ones mentioned.

Why not just make extracted_id the partition key of your new DynamoDB table and use a ConditionExpression attribute_not_exists(extracted_id) to prevent your application from writing duplicate entries?

Is it possible to split a Dynamo Db table into more tables (AWS)?

Currently I'm working with a client on an IOT project involving sensors. Currently all their data is being put into one table. This data is coming from multiple sensor nodes. They want one table for every sensor node. I want to know if through AWS Dynamo Db it is possible to split the data into multiple separate tables using the hash key from an existing table. I have looked into GSI's and LSI's but this still isn't exactly what my client wants. Also would having multiple table even be more effective than using and LSI or GSI ? I am new to nosql and dynamo db so all the help is very appreciated.

DynamoDB does not support splitting data into multiple tables - in the sense that DynamoDB operations themselves, including the atomic conditional checks, can't be performed across table boundaries. But that doesn't mean that splitting data across tables is incompatible with DynamoDB - just that you have to add the logic in your application.
You can definitely do so as long as the data from the different sensors is isolated enough. A more common scenario would be to split data into multiple tables across time boundaries in order to discard/archive old data, since DynamoDB already makes it possible and convenient to handle partitioning your data with hash keys and global secondary indexes.
In the end I would say that there is no need and it doesn't make sense to split data into multiple tables on the hash key - but it can be done. However, a more useful case is to split data into multiple tables on some other attribute of the data that is not part of the hash, or range key (such as the time-series data example).

DynamoDb table design: Single table or multiple tables

I’m quite new to NoSQL and DynamoDB and I used to RDBMS. I’m designing database for a game and we're using DynamoDB and AWS Lambda for our backend. I created a table name “Users” for player profile that contains the user information and resources. Because the game has inventory system I also created a table name “UserItems”.
It’s all good until I realized DynamoDB don’t have transaction and any operation that is executed on both table (for example using an item that increase resource) has a chance of failure on one table while success on other and will cause missing data which affect our customers.
So I was thinking maybe my multiple tables design is not good since it’s a habit of me to design multiple table when I’m working with RDBMS. Which let me to think of storing the entire “UserItems” as hash in “Users” but I’m not sure this is a good practice because the size of a single row in Users table will be really big (we may have 500 unique items per users) and each time I pull or put data from/to “Users” (most of the time don’t need “UserItems” data) the read/write throughput will be also really large.
What should I do, keep the multiple tables design and handle transaction manually or switch to single table design? Or maybe there is a 3rd option?
Updated: more information about my use case
Currently I have 2 tables
Users: UserId (key), Username, Gold
UserItems: UserId (partition key), ItemId (sort key), Name, GoldValue
Scenarios:
User buy an item: Users.Gold will be deduced, new UserItem will be add to UserItems table.
User sell an item: Users.Gold will be increased, the Item will be deleted from UserItems table.
In both scenarios above I will have to do 2 update operation for 2 tables which without transaction there is a chance one of them failed.
To solve that I consider using single table solution which is a single Users table with 4 columns UserId(key), Username, Gold, UserItems. However there are two things I'm worried about:
Data in UserItems might be come to big for a single cell because one user could have up to 500 items.
To add/delete item I have to pull the UserItems from dynamodb, add/delete item and then put it back into Users. So I have to do 1 read and 1 write operation for 1 action. And because of issue (1) the read/write data size could become really big.

FWIW, the AWS documentation on NoSQL Design for DynamoDB suggests to use a single table:
As a general rule, you should maintain as few tables as possible in a
DynamoDB application. As emphasized earlier, most well designed
applications require only one table, unless there is a specific reason
for using multiple tables.
Exceptions are cases where high-volume time series data are involved,
or datasets that have very different access patterns—but these are
exceptions. A single table with inverted indexes can usually enable
simple queries to create and retrieve the complex hierarchical data
structures required by your application.

NoSql database is best suited for non-trasactional data. If you bring normalization(splitting your data into multiple tables) into noSQL, then you are beating the whole purpose of it. If performance is what matters most, then you should consider only having a single table for your use case. DynamoDB supports Range Keys, and also supports Secondary Indices. For your usecase, it would be better to redesign your table to use Range Keys.
If you can share more details about your current table, maybe i can help you with more inputs.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js