Index on a Boolean attribute in DynamoDB

Index on a Boolean attribute in DynamoDB - amazon-web-services

I am new to DynamoDB schema designing. We have a table that stores metadata information for a customer with HashKey being CustomerId. The table also includes an attribute called "isActive" which is not a boolean. If customer unregisters, we plan to set the 'isActive' attribute to be empty.
We wish to pull list of all customerIds that are active. I read about 'sparseIndexes' wherein we can create a GSI on the 'isActive' attribute and only records with 'non-empty' values will be populated in the GSI.
However, it appears scanning is the only way to retrieve list of active customerIds. We can either
a) Scan entire table and filter only active customerIds at application layer
b) Scan the GSI which will be smaller than base table, but not necessarily very small (I would expect at least 1000+ records in it).
Are there any better design approaches to solve this by achieving high cardinality?

Sounds like you have a fairly good understanding of your options. Using GSIs to create a sparse index is fairly common for the access pattern you describe. Keep in mind that you can run a query operation against the index (as opposed to a scan), which will make the operation very fast. In the event you have many items, you could always paginate through the results.
Keep in mind you can add/remove the GSI Primary Key for the item to include/exclude the item from the index. For example, lets say your table has a GSI with a Partition (Hash) key named GSI1PK. Here's what it could look like with 4 customer items defined:
Notice that only Joe and Jill have a GSI1PK value defined, while Sue and Sam do not. Since I defined a global secondary index on GSI1PK, only items with that attribute defined will get projected into that index. Logically, that index would look like this:
If you want to remove Joe or Jill from GSI1, simply update the item to REMOVE GSI1PK from those items. Likewise, if you want to add Sue or Sam to the index, update the item to ADD the GSI1PK attribute to those items.

Related

Is this a reasonable way to design this DynamoDB table? Alternatives?

Our team has started to use AWS and one of our projects will require storing approval statuses of various recommendations in a table.
There are various things that identify a single recommendation, let's say they're : State, ApplicationDate, LocationID, and Phase. And then a bunch of attributes corresponding to the recommendation (title, volume, etc. etc.)
The use case will often require grabbing all entries for a given State and ApplicationDate (and then we will look at all the LocationId and Phase items that correspond to it) for review from a UI. Items are added to the table one at a time for a given Station, ApplicationDate, LocationId, Phase and updated frequently.
A dev with a little more AWS experience mentioned we should probably use State+ApplicationDate as the partition key, and LocationId+Phase as the sort key. These two pieces combined would make the primary key. I generally understand this, but how does that work if we start getting multiple recommendations for the same primary key? I figure we either are ok with just overwriting what was previously there, OR we have to add some other attribute so we can write a recommendation for the State+ApplicationDate/LocationId+Phase multiple times and get all previous values if we need to... but that would require adding something to the primary key right? Would that be like adding some kind of unique value to the sort key? Or for example, if we need to do status and want to record different values at different statuses, would we just need to add status to the sort key?
Does this sound like a reasonable approach or should I be exploring a different NAWS offering for storing this data?

Use a time-based id property, such as a ULID or KSID. This will provide randomness to avoid overwriting data, but also provide a time-based sorting of your data when used as part of a sort key
Because the id value is random, you will want to add it to your sort key for the table or index where you perform your list operations, and reserve the pk for known values that can be specified exactly.
It sounds like the 'State' is a value that can change. You can't update an item's key attributes on the table, so it is more common to use these attributes in a key for a GSI if they are needed to list data.
Given the above, an alternative design is to use the LocationId as the pk, the random id value as the sk, and a GSI with the GSI with 'State' as the pk and the random id as the sk. Or, if you want to list the items by State -> Phase -> date, the GSI sk could be a concatenation of the Phase and id property. The above pattern gives you another list mechanism using the LocationId + timestamp of the recommendation create time.

How to sort DynamoDB table by a single column?

I'd like to list records from my DDB table ordered by creation date.
My table has an attribute DateCreated.
All examples I can find describe ordering within some partition.
But I want global ordering.
Am I supposed to create an artificial attribute which will have the same value across all records, just to use it as a partition key? E.g. add new attribute GlobalPartition with value 1 to every record in the table, and create a GSI with partition key GlobalPartition and sort key DateCreated. Isn't there a better way?
Thx!

As you noticed, DynamoDB indeed does not have an option to sort items "globally". In other words, there is no way to Scan the database in sorted partition-key order. You can only sort items inside one partition, sorted by the "sort key".
When you have a small amount of data, you can indeed do what you said: Have a single partition with everything in this partition. However it's not clear how practical this approach becomes as your single partition grows - to gigabytes or terabytes, and how well DynamoDB can load-balance when you have just a single partition (I never saw any DynamoDB documentation which answer this question).
So another option is not to have a single partition but rather have a number of them. For example, consider that you want to sort items by date. Now insead of having a single partition, have a partition per month, i.e., the partition key is the month number. Now, if you want to sort everything within a month, you can do it directly, but if you want to get a sorted list of a full year, you need to Query twelve partitions, in order, getting a sorted list in each one and combining it to a sorted list for the full year. So-called time-series databases are often modeled this way.

If you want to sort any data in DynamoDB you need to add Sort Key index on that attribute. If value is not in attribute which maps to tables' sort key, or table does not have sort key, then you need to create GSI and put GSI's sort key on that attribute. You can use LSI too. Any attribute, which maps to "Sort Key" of any index. Table, LSI, GSI.
Check for more details "ScanIndexForward" param of the query request.
If ScanIndexForward is true, DynamoDB returns the results in the order in which they are stored (by sort key value). This is the default behavior. If ScanIndexForward is false, DynamoDB reads the results in reverse order by sort key value, and then returns the results to the client.
https://docs.aws.amazon.com/amazondynamodb/latest/APIReference/API_Query.html#API_Query_RequestSyntax
UI has checkbox too for this:
"Global sort" is not possible, while "global" would mean scan operation and it just runs through all rows in database and filters by filters, yet it does not have sorting option. On query on attribute mapped to sort key has ScanIndexForward option to change sort direction.

Is it possible in DynamoDb to get all the items with a given Sort Key?

As primary key I have an id for a recipe and the sort key is the type of food (breakfast, meal, snack, etc).
Is there a way with scan or query to get all the items with a given sort key?

As others have pointed in the comments, you can't query a sort key in the sense that there is no operation that gives a list of items that have the same sort key.
In fact, the whole reason for a sort key is generally to order items in a particular partition.
Putting the two together, what you need is a way to partition the items by the food type and then query on that. Enter the Global Secondary Index (GSI).
With the help of a GSI you can index the data in your table in a way that the food type becomes the partition key, and some other attribute becomes the sort key. Then, getting all the items that match a particular food type becomes possible with a Query.
There are a few things to keep in mind:
a GSI is like another table: it consumes capacity that you will be charged for
a GSI is eventually consistent, meaning changes in the table could take a bit of time before being reflected in the GSI
if you end up creating a GSI where the choice of partition key results in very large partitions, it can lead to throttling (reduced throughput) if any one partition receives a lot of requests
Some more guidelines: https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/bp-indexes-general.html
But before you start creating GSIs, consider for a moment the schema of your table: your choice of partition key seems less than ideal. On the one hand, using the recipe id as the partition key is great because it probably results in very good spread of data but on the other hand, you have no ability to use queries on your table without creating GSIs.
Instead of recipe id as the partition key, consider creating a partition key composed of food type, and perhaps another attribute. This way, you can actually query on food type, or perhaps issue several queries to retrieve all items of a particular food type.

Guidelines for creating GSI in DynamoDB

Imagine that you need to persist something that can be represented with following schema:
{
type: String
createdDate: String (ISO-8601 date)
userId: Number
data: {
reference: Number,
...
}
}
type and createdDate are always defined/required, everything else such as userId, data and whatever fields within data are optional. Combination of type and createdDate does not guarantee any uniqueness. Number of fields within data (when data exists) may differ.
Now imagine that you need to query against this structure like:
Give me items where type is equal to something
Give me items where userId is equal to something
Give me items where type AND userId are equal to something
Give me items where userId AND data.reference are equal to something
Give me items where userId is equal to something, where type IS IN range of values and where data.reference is equal to something
As it seems to me HashKey needs to be introduced on table level to uniquely match the item. Only choice that i have is to use something like UUID generator. Based on that i can't query anything from table that i need described above. So i need to create several global secondary indexes to cover all fifth cases above as follows:
For first use case i could create GSI where type can be HashKey and createdDate can be RangeKey.What bothers me from start here as i mentioned, there is high chance for this composite key to NOT be unique.
For second use case i could crate GSI where userId can be HashKey and createdDate can be RangeKey
Here probably this composite key can match item uniquely.
For third use case, i have probably two solutions. Either to create third GSI where type can be HashKey and userId can be RangeKey. With that approach i'm losing ability to sort returned data and again same worries, this composite key does not guarantee uniqueness. Another approach would be to use one of two previous GSIs and using FilterExpression, right?
For fourth use case i have only one option. To use previous GSI with userId as HashKey and createdDate as a RangeKey and to use FilterExpression against data.reference. Index can't be created on fields from nested object right?
For fifth use case, because IN operator is only supported via FilterExpression (right?) only option again is to use GSI with userId as HashKey and createdDate as a RangeKey and to use FilterExpression for both type and data.reference?
So as only bright side of this problem i see using GSI with userId as HashKey and createdDate as RangeKey. But again userId is not mandatory field it can be NULL. HashKey can't be NULL right?
Most importantly, if composite key(HashKey and RangeKey) can't guarantee uniqueness that means that saving item with composite key that already exists in index will silently rewrite previous item which means i will lose the data?

The thing about DynamoDB: it is a no-SQL database. On the plus side, it is easy to dump pretty much anything into it so long as you have a unique index and it will be fairly efficiently stored for retrieve if you have a good partition key that sub-divides your data into chunks. On the downside, any query you do against fields that are not the partition key or index (primary or secondary) are slow table scans by definition. DynamoDB is not an SQL database and cannot give SQL-like performance when filtering non-indexed columns. If the performance you see is going to be reasonable, you need to delimit your query results to pre-calculated index values available before doing a query or you need to know the results you are looking for are delimited to a few partition keys.
First let's consider the delimited partition keys route. Once you have delimited the partition keys as much as you can and there are no more indexes to reference, everything else you ask DynamoDB is not really a query, but a table scan. You can ask DynamoDB to do it for you, but you may well be better off taking the full results from a partition key or index query and doing the filter yourself in whatever language you are using. I use Java for this purpose because it is simple to do a query for the keys I need through the Java->DynamoDB API and easy to then filter the results in Java. If this is interesting to you I can put together some simple examples.
If you go the index and filter route, understand that the hash key is still a partition key for the index, which is going to determine how much the GSI can be used in parallel. The bigger your DynamoDB table becomes and the more time sensitive your queries are, the bigger the issue this will become.
So yes, you can make the queries you want with indexes, though it will take some careful planning of those indexes.
1. For first use case i could create GSI where type can be HashKey and
createdDate can be RangeKey.What bothers me from start here as i
mentioned, there is high chance for this composite key to NOT be
unique.
GSI's do not have to be unique. You will receive multiple rows on a query, but nothing will be broken from DynamoDB's perspective. However, if you use type as your partition key (HashKey), the performance of this query will likely be poor unless you have few records for each of your type values.
2. For second use case i could crate GSI where userId can be HashKey and
createdDate can be RangeKey Here probably this composite key can match item
uniquely.
No problems here so long as your userId's will be unique on a given day.
3. For third use case, i have probably two solutions. Either to create third
GSI where type can be HashKey and userId can be RangeKey. With that approach
i'm losing ability to sort returned data and again same worries, this
composite key does not guarantee uniqueness. Another approach would be to
use one of two previous GSIs and using FilterExpression, right?
So the RangeKey is your sort key, at least from DynamoDB's perspective. And yes, if you use a GSI and then Filter, you are table scanning the contents of the GSI indexed rows. But yes, if you are combining two GSI's, you either generate a third index in advance or you filter/scan. DynamoDB has no provisions for doing an INNER JOIN on two indexes. And having type as your partition key and then filtering the results has serious performance issues.
4. For fourth use case i have only one option. To use previous GSI with
userId as HashKey and createdDate as a RangeKey and to use FilterExpression
against data.reference. Index can't be created on fields from nested object
right?
I am not sure about your nested object question, but yes, using the previous GSI with a filter/scan will work.
5. For fifth use case, because IN operator is only supported via
FilterExpression (right?) only option again is to use GSI with userId as
HashKey and createdDate as a RangeKey and to use FilterExpression for both
type and data.reference?
Yes, if you want DynamoDB to do the work for you, this is the way to approach your fifth query. But I go back to my original statement: why do this? If you can create a GSI that efficiently gets you to the records you are interested in, use a GSI. But when I never use filter expressions: I get the full partition, index or GSI results back from a query and do the filtering myself in my programming language of choice.
If you need to do everything in DynamoDB your methods will work, but they may not be very fast depending on how many rows are being filtered. I beat on the performance issue pretty hard because I have seen lots of work go into s database project and then had the whole thing not get used because poor performance made it unusable.

Dynamodb global/local secondary index, searching via category and rating

I have a table of songs in Dynamodb that looks like this:
I wish to return to my app a list of songs by two conditions "Category" and "UserRating"
At present my hash key is "Artist" and rangekey is "Songtitle".
I think that if I made a secondary key "Category" I could search for all the songs in a particular category and similarly I could do this for rating but I don't know how to do this for both?
I also believe I understand the understand the difference between the global and local index.
So what I am thinking (which is probably not correct) is that I need to create a global secondary index on "Category" and do a query on the attribute "UserRating".
Will this work? And even if this works is this the correct way to be doing it?
Thanks

With query you can only search for the Hash (now the partition key) and optionally the range (now the sort key). This has to drive your table and index design.
In your case if wish to query Category on its own then you'd create a new GSI with Category as the partition key. If you want to search within a Category for songs with a rating of something, then you'd create that index with a partition key of Category and a sort key of Rating.
If you need to query by rating alone, then you'd have to create a GSI with rating as the partition key. Bear in mind however you can't do anything like "greater than" or "between" on the partition key: you can only do this on the sort key.
One other factor to consider is expected performance. Amazon advise that partition keys have high cardinality. It is called the partition key because it is the means by which the data is physically organised into partitions. If you have an index with x number of rows across only a few categories, then your data will not be well distributed, which causes a potential performance bottleneck. For non-serious projects this won't be noticeable however.
Hope this helps somewhat.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js