DynamoDB querying in 2017 - amazon-web-services

There are few similar questions out there, but looks like they are outdated.
Does DynamoDB still have problems with querying or not?
Use case: table contains users with parameters: name, phone, email, groupId, created, etc...
I want to get all users with groupId = 1, name iLike 'jo' and created > a_year_ago_timestamp.
Looks like this is possible already, according to this.
Or this is another highly expensive scanning operation?

As long as you are using the Query API of DynamoDB, it is not an expensive scanning operation. Using Query API implies that you know the hash key of the table.
In the above case, I assume groupId is a hash key of the table. Please note that you can't use CONTAINS or GE (i.e. greater than) for hash key attribute on KeyConditionExpression.
So, groupId must be hash key in order to use Query API. Otherwise, you may need to look at GSI (Global Secondary Index) in order to use Query API.
Obviously, if you use Scan API with FilterExpression, it would be a costly operation.

Related

DynamoDB query with both GT and begins_with for sort key?

I have a single table design where I have chat rooms (PK) with timestamped messages (SK). Since it's a single table design the SK has a MSG# prefix, followed by the message creation timestamp, to keep message entities separate from other entities.
I'd like to retrieve all messages after a certain timestamp. It seems like the key condition should be PK = "<ChatRoomId>" AND begins_with(SK, "MSG#") AND SK GT "MSG#<LastRead>". The first part of the SK condition is to only fetch message entities and the second is to only fetch new messages. Is it possible to have a double conditions on the sort key like this? It seems like it should be possible as it denotes a contiguous range of sort keys.
You can easily achieve that by using between:
PK = "<ChatRoomId>" AND SK BETWEEN "MSG#<YourDate>" AND "MSG#9999-99-99"
This way you will get all messages starting at <YourDate> and no records with other prefixes. This will work unless you're planning very far ahead.
I have exactly the same use case and found out this answer, thanks for this suggestion, it works but we decided to research further - "between" is inclusive and we'd have to either waste one read capacity unit or make up a fake value as a workaround.
Turns out, the DynamoDB API provides this feature, it's the exclusive start key: https://docs.aws.amazon.com/amazondynamodb/latest/APIReference/API_Query.html#DDB-Query-request-ExclusiveStartKey
Admittedly, the documentation is not very encouraging and seems to suggest that the parameter is some opaque data that you can only obtain by having a previous query:
The primary key of the first item that this operation will evaluate. Use the value that was returned for LastEvaluatedKey in the previous operation.
But the actual content of that key is very simple and transparent: it's a map like {"PK": {"S": "your_pk"}, "SK": {"S": "exclusive_start_sk"}} ( replace PK/SK with your actual key - if you're doing single table design you're probably using those generic names ). GSIPK/GSISK may be provided instead, if you're querying a GSI instead of the main table. You can do some manual query and observe the returned LastEvaluatedKey to verify what it's expecting.
From there you can combine greater_than and begins_with, greater_than being expressed as a pagination parameter

Is this a reasonable way to design this DynamoDB table? Alternatives?

Our team has started to use AWS and one of our projects will require storing approval statuses of various recommendations in a table.
There are various things that identify a single recommendation, let's say they're : State, ApplicationDate, LocationID, and Phase. And then a bunch of attributes corresponding to the recommendation (title, volume, etc. etc.)
The use case will often require grabbing all entries for a given State and ApplicationDate (and then we will look at all the LocationId and Phase items that correspond to it) for review from a UI. Items are added to the table one at a time for a given Station, ApplicationDate, LocationId, Phase and updated frequently.
A dev with a little more AWS experience mentioned we should probably use State+ApplicationDate as the partition key, and LocationId+Phase as the sort key. These two pieces combined would make the primary key. I generally understand this, but how does that work if we start getting multiple recommendations for the same primary key? I figure we either are ok with just overwriting what was previously there, OR we have to add some other attribute so we can write a recommendation for the State+ApplicationDate/LocationId+Phase multiple times and get all previous values if we need to... but that would require adding something to the primary key right? Would that be like adding some kind of unique value to the sort key? Or for example, if we need to do status and want to record different values at different statuses, would we just need to add status to the sort key?
Does this sound like a reasonable approach or should I be exploring a different NAWS offering for storing this data?
Use a time-based id property, such as a ULID or KSID. This will provide randomness to avoid overwriting data, but also provide a time-based sorting of your data when used as part of a sort key
Because the id value is random, you will want to add it to your sort key for the table or index where you perform your list operations, and reserve the pk for known values that can be specified exactly.
It sounds like the 'State' is a value that can change. You can't update an item's key attributes on the table, so it is more common to use these attributes in a key for a GSI if they are needed to list data.
Given the above, an alternative design is to use the LocationId as the pk, the random id value as the sk, and a GSI with the GSI with 'State' as the pk and the random id as the sk. Or, if you want to list the items by State -> Phase -> date, the GSI sk could be a concatenation of the Phase and id property. The above pattern gives you another list mechanism using the LocationId + timestamp of the recommendation create time.

DynamoDB 1 big table or multiple small tables?

I'm currently facing some questions regarding my database design. Currently i'm developing an api which lets users do the following:
Create an Account ( 1 User owns 1 Account)
Create a Profile ( 1 Account owns 1-n Profiles)
Let a profile upload 2 types of items ( 1 Profile owns 0-n Items ; the items differ in type and purpose)
Calling the API methods triggers AWS Lambda to perform the requested operations in the DynamoDB tables.
My current plan looks like this:
It should be possible to query items by specifying a time frame and the Profile ID. But i think my design completely defeats the purpose of DynamoDB. AWS documentation says that a well designed product only requires one table.
What would be a good way to realise this architecture in one table?
Are there any drawbacks on using the current design?
What would you specify as Primary/Partition/sort key/secondary indexes in both the current design and a one-table-approach?
I’m going to give this answer assuming that you need to be able to do the following queries.
Given an Account, find all profiles
Given a Profile, find all Items
Given a Profile and a specific ItemType, find all Items
Given an Item, find the owning Profile
Given a Profile, find the owning account
One of the beauties of DynamoDB (and also a bane, perhaps) is that it is mostly schema-less. You need to have the mandatory Primary Key attributes for every item in the table, but all of the other attributes can be anything you like. In order to have a DynamoDB design with only one table, you usually need to get used to the idea of having mixed types of objects in the same table.
That being said, here’s a possible schema for your use case. My suggestion assumes that you are using something like UUIDs for your identifiers.
The partition key is a field that is simply called pkey (or whatever you want). We’ll also call the sort key skey (but again, it doesn’t really matter). Now, for an Account, the value of pkey is Account-{{uuid}} and the value of skey would be the same. For a Profile, the pkey value is also Account-{{uuid}}, but the skey value is Profile-{{uuid}}. Finally, for an Item, the pkey is Profile-{{uuid}} and the skey is Item-{{type}}-{{uuid}}. For all of the attributes of an item, don’t worry about it, just use whatever attributes you want to use.
Since the “parent” object is always the partition key, you can get any of the “child” objects simply by querying for the ID of the of the parent. For example, your key condition expression to get all the ‘ItemType2’s for a Profile would be
pkey = “Profile-{{uuid}}” AND begins_with(skey, “Item-Type2”)
In this schema, your GSI has the same keys as the table, but reversed. You can query the GSI for ‘Item-{{type}}-{{uuid}}’ to get the owning Profile, and similarly with a Profile is to get the owning account.
What I have illustrated here is the adjacency list pattern. DynamoDB also has an article describing how to use composite sort keys for hierarchical data, which would also be suitable for your data, and depending on your expected queries, it might be more suitable than using the adjacency list.
You don’t have to put everything in a single table. Yes, DynamoDB recommends it, but it is far more important to make sure that your application is correct and maintainable. If having multiple tables means it’s easier to write a defect free application, then use multiple tables.

DynamoDB partition key choice for notes app

I want to create a DynamoDB table that allows me to save notes from users.
The attributes I have:
user_id
note_id (uuid)
type
text
The main queries I will need:
Get all notes of a certain user
Get a specific note
Get all notes of a certain type (the less used query)
I know that in terms of performance and DynamoDB partitions note_id would be the right choice because they are unique and would be distributed equally over the partitions but on the other hand is much harder to get all notes of a user without scanning all items or using a GSI. And if they are unique I suppose it doesn't make any sense to have a sort key.
The other option would be to use user_id as partition key and note_id as sort key, but if I have certain users that are a much larger number of notes than others wouldn't that impact my performance?
Is it better to have a partition key unique (like note_id) to scale well with DynamoDB partitions and use GSIs to create my queries or to use instead a partition key for my main query (user_id)?
Thanks
Possibly the simplest and most cost-effective way would be a single table:
Table Structure
note_id (uuid) / hash key
user_id
type
text
Have two GSIs, one for "Get all notes of a certain user" and one for "Get all notes of a certain type (the less used query)":
GSI for "Get all notes of a certain user"
user_id / hash key
note_id (uuid) / range key
type
text
A little note on this - which of your queries is the most frequent: "Get all notes of a certain user" or "Get a specific note"? If it's the former, then you could swap the GSI keys for the table keys and vice-versa (if that makes sense - in essence, have your user_id + note_id as the key for your table and the note_id as the GSI key). This also depends upon how you structure your user_id - I suspect you've already picked up on; make sure your user_id is not sequential - make it a UUID or similar.
GSI for "Get all notes of a certain type (the less used query)"
type / hash key
note_id (uuid) / range key
user_id
text
Depending upon the cardinality of the type field, you'll need to test whether a GSI will actually be of benefit here or not.
If the GSI is of little benefit and you need more performance, another option would be to store the type with an array of note_id in a separate table altogether. Beware of the 400k item limit with this one and the fact that you'll need to perform another query to get the text of the note.
With this table structure and GSIs, you're able to make a single query for the information you're after, rather than making two if you have two tables.
Of course, you know your data best - it's best to start with what you think is best and then test it to ensure it meets what you're looking for. DynamoDB is priced by provisioned throughput + the amount of indexed data stored so creating "fat" indexes with many attributes projects, as above, if there is a lot of data then it could become more cost effective to perform two queries and store less indexed data.
I would use user_id as your primary partition(hash) key and note_id as your primary range(sort) key.
You have already noted that in an ideal situation, each partition key is accessed with equal regularity to optimise performance see Design For Uniform Data Access Across Items In Your Tables. The use of user_id is perfectly fine as long as you have a good spread of users who regularly log in. Indeed AWS specifically encourage this option (see 'Choosing a Partition Key' table in the link above).
This approach will also make your application code much simpler than your alternative approach.
You then have a second choice which is whether to apply a Global Secondary Index for your get notes by type query. A GSI key, unlike a primary key, does not need to be unique (see AWS GSI guide, therefore I suggest you would simply use type as your GSI partition key without a range key.
The obvious plus side to using a GSI is a faster result when you perform the note type query. However you should be aware of the downsides also. A GSI has a separate throughput allowance than your table, so you need to provision this in addition to your table throughput (at extra cost). If you dont provision your GSI with enough read units it could end up slower than a scan on your table. If you dont provision enough write units, your table writes could be throttled, even if your table had enough write units.
Also, AWS warn that GSIs are updated asynchronously (usually within a fraction of a second but it can be longer). This means queries on your GSI might return the 'wrong' result if you have table writes and index reads very close together. If this was a problem you would need to handle it in your application code.
I see this as 2 tables. User and notes with a GSI on the notes table. Not sure how else you could do it. Using userId as primary key and note_id as sort key requires that you can only retrieve elements when you know both the user_id and the note_id. With DynamoDB if your not scanning you have to satisfy all the elements in the primary key, so both the partition and and sort if there is one. Below is how I would do this.
Get all notes of a certain user
When a user creates a note I would add this to the users table in the users notes attribute. When you want to get all of a users notes then retrieve the user and access the array/list of note_ids stored there.
{ userId: xxx,
notes: [ note_id_1,note_id_2,note_id_3]
}
Get a specific note
A notes table with node_id as the primary key would make that easy.
{
noteId: XXXX,
note: "sfsfsfsfsfsf",
type: "standard_note"
}
Get all notes of a certain type (the less used query)
I would use a GSI on the notes table for this with the attributes of "note_type" and note_id projected onto it.
Update
You can pull this off with one table and a GSI (See the two answers below for how) but I would not do it. Your data model is so simple why make it more complicated than users and notes.

Dynamodb query operations

Is it possible to work with "OR" , "AND " Query operations in DynamoDB?
I need to know if DynamoDB has something like "where fname = xxxx OR lname = xxxx" from SQL queries? in Rails
Thanks.
In general, no.
DynamoDB only allows efficient lookup by primary(hash) key, plus optionally a range query on the "range key". Other attributes are not indexed.
You can use a Scan request to read an entire table filter by a set of attributes, but this is a relatively expensive and slow option for large tables.
You can simulate AND by creating a primary key that includes both values to be queried, and OR by creating duplicate tables that each use one attribute as their primary key, and querying both tables in parallel with BatchGetItem
As BCoates mentioned the answer is NO.
If you want consistent read then you can't use BatchGetItem.
No it not possible to Use 'OR' Operator,
For example in
KeyConditionExpression: '#hashkey = :hk_val AND #rangekey > :rk_val',
it Uses And Operator for matching the for bot HASH and RANGE Key.
There fore we cant use OR in Dynamo Db.
Although there is no such thing as "OR", "AND", but you can still simulate sql queries using scan and filterexpression.
But remember that if you are using scan, then it means whole table will be fetched and then processed, also scan doesn't always scan whole table in one iteration. so you might miss out some items. so this method is generally avoided as it is very costly. but here is piece of python3 code using boto3 api that can simulate the sql query you want
response = table.scan(
FilterExpression=Attr('fname').eq('xxxxx') | Attr('lname').eq('xxxxx'))
filterexpression also different operatiors like &,~