Dynamodb GSI for boolean value - amazon-web-services

So I have this notifications table with the following columns:
PK: (which stores the userId)
sentAt: (which stores the date the notifications was sent)
data: (which stores the data of the notification)
Read: (a boolean value which tells if the user has read the specific notification)
I wanted to create a GSI to get all the notification from a specific user that are not read (Read: False)
So the partition key would be userId and the sort key would be Read but the issue here is that I cannot give a boolean value to the sort key to be able to query the users that have not read the notifications.
This works with scan but that is not the result I am trying to achieve. Can anyone help me on this? Thanks
const params ={
TableName: await this.configService.get('NOTIFICATION_TABLE'),
FilterExpression: '#PK = :PK AND #Read = :Read',
ExpressionAttributeNames: {
'#PK': 'PK',
'#Read': 'Read',
},
ExpressionAttributeValues: {
':PK': 'NOTIFICATION#a8a8e4c7-cab0-431e-8e08-1bcf962358b8',
':Read': true, *//this is causing the error*
},
};
const response = await this.dynamoDB.scan(params).promise();

Yes, we cannot have bool type value to be used as DynamoDB Partition Key or Sort Key.
Some alternatives you could actually consider:
Create a GSI with only Partition Key, gsi-userId. When you do the query, you can query with userId and filter by Read. This will at least help you in saving some costs as you do not need to scan the whole table. However, be aware of Hot Partitions. Link
Consider changing the Read data type to string instead. E.g. It could be values such as Y or N only. As such, you will be able to create a GSI with gsi-userId-Read and this would fulfill what you need.

Related

How to compare strings in DynamoDB using Lambda NodeJS?

I have a lambda function that make some requests on DynamoDB.
var ddb = new AWS.DynamoDB({apiVersion: '2012-08-10'});
const lookupminutes = 10;
var LookupDate = new Date(Date.now() - 1000 * lookupminutes);
params = {
TableName: TableName,
IndexName: "requestdate-index",
KeyConditionExpression: "requestdate > :startdate",
ExpressionAttributeValues: {":startdate": {S: LookupDate.toISOString()}
},
ProjectionExpression: "id, requestdate"
};
var results = await ddb.query(params).promise();
When running the lambda function, I'm getting the error : "Query key condition not supported" in the line that runs the query against DynamoDB
The field requestdate is stored in the table as a string.
Does anyone know what am I doing wrong please ?
Thanks.
You cannot use anything other than an equals operator on a partition key:
params = {
TableName: TableName,
IndexName: "requestdate-index",
KeyConditionExpression: "requestdate = :startdate",
ExpressionAttributeValues: {":startdate": {S: LookupDate.toISOString()}},
ProjectionExpression: "id, requestdate"
};
If you need all of the data back within the last 10 mins then they you have two choices, both of which are not very scalable, unless you shard your key (1a):
Put all the data in your index under the same partition key with sort key being timestamp. Then use KeyConditionExpression like:
gsipk=1 AND timestamp> 10mins
As all of the items are under the same partition key, the query will be efficient but at the cost of scalability as you will essentially bottleneck your throughput to 1000WCU.
1a. And probably the best option if you need scale beyond 1000 WCU is to do just as above except use a random number for the partition key (within a range). For example range = 0-9. That would give us 10 unique partition keys allowing us to scale to 10k WCU, however would require us to request 10 Query in parallel to retrieve the data.
Use a Scan with FilterExpression on the base table. If you do not want to place everything under the same key on the GSI then you can just Scan and add a filter. This becomes slow and expensive as the table grows.

Scanning With sort_key in DynamoDB

I have a table that will contain < 1300 entries at about 600 bytes each. The goal is to display pages of results ordered by epoch date. Right now, for any given search I request the full list of ids using a filtered scan, then handle paging on the UI side. For each page, I pass a chunk of ids to retrieve the full entry (also currently a filtered scan). Ideally, the list of ids would return sorted, but if I understand the docs correctly, only results that have the same partition key are sorted. My current partition key is a uuid, so all entries are unique.
Current Table Configuration
Do I essentially need to use a throwaway key for the partition just to get results returned by date? Maybe the size of my table makes this unreasonable to begin with? Is there a better way to handle this? I have another field, "is_active" that's currently a boolean and could be used for the partition key if I converted it to numeric, but that might complicate my update method. 95% of the time, every entry in the db will be "active", so this doesn't seem efficient.
Scan Index
let params = {
TableName: this.TABLE_NAME,
IndexName: this.INDEX_NAME,
ScanIndexForward: false,
ProjectionExpression: "id",
FilterExpression: filterSqlStatement,
ExpressionAttributeValues: filterValues,
ExpressionAttributeNames: {
"#n": "name"
}
};
let results = await this.DDB_CLIENT.scan(params).promise();
let finalizedResults = results ? results.Items : [];
Given that your dataset is relatively small you might try a fixed partition key with a sort key of the date and the UUID. You'd query by the partition key (which would be a fixed value) and the results would come back sorted. This isn't the best idea with large data sets, but < 1300 is not large.

How to query DynamoDB GSI with compound conditions

I have a DynamoDB table called 'frank' with a single GSI. The partition key is called PK, the sort key is called SK, the GSI partition key is called GSI1_PK and the GSI sort key is called GSI1_SK. I have a single 'data' map storing the actual data.
Populated with some test data it looks like this:
The GSI partition key and sort key map directly to the attributes with the same names within the table.
I can run a partiql query to grab the results that are shown in the image. Here's the partiql code:
select PK, SK, GSI1_PK, GSI1_SK, data from "frank"."GSI1"
where
("GSI1_PK"='tesla')
and
(
( "GSI1_SK" >= 'A_VISITOR#2021-06-01-00-00-00-000' and "GSI1_SK" <= 'A_VISITOR#2021-06-20-23-59-59-999' )
or
( "GSI1_SK" >= 'B_INTERACTION#2021-06-01-00-00-00-000' and "GSI1_SK" <= 'B_INTERACTION#2021-06-20-23-59-59-999' )
)
Note how the partiql code references "GSI1_SK" multiple times. The partiql query works, and returns the data shown in the image. All great so far.
However, I now want to move this into a Lambda function. How do I structure a AWS.DynamoDB.DocumentClient query to do exactly what this partiql query is doing?
I can get this to work in my Lambda function:
const visitorStart="A_VISITOR#2021-06-01-00-00-00-000";
const visitorEnd="A_VISITOR#2021-06-20-23-59-59-999";
var params = {
TableName: "frank",
IndexName: "GSI1",
KeyConditionExpression: "#GSI1_PK=:tmn AND #GSI1_SK BETWEEN :visitorStart AND :visitorEnd",
ExpressionAttributeNames :{ "#GSI1_PK":"GSI1_PK", "#GSI1_SK":"GSI1_SK" },
ExpressionAttributeValues: {
":tmn": lowerCaseTeamName,
":visitorStart": visitorStart,
":visitorEnd": visitorEnd
}
};
const data = await documentClient.query(params).promise();
console.log(data);
But as soon as I try a more complex compound condition I get this error:
ValidationException: Invalid operator used in KeyConditionExpression: OR
Here is the more complex attempt:
const visitorStart="A_VISITOR#2021-06-01-00-00-00-000";
const visitorEnd="A_VISITOR#2021-06-20-23-59-59-999";
const interactionStart="B_INTERACTION#2021-06-01-00-00-00-000";
const interactionEnd="B_INTERACTION#2021-06-20-23-59-59-999";
var params = {
TableName: "frank",
IndexName: "GSI1",
KeyConditionExpression: "#GSI1_PK=:tmn AND (#GSI1_SK BETWEEN :visitorStart AND :visitorEnd OR #GSI1_SK BETWEEN :interactionStart AND :interactionEnd) ",
ExpressionAttributeNames :{ "#GSI1_PK":"GSI1_PK", "#GSI1_SK":"GSI1_SK" },
ExpressionAttributeValues: {
":tmn": lowerCaseTeamName,
":visitorStart": visitorStart,
":visitorEnd": visitorEnd,
":interactionStart": interactionStart,
":interactionEnd": interactionEnd
}
};
const data = await documentClient.query(params).promise();
console.log(data);
The docs say that KeyConditionExpressions don't support 'OR'. So, how do I replicate my more complex partiql query in Lambda using AWS.DynamoDB.DocumentClient?
If you look at the documentation of PartiQL for DynamoDB they do warn you, that PartiQL has no scruples to use a full table scan to get you your data: https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/ql-reference.select.html#ql-reference.select.syntax
To ensure that a SELECT statement does not result in a full table scan, the WHERE clause condition must specify a partition key. Use the equality or IN operator.
In those cases PartiQL would run a scan and use a FilterExpression to filter out the data.
Of course in your example you provided a partition key, so I'd assume that PartiQL would run a query with the partition key and a FilterExpression to apply the rest of the condition.
You could replicate it that way, and depending on the size of your partitions this might work just fine. However, if the partition will grow beyond 1MB and most of the data would be filtered out, you'll need to deal with pagination even though you won't get any data.
Because of that I'd suggest you to simply split it up and run each or condition as a separate query, and merge the data on the client.
Unfortunately, DynamoDB does not support multiple boolean operations in the KeyConditionExpression. The partiql query you are executing is probably performing a full table scan to return the results.
If you want to replicate the partiql query using the DocumentClient, you could use the scan operation. If you want to avoid using scan, you could perform two separate query operations and join the results in your application code.

DynamoDB Between query on GSI does not work as expected

It is a jobPosts schema that has a posted_date as one of the attributes. The goal is to query all the job posts between two dates.
Here is the schema for your reference:
{
'job_id': {S: jobInfo.job_id},
'company': {S: jobInfo.company},
'title': {S: jobInfo.title},
'posted_on': {S: jobInfo.posted_on},
}
posted_on' is based on ISO string (2019-11-10T10:52:38.013Z). job_id is the primary key (partition key) and since I need to query the dates, I created GSI(partition key) on posted_on. Now here is the query:
const params = {
TableName : "jobPosts",
IndexName: 'date_for_filter_purpose-index',
ProjectionExpression:"job_id, company, title, posted_on",
KeyConditionExpression: "posted_on BETWEEN :startDate AND :endDate",
ExpressionAttributeValues: {
":startDate": {S: "2019-10-10T10:52:38.013Z"},
":endDate": {S: "2019-11-10T10:52:38.013Z"}
}
};
I have one document in dynamoDB and here it is:
{
job_id:,
company: "xyz",
title: "abc",
posted_on: "2019-11-01T10:52:38.013Z"
}
Now, on executing this, I get the following error:
{
"message": "Query key condition not supported",
"code": "ValidationException",
"time": "2019-11-11T06:15:37.231Z",
"requestId": "J078NON3L8KSJE5E8I3IP9N0IBVV4KQNSO5AEMVJF66Q9ASUAAJG",
"statusCode": 400,
"retryable": false,
"retryDelay": 12.382362030893768
}
I don't know what is wrong with the above query.
Update after Tommy Answer:
I removed the GSI on posted_on and re-created the table with job_id as partition key and posted_on as sort key. I get the following error:
{
"message": "Query condition missed key schema element: job_id",
"code": "ValidationException",
"time": "2019-11-12T11:01:48.682Z",
"requestId": "M9E793UQNJHPN5ULQFJI2NR0BVVV4KQNSO5AEMVJF66Q9ASUAAJG",
"statusCode": 400,
"retryable": false,
"retryDelay": 42.52613025785952
}
As per this SO answer, GSI should be able to query the dates using BETWEEN keyword.
The answer you refer to relates to a query where the partition key has a specific value and the sort key is in a given range. It's analagous to select * from table where status=Z and date between X and Y. That's not what you're trying to do, if I read your question correctly. You want select * from table where date between X and Y. You cannot do this with DynamoDB query - you cannot query a partition key by range.
If you knew that your max range of query dates was on a given day then you could create a GSI with a partition key set to the computed YYYYMMDD value of the date/time and whose sort key was the full date/time. Then you could query with a key condition expression for a partition key of the computed YYYYMMDD and a sort key between X and Y. For this to work, the YYYYMMDD of X and Y would have to be the same.
If you knew that your max range of query dates was a month then you could create a GSI with partition key set to the computed YYYYMM of the date/time and whose sort key was the full date/time. For this to work, the YYYYMM of X and Y would have to be the same.
I guess it's a little counter-intuitive but DynamoDB supports only .eq condition on partition key attributes.
As per KeyConditions Documentation
You must provide the index partition key name and value as an EQ condition. You can optionally provide a second condition, referring to the index sort key.
Furthermore, in Query API Documentation you can find the following
The condition must perform an equality test on a single partition key value.
The condition can optionally perform one of several comparison tests on a single sort key value. This allows Query to retrieve one item with a given partition key value and sort key value, or several items that have the same partition key value but different sort key values.
That explains the error message you are getting.
One of the solutions might be to create a composite primary key with posted_on attribute as the sort key, instead of the GSI. Then, depending on your use case and access pattern, you'll need to figure out which attribute would work best as the partition key.
This blog should help you to choose the right partition key for your schema.

DynamoDB: Best hash/sort keys for my use case [confusion with AppSync/GraphQL]

I plan on using AWS Cognito for user auth, DynamoDB for persistence and AppSync (and a lot of Mobile Hub) to power the API - a Book Review site.
I'm having a hard time determining which field should be my hash key and which should be my sort key, and which LSI/GSI I should create.
I have a list of Books with details like so:
type Book {
isbn: Int!
year: Int!
title: String!
description: String
front_cover_photo_url: String
genre_ids: [Int]
count_thumbs: Int
us_release_date: String
upcoming_release: Boolean
currently_featured_in_book_stores: Boolean
best_seller: Boolean
reviews: [Review]
}
I also have a review record each time a user writes a review about a book.
type Review {
isbn: Int!
id: ID!
created_at: String!
# The user that submitted the review
user_id: String!
# The number of thumbs out of 5
thumbs: Int!
# Comments on the review
comments: String!
}
Books, in my case, can have multiple genres - e.g."Fantasy" and "Drama". Books also have reviews by Users, whose data is stored in Cognito. We will display the reviews in reverse chronological order next to every book.
QUESTION 1: If I denormalize and use Drama as a genre instead of Genre ID 2, then what if I need to rename the genre later to Dramatic... wouldn't I need to update every item?
I need to be able to answer, at a minimum:
Get all books currently featured in book stores [currently_featured_in_book_stores == True]
Get all books that are "upcoming" [upcoming_release == True]
Get all books sorted by most thumbs [sort by count_thumbs DESC]
Get all books that are in genre "Comedy" [genre_ids contains 123 or "Comedy" depending on answer to Q1]
Query for book(s) named "Harry Potter" [title LIKE '%Harry Potter%']
Get all books with ISBN 1, 2, 3, 4, or 9 [ isbn IN [1,2,3,4,9] ]
QUESTION 2: What's the best way to structure the book data in DynamoDB, and which hash/sort/LSI/GSI would you use?
Since I'm using Cognito, the user profile data is stored outside of DynamoDB.
QUESTION 3: Should I have a User table in DynamoDB and dual write new registrations, so I can use AppSync to populate the review's details when showing their review? If not, how would I get the user's username/first name/last name when populating the book review details?
QUESTION 4: Since we've gone this far, any suggestions for the graphql schema?
I would encourage you to read this answer. I have previously written to provide some general background on choosing keys. You should also open the links from that answer, which provide most of the key information AWS make available on the subject.
Before providing an answer I think I should also give the caveat that data architecture typically takes into account lots of factors. You've put some really good information in the question but inevitably there is not enough to provide a definitive 'best' solution. And indeed even with more information you would get different opinions.
Question 2
That said, here is what I would be thinking about doing in your case. I would be looking at creating a table called Books and a table called BookReviews.
Table: Books
Partition Key: ISBN
Table: BookReviews
Partition Key: ISBN
Sort Key: BookReview-id
I would not be looking to create any GSIs or LSIs.
Most of your queries involve finding 'all books' and ordering them in some way. These lists do not sound time sensitive. For example when a user asks for the most popular 100 books do they need to know the most popular books, including every vote counted up until the last second? I doubt it. Additionally are these lists specific to individual users? It doesn't sound like it.
My general tip is this; store your raw data in DynamoDB, and update it in real time. Create your common lists of books and update them once in a while (perhaps daily), store these lists in a cache. Optionally you could store these lists in DynamoDB in separate tables and query them in the event your cache is destroyed.
Get all books currently featured in book stores
var params = {
TableName: "Books",
ExpressionAttributeValues: {
":a": {
BOOL: true
}
},
FilterExpression: "currently_featured_in_book_stores = :a"
};
dynamodb.scan(params, function(err, data) {
if (err) console.log(err, err.stack); // an error occurred
else console.log(data); // successful response
});
This operation will retrieve all books that are currently featured in book stores. It uses a scan. If you not already familiar with scan, query and getItem you should definitely spend some time reading about them.
A scan evaluates every item in a table, for this reason scans sometimes don't scale well on large tables and can be expensive if you are only retrieving a few items. A query uses the partition key to return a set of items and is therefore typically fast and efficient. You can use a sort key in a query to quickly return a range of items from within a partition. GetItem uses the unique primary key and is very efficient.
If your table had 100 items, ANY scan you perform will cost 100 RCUs. If you perform a query, and only 2 items are in the queried partition, it would cost you 2 RCUs.
If a significant proportion of items in the Books table have currently_featured_in_book_stores=true, I would do a scan. If only a small number of items in the table have currently_featured_in_book_stores=true AND this is a very frequent query, you could consider creating a GSI on the Books table with partition key of currently_featured_in_book_stores and sort key of ISBN.
Imagine your books table has 100 books, and 50 have currently_featured_in_book_stores=true. Doing a scan costs 100 RCUs and won't cost much more than a query. Now imagine only one book has currently_featured_in_book_stores=true, perfoming a scan would cost 100 RCUs but a query would only cost 1 RCU. However you should think hard before adding GSIs, they do not share throughput with the base table, and you have to purchase RCUs separately for your GSI. If you under provision a GSI it can end up being slower than a scan on a well provisioned base table.
A boolean value is a bad partition key and I would go for a scan here. That said if you created the GSI above your query would look like this:
var params = {
TableName: "Books",
IndexName: "Index_Books_In_Stores",
ExpressionAttributeValues: {
":v1": {
BOOL: true
}
},
KeyConditionExpression: "currently_featured_in_book_stores = :v1"
};
dynamodb.query(params, function(err, data) {
if (err) console.log(err, err.stack); // an error occurred
else console.log(data); // successful response
});
Get all books that are upcoming
All of the above still applies. I would do a scan like this
var params = {
TableName: "Books",
ExpressionAttributeValues: {
":a": {
BOOL: true
}
},
FilterExpression: "upcoming_release = :a"
};
dynamodb.scan(params, function(err, data) {
if (err) console.log(err, err.stack); // an error occurred
else console.log(data); // successful response
});
I would do this scan infrequently and cache the results in a temporary store (i.e. in application memory).
Get all books sorted by most thumbs
The important thing here is the 'Get all books...'. That tells you right away that a scan is probably going to the best approach. You can think of a query as a scan that only looks at one partition. You don't want to look at a partition of books, you want ALL the books, so a scan is the way to go.
The only way DynamoDB will return sorted items is if you perform a query on a table or index that has a sort key. In this case the items would automatically be returned in sorted order based on the sort key. So for this search, you just need to do a scan to get all the books, and then sort them by your chosen attribute (thumbs) client side. The scan simply returns all books and looks like this.
var params = {
TableName: "Books"
};
dynamodb.scan(params, function(err, data) {
if (err) console.log(err, err.stack); // an error occurred
else console.log(data); // successful response
});
Again, I would do this scan very infrequently and cache the top books. You can order your cache and just retrieve the number of items you need, perhaps the top 10, 100 or 1000. If the user carried on paging beyond the scope of the cache, you might need to do a new scan. I think more likely you would just limit the number of items and stop the user paging any further.
Get all books that are in genre "Comedy"
Again, most likely I would do a scan infrequently and cache the list. You could consider adding a GSI with partition key genre and sort key ISBN. Personally I would start with the scan and cache approach and see how you get on. You can always add the GSI at a later date.
Query for book(s) named "Harry Potter"
Clearly you can't cache this one. Do a scan with a filterexpression on title
var params = {
TableName: "Books",
ExpressionAttributeValues: {
":a": {
S: "Harry Potter"
}
},
FilterExpression: "title CONTAINS :a"
};
dynamodb.scan(params, function(err, data) {
if (err) console.log(err, err.stack); // an error occurred
else console.log(data); // successful response
});
You can checkout the condition operators here
Get all books with ISBN 1, 2, 3, 4, or 9
For this one, do a GetItem on each individual ISBN and add it into a set. The query below gets one book. You would put this in a loop and iterate through the set of ISBNs you want to get.
var params = {
Key: {
"ISBN": {
S: "1"
}
},
TableName: "Books"
};
dynamodb.getItem(params, function(err, data) {
if (err) console.log(err, err.stack); // an error occurred
else console.log(data); // successful response
});
Question 1
Yes, if you store the genre as a string against each item, and you change the genre name, you would have to update each item. Or as an alternative you would have to update the genre on the item before presenting it to the user.
If you expect to change genre names, the idea of using genre_id mappings seems like a good one. Just have a table of genre names and ids, load it when your application starts and keep it in application memory. You might need an admin function to reload the genre mappings table.
Keeping application parameters in a database is a well used design.
Question 3
Absolutely, have a User table in DynamoDB. That's the way I do it in my application which uses Cognito. I store a minimum set of fields in Cognito relating to user registration, then I have lots of application specific data in DynamoDB in a user table.
Question 4
Regarding graph schemas, I would check out this articles by AWS. Not too sure if that's of help.