Coming from a SQL background, trying to undestand NoSQL particularly DynamoDB options. Given this schema:
{
"publist": [{
"Author": "John Scalzi",
"Title": "Old Man's War",
"Publisher": "Tor Books",
"Tags": [
"DeepSpace",
"SciFi"
]
},
{
"Author": "Ursula Le Guin",
"Title": "Wizard of Earthsea",
"Publisher": "Mifflin Harcourt",
"Tags": [
"MustRead",
"Fantasy"
]
},
{
"Author": "Cory Doctorow",
"Title": "Little Brother",
"Publisher": "Doherty"
}
]
}
I could have the main table have Author/Title as hash/range keys. A global secondary index could be Publisher/Title. What are the best practices here. How can I get a list of all Authors for a publisher without a total table scan? Cant have a secondary index because Publisher/Author is not unique! Also what are my options if I want all the titles that have a tag of DeepSpace?
EDIT: See RPM & Vikdor answers below. GSI need not be unique, so Publisher/Author is possible. But question remains: is there any workaround for getting all authors by tag, without full table scan?
Cant have a secondary index because Publisher/Author is not unique!
Sure you can, just make sure your Publisher/Title index has Author as a projection - you can then do a query by publisher and just iterate over the results and collect the authors.
When you set up your indexes, you can choose which attributes are projected into the index. Having a Publisher or Publisher/Title key doesn't mean you can only view the Publisher or Publisher and Title, it means you can only query by Publisher or Title, so if you have all attributes or the Author attribute projected into your index, you can get a list of authors by publisher using a query and not a full table scan.
Cant have a secondary index because Publisher/Author is not unique!
The (hash primary key, range primary key) tuple need not be unique for defining a Global Secondary Index. This is only a requirement for the Table level key definitions, i.e. the table cannot have multiple rows with the same values of (hash primary key, range primary key) tuple.
How can I get a list of all Authors for a publisher without a total table scan
You define a GSI on Publisher (Hash PK), Author (Range PK) and use DynamoDB query on the GSI with the Publisher attribute set as the Hash Key Value.
Unlike in SQL where it is possible to create non-clustered indexes on arbitrary columns based on the retrieval patterns, in DynamoDB, as the number of Local Secondary Indexes and Global Secondary Indexes are limited per table, it is important to list down the use cases of retrieving data before identifying the Hash Primary Key and Range Primary Key for a table and leverage Local Secondary Indexes as much as possible, as they use the table's read & write capacity and are strongly consistent (you can choose to run eventually-consistent queries too on LSIs to save capacity). GSIs need their own read & write capacity and are eventually-consistent.
Unfortunately this is not supported currently in DynamoDB. DDB does not provide the capability to query on nested documents alike MongoDB.
In this situation consider modelling data differently and put the nested document in a separate table.
hope this will help.
Cheers,
Related
I have a table in which has a "userId" column (set as a partition key) and a "createdAt" column (set as the sort key) so they form up a composite primary key.
I also need to find the exact row in case I don't have the User ID available, so I made another column "id" and made it as a global secondary index.
In my case, should I make the "id" column the primary key and remove the "userId" as the partition key or will this remove the feature of what "Partitioning" actually does by the DynamoDB?
Similarly, If I need to delete a row from the table, should I send "createdAt" field from the front end to be able to find out the exact row? Does this make sense? Sending the "id" of the row seems more good to me to be able to delete the row.
You probably don't want to put a timestamp in your user primary keys. Why? You'd need to know the exact time the user was created to fetch a user, which is probably not what you want.
Consider using a partition key of USER#<user_id> and a sort key of something predictable, like A or METADATA or USER#<user_id>. This allows you to fetch/delete a user by their ID.
If you have access patterns around fetching users in order of account creation, you can create a GSI with the sort key set to the createdAt attribute.
I have a DynamoDB table with partition key as userID and no sort key.
The table also has a timestamp attribute in each item. I wanted to retrieve all the items having a timestamp in the specified range (regardless of userID i.e. ranging across all partitions).
After reading the docs and searching Stack Overflow (here), I found that I need to create a GSI for my table.
Hence, I created a GSI with the following keys:
Partition Key: userID
Sort Key: timestamp
I am querying the index with Java SDK using the following code:
String lastWeekDateString = getLastWeekDateString();
AmazonDynamoDB client = AmazonDynamoDBClientBuilder.standard().build();
DynamoDB dynamoDB = new DynamoDB(client);
Table table = dynamoDB.getTable("user table");
Index index = table.getIndex("userID-timestamp-index");
QuerySpec querySpec = new QuerySpec()
.withKeyConditionExpression("timestamp > :v_timestampLowerBound")
.withValueMap(new ValueMap()
.withString(":v_timestampLowerBound", lastWeekDateString));
ItemCollection<QueryOutcome> items = index.query(querySpec);
Iterator<Item> iter = items.iterator();
while (iter.hasNext()) {
Item item = iter.next();
// extract item attributes here
}
I am getting the following error on executing this code:
Query condition missed key schema element: userID
From what I know, I should be able to query the GSI using only the sort key without giving any condition on the partition key. Please help me understand what is wrong with my implementation. Thanks.
Edit: After reading the thread here, it turns out that we cannot query a GSI with only a range on the sort key. So, what is the alternative, if any, to query the entire table by a range query on an attribute? One suggestion I found in that thread was to use year as the partition key. This will require multiple queries if the desired range spans multiple years. Also, this does not distribute the data uniformly across all partitions, since only the partition corresponding to the current year will be used for insertions for one full year. Please suggest any alternatives.
When using dynamodb Query operation, you must specify at least the Partition key. This is why you get the error that userId is required. (In the AWS Query docs)
The condition must perform an equality test on a single partition key value.
The only way to get items without the Partition Key is by doing a Scan operation (but this wont be sorted by your sort key!)
If you want to get all the items sorted, you would have to create a GSI with a partition key that will be the same for all items you need (e.g. create a new attribute on all items, such as "type": "item"). You can then query the GSI and specify #type=:item
QuerySpec querySpec = new QuerySpec()
.withKeyConditionExpression(":type = #item AND timestamp > :v_timestampLowerBound")
.withKeyMap(new KeyMap()
.withString("#type", "type"))
.withValueMap(new ValueMap()
.withString(":v_timestampLowerBound", lastWeekDateString)
.withString(":item", "item"));
Always good solution for any customised querying requirements with DDB is to have right primary key scheme design for GSI.
In designing primary key of DDB, the main principal is that hash key should be designed for partitioning entire items, and sort key should be designed for sorting items within the partition.
Having said that, I recommend you to use year of timestamp as a hash key, and month-date as a sort key.
At most, the number of query you need to make is just 2 at max in this case.
you are right, you should avoid filtering or scanning as much as you can.
So for example, you can make the query like this If the year of start date and one of end date would be same, you need only one query:
.withKeyConditionExpression("#year = :year and #month-date > :start-month-date and #month-date < :end-month-date")
and else like this:
.withKeyConditionExpression("#year = :start-year and #month-date > :start-month-date")
and
.withKeyConditionExpression("#year = :end-year and #month-date < :end-month-date")
Finally, you should union the result set from both queries.
This consumes only 2 read capacity unit at most.
For better comparison of sort key, you might need to use UNIX timestamp.
Thanks
I have a table of 500gb. I want to transfer the data to another table based on the timestamps.
There are several items in table and I want only latest entry of every item in another table.
Considering the size of table, can anyone recommend best aws service to get it done fast and easy?
I have come across aws glue, hivecopyactivity. Are this the best solution or is there any other service I can use?
(assuming you now can add a Global secondary indexes (GSI) on that table, that is: you currently have < 5 GSIs)
Define a new GSI on your table. The GSI's partition key will be x. The GSI's sort key will be timestamp. Once you have that GSI defined you can do a query on that index with ScanIndexForward set to false to get the most recent item first. You need to supply the value of x you are interested at. In the following example request it is simply set to 'abc'
{
"TableName": "<your-table-name>",
"IndexName": "<your-GSI-name>",
"KeyConditionExpression": "x = :argx",
"ExpressionAttributeValues": {
":argx": {"S": "abc"}
},
"ScanIndexForward": false,
"Limit": 1
}
This query looks at items with a given x value (as set in the ExpressionAttributeValues field) sorted in descending order (by the GSI's sort key, which is the timestamp field) and picks the first one (Limit is set to 1). As long as you do not need filtering (the FilterExpression field is empty) then you will get the result that you need by issuing a single Query request.
If you do want to use filtering you will need to do multiple requests and unset the Limit field (i.e., use its default value). See this answer for further details on those subtleties.
I noticed that DynamoDB query/scan only returns documents that contain a subset of the document, just the key columns it appears.
This means I need to do a separate Batch_Get to get the actual documents referenced by those keys.
I am not using a projection expression, and according to the documentation this means the whole item should be returned.1
How do I get query to return the entire document so I don't have to do a separate batch get?
One example bit of code that shows this is below. It prints out found documents, yet they contain only the primary key, the secondary key, and the sort key.
t1 = db.Table(tname)
q = {
'IndexName': 'mysGSI',
'KeyConditionExpression': "secKey= :val1 AND " \
"begins_with(sortKey,:status)",
'ExpressionAttributeValues': {
":val1": 'XXX',
":status": 'active-',
}
}
res = t1.query(**q)
for doc in res['Items']:
print(json.dumps(doc))
This situation is discussed in the documentation for the Select parameter. You have to read quite a lot to find this, which is not ideal.
If you query or scan a global secondary index, you can only request
attributes that are projected into the index. Global secondary index
queries cannot fetch attributes from the parent table.
Basically:
If you query the parent table then you get all attributes by default.
If you query an LSI then you get all attributes by default - they're retrieved from the projection in the LSI if all attributes are projected into the index (so that costs nothing extra) or from the base table otherwise (which will cost you more reads).
If you query or scan a GSI, you can only request attributes that are projected into the index. GSI queries cannot fetch attributes from the parent table.
Hi I have a dynamodb table. I want the service to return me all the items in this table and the order is by sorting on one attribute.
Do I need to create a global secondary index for this? If that is the case, what should be the hash key, what is the range key?
(Note that query on gsi must specify a "EQ" comparator on the hash key of GSI.)
Thanks a lot!
Erben
If you know the HashKey, then any query will return the items sorted by Range key. From the documentation:
Query results are always sorted by the range key. If the data type of the range key is Number, the results are returned in numeric order. Otherwise, the results are returned in order of UTF-8 bytes. By default, the sort order is ascending. To reverse the order, set the ScanIndexForward parameter set to false.
Now, if you need to return all the items, you should use a scan. You cannot order the results of a scan.
Another option is to use a GSI (example). Here, you see that the GSI contains only HashKey. The results I guess will be in sorted order of this key (I didn't check this part in a program yet!).
As of now the dynamoDB scan cannot return you sorted results.
You need to use a query with a new global secondary index (GSI) with a hashkey and range field. The trick is to use a hashkey which is assigned the same value for all data in your table.
I recommend making a new field for all data and calling it "Status" and set the value to "OK", or something similar.
Then your query to get all the results sorted would look like this:
{
TableName: "YourTable",
IndexName: "Status-YourRange-index",
KeyConditions: {
Status: {
ComparisonOperator: "EQ",
AttributeValueList: [
"OK"
]
}
},
ScanIndexForward: false
}
The docs for how to write GSI queries are found here: http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/GSI.html#GSI.Querying
Approach I followed to solve this problem is by creating a Global Secondary Index as below. Not sure if this is the best approach but posting it if it is useful to someone.
Hash Key | Range Key
------------------------------------
Date value of CreatedAt | CreatedAt
Limitation imposed on the HTTP API user to specify the number of days to retrieve data, defaults to 24 hr.
This way, I can always specify the HashKey as Current date's day and RangeKey can use > and < operators while retrieving. This way the data is also spread across multiple shards.