Sorting items by dates in DynamoDB - amazon-web-services

I come from a strict SQL background.
Now migrating to DynamoDB, I have a table full of items which I would like to sort by dates.
Here is how I do it:
I set up a secondary index Category-Date-Index. Category is Hash and Date is Range. All items I am sorting will have the same value for category.
The problem I now have is that many items have the same dates. This secondary index automatically drops items with the same Category-Date and keeps only one. This is not the behavior I desire.
What would be the right way to do this?
I would also appreciate pointers to a good reading on how to structure tables and indices in DynamoDB when considering these use cases.

how many items with same date do you have?
you can always add to your date an extra postfix (like a random number in range of 0 - X - if your date is an int - epoc time) - this will also ensure your sorting. (only if your range is string, and you always add the same number of digits)
for example:
original item = (hash, 1234567)
converted item = (hash, '1234567010')
original item2 = (hash, 1234567)
converted item2 = (hash, '1234567900')
you can use 'overwrite' param (and set it to false) when inserting an item. in case of an error, you can add an extra number to your range key.
you can find good guidelines here:
http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/GuidelinesForTables.html

Related

Querying a Global Secondary Index of a DynamoDB table without using the partition key

I have a DynamoDB table with partition key as userID and no sort key.
The table also has a timestamp attribute in each item. I wanted to retrieve all the items having a timestamp in the specified range (regardless of userID i.e. ranging across all partitions).
After reading the docs and searching Stack Overflow (here), I found that I need to create a GSI for my table.
Hence, I created a GSI with the following keys:
Partition Key: userID
Sort Key: timestamp
I am querying the index with Java SDK using the following code:
String lastWeekDateString = getLastWeekDateString();
AmazonDynamoDB client = AmazonDynamoDBClientBuilder.standard().build();
DynamoDB dynamoDB = new DynamoDB(client);
Table table = dynamoDB.getTable("user table");
Index index = table.getIndex("userID-timestamp-index");
QuerySpec querySpec = new QuerySpec()
.withKeyConditionExpression("timestamp > :v_timestampLowerBound")
.withValueMap(new ValueMap()
.withString(":v_timestampLowerBound", lastWeekDateString));
ItemCollection<QueryOutcome> items = index.query(querySpec);
Iterator<Item> iter = items.iterator();
while (iter.hasNext()) {
Item item = iter.next();
// extract item attributes here
}
I am getting the following error on executing this code:
Query condition missed key schema element: userID
From what I know, I should be able to query the GSI using only the sort key without giving any condition on the partition key. Please help me understand what is wrong with my implementation. Thanks.
Edit: After reading the thread here, it turns out that we cannot query a GSI with only a range on the sort key. So, what is the alternative, if any, to query the entire table by a range query on an attribute? One suggestion I found in that thread was to use year as the partition key. This will require multiple queries if the desired range spans multiple years. Also, this does not distribute the data uniformly across all partitions, since only the partition corresponding to the current year will be used for insertions for one full year. Please suggest any alternatives.
When using dynamodb Query operation, you must specify at least the Partition key. This is why you get the error that userId is required. (In the AWS Query docs)
The condition must perform an equality test on a single partition key value.
The only way to get items without the Partition Key is by doing a Scan operation (but this wont be sorted by your sort key!)
If you want to get all the items sorted, you would have to create a GSI with a partition key that will be the same for all items you need (e.g. create a new attribute on all items, such as "type": "item"). You can then query the GSI and specify #type=:item
QuerySpec querySpec = new QuerySpec()
.withKeyConditionExpression(":type = #item AND timestamp > :v_timestampLowerBound")
.withKeyMap(new KeyMap()
.withString("#type", "type"))
.withValueMap(new ValueMap()
.withString(":v_timestampLowerBound", lastWeekDateString)
.withString(":item", "item"));
Always good solution for any customised querying requirements with DDB is to have right primary key scheme design for GSI.
In designing primary key of DDB, the main principal is that hash key should be designed for partitioning entire items, and sort key should be designed for sorting items within the partition.
Having said that, I recommend you to use year of timestamp as a hash key, and month-date as a sort key.
At most, the number of query you need to make is just 2 at max in this case.
you are right, you should avoid filtering or scanning as much as you can.
So for example, you can make the query like this If the year of start date and one of end date would be same, you need only one query:
.withKeyConditionExpression("#year = :year and #month-date > :start-month-date and #month-date < :end-month-date")
and else like this:
.withKeyConditionExpression("#year = :start-year and #month-date > :start-month-date")
and
.withKeyConditionExpression("#year = :end-year and #month-date < :end-month-date")
Finally, you should union the result set from both queries.
This consumes only 2 read capacity unit at most.
For better comparison of sort key, you might need to use UNIX timestamp.
Thanks

Sqlite Query to remove duplicates from one column. Removal depends on the second column

Please have a look at the following data example:
In this table, I have multiple columns. There is no PRIMARY KEY, as per the image I attached, there are a few duplicates in STK_CODE. Depending on the (min) column, I want to remove duplicate rows.
According to the image, one stk_code has three different rows. Corresponding to these duplicate stk_codes, value in (min) column is different, I want to keep the row which has minimum value in (min) column.
I am very new at sqlite and I am dealing with (-lsqlite3) to join cpp with sqlite.
Is there any way possible?
Your table has rowid as primary key.
Use it to get the rowids that you don't want to delete:
DELETE FROM comparison
WHERE rowid NOT IN (
SELECT rowid
FROM comparison
GROUP BY STK_CODE
HAVING (COUNT(*) = 1 OR MIN(CASE WHEN min > 0 THEN min END))
)
This code uses rowid as a bare column and a documented feature of SQLite with which when you use MIN() or MAX() aggregate functions the query returns that row which contains the min or max value.
See a simplified demo.

Get latest 3 entries from DynamoDb

I have a dynamo-db table with following schema
{
"id": String [hash key]
"type": String [range key]
}
I have a usecase where I need to fetch last 3 rows for a given id when type is unknown.
Your items need a timestamp attribute. Without that they can’t be sorted out filtered by time. Once you have that, you can define a local secondary index with the id as partition key and the timestamp as the sort key. You can then get the top three items from the index.
Find more information about DynamoDb’s Local Secondary Index here.
Add a field to store the timestamp to the schema
Use query to fetch all the records for the given key
Query always returns records sorted by range key, you cannot set a sort order (without changing table's schema), so, sort the records by timestamp in your code
Get top 3 records
If you have a lot of records, use filter expressions to drop extra results. E.g. if you know that latest records will always have a timestamp not older than a hour (day, week or so) you could filter older records.

Query all items with primary key in a given range and secondary key matching a given value

I have a dynamoDB table that stores sentences. Each sentence has a primary key called 'id' (of type int) and other secondary keys for each word in the sentence.
For example, the entry "hello world" would have some integer as id and would have entries "hello"=1 and "world"=1. I need to query all sentences that have id within a given range and that contain a word from a list of given words (words = [word1, word2, word3, word4, word5]). The query I have so far is:
while items == []:
response = lyric_table.scan(
FilterExpression=(Key(words[0]).eq(1) |
Key(words[1]).eq(1) |
Key(words[2]).eq(1) |
Key(words[3]).eq(1)|
Key(words[4]).eq(1)) &
filt,
ExclusiveStartKey={'id': r},)
items = response['Items']
where
filt = Key('id').between(r1, r2) | Key('id').between(r3, r4) ...
I am also selecting the ExclusiveStartKey to be a random number chosen from r1, r3, ... in each iteration of the while loop, although I am not sure if this is necessary.
This code is working as expected when "words" contain words that are relatively common in the table, but is taking too much time to run for when "words" contains words that are not too common in the database. In some cases, the scan just runs indefinitely. I also tried using query instead of scan, but had no luck in improving the code with that.
Do you have any suggestions on how to optimize the above code?
The only way to efficiently carry out a range operation in Dynamodb is if the attribute is a sort key(or secondary key) on a table or index. If its a partition key(or primary key), Dynamodb hashes it to distribute randomly. This is by design to allow for read/write scalability. Range operation on the partition key will involve a table scan and so, it won’t be efficient.
If I understand your question correctly, your data looks something like this:
Id Word SomeValue
101 Hello 1
101 World 2
If you know the complete range of your ids (say, 1-1000), one solution may be to bucket these ids, use the bucket key as the partition key and Id as the sort key:
BucketId Id
1 ..
1 ..
100 101
100 101
100 121
…
200
300
:
1000
And then for a range of (101, 320), you can do 3 queries on ids 100, 200 and 300 with the appropriate filter expressions. This will definitely be much more efficient than a table scan. As for words, not sure what your particular use case is, but if they are limited in number per id, they can be stored as a single map or set attribute.

Dynamodb scan in sorted order

Hi I have a dynamodb table. I want the service to return me all the items in this table and the order is by sorting on one attribute.
Do I need to create a global secondary index for this? If that is the case, what should be the hash key, what is the range key?
(Note that query on gsi must specify a "EQ" comparator on the hash key of GSI.)
Thanks a lot!
Erben
If you know the HashKey, then any query will return the items sorted by Range key. From the documentation:
Query results are always sorted by the range key. If the data type of the range key is Number, the results are returned in numeric order. Otherwise, the results are returned in order of UTF-8 bytes. By default, the sort order is ascending. To reverse the order, set the ScanIndexForward parameter set to false.
Now, if you need to return all the items, you should use a scan. You cannot order the results of a scan.
Another option is to use a GSI (example). Here, you see that the GSI contains only HashKey. The results I guess will be in sorted order of this key (I didn't check this part in a program yet!).
As of now the dynamoDB scan cannot return you sorted results.
You need to use a query with a new global secondary index (GSI) with a hashkey and range field. The trick is to use a hashkey which is assigned the same value for all data in your table.
I recommend making a new field for all data and calling it "Status" and set the value to "OK", or something similar.
Then your query to get all the results sorted would look like this:
{
TableName: "YourTable",
IndexName: "Status-YourRange-index",
KeyConditions: {
Status: {
ComparisonOperator: "EQ",
AttributeValueList: [
"OK"
]
}
},
ScanIndexForward: false
}
The docs for how to write GSI queries are found here: http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/GSI.html#GSI.Querying
Approach I followed to solve this problem is by creating a Global Secondary Index as below. Not sure if this is the best approach but posting it if it is useful to someone.
Hash Key | Range Key
------------------------------------
Date value of CreatedAt | CreatedAt
Limitation imposed on the HTTP API user to specify the number of days to retrieve data, defaults to 24 hr.
This way, I can always specify the HashKey as Current date's day and RangeKey can use > and < operators while retrieving. This way the data is also spread across multiple shards.