dynamodb - scan items where map contains a key - amazon-web-services

I have a table that contains a field (not a key field), called appsMap, and it looks like this:
appsMap = { "qa-app": "abc", "another-app": "xyz" }
I want to scan all rows whose appsMap contains the key "qa-app" (the value is not important, just the key). I tried something like this but it doesn't work in the way I need:
FilterExpression = '#appsMap.#app <> :v',
ExpressionAttributeNames = {
"#app": "qa-app",
"#appsMap": "appsMap"
},
ExpressionAttributeValues = {
":v": { "NULL": True }
},
ProjectionExpression = "deviceID"
What's the correct syntax?
Thanks.

There is a discussion on the subject here:
https://forums.aws.amazon.com/thread.jspa?threadID=164470
You might be missing this part from the example:
ExpressionAttributeValues: {":name":{"S":"Jeff"}}
However, just wanted to echo what was already being said, scan is an expensive procedure that goes through every item and thus making your database hard to scale.
Unlike with other databases, you have to do plenty of setup with Dynamo in order to get it to perform at it's great level, here is a suggestion:
1) Convert this into a root value, for example add to the root: qaExist, with possible values of 0|1 or true|false.
2) Create secondary index for the newly created value.
3) Make query on the new index specifying 0 as a search parameter.
This will make your system very fast and very scalable regardless of how many records you get in there later on.

If I understand the question correctly, you can do the following:
FilterExpression = 'attribute_exists(#0.#1)',
ExpressionAttributeNames = {
"#0": "appsMap",
"#1": "qa-app"
},
ProjectionExpression = "deviceID"

Since you're not being a bit vague about your expectations and what's happening ("I tried something like this but it doesn't work in the way I need") I'd like to mention that a scan with a filter is very different than a query.
Filters are applied on the server but only after the scan request is executed, meaning that it will still iterate over all data in your table and instead of returning you each item, it applies a filter to each response, saving you some network bandwidth, but potentially returning empty results as you page trough your entire table.
You could look into creating a GSI on the table if this is a query you expect to have to run often.

Related

Scanning With sort_key in DynamoDB

I have a table that will contain < 1300 entries at about 600 bytes each. The goal is to display pages of results ordered by epoch date. Right now, for any given search I request the full list of ids using a filtered scan, then handle paging on the UI side. For each page, I pass a chunk of ids to retrieve the full entry (also currently a filtered scan). Ideally, the list of ids would return sorted, but if I understand the docs correctly, only results that have the same partition key are sorted. My current partition key is a uuid, so all entries are unique.
Current Table Configuration
Do I essentially need to use a throwaway key for the partition just to get results returned by date? Maybe the size of my table makes this unreasonable to begin with? Is there a better way to handle this? I have another field, "is_active" that's currently a boolean and could be used for the partition key if I converted it to numeric, but that might complicate my update method. 95% of the time, every entry in the db will be "active", so this doesn't seem efficient.
Scan Index
let params = {
TableName: this.TABLE_NAME,
IndexName: this.INDEX_NAME,
ScanIndexForward: false,
ProjectionExpression: "id",
FilterExpression: filterSqlStatement,
ExpressionAttributeValues: filterValues,
ExpressionAttributeNames: {
"#n": "name"
}
};
let results = await this.DDB_CLIENT.scan(params).promise();
let finalizedResults = results ? results.Items : [];
Given that your dataset is relatively small you might try a fixed partition key with a sort key of the date and the UUID. You'd query by the partition key (which would be a fixed value) and the results would come back sorted. This isn't the best idea with large data sets, but < 1300 is not large.

Best method to extract data from dynamoDb and move it to another table

I have a table of 500gb. I want to transfer the data to another table based on the timestamps.
There are several items in table and I want only latest entry of every item in another table.
Considering the size of table, can anyone recommend best aws service to get it done fast and easy?
I have come across aws glue, hivecopyactivity. Are this the best solution or is there any other service I can use?
(assuming you now can add a Global secondary indexes (GSI) on that table, that is: you currently have < 5 GSIs)
Define a new GSI on your table. The GSI's partition key will be x. The GSI's sort key will be timestamp. Once you have that GSI defined you can do a query on that index with ScanIndexForward set to false to get the most recent item first. You need to supply the value of x you are interested at. In the following example request it is simply set to 'abc'
{
"TableName": "<your-table-name>",
"IndexName": "<your-GSI-name>",
"KeyConditionExpression": "x = :argx",
"ExpressionAttributeValues": {
":argx": {"S": "abc"}
},
"ScanIndexForward": false,
"Limit": 1
}
This query looks at items with a given x value (as set in the ExpressionAttributeValues field) sorted in descending order (by the GSI's sort key, which is the timestamp field) and picks the first one (Limit is set to 1). As long as you do not need filtering (the FilterExpression field is empty) then you will get the result that you need by issuing a single Query request.
If you do want to use filtering you will need to do multiple requests and unset the Limit field (i.e., use its default value). See this answer for further details on those subtleties.

DynamoDB QuerySpec {MaxResultSize + filter expression}

From the DynamoDB documentation
The Query operation allows you to limit the number of items that it
returns in the result. To do this, set the Limit parameter to the
maximum number of items that you want.
For example, suppose you Query a table, with a Limit value of 6, and
without a filter expression. The Query result will contain the first
six items from the table that match the key condition expression from
the request.
Now suppose you add a filter expression to the Query. In this case,
DynamoDB will apply the filter expression to the six items that were
returned, discarding those that do not match. The final Query result
will contain 6 items or fewer, depending on the number of items that
were filtered.
Looks like the following query should return (at least sometimes) 0 records.
In summary, I have a UserLogins table. A simplified version is:
1. UserId - HashKey
2. DeviceId - RangeKey
3. ActiveLogin - Boolean
4. TimeToLive - ...
Now, let's say UserId = X has 10,000 inactive logins in different DeviceIds and 1 active login.
However, when I run this query against my DynamoDB table:
QuerySpec{
hashKey: null,
rangeKeyCondition: null,
queryFilters: null,
nameMap: {"#0" -> "UserId"}, {"#1" -> "ActiveLogin"}
valueMap: {":0" -> "X"}, {":1" -> "true"}
exclusiveStartKey: null,
maxPageSize: null,
maxResultSize: 10,
req: {TableName: UserLogins,ConsistentRead: true,ReturnConsumedCapacity: TOTAL,FilterExpression: #1 = :1,KeyConditionExpression: #0 = :0,ExpressionAttributeNames: {#0=UserId, #1=ActiveLogin},ExpressionAttributeValues: {:0={S: X,}, :1={BOOL: true}}}
I always get 1 row. The 1 active login for UserId=X. And it's not happening just for 1 user, it's happening for multiple users in a similar situation.
Are my results contradicting the DynamoDB documentation?
It looks like a contradiction because if maxResultSize=10, means that DynamoDB will only read the first 10 items (out of 10,001) and then it will apply the filter active=true only (which might return 0 results). It seems very unlikely that the record with active=true happened to be in the first 10 records that DynamoDB read.
This is happening to hundreds of customers that are running similar queries. It works great, when according to the documentation it shouldn't be working.
I can't see any obvious problem with the Query. Are you sure about your premise that users have 10,000 items each?
Your keys are UserId and DeviceId. That seems to mean that if your user logs in with the same device it would overwrite the existing item. Or put another way, I think you are saying your users having 10,000 different devices each (unless the DeviceId rotates in some way).
In your shoes I would just remove the filterexpression and print the results to the log to see what you're getting in your 10 results. Then remove the limit too and see what results you get with that.

How to use MapReduce when extracting a group of document id's by some criteria from CouchDB

I'm in my first week of CouchDB experimentation and trying to stop thinking in SQL. I have a collection of documents (5000 event files) that all have some ID value that will be common to groups of documents. So there might be 10 that all have TheID: 'foobar'.
(In case someone asks - TheID is not an auto-increment value from a relational database - it is a unique id assigned by a partner company of ours. I cannot redesign my source data to identify itself some other way, I have to use this TheID field to recognise groups of documents.)
I want to query my list of documents:
{ _id: 'document1', Message: { TheID: 'foobar' } }
{ _id: 'document2', Message: { TheID: 'xyz' } }
{ _id: 'document3', Message: { TheID: 'xyz' } }
{ _id: 'document4', Message: { TheID: 'foobar' } }
{ _id: 'document5', Message: { TheID: 'wibble' } }
{ _id: 'document6', Message: { TheID: 'foobar' } }
I want the results:
'foobar': [ 'document1', 'document4', 'document6' ]
'xyz': [ 'document2', 'document3' ]
'wibble': [ 'document5' ]
The aim is to represent groups of documents on our UI grouped by TheID, so the user can see all documents for a specific TheID together, and select that TheID to drill into the data querying just by that TheID value. Yes, the string id of each document is useful - in our case, the _id value of each document is the source event identifier, so it is a unique and useful value that the user is going to want to see in the list on screen.
In SQL one might order by or group by the TheID field and iterate the result set appropriately. I doubt this thinking is any use at all with a CouchDB query.
I know that I can use a map function to extract the TheID value for each document, for example:
function (doc) {
emit(doc.Message.TheID, 1);
}
or perhaps
function (doc) {
emit(doc._id, doc.Message.TheID);
}
I'm not sure exactly what I should emit as the key and value. Even if this is useful, I'm getting the feeling that I should not use a reduce function to try to 'reduce' the large map output (1 result row per document in the database) to what I want (3 results each with a list of document id's).
http://guide.couchdb.org/draft/views.html says "A common mistake new CouchDB users make is attempting to construct complex aggregate values with a reduce function. Full reductions should result in a scalar value, like 5, and not, for instance, a JSON hash with a set of unique keys and the count of each."
I thought I might be able to use reduce to scan the results of the map and somehow collect all results that have a common TheID value into a single result object. What I see when reading the reduce documentation is that it will be given arrays of keys and values that contain fairly unpredictable collections, driven by the structure of the btree underlying the map results. It won't be given arrays guaranteed to contain all similar TheID values that I could scan for. This approach seems completely broken.
So, is a map/reduce pair the right thing to do here? Should I look at using a 'show' or 'list' instead? I'm intending to build a mustache based HTML template engine around the results, so 'list' seems the wrong way to go.
Thanks in advance for any guidance.
EDIT I have done some local dev and come up with what I think is a broken solution. Hopefully this will show you the direction I'm trying to go in. See a public cloud based CouchDB I created at https://neek.iriscouch.com/_utils/database.html?test/_design/test/_view/collectByTheID
This is public. If you would like to play, please copy it to a new view, don't pollute this one in case others come in and want to see the original.
map function:
function(doc) {
emit(doc.Message.TheID, doc._id);
}
reduce function:
function(keys, values, rereduce) {
if (!rereduce) {
return values;
} else {
var ret = [];
values.forEach(function (ar) {
ret.concat(ar);
});
return ret;
}
}
Results:
"foobar" ["document6", "document4", "document1"]
"wibble" ["document5"]
"xyz" ["document3", "document2"]
The reduce function first leaves the array of values alone, and on the second pass concatenates them together. However when I run this on my large 5000+ document database it comes up with some TheID values with empty document id arrays. I believe this suffers from the problem I mentioned before, where the array of values passed to reduce are build dependent on the btree structure of the map they are extractd from and are not guaranteed to contain a complete set of values for given keys.
Make use of the group_level feature:
Map:
emit([doc.message.TheID, doc._id], null)
Reduce:
You must include a reduce to use group_level, it can be empty as below or something else, i.e. _count
function(keys, values){
return null;
}
A query with group_level=1 would return:
/_design/d/_view/v?group_level=1
[
{key: ["foobar"], value: null},
{key: ["xyz"], value: null},
{key: ["wibble"], value: null}
]
You would use this query to populate the top level in your grouping UI. When the user expands a category, you would do another query with group_level 2 and start and end keys:
/_design/d/_view/v?group_level=2&startkey=["foobar"]&endkey=["foobar",{}]
[
{key: ["foobar", "document6"], value: null},
{key: ["foobar", "document4"], value: null},
{key: ["foobar", "document1"], value: null}
]
This doesn't produce the output exactly as you are requesting, however, I think you'll find it flexible enough

DynamoDB: How to perform conditional write to enforce unique Hash + Range key

I am using DynamoDB to store events.
They are stored in 1 event table with a hash key 'Source ID' and a range key 'version'. Every time a new event occurs for a source, i want to add a new item with the source ID and an increased version nr.
Is it possible to specify a conditional write so that a duplicate item (same hash key and same range key) can never exist? And if so, how would you do this?
I have done this successfully for tables with just a Hash Key:
Map<String, ExpectedAttributeValue> expected = new HashMap<String, ExpectedAttributeValue>();
expected.put("key", new ExpectedAttributeValue().withExists(false));
But not sure how to handle hash + range keys....
I don't know Java SDK well but you can specify "Exist=False" on both the range_key and the hash_key.
Maybe a better idea could be to use a timestamp instead of a version number ? Otherwise, there are also techniques to generate unique ids.
I was trying to enforce a unique combination of hash and range keys and came across this post. I found that it didn't completely answer my question but certainly pointed me in the right direction. This is an attempt to tidy up the loose ends.
It seems that DynamoDB actually enforces a unique combination of hash and range key by design. I quote
"All items in the table must have a value for the primary key attribute and Amazon DynamoDB ensures that the value for that name is unique"
from http://aws.amazon.com/dynamodb/ under the section with the heading Primary Key.
In my own tests using putItem with the aws-sdk for nodejs I was able to post two identical items without generating an error. When I checked the database, only one item was actually inserted. It seems that the second call to putItem with the same hash and range key combination is treated like an update to the original item.
I too received the error "Cannot expect an attribute to have a specified value while expecting it to not exist" when I tried to set the exist=false option on the hash key and range key with the values set. To resolve this error, I removed the value under the expected hash and range key and it started to generate a validation error when I tried to insert the same key twice.
So, my insert command looks like this (will be different for Java, but hopefully you get the idea)
{ "TableName": "MyTableName",
"Item" : {
"HashKeyFieldName": {
"S": HashKeyValue
},
"RangeKeyFieldName": {
"N": currentTime.getTime().toString()
},
"OtherField": {
"N": "61404032632"
}
},
"Expected": {
"HashKeyFieldName" : { "Exists" : false},
"RangeKeyFieldName" : { "Exists" : false}
}
}
Whereas originally I was trying to do a conditional insert to check if there was a hash value and range value the same as what I was trying to insert, now I just need to check if the HashField and RangeField exist at all. If they exist, that means I am updating an item rather than inserting.