I'm using DynamoDB and I need to update a specific attribute on multiple records. Writing my requirement in pseudo-language I would like to do an update that says "update table persons set relationshipStatus = 'married' where personKey IN (key1, key2, key3, ...)" (assuming that personKey is the KEY in my DynamoDB table).
In other words, I want to do an update with an IN-clause, or I suppose one could call it a batch update. I have found this link that asks explicitly if an operation like a batch update exists and the answer there is that it does not. It does not mention IN-clauses, however. The documentation shows that IN-clauses are supported in ConditionalExpressions (100 values can be supplied at a time). However, I am not sure if such an IN-clause is suitable for my situation because I still need to supply a mandatory KEY attribute (which expects a single value it seems - I might be wrong) and I am worried that it will do a full table scan for each update.
So my question is: how do I achieve an update on multiple DynamoDB records at the same time? At the moment it almost looks like I will have to call an update statement for each Key one-by-one and that just feels really wrong...
As you noted, DynamoDB does not support a batch update operation. You would need to query for, and obtain the keys for all the records you want to update. Then loop through that list, updating each item one at a time.
You can use TransactWriteItems action to update multiple records in DynamoDB table.
The official documentation available here, also you can see TransactWriteItems javascript/nodejs example here.
I don't know if it has changed since the answer was given but it's possible now
See the docs:
https://docs.aws.amazon.com/amazondynamodb/latest/APIReference/API_BatchWriteItem.html
I have used it like this in javascript (mapping the new blocks to an array of objects with the wanted structure:
let params = {}
let tableName = 'Blocks';
params.RequestItems[tableName] = _.map(newBlocks, block => {
return {
PutRequest: {
Item: {
'org_id': orgId,
'block_id': block.block_id,
'block_text': block.block_text
},
ConditionExpression: 'org_id <> :orgId AND block_id <> :block_id',
ExpressionAttributeValues: {
':orgId': orgId,
':block_id': block.block_id
}
}
}
})
docClient.batchWrite(params, function(err, data) {
.... and do stuff with the result
You can even mix puts and deletes
And if your using dynogels (you cant mix em due to dynogels support but what you can do is for updating (use create because behind the scenes it casts to the batchWrite function as put's)
var item1 = {email: 'foo1#example.com', name: 'Foo 1', age: 10};
var item2 = {email: 'foo2#example.com', name: 'Foo 2', age: 20};
var item3 = {email: 'foo3#example.com', name: 'Foo 3', age: 30};
Account.create([item1, item2, item3], function (err, acccounts) {
console.log('created 3 accounts in DynamoDB', accounts);
});
Note this from DynamoDB limitations (from the docs):
The BatchWriteItem operation puts or deletes multiple items in one or more tables. A single call to BatchWriteItem can write up to 16 MB of data, which can comprise as many as 25 put or delete requests. Individual items to be written can be as large as 400 KB.
If i remember correctly i think dynogels is chunking the requests into chucks of 25 before sending them off and then collecting them in one promise and returns (though im not 100% certain on this) - otherwise a wrapper function would be pretty simple to assemble
DynamoDb is not designed as relational DB to support the native transaction. It is better to design the schema to avoid the situation of multiple updates at the first place. Or if it is not practical in your case, please keep in mind you may improve it when restructuring the design.
The only way to update multiple items at the same time is use TransactionWrite operation provided by DynamoDB. But it comes with a limitation (25 at most for example). So keep in mind with that, you probable should do some limitation in your application as well. In spite of being very costly (because of the implementation involving some consensus algorithm), it is still mush faster than a simple loop. And it gives you ACID property, which is probably we need most. Think of a situation using loop, if one of the updates fails, how do you deal with the failure? Is it possible to rollback all changes without causing some race condition? Are the updates idempotent? It really depends on the nature of your application of cause. Be careful.
Another option is to use the thread pool to do the network I/O job, which can definitely save a lot of time, but it also has the same failure-and-rollback issue to think about.
Related
I want to truncate dynamodb table which can have up to 3 millions to 4 millions of records. what is the best way?
Right now I am using scan which does not give good performance(I have tried to delete only for few records: 3):
DynamoDB dynamoDB = new DynamoDB(amazonDynamoDBClient);
Table table = dynamoDB.getTable("table-test");
ItemCollection<ScanOutcome> resultItems = table.scan();
Iterator<Item> itemsItr = resultItems.iterator();
while(itemsItr.hasNext()){
Item item = itemsItr.next();
String itemPk = (String) item.get("PK");
String itemSk = (String) item.get("SK");
DeleteItemSpec deleteItemSpec = new DeleteItemSpec().withPrimaryKey("PK", itemPk, "SK", itemSk);
table.deleteItem(deleteItemSpec);
}
The best way is to delete your table, and create new one of the same name. This is how clearing all data from DynamoDB is usually performed.
As Marcin already answered, the best way is to delete your table and create a new one. It is certainly the cheapest way - because any other way would require scanning the entire table and paying for the read capacity units required to do it.
In some cases, however, you might want to delete old items while the table is still actively used. In that case you can use a Scan like you wanted, but can do it much more efficiently than you did: First, don't run individual DeleteItem requests sequentially, waiting for one delete to complete before asking for the next one... You can send batches of 25 deletes in one BatchWriteItem request. You can also send multiple BatchWriteItem requests in parallel. Finally, for even faster deletion, you can parallelize your Scan to multiple threads or even machines - see the parallel scan section of the DynamoDB documentation. Just don't forget that if you delete items while the table is still actively written to, you need a way to tell old items which you want to delete, from new items that you don't want to delete - as the scan may start producing these new items as well.
Finally, if you find yourself often clearing old data from a table - you should consider whether you can use DynamoDB's TTL feature, where DynamoDB automatically looks for expired items (based on an expiration-time attribute on each item) and deletes them - at no cost to you.
I want to add an item only if it does not exist. I am not sure how to do it. Currently I am adding successfully without checking the condition (it adds regardless if the item exists). The code is:
const params = {
TableName: MY_TABLE,
Item: myItem
};
documentClient.put(params).promise()
.catch(err => { Logger.error(`DB Error: put in table failed: ${err}`) });
}
What do I need to add in order to make the code check if the item exists and if it does, just return?
Note: I do not want to use the database mapper. I want the code to the be written using the AWS.DynamoDB class
DynamoDB supports Conditional Writes, allowing you to define a check which needs to be successfull, for the item to be inserted/updated.
DynamoDB is not an SQL database (hopefully you know this...) and does not offer the full set of ACID guarantees. One of the things you can't do with it is an atomic "check and write" - all you can do is fetch the item, see if it exists and then put the item. However, if done other process wires the item to the table between your "get" and your "write", you won't know anything about it.
If you absolutely need this kind of behaviour, DynamoDB is not the right solution.
I have an item (number) in DynamoDB table. This value is read in a service, incremented and updated back to the table. There are multiple machines with multiple threads doing this simultaneously.
My problem here is to be able to read the correct consistent value, and update with the correct value.
I tried doing the increment and update in a java synchronized block.
However, I still noticed inconsistencies in the count in the end. It doesn't seem to be updating in a consistent manner.
"My problem here is to be able to read the correct consistent value, and update with the correct value."
To read/write the correct consistent value
Read Consistency in dynamodb (you can set it in your query as ConsistentRead parameter):
There are two types of read.
Eventually Consistent Read: if you read data after changes in table, that might be stale and should wait a bit to be consistent.
Strongly Consistent Data: it returns most up-to-date data, so, should not be worried about the stale data
ConditionExpression (specify in your query):
in your query you can specify that update the value if some conditions are true (for example current value in db is the same as the value you read before. meaning no one updated it in between) otherwise it returns ConditionalCheckFailedException and you need to handle it in your code to redo, ...
So, to answer your question, first you need to ready strongly consistent to get the current counter value in db. Then, to update it, your query should look like this (removed unnecessary parameters) and you should handle ConditionalCheckFailedException in your code:
"TableName": "counters",
"ReturnValues": "UPDATED_NEW",
"ExpressionAttributeValues": {
":a": currentValue,
":bb": newValue
},
"ExpressionAttributeNames": {
"#currentValue": "currentValue"
},
**// current value is what you ve read
// by Strongly Consistent **
"ConditionExpression": "(#currentValue = :a)",
"UpdateExpression": "SET #currentValue = :bb", // new counter value
With every record store a uuid (long random string) sort of value, whenever you are trying to update the record send with update request which should update only if uuid is equal the the value you read. And update the uuid value.
synchronised block will not work if you are trying to write from multiple machines together.
I have next table structure:
ID string `dynamodbav:"id,omitempty"`
Type string `dynamodbav:"type,omitempty"`
Value string `dynamodbav:"value,omitempty"`
Token string `dynamodbav:"token,omitempty"`
Status int `dynamodbav:"status,omitempty"`
ActionID string `dynamodbav:"action_id,omitempty"`
CreatedAt time.Time `dynamodbav:"created_at,omitempty"`
UpdatedAt time.Time `dynamodbav:"updated_at,omitempty"`
ValidationToken string `dynamodbav:"validation_token,omitempty"`
and I have 2 Global Secondary Indexes for Value(ValueIndex) filed and Token(TokenIndex) field. Later somewhere in the internal logic I perform the Update of this entity and immediate read of this entity by one of this indexes(ValueIndex or TokenIndex) and I see the expected problem that data is not ready(I mean not yet updated). I can't use ConsistentRead for this cases, because this is Global Secondary Index and it doesn't support this options. As a result I can't run my load tests over this logic, because data is not ready when tests go in 10-20-30 threads. So my question - is it possible to solve this problem somewhere? or should I reorganize my table and split it to 2-3 different tables and move filed like Value, Token to HASH key or SORT key?
GSIs are updated asynchronously from the table they are indexing. The updates to a GSI typically occur in well under a second. So, if you're after immediate read of a GSI after insert / update / delete, then there is the potential to get stale data. This is how GSIs work - nothing you can do about that. However, you need to be really mindful of three things:
Make sure you keep your GSI lean - that is, only project the absolute minimum attributes that you need. Less data to write will make it quicker.
Ensure that your GSIs have the correct provisioned throughput. If it doesn't, it may not be able to keep up with activity in the table and therefore you'll get long delays in the GSI being kept in sync.
If an update causes the keys in the GSI to be updated, you'll need 2 units of throughput provisioned per update. In essence, DynamoDB will delete the item then insert a new item with the keys updated. So, even though your table has 100 provisioned writes, if every single write causes an update to your GSI key, you'll need to provision 200 write units.
Once you've tuned your DynamoDB setup and you still absolutely cannot handle the brief delay in GSIs, you'll probably need to use different technology. For example, even if you decided to split your table into multiple tables, it'll have the same (if not worse) impact. You'll update one table, then try to read the data from another table and you haven't yet inserted the values into a different table.
I suspect that once you tune DynamoDB for your situation, you'll get pretty damn close you what you want.
I'm building an app where two users can connect with each other and I need to store that connection (e.g. a friendship) in a DynamoDB table. Basically, the connection table has have two fields:
userIdA (hash key)
userIdB (sort key)
I was thinking to add an index on userIdB to query on both fields. Should I store a connection with one record (ALICE, BOB) or two records (ALICE, BOB; BOB, ALICE)? The first option needs one write operation and less space, but I have to query twice to get all all connections of an user. The second option needs two write operations and more space, but I only have to query once for the userId.
The user tablehas details like name and email:
userId (hash key)
name (sort key)
email
In my app, I want to show all connections of a certain user with user details in a listview. That means I have two options:
Store the user details of the connected users also in the connection table, e.g. add two name fields to that table. This is fast, but if the user name changes (name and email are retrieved from Facebook), the details are invalid and I need to update all entries.
Query the user details of each userId with a Batch Get request to read multiple items. This may be slower, but I always have up to date user details and don't need to store them in the connection table.
So what is the better solution, or are there any other advantages/disadvantages that I may have overlooked?
EDIT
After some google research regarding friendship tables with NoSQL databases, I found the following two links:
How does Facebook maintain a list of friends for each user? Does it maintain a separate table for each user?
NoSQL Design Patterns for Relational Data
The first link suggests to store the connection (or friendship) in a two way direction with two records, because it makes it easier and faster to query:
Connections:
1 userIdA userIdB
2 userIdB userIdA
The second link suggests to save a subset of duplicated data (“summary”) into the tables to read it faster with just one query. That would be mean to save the user details also into the connection table and to save the userIds into an attribute of the user table:
Connections:
# userIdA userIdB userDetails status
1 123 456 { userId: 456, name: "Bob" } connected
2 456 123 { userId: 123, name: "Alice" } connected
Users:
# userId name connections
1 123 Alice { 456 }
2 456 Bob { 123 }
This database model makes it pretty easy to query connections, but seems to be difficult to update if some user details may change. Also, I'm not sure if I need the userIds within the user table again because I can easily query on a userId.
What do you think about that database model?
In general, nosql databases are often combined with a couple of assumptions:
Eventual consistency is acceptable. That is, it's often acceptable in application design if during an update some of the intermediate answers aren't right. That is, it might be fine if for a few seconds while alice is becoming Bob's friend, It's OK if "Is Alice Bob's friend" returns true while "is Bob Alice's friend" returns false
Performance is important. If you're using nosql it's generally because performance matters to you. It's also almost certainly because you care about the performance of operations that happen most commonly. (It's possible that you have a problem where the performance of some uncommon operation is so bad that you can't do it; nosql is not generally the answer in that situation)
You're willing to make uncommon operations slower to improve the performance of common operations.
So, how does that apply to your question. First, it suggests that ultimately the answer depends on performance. That is, no matter what people say here, the right answer depends on what you observe in practice. You can try multiple options and see what results you get.
With regard to the specific options you enumerated.
Assuming that performance is enough of a concern that nosql is a reasonable solution for your application, it's almost certainly query rather than update performance you care about. You probably will be happy if you make updates slower and more expensive so that queries can be faster. That's kind of the whole point.
You can likely handle updates out of band--that is eventually consistency likely works for you. You could submit update operations to a SQS queue rather than handling them during your page load. So if someone clicks a confirm friend button, you could queue a request to actually update your database. It is OK even if that involves rebuilding their user row, rebuilding the friend rows, and even updating some counts about how many friends they have.
It probably does make sense to store a friend row in each direction so you only need one query.
It probably does make sense to store the user information like Name and picture that you typically display in a friend list duplicated in the friendship rows. Note that whenever the name or picture changes you'll need to go update all those rows.
It's less clear that storing the friends in the user table makes sense. That could get big. Also, it could be tricky to guarantee eventual consistency. Consider what happens if you are processing updates to two users' friendships at the same time. It's very important that you not end up with inconsistency once all the dust has settled.
Whenever you have non-normalized data such as duplicating rows in each direction, or copying user info into friendship tables, you want some way to revalidate and fix your data. You want to write code that in the background can go scan your system for inconsistencies caused by bugs or crashed activities and fix them.
I suggest you have the following fields in the table:
userId (hash key)
name (sort key)
email
connections (Comma separated or an array of userId assuming you have multiple connections for a user)
This structure can ensure consistency across your data.