We have a huge DynamoDB table (~ 4 billion items) and one of the columns is some kind of category (string) and we would like to map this column to either new one category_id (integer) or update existing one from string to int. Is there a way to do this efficiently without creating new table and populating it from beginning. In other words to update existing table?
Is there a way to do this efficiently
Not in DynamoDB, that use case is not what it's designed for...
Also note, unless you're talking about the hash or sort key (of the table or of an existing index), DDB doesn't have columns.
You'd run Scan() (in a loop since it only returns 1MB of data)...
Then Update each item 1 at a time. (note could BatchUpdate of 10 items at a time, but that save just network overhead..still does 10 individual updates)
If the attribute in question is used as a key in the table or an existing index...then a new table is your only option. Here's a good article with a strategy for migrating a production table.
Create a new table (let us call this NewTable), with the desired key structure, LSIs, GSIs.
Enable DynamoDB Streams on the original table
Associate a Lambda to the Stream, which pushes the record into NewTable. (This Lambda should trim off the migration flag in Step 5)
[Optional] Create a GSI on the original table to speed up scanning items. Ensure this GSI only has attributes: Primary Key, and Migrated (See Step 5).
Scan the GSI created in the previous step (or entire table) and use the following Filter:
FilterExpression = "attribute_not_exists(Migrated)"
Update each item in the table with a migrate flag (ie: “Migrated”: { “S”: “0” }, which sends it to the DynamoDB Streams (using UpdateItem API, to ensure no data loss occurs).
NOTE You may want to increase write capacity units on the table during the updates.
The Lambda will pick up all items, trim off the Migrated flag and push it into NewTable.
Once all items have been migrated, repoint the code to the new table
Remove original table, and Lambda function once happy all is good.
Related
I have a dynamo db table with following structure
partitionKey - userId+keyName
sortKey - keyName+itemId
itemData - any object
createdAt - long value
updatedAt - long value
In this table I want to save list of items lets say all unique eatable items found in a shop. As per the requirement I need to find out the count of items in a particular shop. As per my findings I came across three ways to do this
Use Query to fetch count as per this link without explicitly saving count value.
Use transactions while saving items and store/update count explicitly. [We want to add/remove multiple items in a single request]. And later get count using GetItem api.
Use dynamo db streams to trigger SNS and eventually store explicit count in the same table/different table. And later get count using GetItem api.
Note
Latency is important here along with the cost.
You can assume this dynamo db table can have millions of items.
Eventual consistency is fine.
In my view 3rd option looks more efficient in terms of cost, latency. But want to know if my thoughts are correct
Using Dynamo streams to write aggregate data back to Dynamo is definitely the way to go!
This will of course be eventually consistent by its nature, as updating your item and waiting for the stream to update the aggregate are two different non-atomic operations.
A fourth option would be to have something like an ElasticSearch index updated (also by using streams), which allows you to do arbitrary ad-hoc queries.
If you need consistency for your aggregates, you have to use transactions for this, with all the limitations imposed.
As stated in this question, I've assumed that you can't have something like updated date as the sort key of a table, because if you update you will create a duplicate record.
Further, I've always assumed that the same thing applied to a GSI using updated date. But in my scenario I have the updated date as a sort key on a GSI, and no new records are created when I update the original item.
To recap, the attributes and key schema are:
Attributes:
Id
MySortKey
MyComputedField
UpdatedDate
Table:
PartitionKey: Id
SortKey: MySortKey
GSI:
PartitionKey: MyComputedField
SortKey: UpdatedDate
My question is, am I indirectly affecting the performance of the index by doing this? Or are there any other issues caused by this pattern that I'm not aware of?
Global Secondary Indexes are separate tables under the hood and changed items from the primary table are replicated to it.
As you observed correctly, you can use a changing attribute as the sort key in a GSI without that resulting in duplicates once you write to the base table.
Note, that there is no guarantee of uniqueness in the GSI, i.e. you can have more than one item with the same key attributes.
In addition to that you can only do eventually consistent reads from GSIs.
GSIs also have their own read and write capacity units that you need to provision and if you change items in the base table that need to be replicated, the operation will consume write capacity units on the GSI.
Reads are separate from that.
The RCUs on the GSI remain unaffected from the writes to the table.
But if you often change items, you may see some inconsistencies for a very brief period of time (that's why only eventually consistent reads are possible).
That means you can use the patterns if you can live with the side effects I mentioned.
I'd like to list records from my DDB table ordered by creation date.
My table has an attribute DateCreated.
All examples I can find describe ordering within some partition.
But I want global ordering.
Am I supposed to create an artificial attribute which will have the same value across all records, just to use it as a partition key? E.g. add new attribute GlobalPartition with value 1 to every record in the table, and create a GSI with partition key GlobalPartition and sort key DateCreated. Isn't there a better way?
Thx!
As you noticed, DynamoDB indeed does not have an option to sort items "globally". In other words, there is no way to Scan the database in sorted partition-key order. You can only sort items inside one partition, sorted by the "sort key".
When you have a small amount of data, you can indeed do what you said: Have a single partition with everything in this partition. However it's not clear how practical this approach becomes as your single partition grows - to gigabytes or terabytes, and how well DynamoDB can load-balance when you have just a single partition (I never saw any DynamoDB documentation which answer this question).
So another option is not to have a single partition but rather have a number of them. For example, consider that you want to sort items by date. Now insead of having a single partition, have a partition per month, i.e., the partition key is the month number. Now, if you want to sort everything within a month, you can do it directly, but if you want to get a sorted list of a full year, you need to Query twelve partitions, in order, getting a sorted list in each one and combining it to a sorted list for the full year. So-called time-series databases are often modeled this way.
If you want to sort any data in DynamoDB you need to add Sort Key index on that attribute. If value is not in attribute which maps to tables' sort key, or table does not have sort key, then you need to create GSI and put GSI's sort key on that attribute. You can use LSI too. Any attribute, which maps to "Sort Key" of any index. Table, LSI, GSI.
Check for more details "ScanIndexForward" param of the query request.
If ScanIndexForward is true, DynamoDB returns the results in the order in which they are stored (by sort key value). This is the default behavior. If ScanIndexForward is false, DynamoDB reads the results in reverse order by sort key value, and then returns the results to the client.
https://docs.aws.amazon.com/amazondynamodb/latest/APIReference/API_Query.html#API_Query_RequestSyntax
UI has checkbox too for this:
"Global sort" is not possible, while "global" would mean scan operation and it just runs through all rows in database and filters by filters, yet it does not have sorting option. On query on attribute mapped to sort key has ScanIndexForward option to change sort direction.
I have an existing HANA warehouse which was built without create/update timestamps. I need to generate a number of nightly batch delta files to send to another platform. My problem is how to detect which records are new or changed so that I can capture those records within the replication process.
Is there a way to use HANA's built-in features to detect new/changed records?
SAP HANA does not provide a general change data capture interface for tables (up to current version HANA 2 SPS 02).
That means, to detect "changed records since a given point in time" some other approach has to be taken.
Depending on the information in the tables different options can be used:
if a table explicitly contains a reference to the last change time, this can be used
if a table has guaranteed update characteristics (e.g. no in-place update and monotone ID values), this could be used. E.g.
read all records where ID is larger than the last processed ID
if the table does not provide intrinsic information about change time then one could maintain a copy of the table that contains
only the records processed so far. This copy can then be used to
compare the current table and compute the difference. SAP HANA's
Smart Data Integration (SDI) flowgraphs support this approach.
In my experience, efforts to try "save time and money" on this seemingly simple problem of a delta load usually turn out to be more complex, time-consuming and expensive than using the corresponding features of ETL tools.
It is possible to create a Log table and organize columns according to your needs so that by creating a trigger on your database tables you can create a log record with timestamp values. Then you can query your log table to determine which records are inserted, updated or deleted from your source tables.
For example, following is from one of my test trigger codes
CREATE TRIGGER "A00077387"."SALARY_A_UPD" AFTER UPDATE ON "A00077387"."SALARY" REFERENCING OLD ROW MYOLDROW,
NEW ROW MYNEWROW FOR EACH ROW
begin INSERT
INTO SalaryLog ( Employee,
Salary,
Operation,
DateTime ) VALUES ( :mynewrow.Employee,
:mynewrow.Salary,
'U',
CURRENT_DATE )
;
end
;
You can create AFTER INSERT and AFTER DELETE triggers as well similar to AFTER UPDATE
You can organize your Log table so that so can track more than one table if you wish just by keeping table name, PK fields and values, operation type, timestamp values, etc.
But it is better and easier to use seperate Log tables for each table.
I am new to nosql / DynamoDB.
I have a list of ~10 000 container-items records, which is updated every 6 hours:
[
{ containerId: '1a3z5', items: ['B2a3, Z324, D339, M413'] },
{ containerId: '42as1', items: ['YY23, K132'] },
...
]
(primary key = containerId)
Is it viable to just delete the table, and recreate with new values?
Or should I loop through every item of the new list, and conditionally update/write/delete the current DynamoDB records (using batchwrite)?
For this scenario batch update is better approach. You have 2 cases:
If you need to update only certain records than batch update is more efficient. You can scan the whole table and iterate thought the records and only update certain records.
If you need to update all the records every 6 hours batch update will be more efficient, because if you drop the table and recreate table, that also means you have to recreate indexes and this is not a very fast process. And after you recreate table you still have to do the inserts and in the meantime you have to keep all the records in another database or in-memory.
One scenario where deleting the whole table is a good approach if you need to delete all the data from the table with thousands or more records, than its much faster to recreate table, than delete all the records though API.
And one more suggestion have you considered alternatives, because your problem does not look like a good use-case for DynamoDB. For example MongoDB and Cassandra support update by query out of the box.
If the update touches some but not all existing items and if partial update of 'items' is possible then you have no choice but to do a per record operation. And this would be true even with a more capable database.
You can perhaps speed it up by retrieving only the existing containerIds first so based on that set you know which to do update versus insert on. Alternately you can do a batch retrieve by ids using the ids from the set of updates and which every ones do not return a result are the ones you have to insert and the ones where you do are the ones to update.