I have a DynamoDB table that has a created date/time column that indicates when the record/item was inserted into the table. I have about 20 years worth of data in this table (records were migrated from a previous database), and I would now like to truncate anything older than 6 months old moving forward.
The obvious thing to do here would be to set a TTL on the table for 6 months, however my understanding is that AWS TTLs only go back a certain number of years (please correct me if you know otherwise!). So my understanding is that if I set a 6 month TTL on 20 years of data, I might delete record starting at 6 months old going back maybe 3 - 5 yearrs, but then there'd be a whole lot of really old data left over, unaffected by the TTL (again please correct me if you know otherwise!). So I guess I'm looking for:
The ability to do a manual, one-time deletion of data older than 6 months old; and
The ability to set a 6 month TTL moving forward
For the first one, I need to execute something like DELETE FROM mytable WHERE created > '2018-06-25', however I can't figure out how to do this from the AWS/DynamoDB management console, any ideas?
For the second part, when I go to Manage TTL in the DynamoDB console:
I'm not actually seeing where I would set the 6 month expiry. Is it the date/time fields at the very bottom of that dialog?! Seems strange to me...if that were the case then the TTL wouldn't be a scrolling 6 month window, it would just be a hardcoded point in time which I'd need to keep updating manually so that data is never more than 6 months old...
You are correct about how far back in time TTL goes, it's actually 5 years. The way it works is comparing your TTL attribute value with the current timestamp. If your item has a timestamp that is older than the current timestamp, it's scheduled for deletion in the next 48 hours (it's not immediate). So, if you use the timestamp of creation of the item, everything will be scheduled for deletion as soon as you insert, and that's not what you want.
The way you manage the 6-month expiry policy is in your application. When you create an item, set a TTL attribute to a timestamp 6 months ahead of the creation time and just leave it there. Dynamo will take care of deleting it in 6 months. For your "legacy" data, I can't see a way around querying and looping through each item and setting the TTL for each of them manually.
Deleting old records directly or updating their TTL so they can be deleted later by DynamoDB both require the same write capacity. You’ll need to scan / query and delete records one-by-one.
If you have, let’s say, 90% of old data, the most cost- and time-efficient way of deleting it is to move remaining 10% to a new table and delete the old one.
Another non-standard way I see is to choose an existent timestamp field you can sacrifice (for instance, audit field such as creation date), remove it from the new records and use as TTL to delete the old ones. It will allow you to do what you need cheaper and without switching to another table that may require multi-step changes in your application, but requires the field to (a) not being in use, (b) be in the past and (c) be a UNIX timestamp. If you don’t want to delete it permanently, you may copy it to another attribute and copy back after all old records have been deleted and TTL on that field is turned off (or switched to another attribute). It will not work for records having timestamp before 5 years ago.
Related
I want to truncate dynamodb table which can have up to 3 millions to 4 millions of records. what is the best way?
Right now I am using scan which does not give good performance(I have tried to delete only for few records: 3):
DynamoDB dynamoDB = new DynamoDB(amazonDynamoDBClient);
Table table = dynamoDB.getTable("table-test");
ItemCollection<ScanOutcome> resultItems = table.scan();
Iterator<Item> itemsItr = resultItems.iterator();
while(itemsItr.hasNext()){
Item item = itemsItr.next();
String itemPk = (String) item.get("PK");
String itemSk = (String) item.get("SK");
DeleteItemSpec deleteItemSpec = new DeleteItemSpec().withPrimaryKey("PK", itemPk, "SK", itemSk);
table.deleteItem(deleteItemSpec);
}
The best way is to delete your table, and create new one of the same name. This is how clearing all data from DynamoDB is usually performed.
As Marcin already answered, the best way is to delete your table and create a new one. It is certainly the cheapest way - because any other way would require scanning the entire table and paying for the read capacity units required to do it.
In some cases, however, you might want to delete old items while the table is still actively used. In that case you can use a Scan like you wanted, but can do it much more efficiently than you did: First, don't run individual DeleteItem requests sequentially, waiting for one delete to complete before asking for the next one... You can send batches of 25 deletes in one BatchWriteItem request. You can also send multiple BatchWriteItem requests in parallel. Finally, for even faster deletion, you can parallelize your Scan to multiple threads or even machines - see the parallel scan section of the DynamoDB documentation. Just don't forget that if you delete items while the table is still actively written to, you need a way to tell old items which you want to delete, from new items that you don't want to delete - as the scan may start producing these new items as well.
Finally, if you find yourself often clearing old data from a table - you should consider whether you can use DynamoDB's TTL feature, where DynamoDB automatically looks for expired items (based on an expiration-time attribute on each item) and deletes them - at no cost to you.
I feel like I'm thinking my self in circles here. Maybe you all can help :)
Say I have this simple table design in DynamoDB:
Id | Employee | Created | SomeExtraMetadataColumns... | LastUpdated
Say my only use case is to find all the rows in this table where LastUpdated < (now - 2 hours).
Assume that 99% of the data in the table will not meet this criteria. Assume there is a some job running every 15 mins that is updating the LastUpdated column.
Assume there are say 100,000 rows and grows maybe 1000 rows a day. (no need to large write capacity).
Assume a single entity will be performing this 'read' use case (no need for large read capacity).
Options I can think of:
Do a scan.
Pro: can leverage parallels scans to scale in the future.
Con: wastes a lot of money reading rows that do not match the filter criteria.
Add a new column called 'Constant' that would always have the value of 'Foo' and make a GSI with the Partition Key of 'Constant' and a Sort Key of LastUpdated. Then execute a query on this index for Constant = 'Foo' and LastUpdated < (now - 2hours).
Pro: Only queries the rows matching the filter. No wasted money.
Con: In theory this would be plagued by the 'hot partition' problem if writes scale up. But I am unsure how much of a problem it will be as aws outlined this problem to be a thing of the past.
Honestly, I leaning toward the latter option. But I'm curious what the communities thoughts are on this. Perhaps I am missing something.
Based on the assumption that the last_updated field is the only field you need to query against, I would do something like this:
PK: EMPLOYEE::{emp_id}
SK: LastUpdated
Attributes: Employee, ..., Created
PK: EMPLOYEE::UPDATE
SK: LastUpdated::{emp_id}
Attributes: Employee, ..., Created
By denormalising your data here you have the ability to create an update record with an update row which can be queried with PK = EMPLOYEE::UPDATE and SK between 'datetime' and 'datetime'. This is assuming you store the datetime as something like 2020-10-01T00:00:00Z.
You can either insert this additional row here or you could consider utilising DynamoDB streams to stream update events to Lambda and then add the row from there. You can set a TTL on the 'update' row which will expire somewhere between 0 and 48 hours from the TTL you set keeping the table clean. It doesn't need to be instantly removed because you're querying based on the PK and SK anyway.
A scan is an absolute no-no on a table that size so I would definitely recommend against that. If it increases by 1,000 per day like you say then before long your scan would be unmanageable and would not scale. Even at 100,000 rows a scan is very bad.
You could also utilise DynamoDB Streams to stream your data out to data stores which are suitable for analytics which is what I assume you're trying to achieve here. For example you could stream the data to redshift, RDS etc etc. Those require a few extra steps and could benefit from kinesis depending on the scale of updates but it's something else to consider.
Ultimately there are quite a lot of options here. I'd start by investigating the denormalisation and then investigate other options. If you're trying to do analytics in DynamoDB I would advise against it.
PS: I nearly always call my PK and SK attributes PK and SK and have them as strings so I can easily add different types of data or denormalisations to a table easily.
Definitely stay away from scan...
I'd look at a GSI with
PK: YYYY-MM-DD-HH
SK: MM-SS.mmmmmm
Now to get the records updated in the last two hours, you need only make three queries.
I have a table ActivityLog to which new data is added in every second.
I am querying this table every 5 seconds using an Api in the following way.
logs = ActivityLog.objects.prefetch_related('activity').filter(login=login_obj,read_status=0)
Now let's say when I queried this table at time 13:20:05 I've got 5 objects in logs and after my querying 5 more rows were added to the table at 13:20:06. When I try to update only the queried logsdataset using logs.update(read_status=1) it also updates the newly added data in the table. That is instead of updating 5 objects it updates 10 objects. How can I update only the 5 objects that I've queried without looping through it.
Take a look at select_for_update. Just be aware that the rows will be locked in the meanwhile.
Question
What is the Best way to update a column of a table of tens of millions of rows?
1)
I saw creating a new table and rename the old one when finish
2)
I saw update in batches using a temp table
3)
I saw single transaction (don't like this one though)
4)
never listen to cursor solution for a problema like this and I think it's not worthy to try
5) I read about loading data from file (Using BCP), but have not read if the performance is better or not. was not clear if it is just to copy or if it would allow join a big table with something and then bull copy.
really would like have some advice here.
Priority is performance
At the momment I'm testing solution 2) and Exploring solution 5)
Additional Information (UPDATE)
thank you for the critical thinking in here.
The operation be done in downtime.
UPDATE Will not cause row forwarding
All the tables go indexes, average 5 indexes, although few tables got
like 13 indexes.
the probability of target column is present in one of the table
indexes something like 50%.
Some tables can be rebuilt and replace, others don't because they
make part of a software solution, and we might lose support to those.
from those tables some got triggers.
I'll need to do this for more than 600 tables where ~150 range from
0.8 Million to 35 Million rows
The update is always in the same column in the various fields
References
BCP for data transfer
Actually it depends:
on the number of indexes the table contains
the size of the row before and after the UPDATE operation
type of UPDATE - would it be in place? does it need to modify the row length
does the operation cause row forwarding?
how big is the table?
how big would the transaction log of the UPDATE command be?
does the table contain triggers?
can the operation be done in downtime?
will the table be modified during the operation?
are minimal logging operations allowed?
would the whole UPDATE transaction fit in the transaction log?
can the table be rebuilt & replaced with a new one?
what was the timing of the operation on the test environment?
what about free space in the database - is there enough space for a copy of the table?
what kind of UPDATE operation is to be performed? does additional SELECT commands have to be done to calculate the new value of every row? or is it a static change?
Depending on the answers and the results of the operation in the test environment we could consider the fastest operations to be:
minimal logging copy of the table
an in place UPDATE operation preferably in batches
In our Sitecore (6.6) implementation we use Lucene indexing. In our PROD server, index bilding process is very slow. At the moment it has 5000+ entries to waiting in the index queue.
Queries I used (in master database),
select * from Properties (check the index last run time)
select * from History where created > 'last index updated time'
As a result of this delay, data gets created do not reflect their changes in the website. Also this queue keeps increasing. When the site takes offline, index building catch up after a while.
Its a heavy read intensive website.
We encountered CPU going high issues, but now they have been sorted. We thought index building was lagging because of the CPU high issue. But now the CPU is running around 30-40%. Still the lucene indexing queue increase rate is high.
How can I solve this issue? Please help.
You need to set up a database maintenance task, so that you regularly flush your History table. If you have sites that are index heavy, this table can grow excessively large. I think the default job cleans this table out with everything that is older than 30 days - you could set this much lower. Like 1 day, or a couple of days.
This article on SDN covers most of the standard maintenance tasks: http://sdn.sitecore.net/Articles/Administration/Database%20Maintenance.aspx
More general information about searching, indexing and performance here: http://sdn.sitecore.net/upload/sitecore6/65/sitecore_search_and_indexing_sc60-65-a4.pdf#search=%22clean%22
I think you need to take a step back and ask the question as to why there is such a large number of entries being added to the history table to begin with, before looking at what configuration changes to Sitecore can be made.
You should trace through your code in your development environment based on each of the use cases for your implementation, to find all calls to the Sitecore API where an item is:
Added into the Sitecore Tree
Edited - the changing of any fields item including security, presentation, workflow, publishing restrictions, etc.
Duplicated
Deleted from the Sitecore Tree
Moved to a new location.
Has a new version is added
Has a version removed
As you are going through, make sure that all edit actions to an item are performed with in a single Sitecore.Data.Items.Item.Editing.BeginEdit() and Sitecore.Data.Items.Item.Editing.EndEdit() call whenever possible, so that the changes are performed as a single edit action instead of multiple. Every time Sitecore.Data.Items.Item.Editing.EndEdit() is called, a new record will be inserted into the history table so unnecessary edits will only cause the history table size to increase.
If you are duplicating an item using the Sitecore.Data.Items.Item.CopyTo() method, remember that all versions of the item will be duplicated as well as the item's descendants. This means that the history table will have a record in it for every version of the item that was copied. If you only require the latest version and therefore removing older versions from the new item after it was created, again you should be aware that removing a version from an item will result in a record inserted into the history table for each version deleted.
If you have minimized all of the above actions to the bare minimum that is required to make the system functional, you should find that the Lucene Indexing will keep up-to-date pretty well without having to change Sitecore's default index configuration.