Backfill for new index in existing dynamodb table

Backfill for new index in existing dynamodb table - amazon-web-services

I have an existing dynamodb table. I wanted to create a index on this existing table, which has some data already. But when I am doing that, I still cannot see any existing data in this index in console? Any idea why?
So, doesn't it backfill the old data in the index? If so, how can I do that?

backfilling should take place automatically, but it takes time. the more data you have in your table the longer it will take for the backfill process to complete. you can speed it up by provisioning additional capacity to the index (which, naturally, you'll need to pay for it)

Related

How can I keep only the last n items within DynamoDB?

I have a streaming app that is putting data actively into DynamoDB.
I want to store the last 100 added items and delete the older ones; it seems that the TTL feature will not work in this case.
Any suggestions?

There is no feature within Amazon DynamoDB that enforces only keeping the last n items.
Limit 100 items as the maximum within your application by perhaps storing and keeping a running counter.

I'd do this via a lambda function with a trigger on the DynamoDB in question.
The lambda would then delete the older entries each time a change is made to the table. You'd need some sort of highwater mark for the table items and some way to keep track of it. I'd have this in a secondary DynamoDB table. Each new item put to the DynamoDB item table would get that HWM add it as a field to the item and update it. Basically implementing an autoincrement field, as they don't exist in DynamoDB. Then the lambda function could delete any items with an autoincrement id that is HWM - 100 or less.
There may be better ways but this would achieve the goal.

Best practice of using Dynamo table when it needs to be periodically updated

In my use case, I need to periodically update a Dynamo table (like once per day). And considering lots of entries need to be inserted, deleted or modified, I plan to drop the old table and create a new one in this case.
How could I make the table queryable while I recreate it? Which API shall I use? It's fine that the old table is the target table. So that customer won't experience any outage.
Is it possible I have something like version number of the table so that I could perform rollback quickly?

I would suggest table name with a common suffix (some people use date, others use a version number).
Store the usable DynamoDB table name in a configuration store (if you are not already using one, you could use Secrets Manager, SSM Parameter Store, another DynamoDB table, a Redis cluster or a third party solution such as Consul).
Automate the creation and insertion of data into a new DynamoDB table. Then update the config store with the name of the newly created DynamoDB table. Allow enough time to switchover, then remove the previous DynamoDB table.
You could do the final part by using Step Functions to automate the workflow with a Wait of a few hours to ensure that nothing is happening, in fact you could even add a Lambda function that would validate whether any traffic is hitting the old DynamoDB.

Temporary blocks in update command making DISK FULL redshift

I am new to redshift and struggling to update a column in a redshift table. I have a huge data table and added an empty column to it. I am trying to fill this empty column by joining it with another table using the update command. What I am worried about is that even though there is 291 GB of space left, temporary blocks being created by this UPDATE statement produce the DISK FULL error. Any solutions or suggestions are appreciated. Thanks in advance!

It is not recommended to perform a large UPDATE command in Amazon Redshift tables.
The reason is that updating even just one column in a row causes the following:
The existing row will be marked as Deleted, but still occupies disk space until the table is VACUUMed
A new row is added to the end of the table storage, which is then out of sort order
If you are updating every row in the table, this means that the storage required for the table is twice as much, possibly more due to less-efficient compression. This is possibly what is consuming your disk space.
The suggested alternate method is to select the joined data into a new table. Yes, this will also require more disk space, but it will be more efficiently organized. You can then delete the original table and rename the new table to the old table name.
Some resources:
Updating and Inserting New Data - Amazon Redshift
How to Improve Amazon Redshift Upload Performance

How to remove the range-key column in DynamoDb table without impacting the data?

I need to remove the range-key in an existing Dynamo-DB table, without impacting the data.

You won't be able to do this on the existing table. You will need to do a data migration into a new table that is configured the way you want.
If the data migration can be done offline, you simply need to Scan all of the data out of the original table and PutItem it into the new table.
Protip: You can have multiple workers Scan a table in parallel if you have a large table. You simply need to assign each worker a Segment. Make sure your solution is robust enough to handle workers going down by starting a new worker and reassigning it the same Segment number.
Doing a live data migration isn't too bad either. You will need to create a DynamoDB Stream on the original table and attach a Lambda that essentially replays the changes onto the new table. The basic strategy is when an item is deleted, call DeleteItem on the new table and when an item is inserted or updated, call PutItem with the NEW_IMAGE on the new table. This will capture any live activity. Once that's set up, you need to copy over the data the same way you would in the offline case.
No matter what you do, you will be "impacting" the data. Removing the range key will fundamentally change the way the data is organized. Keep in mind it will also mean that you have a different uniqueness constraint on your data.

How feasible it is to add columns in Redshift once it has too much data?

I think Redshift as Relational Database. I am having one scenario which I wanted to know:-
If there is an existing redshift database, then how easy or how much time it would take to add a column? Does it take a lock on the table when added a new field?

Amazon Redshift is a columnar database. This means each column is stored separately on disk, which is why it operates so quickly.
Adding a column is therefore a relatively simple operation.
If you are worried about it, you could use the CREATE TABLE AS command to create a new table based on the current one, then add a column and see how long it takes.

Like any rdmbs column will be added in the order of entry. i.e it with the last columns (not in the middle or whereever you like)

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Backfill for new index in existing dynamodb table - amazon-web-services

I have an existing dynamodb table. I wanted to create a index on this existing table, which has some data already. But when I am doing that, I still cannot see any existing data in this index in console? Any idea why? So, doesn't it backfill the old data in the index? If so, how can I do that?

backfilling should take place automatically, but it takes time. the more data you have in your table the longer it will take for the backfill process to complete. you can speed it up by provisioning additional capacity to the index (which, naturally, you'll need to pay for it)

Related

How can I keep only the last n items within DynamoDB?

Best practice of using Dynamo table when it needs to be periodically updated

Temporary blocks in update command making DISK FULL redshift

How to remove the range-key column in DynamoDb table without impacting the data?

How feasible it is to add columns in Redshift once it has too much data?

Categories

Resources