I am new to redshift and struggling to update a column in a redshift table. I have a huge data table and added an empty column to it. I am trying to fill this empty column by joining it with another table using the update command. What I am worried about is that even though there is 291 GB of space left, temporary blocks being created by this UPDATE statement produce the DISK FULL error. Any solutions or suggestions are appreciated. Thanks in advance!
It is not recommended to perform a large UPDATE command in Amazon Redshift tables.
The reason is that updating even just one column in a row causes the following:
The existing row will be marked as Deleted, but still occupies disk space until the table is VACUUMed
A new row is added to the end of the table storage, which is then out of sort order
If you are updating every row in the table, this means that the storage required for the table is twice as much, possibly more due to less-efficient compression. This is possibly what is consuming your disk space.
The suggested alternate method is to select the joined data into a new table. Yes, this will also require more disk space, but it will be more efficiently organized. You can then delete the original table and rename the new table to the old table name.
Some resources:
Updating and Inserting New Data - Amazon Redshift
How to Improve Amazon Redshift Upload Performance
Related
using impala I noticed a deterioration in performance when I perform several times truncate and insert operations in internal tables.
The question is: can refreshing the tables avoid the problem?
So far I have used refresh only for external tables every time I copied files to hdfs to be loaded into the tables themselves.
Many thanks in advance!
Moreno
You can use compute stats instead of refresh.
Refresh is normally used when you add a data file or change something in table metadata - like add column or partition /change column etc. It quickly reloads the metadata. There is another related command invalidate metadata but this is more expensive than refresh and will force impala to reload metadata when table is called in next query.
compute stats - This is to compute stats of the table or columns when around 30% data changed. Its expensive operation but effective when you do frequent truncate and load.
I want a table to store the history of a object for a week and then replace the same with history of next week. What would be the best way to achieve this in aws?
The data is stored in json format in s3 is a weekly dump. The pipeline runs the script weekly once and dumps data into s3 for analysis. For the next run of the script i do not need the previous week-1 data, so this needs to be replaced with new week-2 data. The schema of the table remains constant but the data keeps changing every week.
I would recommend to use data partitioning to solve your issue without deleting underlying S3 files from previous weeks (which is not possible via an Athena query).
Thus, the idea is to use a partition key based on the date, and then use this partition key in the WHERE clause of your Athena request, which will cause Athena to ignore previous files (which are not under the last partition).
For example, if you use the file dump date as partition key (let's say we chose to name it dump_key), your files will have to be stored in subfolders like
s3://your-bucket/subfolder/dump_key=2021-01-01-13-00/files.csv
s3://your-bucket/subfolder/dump_key=2021-01-07-13-00/files.csv
Then, during your data processing, you'll first need to create your table and specify a partition key with the PARTITIONED BY option.
Then, you'll have to make sure you added a new partition using the PARTITION ADD command every time it's necessary for your use case:
ALTER TABLE your_table ADD PARTITION (dump_key='2021-01-07-13-00') location 's3://your-bucket/subfolder/dump_key=2021-01-07-13-00/'
Then you'll be able to query your table by filtering previous data using the right WHERE clause:
SELECT * FROM my_table WHERE dump_key >= 2021-01-05-00-00
This will cause Athena to ignore files in previous partitions when querying your table.
Documentation here:
https://docs.aws.amazon.com/athena/latest/ug/partitions.html
Background: In Redshift, I want to add a distribution key to an existing table that has an identity column, just like in this question.
I am confused by the answer on that question -- I thought that to have my table data stored according to a certain dist key, I have to INSERT the data, I can't just COPY or APPEND from an undistributed table. Is this different when COPYing from S3?
There are some interesting methods on the Questions you linked!
You cannot add a Distribution Key to an existing table. You would need to create a new table, then copy the data across. This can be done via INSERT INTO new-table SELECT * FROM old-table.
When data is loaded into an Amazon Redshift table, it ALWAYS honors the Distribution Key because the DISTKEY determines which slice stores the data. Whether you use COPY (which is preferred) or an INSERT, data will always be distributed according to the DISTKEY.
The SORTKEY will also be used when data is loaded via COPY, but existing data will not be re-sorted. For example, if you have a column of data already loaded in alphabetical order, then newly-loaded rows will be added to the end of the existing data. This new data will be sorted, but the column as a whole will not be sorted. Use a VACUUM command to re-sort the whole table.
Whenever possible, you should use the COPY command to load data into a Redshift table. This allows Redshift to load the data in parallel using all nodes. Try to minimize the amount of data loaded via INSERT — preferably load multiple rows using this method. Try to avoid INSERT single rows, which is very inefficient in Redshift compared to bulk loading.
I need to remove the range-key in an existing Dynamo-DB table, without impacting the data.
You won't be able to do this on the existing table. You will need to do a data migration into a new table that is configured the way you want.
If the data migration can be done offline, you simply need to Scan all of the data out of the original table and PutItem it into the new table.
Protip: You can have multiple workers Scan a table in parallel if you have a large table. You simply need to assign each worker a Segment. Make sure your solution is robust enough to handle workers going down by starting a new worker and reassigning it the same Segment number.
Doing a live data migration isn't too bad either. You will need to create a DynamoDB Stream on the original table and attach a Lambda that essentially replays the changes onto the new table. The basic strategy is when an item is deleted, call DeleteItem on the new table and when an item is inserted or updated, call PutItem with the NEW_IMAGE on the new table. This will capture any live activity. Once that's set up, you need to copy over the data the same way you would in the offline case.
No matter what you do, you will be "impacting" the data. Removing the range key will fundamentally change the way the data is organized. Keep in mind it will also mean that you have a different uniqueness constraint on your data.
I think Redshift as Relational Database. I am having one scenario which I wanted to know:-
If there is an existing redshift database, then how easy or how much time it would take to add a column? Does it take a lock on the table when added a new field?
Amazon Redshift is a columnar database. This means each column is stored separately on disk, which is why it operates so quickly.
Adding a column is therefore a relatively simple operation.
If you are worried about it, you could use the CREATE TABLE AS command to create a new table based on the current one, then add a column and see how long it takes.
Like any rdmbs column will be added in the order of entry. i.e it with the last columns (not in the middle or whereever you like)