We have a large table, that we need to do a DEEP COPY on it.
Since we don't have enough empty disk space to make it in one statements I've tried to make it in batches.
But the batches seem to run very very slowly.
I'm running something like this:
INSERT INTO new_table
SELECT * FROM old_table
WHERE creation_date between '2018-01-01' AND '2018-02-01'
Even though the query returns small amount of lines ~ 1K
SELECT * FROM old_table
WHERE creation_date between '2018-01-01' AND '2018-02-01'
The INSERT query take around 50 minutes to complete.
The old_table has ~286M rows and ~400 columns
creation_date is one of the SORTKEYs
Explain plan looks like:
XN Seq Scan on old_table (cost=0.00..4543811.52 rows=178152 width=136883)
Filter: ((creation_date <= '2018-02-01'::date) AND (creation_date >= '2018 01-01'::date))
My question is:
What may be the reason for INSERT query to take this long?
In my opinion, following are two possibilities--- though if you could add more details to your question will be great.
As #John stated in comments, your SORTKEY matters a lot in RedShift, is creation_date sortkey?
Did you do lot of updates to your old_table, if so, you must to vacuum first do VACUUM DELETE Only old_table then, do select queries.
Other option, you might be doing S3 way, but not sure do you want to do it.
Related
Workflow
In a data import workflow, we are creating a staging table using CREATE TABLE LIKE statement.
CREATE TABLE abc_staging (LIKE abc INCLUDING DEFAULTS);
Then, we run COPY to import CSV data from S3 into the staging table.
The data in CSV is incomplete. Namely, there are fields partition_0, partition_1, partition_2 which are missing in the CSV file; we fill them in like this:
UPDATE
abc_staging
SET
partition_0 = 'BUZINGA',
partition_1 = '2018',
partition_2 = '07';
Problem
This query seems expensive (takes ≈20 minutes oftentimes), and I would like to avoid it. That could have been possible if I could configure DEFAULT values on these columns when creating the abc_staging table. I did not find any method as to how that can be done; nor any explicit indication that is impossible. So perhaps this is still possible but I am missing how to do that?
Alternative solutions I considered
Drop these columns and add them again
That would be easy to do, but ALTER TABLE ADD COLUMN only adds columns to the end of the column list. In abc table, they are not at the end of the column list, which means the schemas of abc and abc_staging will mismatch. That breaks ALTER TABLE APPEND operation that I use to move data from staging table to the main table.
Note. Reordering columns in abc table to alleviate this difficulty will require recreating the huge abc table which I'd like to avoid.
Generate the staging table creation script programmatically with proper columns and get rid of CREATE TABLE LIKE
I will have to do that if I do not find any better solution.
Fill in the partition_* fields in the original CSV file
That is possible but will break backwards compatibility (I already have perhaps hundreds thousands of files in there). Harder but manageable.
As you are finding you are not creating a table exactly LIKE the original and Redshift doesn't let you ALTER a column's default value. Your proposed path is likely the best (define the staging table explicitly).
Since I don't know your exact situation other paths might be better so me explore a bit. First off when you UPDATE the staging table you are in fact reading every row in the table, invalidating that row, and writing a new row (with new information) at the end of the table. This leads to a lot of invalidated rows. Now when you do ALTER TABLE APPEND all these invalidated rows are being added to your main table. Unless you vacuum the staging table before hand. So you may not be getting the value you want out of ALTER TABLE APPEND.
You may be better off INSERTing the data onto your main table with an ORDER BY clause. This is slower than the ALTER TABLE APPEND statement but you won't have to do the UPDATE so the overall process could be faster. You could come out further ahead because of reduced need to VACUUM. Your situation will determine if this is better or not. Just another option for your list.
I am curious about your UPDATE speed. This just needs to read and then write every row in the staging table. Unless the staging table is very large it doesn't seem like this should take 20 min. Other activity could be creating this slowdown. Just curious.
Another option would be to change your main table to have these 3 columns last (yes this would be some work). This way you could add the columns to the staging table and things would line up for ALTER TABLE APPEND. Just another possibility.
The easiest solution turned to be adding the necessary partition_* fields to the source CSV files.
After employing that change and removing the UPDATE from the importer pipeline, the performance has greatly improved. Imports now take ≈10 minutes each in total (that encompasses COPY, DELETE duplicates and ALTER TABLE APPEND).
Disk space is no longer climbing up to 100%.
Thanks everyone for help!
I feel like I'm thinking my self in circles here. Maybe you all can help :)
Say I have this simple table design in DynamoDB:
Id | Employee | Created | SomeExtraMetadataColumns... | LastUpdated
Say my only use case is to find all the rows in this table where LastUpdated < (now - 2 hours).
Assume that 99% of the data in the table will not meet this criteria. Assume there is a some job running every 15 mins that is updating the LastUpdated column.
Assume there are say 100,000 rows and grows maybe 1000 rows a day. (no need to large write capacity).
Assume a single entity will be performing this 'read' use case (no need for large read capacity).
Options I can think of:
Do a scan.
Pro: can leverage parallels scans to scale in the future.
Con: wastes a lot of money reading rows that do not match the filter criteria.
Add a new column called 'Constant' that would always have the value of 'Foo' and make a GSI with the Partition Key of 'Constant' and a Sort Key of LastUpdated. Then execute a query on this index for Constant = 'Foo' and LastUpdated < (now - 2hours).
Pro: Only queries the rows matching the filter. No wasted money.
Con: In theory this would be plagued by the 'hot partition' problem if writes scale up. But I am unsure how much of a problem it will be as aws outlined this problem to be a thing of the past.
Honestly, I leaning toward the latter option. But I'm curious what the communities thoughts are on this. Perhaps I am missing something.
Based on the assumption that the last_updated field is the only field you need to query against, I would do something like this:
PK: EMPLOYEE::{emp_id}
SK: LastUpdated
Attributes: Employee, ..., Created
PK: EMPLOYEE::UPDATE
SK: LastUpdated::{emp_id}
Attributes: Employee, ..., Created
By denormalising your data here you have the ability to create an update record with an update row which can be queried with PK = EMPLOYEE::UPDATE and SK between 'datetime' and 'datetime'. This is assuming you store the datetime as something like 2020-10-01T00:00:00Z.
You can either insert this additional row here or you could consider utilising DynamoDB streams to stream update events to Lambda and then add the row from there. You can set a TTL on the 'update' row which will expire somewhere between 0 and 48 hours from the TTL you set keeping the table clean. It doesn't need to be instantly removed because you're querying based on the PK and SK anyway.
A scan is an absolute no-no on a table that size so I would definitely recommend against that. If it increases by 1,000 per day like you say then before long your scan would be unmanageable and would not scale. Even at 100,000 rows a scan is very bad.
You could also utilise DynamoDB Streams to stream your data out to data stores which are suitable for analytics which is what I assume you're trying to achieve here. For example you could stream the data to redshift, RDS etc etc. Those require a few extra steps and could benefit from kinesis depending on the scale of updates but it's something else to consider.
Ultimately there are quite a lot of options here. I'd start by investigating the denormalisation and then investigate other options. If you're trying to do analytics in DynamoDB I would advise against it.
PS: I nearly always call my PK and SK attributes PK and SK and have them as strings so I can easily add different types of data or denormalisations to a table easily.
Definitely stay away from scan...
I'd look at a GSI with
PK: YYYY-MM-DD-HH
SK: MM-SS.mmmmmm
Now to get the records updated in the last two hours, you need only make three queries.
I have identical tables in Postgres9 and Postgres10, except that the Postgres10 table is partitioned by state. There are about 80 million records in both tables
When I make a query like this, it is about 10 times faster on the partitioned table than on the Postgres9 unpartitioned table. hooray!
# This is fast with a partition, and slow without
parcels = Parcel.objects.filter(state='15', boundary__intersects=polygon)
However, when I try to make an update through Django, it is about 1000 times slower on the partitioned table (it takes like 2 minutes) than on the Postgres9 version:
for parcel in parcels:
do something
# This is slow with a partition, and fast without
parcel.save()
But when I make an update directly through psql, it is super fast on the partitioned table in Postgres10, and quite slow on the unpartitioned table in Postgres9:
# This is fast with a partition, and slow without
UPDATE parcel SET field=42 WHERE state='15' AND parcel_id='someid';
Why would my call to save in Django be so much slower than updating directly through psql? Is there an equivalent to QuerySet.explain() for save operations?
The reason that Postgres10 was slower than Postgres9 was because partitioned tables do not allow a primary key on the parent. So my call to parcel.save() was trying to run a query like UPDATE parcel SET field=42 WHERE "id" = 'someid';, but querying by id was extremely slow on the partitioned table.
I fixed this by adding a primary key to each of the child tables. E.g. ALTER TABLE parcel_01 ADD PRIMARY KEY (id);. When I added it for every child, it had a similar effect, and my update time went down by 99.9%. Overall, using a partitioned table with Postgres10 reduced the total runtime of the tool that I am working on by 75% (from 40 minutes to about 10)
I want to thank #2ps for suggesting that I look at the generated SQL, and for suggesting a way to force the query that I wanted, without using a primary key.
Currently I'm loading data from Google Storage to stage_table_orders using WRITE_APPEND. Since this load both new and existed order there could be a case where same order has more than one version the field etl_timestamp tells which row is the most updated one.
then I WRITE_TRUNCATE my production_table_orders with query like:
select ...
from (
SELECT * , ROW_NUMBER() OVER
(PARTITION BY date_purchased, orderid order by etl_timestamp DESC) as rn
FROM `warehouse.stage_table_orders` )
where rn=1
Then the production_table_orders always contains the most updated version of each order.
This process is suppose to run every 3 minutes.
I'm wondering if this is the best practice.
I have around 20M rows. It seems not smart to WRITE_TRUNCATE 20M rows every 3 minutes.
Suggestion?
We are doing the same. To help improve performance though, try to partition the table by date_purchased and cluster by orderid.
Use a CTAS statement (to the table itself) as you cannot add partition after fact.
EDIT: use 2 tables and MERGE
Depending on your particular use case i.e. the number of fields that could be updated between old and new, you could use 2 tables, e.g. stage_table_orders for the imported records and final_table_orders as destination table and do
a MERGE like so:
MERGE final_table_orders F
USING stage_table_orders S
ON F.orderid = S.orderid AND
F.date_purchased = S.date_purchased
WHEN MATCHED THEN
UPDATE SET field_that_change = S.field_that_change
WHEN NOT MATCHED THEN
INSERT (field1, field2, ...) VALUES(S.field1, S.field2, ...)
Pro: efficient if few rows are "upserted", not millions (although not tested) + pruning partitions should work.
Con: you have to explicitly list the fields in the update and insert clauses. A one-time effort if schema is pretty much fixed.
There are may ways to de-duplicate and there is no one-size-fits-all. Search in SO for similar requests using ARRAY_AGG, or EXISTS with DELETE or UNION ALL,... Try them out and see which performs better for YOUR dataset.
Question
What is the Best way to update a column of a table of tens of millions of rows?
1)
I saw creating a new table and rename the old one when finish
2)
I saw update in batches using a temp table
3)
I saw single transaction (don't like this one though)
4)
never listen to cursor solution for a problema like this and I think it's not worthy to try
5) I read about loading data from file (Using BCP), but have not read if the performance is better or not. was not clear if it is just to copy or if it would allow join a big table with something and then bull copy.
really would like have some advice here.
Priority is performance
At the momment I'm testing solution 2) and Exploring solution 5)
Additional Information (UPDATE)
thank you for the critical thinking in here.
The operation be done in downtime.
UPDATE Will not cause row forwarding
All the tables go indexes, average 5 indexes, although few tables got
like 13 indexes.
the probability of target column is present in one of the table
indexes something like 50%.
Some tables can be rebuilt and replace, others don't because they
make part of a software solution, and we might lose support to those.
from those tables some got triggers.
I'll need to do this for more than 600 tables where ~150 range from
0.8 Million to 35 Million rows
The update is always in the same column in the various fields
References
BCP for data transfer
Actually it depends:
on the number of indexes the table contains
the size of the row before and after the UPDATE operation
type of UPDATE - would it be in place? does it need to modify the row length
does the operation cause row forwarding?
how big is the table?
how big would the transaction log of the UPDATE command be?
does the table contain triggers?
can the operation be done in downtime?
will the table be modified during the operation?
are minimal logging operations allowed?
would the whole UPDATE transaction fit in the transaction log?
can the table be rebuilt & replaced with a new one?
what was the timing of the operation on the test environment?
what about free space in the database - is there enough space for a copy of the table?
what kind of UPDATE operation is to be performed? does additional SELECT commands have to be done to calculate the new value of every row? or is it a static change?
Depending on the answers and the results of the operation in the test environment we could consider the fastest operations to be:
minimal logging copy of the table
an in place UPDATE operation preferably in batches