I've an apparently simple task to perform, i have to convert several tables column from a string to a new entity (integer FOREIGN KEY) value.
I have DB 10 tables with a column called "app_version" which atm are VARCHAR columns type. Since i'm going to have a little project refactor i'd like to convert those VARCHAR columns to a new column which contains an ID representing the newly mapped value so:
V1 -> ID: 1
V2 -> ID: 2
and so on
I've prepared a Doctrine Migration (i'm using symfony 3.4) which performs the conversion by DROPPING the old column and adding the new id column for the AppVersion table.
Of course i need to preserve my current existing data.
I know about preUp and postUp but i can't figure how to do it w/o hitting the DB performance too much. I can collect the data via SELECT in the preUp, store them in some PHP vars to use later on inside postUp to write new values to DB but since i have 10 tables with many rows this become a disaster real fast.
Do you guys have any suggestion i could apply to make this smooth and easy?
Please do not ask why i have to do this refactor now and i didn't setup the DB correctly in the first time. :D
Keywords for ideas: transaction? bulk query? avoid php vars storage? write sql file? everything can be good
I feel dumb but the solution was much more simple, i created a custom migration with all the "ALTER TABLE [table_name] DROP app_version" to be executed AFTER one that simply does:
UPDATE [table_name] SET app_version_id = 1 WHERE app_version = "V1"
Related
Workflow
In a data import workflow, we are creating a staging table using CREATE TABLE LIKE statement.
CREATE TABLE abc_staging (LIKE abc INCLUDING DEFAULTS);
Then, we run COPY to import CSV data from S3 into the staging table.
The data in CSV is incomplete. Namely, there are fields partition_0, partition_1, partition_2 which are missing in the CSV file; we fill them in like this:
UPDATE
abc_staging
SET
partition_0 = 'BUZINGA',
partition_1 = '2018',
partition_2 = '07';
Problem
This query seems expensive (takes ≈20 minutes oftentimes), and I would like to avoid it. That could have been possible if I could configure DEFAULT values on these columns when creating the abc_staging table. I did not find any method as to how that can be done; nor any explicit indication that is impossible. So perhaps this is still possible but I am missing how to do that?
Alternative solutions I considered
Drop these columns and add them again
That would be easy to do, but ALTER TABLE ADD COLUMN only adds columns to the end of the column list. In abc table, they are not at the end of the column list, which means the schemas of abc and abc_staging will mismatch. That breaks ALTER TABLE APPEND operation that I use to move data from staging table to the main table.
Note. Reordering columns in abc table to alleviate this difficulty will require recreating the huge abc table which I'd like to avoid.
Generate the staging table creation script programmatically with proper columns and get rid of CREATE TABLE LIKE
I will have to do that if I do not find any better solution.
Fill in the partition_* fields in the original CSV file
That is possible but will break backwards compatibility (I already have perhaps hundreds thousands of files in there). Harder but manageable.
As you are finding you are not creating a table exactly LIKE the original and Redshift doesn't let you ALTER a column's default value. Your proposed path is likely the best (define the staging table explicitly).
Since I don't know your exact situation other paths might be better so me explore a bit. First off when you UPDATE the staging table you are in fact reading every row in the table, invalidating that row, and writing a new row (with new information) at the end of the table. This leads to a lot of invalidated rows. Now when you do ALTER TABLE APPEND all these invalidated rows are being added to your main table. Unless you vacuum the staging table before hand. So you may not be getting the value you want out of ALTER TABLE APPEND.
You may be better off INSERTing the data onto your main table with an ORDER BY clause. This is slower than the ALTER TABLE APPEND statement but you won't have to do the UPDATE so the overall process could be faster. You could come out further ahead because of reduced need to VACUUM. Your situation will determine if this is better or not. Just another option for your list.
I am curious about your UPDATE speed. This just needs to read and then write every row in the staging table. Unless the staging table is very large it doesn't seem like this should take 20 min. Other activity could be creating this slowdown. Just curious.
Another option would be to change your main table to have these 3 columns last (yes this would be some work). This way you could add the columns to the staging table and things would line up for ALTER TABLE APPEND. Just another possibility.
The easiest solution turned to be adding the necessary partition_* fields to the source CSV files.
After employing that change and removing the UPDATE from the importer pipeline, the performance has greatly improved. Imports now take ≈10 minutes each in total (that encompasses COPY, DELETE duplicates and ALTER TABLE APPEND).
Disk space is no longer climbing up to 100%.
Thanks everyone for help!
Currently I'm loading data from Google Storage to stage_table_orders using WRITE_APPEND. Since this load both new and existed order there could be a case where same order has more than one version the field etl_timestamp tells which row is the most updated one.
then I WRITE_TRUNCATE my production_table_orders with query like:
select ...
from (
SELECT * , ROW_NUMBER() OVER
(PARTITION BY date_purchased, orderid order by etl_timestamp DESC) as rn
FROM `warehouse.stage_table_orders` )
where rn=1
Then the production_table_orders always contains the most updated version of each order.
This process is suppose to run every 3 minutes.
I'm wondering if this is the best practice.
I have around 20M rows. It seems not smart to WRITE_TRUNCATE 20M rows every 3 minutes.
Suggestion?
We are doing the same. To help improve performance though, try to partition the table by date_purchased and cluster by orderid.
Use a CTAS statement (to the table itself) as you cannot add partition after fact.
EDIT: use 2 tables and MERGE
Depending on your particular use case i.e. the number of fields that could be updated between old and new, you could use 2 tables, e.g. stage_table_orders for the imported records and final_table_orders as destination table and do
a MERGE like so:
MERGE final_table_orders F
USING stage_table_orders S
ON F.orderid = S.orderid AND
F.date_purchased = S.date_purchased
WHEN MATCHED THEN
UPDATE SET field_that_change = S.field_that_change
WHEN NOT MATCHED THEN
INSERT (field1, field2, ...) VALUES(S.field1, S.field2, ...)
Pro: efficient if few rows are "upserted", not millions (although not tested) + pruning partitions should work.
Con: you have to explicitly list the fields in the update and insert clauses. A one-time effort if schema is pretty much fixed.
There are may ways to de-duplicate and there is no one-size-fits-all. Search in SO for similar requests using ARRAY_AGG, or EXISTS with DELETE or UNION ALL,... Try them out and see which performs better for YOUR dataset.
Suppose I have a dynamo table named Person, which has 2 fields, name (string), age (int). Let's assume it has a TB worth of data and experiences a small amount of read throughput, but a ton of write throughput. Now I want to add a new field called Phone (string). What is the best way to go about moving the data from one table to another?
Note: Dynamo doesn't let you rename tables, and fields cannot be null.
Here are the options I think I have:
Dump the table to .csv, run a script (overnight probably since it's a TB worth of data) to add a default phone number to this file. (Not ideal, will also lose all new data submitted into old table, unless I bring the service offline to perform the migration (which is not an option in this case)).
Use the SCAN api call. (SCAN will read all values, then will consume significant write throughput on the new table to insert all old data into it).
How can I do perform a dynamo migration on a large table w/o
significant data loss?
you don't need to do anything. This is NoSQL, not SQL. (i.e. there is no idiomatic way to do this as you normally don't need migrations for NoSQL)
Just start writing entries with the additional key.
Records you get back that are written before will not have this key. What you normally do is have a default value you use when missing.
If you want to backfill, just go through and read the value + put the value with the additional field. You can do this in one run via a scan or again do it lazily when accessing the data.
Iam coding in c and using sqlite3 as database .I want to ask that how can we check whether no. of columns in a table got changed or not. Situation is like this that i am going to run the application with new executable according to which new columns will be added in the table.So when the DB will be created again application should check whether the table schema is same or not and according to the new schema it should create table.Iam developing application for a embedded enviroment(specifically for a device).
When iam changing the number of columns of table in db and running new executable in the device new tables are not getting created because of the presence of old tables but when iam deleting the old db and creating fresh tables then changes are coming.So how to handle this situation ?
Platform : Linux , gcc compiler
Thanks in advance
Pls guide me like this : (assuming Old DB is already present )
That firstly we have to check for the schema of Old DB and if there is any change in some of the tables(like some new columns added or deleted ) then create the new DB according to that .
Use Versioning and Explicit Column References
You can make use of database versioning to help assist with this sort of problem.
Create a separate table with only one column and one record to store the database version.
Whenever you upgrade your database, set the version number in the separate table.
Design your insert queries to specify the columns.
Define default values for new columns so that old programs insert default values.
Examples
UPDATE databaseVersion SET version=2;
Version 1 Query
INSERT INTO MyTable (id, var1, var2) VALUES (2, '5', '6');
Version 2 Query
INSERT INTO MyTable (id, var1, var2, var3) VALUES (3, '5', '6', '7');
This way your queries should still be compatible on the new DB when using the old program.
I was wondering what the "best" way to verify the structure of my database is with SQLite in Qt / C++. I'm using SQLite so there is a file which contains my database, and I want to make sure that, when launching the program, the database is structured the way it should be- i.e., it has X tables each with their own Y columns, appropriately named, etc. Could someone point my in the right direction? Thanks so much!
You can get a list of all the tables in the database with this query:
select tbl_name from sqlite_master;
And then for each table returned, run this query to get column information
pragma table_info(my_table);
For the pragma, each row of the result set will contain: a column index, the column name, the column's type affinity, whether the column may be NULL, and the column's default value.
(I'm assuming here that you know how to run SQL queries against your database in the SQLite C interface.)
If you have QT and thus QtSql at hand, you can also use the QSqlDatabase::tables() (API doc) method to get the tables and QSqlDatabase::record(tablename) to get the field names. It can also give you the primary key(s), but for further details you will have to follow pkh's advice to use the table_info pragma.