I am trying to do an upsert with a stage table when copying data from S3.
I want to do this because I want to be able to backfill the data (or launch the process more than just once), and right now it was creating duplicate rows.
I see a bunch of responses that show how to do a DELETE from {table} USING {stage_table} WHERE {table.primarykey} = {stage_table.primarykey}
The thing is that I want to do that with a generic function, this means.. how can I access the primary key somehow 'automatically'? Because "primarykey" or "primaryKey" like I read in many places does not work. I am guessing it is just pseudo-code.
Any help would be appreciated. Thanks!
EDIT
The idea is to execute the upsert like this:
[transaction]
connection.execute("CREATE TEMP TABLE {stage_table} (like {table});".format(stage_table=stage_table, table=text(self.table)))
connection.execute(self.clean(self.compile_query(copy)))
connection.execute("DELETE FROM {table} USING {stage_table} WHERE {table}.primarykey = {stage_table}.primarykey;".format(stage_table=stage_table, table=text(self.table)))
connection.execute("INSERT INTO {table} SELECT * FROM {stage_table};".format(stage_table=stage_table, table=text(self.table)))
connection.execute("DROP TABLE {stage_table};".format(stage_table=stage_table))
[end transaction]
Related
Workflow
In a data import workflow, we are creating a staging table using CREATE TABLE LIKE statement.
CREATE TABLE abc_staging (LIKE abc INCLUDING DEFAULTS);
Then, we run COPY to import CSV data from S3 into the staging table.
The data in CSV is incomplete. Namely, there are fields partition_0, partition_1, partition_2 which are missing in the CSV file; we fill them in like this:
UPDATE
abc_staging
SET
partition_0 = 'BUZINGA',
partition_1 = '2018',
partition_2 = '07';
Problem
This query seems expensive (takes ≈20 minutes oftentimes), and I would like to avoid it. That could have been possible if I could configure DEFAULT values on these columns when creating the abc_staging table. I did not find any method as to how that can be done; nor any explicit indication that is impossible. So perhaps this is still possible but I am missing how to do that?
Alternative solutions I considered
Drop these columns and add them again
That would be easy to do, but ALTER TABLE ADD COLUMN only adds columns to the end of the column list. In abc table, they are not at the end of the column list, which means the schemas of abc and abc_staging will mismatch. That breaks ALTER TABLE APPEND operation that I use to move data from staging table to the main table.
Note. Reordering columns in abc table to alleviate this difficulty will require recreating the huge abc table which I'd like to avoid.
Generate the staging table creation script programmatically with proper columns and get rid of CREATE TABLE LIKE
I will have to do that if I do not find any better solution.
Fill in the partition_* fields in the original CSV file
That is possible but will break backwards compatibility (I already have perhaps hundreds thousands of files in there). Harder but manageable.
As you are finding you are not creating a table exactly LIKE the original and Redshift doesn't let you ALTER a column's default value. Your proposed path is likely the best (define the staging table explicitly).
Since I don't know your exact situation other paths might be better so me explore a bit. First off when you UPDATE the staging table you are in fact reading every row in the table, invalidating that row, and writing a new row (with new information) at the end of the table. This leads to a lot of invalidated rows. Now when you do ALTER TABLE APPEND all these invalidated rows are being added to your main table. Unless you vacuum the staging table before hand. So you may not be getting the value you want out of ALTER TABLE APPEND.
You may be better off INSERTing the data onto your main table with an ORDER BY clause. This is slower than the ALTER TABLE APPEND statement but you won't have to do the UPDATE so the overall process could be faster. You could come out further ahead because of reduced need to VACUUM. Your situation will determine if this is better or not. Just another option for your list.
I am curious about your UPDATE speed. This just needs to read and then write every row in the staging table. Unless the staging table is very large it doesn't seem like this should take 20 min. Other activity could be creating this slowdown. Just curious.
Another option would be to change your main table to have these 3 columns last (yes this would be some work). This way you could add the columns to the staging table and things would line up for ALTER TABLE APPEND. Just another possibility.
The easiest solution turned to be adding the necessary partition_* fields to the source CSV files.
After employing that change and removing the UPDATE from the importer pipeline, the performance has greatly improved. Imports now take ≈10 minutes each in total (that encompasses COPY, DELETE duplicates and ALTER TABLE APPEND).
Disk space is no longer climbing up to 100%.
Thanks everyone for help!
I am reading json files from GCS and I have to load data into different BigQuery tables. These file may have multiple records for same customer with different timestamp. I have to pick latest among them for each customer. I am planning to achieve as below
Read files
Group by customer id
Apply DoFn to compare timestamp of records in each group and have only latest one from them
Flat it, convert to table row insert into BQ.
But I am unable to proceed with step 1. I see GroupByKey.create() but unable to make it use customer id as key.
I am implementing using JAVA. Any suggestions would be of great help. Thank you.
Before you GroupByKey you need to have your dataset in key-value pairs. It would be good if you had shown some of your code, but without knowing much, you'd do the following:
PCollection<JsonObject> objects = p.apply(FileIO.read(....)).apply(FormatData...)
// Once we have the data in JsonObjects, we key by customer ID:
PCollection<KV<String, Iterable<JsonObject>>> groupedData =
objects.apply(MapElements.via(elm -> KV.of(elm.getString("customerId"), elm)))
.apply(GroupByKey.create())
Once that's done, you can check timestamps and discard all bot the most recent as you were thinking.
Note that you will need to set coders, etc - if you get stuck with that we can iterate.
As a hint / tip, you can consider this example of a Json Coder.
I've an apparently simple task to perform, i have to convert several tables column from a string to a new entity (integer FOREIGN KEY) value.
I have DB 10 tables with a column called "app_version" which atm are VARCHAR columns type. Since i'm going to have a little project refactor i'd like to convert those VARCHAR columns to a new column which contains an ID representing the newly mapped value so:
V1 -> ID: 1
V2 -> ID: 2
and so on
I've prepared a Doctrine Migration (i'm using symfony 3.4) which performs the conversion by DROPPING the old column and adding the new id column for the AppVersion table.
Of course i need to preserve my current existing data.
I know about preUp and postUp but i can't figure how to do it w/o hitting the DB performance too much. I can collect the data via SELECT in the preUp, store them in some PHP vars to use later on inside postUp to write new values to DB but since i have 10 tables with many rows this become a disaster real fast.
Do you guys have any suggestion i could apply to make this smooth and easy?
Please do not ask why i have to do this refactor now and i didn't setup the DB correctly in the first time. :D
Keywords for ideas: transaction? bulk query? avoid php vars storage? write sql file? everything can be good
I feel dumb but the solution was much more simple, i created a custom migration with all the "ALTER TABLE [table_name] DROP app_version" to be executed AFTER one that simply does:
UPDATE [table_name] SET app_version_id = 1 WHERE app_version = "V1"
I have an Athena table of data in S3 that acts as a source table, with columns id, name, event. For every unique name value in this table, I would like to output a new table with all of the rows corresponding to that name value, and save to a different bucket in S3. This will result in n new files stored in S3, where n is also the number of unique name values in the source table.
I have tried single Athena queries in Lambda using PARTITION BY and CTAS queries, but can't seem to get the result that I wanted. It seems that AWS Glue may be able to get my expected result, but I've read online that it's more expensive, and that perhaps I may be able to get my expected result using Lambda.
How can I store a new file (JSON format, preferably) that contains all rows corresponding to each unique name in S3?
Preferably I would run this once a day to update the data stored by name, but the question above is the main concern for now.
When u write your spark/glue code you will need to partition the data using the name column. However this will result in a path having the below format
S3://bucketname/folder/name=value/file.json
This should give a separate set of files for each name value, but if u want to access that as a separate table u might need to get rid of that = sign from the key before u crawl the data and make it available via Athena
If u do use a lambda, the operation involves going through the data , similar to what glue does, and partitioning the data
I guess it all depends on the volume of data that it needs to process. Glue, if using spark may have a little bit of an extra start up time. Glue python shells have comparatively better start up times
I'm having some trouble getting this table creation query to work, and I'm wondering if I'm running in to a limitation in redshift.
Here's what I want to do:
I have data that I need to move between schema, and I need to create the destination tables for the data on the fly, but only if they don't already exist.
Here are queries that I know work:
create table if not exists temp_table (id bigint);
This creates a table if it doesn't already exist, and it works just fine.
create table temp_2 as select * from temp_table where 1=2;
So that creates an empty table with the same structure as the previous one. That also works fine.
However, when I do this query:
create table if not exists temp_2 as select * from temp_table where 1=2;
Redshift chokes and says there is an error near as (for the record, I did try removing "as" and then it says there is an error near select)
I couldn't find anything in the redshift docs, and at this point I'm just guessing as to how to fix this. Is this something I just can't do in redshift?
I should mention that I absolutely can separate out the queries that selectively create the table and populate it with data, and I probably will end up doing that. I was mostly just curious if anyone could tell me what's wrong with that query.
EDIT:
I do not believe this is a duplicate. The post linked to offers a number of solutions that rely on user defined functions...redshift doesn't support UDF's. They did recently implement a python based UDF system, but my understanding is that its in beta, and we don't know how to implement it anyway.
Thanks for looking, though.
I couldn't find anything in the redshift docs, and at this point I'm
just guessing as to how to fix this. Is this something I just can't do
in redshift?
Indeed this combination of CREATE TABLE ... AS SELECT AND IF NOT EXISTS is not possible in Redshift (per documentation). Concerning PostgreSQL, it's possible since version 9.5.
On SO, this is discussed here: PostgreSQL: Create table if not exists AS . The accepted answer provides options that don't require any UDF or procedural code, so they're likely to work with Redshift too.