Redshift does not support Arrays, however my source database has several Array columns that I need in Redshift.
How should this field type be handled when trying to migrate it into Redshift?
While Redshift does not support arrays in the PostgreSQL-sense, it provides some JSON functions you might want to have a look at: http://docs.aws.amazon.com/redshift/latest/dg/json-functions.html
You can insert arrays into varchar columns:
create temporary table _test (col1 varchar(20));
insert into _test values ('[1,2,3]');
Then using json_extract_array_element_text() would yield:
db=# select json_extract_array_element_text(col1, 2) from _test;
json_extract_array_element_text
---------------------------------
3
(1 row)
Surround it with a specific character that never appears on this field values.
Example:
field = |value1|value2|value3|
And when querying it you just do this:
where field like '%|value1|%'
Also have in mind that like queries are expensive and can decrease your cluster's perfomance.
Related
I've an apparently simple task to perform, i have to convert several tables column from a string to a new entity (integer FOREIGN KEY) value.
I have DB 10 tables with a column called "app_version" which atm are VARCHAR columns type. Since i'm going to have a little project refactor i'd like to convert those VARCHAR columns to a new column which contains an ID representing the newly mapped value so:
V1 -> ID: 1
V2 -> ID: 2
and so on
I've prepared a Doctrine Migration (i'm using symfony 3.4) which performs the conversion by DROPPING the old column and adding the new id column for the AppVersion table.
Of course i need to preserve my current existing data.
I know about preUp and postUp but i can't figure how to do it w/o hitting the DB performance too much. I can collect the data via SELECT in the preUp, store them in some PHP vars to use later on inside postUp to write new values to DB but since i have 10 tables with many rows this become a disaster real fast.
Do you guys have any suggestion i could apply to make this smooth and easy?
Please do not ask why i have to do this refactor now and i didn't setup the DB correctly in the first time. :D
Keywords for ideas: transaction? bulk query? avoid php vars storage? write sql file? everything can be good
I feel dumb but the solution was much more simple, i created a custom migration with all the "ALTER TABLE [table_name] DROP app_version" to be executed AFTER one that simply does:
UPDATE [table_name] SET app_version_id = 1 WHERE app_version = "V1"
Currently I'm loading data from Google Storage to stage_table_orders using WRITE_APPEND. Since this load both new and existed order there could be a case where same order has more than one version the field etl_timestamp tells which row is the most updated one.
then I WRITE_TRUNCATE my production_table_orders with query like:
select ...
from (
SELECT * , ROW_NUMBER() OVER
(PARTITION BY date_purchased, orderid order by etl_timestamp DESC) as rn
FROM `warehouse.stage_table_orders` )
where rn=1
Then the production_table_orders always contains the most updated version of each order.
This process is suppose to run every 3 minutes.
I'm wondering if this is the best practice.
I have around 20M rows. It seems not smart to WRITE_TRUNCATE 20M rows every 3 minutes.
Suggestion?
We are doing the same. To help improve performance though, try to partition the table by date_purchased and cluster by orderid.
Use a CTAS statement (to the table itself) as you cannot add partition after fact.
EDIT: use 2 tables and MERGE
Depending on your particular use case i.e. the number of fields that could be updated between old and new, you could use 2 tables, e.g. stage_table_orders for the imported records and final_table_orders as destination table and do
a MERGE like so:
MERGE final_table_orders F
USING stage_table_orders S
ON F.orderid = S.orderid AND
F.date_purchased = S.date_purchased
WHEN MATCHED THEN
UPDATE SET field_that_change = S.field_that_change
WHEN NOT MATCHED THEN
INSERT (field1, field2, ...) VALUES(S.field1, S.field2, ...)
Pro: efficient if few rows are "upserted", not millions (although not tested) + pruning partitions should work.
Con: you have to explicitly list the fields in the update and insert clauses. A one-time effort if schema is pretty much fixed.
There are may ways to de-duplicate and there is no one-size-fits-all. Search in SO for similar requests using ARRAY_AGG, or EXISTS with DELETE or UNION ALL,... Try them out and see which performs better for YOUR dataset.
I have an ingestion time partitioned table that's getting a little large. I wanted to group by the values in one of the columns and use that to split it into multiple tables. Is there an easy way to do that while retaining the original _PARTITIONTIME values in the set of new ingestion time partitioned tables?
Also I'm hoping for something that's relatively simple/cheap. I could do something like copy my table a bunch of times and then delete the data for all but one value on each copy, but I'd get charged a huge amount for all those DELETE operations.
Also I have enough unique values in the column I want to split on that saving a "WHERE column = value" query result to a table for every value would be cost prohibitive. I'm not finding any documentation that mentions whether this approach would even preserve the partitions, so even if it weren't cost prohibitive it may not work.
Case you describe required having two level partitioning which is not supported yet
You can create column partition table https://cloud.google.com/bigquery/docs/creating-column-partitions
And after this build this value of column as needed that used to partitioning before insert - but in this case you lost _PARTITIONTIME value
Based on additional clarification - I had similar problem - and my solution was to write python application that will read source table (read is important here - not query - so it will be free) - split data based on your criteria and stream data (simple - but not free) or generate json/csv files and upload it into target tables (which also will be free but with some limitation on number of these operations) - will required more coding/exception handling if you go second route.
You can also can do it via DataFlow - it will be definitely more expensive than custom solution but potentially more robust.
Examples for gcloud python library
client = bigquery.Client(project="PROJECT_NAME")
t1 = client.get_table(source_table_ref)
target_schema = t1.schema[1:] #removing first column which is a key to split
ds_target = client.dataset(project=target_project, dataset_id=target_dataset)
rows_to_process_iter = client.list_rows( t1, start_index=start_index, max_results=max_results)
# convert to list
rows_to_process = list(rows_to_process_iter)
# doing something with records
# stream records to destination
errors = client.create_rows(target_table, records_to_stream)
BigQuery now supports clustered partitioned tables, which allow you to specify additional columns that the data should be split by.
What happens when after creating the table in AWS Athena for files on S3, the structure of the files on S3 change?
For eg:
If the files previously had 5 columns when the table was created and later the new files started getting 1 more column:
a) at the end?
b) in between?
What happens when some columns are not available in new files?
What happens when the columns remain the same but the column order changes?
Can we alter Athena tables to adjust to these changes?
1 - Athena is not a NoSQL solution. It is not dynamic schema either. If you change the schema, all your files in a particular folder should reflect that change. Athena wont magically update to have it included.
2 - Then it'll be a problem and it'll break. You should include NULL or ,, to force it to be okay.
3 - Athena picks it up by column order. Not by name, really. If your column orders change, it'll probably break (different types).
4 - Yes. You can always easily recreate Athena tables by dropping it and creating a new one.
If you have variable length files, then you should insert them into different folders so that each folder represents one consistent schema. You can then unify this later on in Athena with a union or similar to create a condensed, simplified table that you can apply the consistent schema to.
It depends on the files format you are using and the setup (if the schema is by field order or by field name). All the details are here: https://docs.aws.amazon.com/athena/latest/ug/handling-schema-updates-chapter.html
Take a big note that if the data is nested or in arrays, it will completely break your data, to quote from this page:
Schema updates described in this section do not work on tables with complex or nested data types, such as arrays and structs.
I want to clear all the values from a table. It has a few secondary indexes. I tried to do this via committing a transaction with Mutation.delete("MyTable", KeySet.all()) (see docs here). But I got an error:
error:INVALID_ARGUMENT: io.grpc.StatusRuntimeException: INVALID_ARGUMENT: The transaction contains too many mutations.
How can I efficiently clear my table contents?
Cloud Spanner does not support such a truncate command. If your table had no secondary indexes, then you could specify KeySet.all() in your Delete as specified above, but this can fail if your table has secondary indexes and is large.
The best way to do what you want is to issue an updateDdl RPC including the following statements:
1) For each secondary index on MyTable, include a corresponding DROP INDEX statement
2) DROP TABLE MyTable
3) If necessary, re-create your table and indexes via the CREATE TABLE and CREATE INDEX statements, respectively.
Note that you are allowed and encouraged to include all of these statements in a single updateDdl RPC. The advantage of this is that it gives you atomic ("all-or-nothing") semantics.