Alter sort and distribution for dependent tables - amazon-web-services

This is the query to change the sort and dist key in Redshift tables.
CREATE TABLE new_dummy
DISTKEY (id)
SORTKEY (account_id,created_at)
AS (SELECT * FROM dummy);
ALTER TABLE dummy RENAME TO old_dummy;
ALTER TABLE new_dummy RENAME TO dummy;
DROP TABLE old_dummy;
It throws the below error:
ERROR: cannot drop table old_dummy because other objects depend on it
HINT: Use DROP ... CASCADE to drop the dependent objects too.
So is it not possible to change the keys for dependent tables?

It appears that you have VIEWs that are referencing the original (dummy) table.
When a table is renamed, the VIEW continues to point to the original table, regardless of what it is named. Therefore, trying to delete the table results in an error.
You will need to drop the view before dropping the table. You can then recreate the view to point to the new dummy table.
So, the flow would be:
Create new_dummy and load data
Drop view
Drop dummy
Rename new_dummy to dummy
Create view
You might think that his is bad, but it's actually a good feature because renaming a table will not break any views. The view will automatically stay with the correct table.
UPDATE:
Based on Joe's comment below, the flow would be:
CREATE VIEW ... WITH NO SCHEMA BINDING
Then, for each reload:
Create new_dummy and load data
Drop dummy
Rename new_dummy to dummy

This answer is based upon the fact that you have foreign key references within the table definition, which are not compatible with the process of renaming and dropping tables.
Given this situation, I would recommend that you load data as follows:
Start a transaction
DELETE * from the table
Load data with INSERT INTO
End the transaction
This means you are totally reloading the contents of the table. Wrapping it in a transaction means that there is no period where the table will appear 'empty'.
However, this leaves the table in a bit of a messy state, requiring a VACUUM to delete the old data.
Alternatively, you could:
TRUNCATE the table
Load data with INSERT INTO
TRUNCATE does not require a cleanup since it clears all data associated with the table (not just marking it for deletion). However, TRUNCATE immediately commits the transaction, so there will be a gap where the table will be empty.

Related

Big Query - Convert a int column into float

I would like to convert a column called lauder from int to float in Big Query. My table is called historical. I have been able to use this SQL query
SELECT *, CAST(lauder as float64) as temp
FROM sandbox.dailydev.historical
The query works but the changes are not saved into the table. What should I do?
If you use SELECT * you will scan the whole table and thus will be the cost. If table is small this shouldn't be a problem, but if it is big enough to be concern about cost - below is another approach:
apply ALTER TABLE ADD COLUMN to add new column of needed data type
apply UPDATE for new column
UPDATE table
SET new_column = CAST(old_column as float64)
WHERE true
Do you want to save them in a temporary table to use it later?
You can save it to a temporary table like below and then refer to "temp"
with temp as
( SELECT *, CAST(lauder as float64)
FROM sandbox.dailydev.historical)
You can not change a columns data type in a table
https://cloud.google.com/bigquery/docs/manually-changing-schemas#changing_a_columns_data_type
What you can do is either:
Create a view to sit on top and handle the data type conversion
Create a new column and set the data type to float64 and insert values into it
Overwrite the table
Options 2 and 3 are outlined well including pros and cons in the link I shared above.
Your statement is correct. But tables columns in Big Query are immutable. You need to run your query and save results to a new table with the modified column.
Click "More" > "Query settings", and in "Destination" select "Set a destination table for query results" and fill the table name. You can even select if you want to overwrite the existing table with generated one.
After these settings are set, just "Run" your query as usual.
You can use CREATE or REPLACE TABLE to write the structural changes along with data into the same table:
CREATE OR REPLACE TABLE sandbox.dailydev.historical
AS SELECT *, CAST(lauder as float64) as temp FROM sandbox.dailydev.historical;
In this example, historical table will be restructured with an additional column temp.
In some cases you can change column types:
CREATE TABLE mydataset.mytable(c1 INT64);
ALTER TABLE mydataset.mytable ALTER COLUMN c1 SET DATA TYPE NUMERIC;
Check conversion rules.
And google docs.

How to insert Billing Data from one Table into another Table in BigQuery

I have two tables both billing data from GCP in two different regions. I want to insert one table into the other. Both tables are partitioned by day, and the larger one is being written to by GCP for billing exports, which is why I want to insert the data into the larger table.
I am attempting the following:
Export the smaller table to Google Cloud Storage (GCS) so it can be imported into the other region.
Import the table from GCS into Big Query.
Use Big Query SQL to run INSERT INTO dataset.big_billing_table SELECT * FROM dataset.small_billing_table
However, I am getting a lot of issues as it won't just let me insert (as there are repeated fields in the schema etc). An example of the dataset can be found here https://bigquery.cloud.google.com/table/data-analytics-pocs:public.gcp_billing_export_v1_EXAMPL_E0XD3A_DB33F1
Thanks :)
## Update ##
So the issue was exporting and importing the data with the Avro format and using the auto-detect schema when importing the table back in (Timestamps were getting confused with integer types).
Solution
Export the small table in JSON format to GCS, use GCS to do the regional transfer of the files and then import the JSON file into a Bigquery table and DONT use schema auto detect (e.g specify the schema manually). Then you can use INSERT INTO no problems etc.
I was able to reproduce your case with the example data set you provided. I used dummy tables, generated from the below queries, in order to corroborate the cases:
Table 1: billing_bigquery
SELECT * FROM `data-analytics-pocs.public.gcp_billing_export_v1_EXAMPL_E0XD3A_DB33F1`
where service.description ='BigQuery' limit 1000
Table 2: billing_pubsub
SELECT * FROM `data-analytics-pocs.public.gcp_billing_export_v1_EXAMPL_E0XD3A_DB33F1`
where service.description ='Cloud Pub/Sub' limit 1000
I will propose two methods for performing this task. However, I must point that the target and the source table must have the same columns names, at least the ones you are going to insert.
First, I used INSERT TO method. However, I would like to stress that, according to the documentation, if your table is partitioned you must include the columns names which will be used to insert new rows. Therefore, using the dummy data already shown, it will be as following:
INSERT INTO `billing_bigquery` ( billing_account_id, service, sku, usage_start_time, usage_end_time, project, labels, system_labels, location, export_time, cost, currency, currency_conversion_rate, usage, credits )#invoice, cost_type
SELECT billing_account_id, service, sku, usage_start_time, usage_end_time, project, labels, system_labels, location, export_time, cost, currency, currency_conversion_rate, usage, credits
FROM `billing_pubsub`
Notice that for nested fields I just write down the fields name, for instance: service and not service.description, because they will already be used. Furthermore, I did not select all the columns in the target dataset but all the columns I selected in the target's tables are required to be in the source's table selection as well.
The second method, you can simply use the Query settings button to append the small_billing_table to the big_billing_table. In BigQuery Console, click in More >> Query settings. Then the settings window will appear and you go to Destination table, check Set a destination table for query results, fill the fields: Project name,
Dataset name and Table name -these are the destination table's information-. Subsequently, in
Destination table write preference check Append to table, which according to the documentation:
Append to table — Appends the query results to an existing table
Then you run the following query:
Select * from <project.dataset.source_table>
Then after running it, the source's table data should be appended in the target's table.

Is there anyway to keep only one week data in redshift table

I have a source where everyday file get populated and every day it is loaded into redshift table
but I want to keep only one week of data in the table after one week it should delete the data which are older.
Suggest a, way for that.
A common method is:
Load each day's data into a separate table
Use CREATE VIEW to create a combined view of the past week's tables
For example:
CREATE VIEW data
AS
SELECT * FROM monday_table
UNION ALL
SELECT * FROM tuesday_table
UNION ALL
SELECT * FROM wednesday_table
...etc
Your users can simply use the View as a normal table.
Then, each day when new data has arrived, DROP or TRUNCATE the oldest table and load the new data
Either load the new data in the same-named table as the one dropped/truncated, or re-create the View to include this new table and not the dropped one
There is no automatic process to do the above steps, but you could make it part of the script that runs your Load process.

How to create table based on minimum date from other table in DAX?

I want to create a second table from the first table using filters with dates and other variables as follows. How can I create this?
Following is the expected table and original table,
Go to Edit Queries. Lets say our base table is named RawData. Add a blank query and use this expression to copy your RawData table:
=RawData
The new table will be RawDataGrouped. Now select the new table and go to Home > Group By and use the following settings:
The result will be the following table. Note that I didnt use the exactly values you used to keep this sample at a miminum effort:
You also can now create a relationship between this two tables (by the Index column) to use cross filtering between them.
You could show the grouped data and use the relationship to display the RawDate in a subreport (or custom tooltip) for example.
I assume you are looking for a calculated table. Below is the workaround for the same,
In Query Editor you can create a duplicate table of the existing (Original) table and select the Date Filter -> Is Earliest option by clicking right corner of the Date column in new duplicate table. Now your table should contain only the rows which are having minimum date for the column.
Note: This table is dynamic and will give subsequent results based on data changes in the original table, but you to have refresh both the table.
Original Table:
Desired Table:
When I have added new column into it, post to refreshing dataset I have got below result (This implies, it is doing recalculation based on each data change in the original source)
New data entry:
Output:

How to remove fields from big query table?

I updated schema of one production bq table and added record type of field with few repeated fields inside another record type field. Now I have to drop this column. I can't delete whole table as it already has more than 30 TB data. Is there any way to drop particular column in big query table?
See:
https://cloud.google.com/bigquery/docs/manually-changing-schemas#deleting_a_column_from_a_table_schema
There are two ways to manually delete a column:
Using a SQL query — Choose this option if you are more concerned about simplicity and ease of use, and you are less concerned about costs.
Recreating the table — Choose this option if you are more concerned about costs, and you are less concerned about simplicity and ease of use.