I have created BQ table with two columns:
col1 string nullable
col2 string required
And next populated table with some dummy data:
insert into `test.test` values ('val1', 'val2')
insert into `test.test` values (null, 'val2')
insert into `test.test` values ('val1', 'val2')
After that I dropped single column:
Alter table `test.test` drop column col2
And after that I would like to add new column:
alter table `test.test` add column col3 string
HINT: New column that I am trying to add is named differently than the one I deleted.
BQ raises an error to me:
Column `col2` was recently deleted in the table `test`. Deleted column name is reserved for up to the time travel duration, use a different column name instead.
It seems to be not right. I know that deleted columns is still kept somewhere in BQ world, but I am trying to add column with different name than the deleted one.
Any idea?
This is a known issue and is being worked upon by Bigquery Engineering team. You may click on +1 to bring more attention on the issue and STAR the issue so that you can be notified for updates.
Meanwhile, as a workaround you try adding new field via
BigQuery UI
bq command line
API or Client Library
Related
I would like to know whether my data modeling is working for Power BI.
The dataset I am using is training course for students and corporate. The original data has 3 tables separated by individual program. The purpose of my visualization is to analyze the 3 programs from all students in a single dashboard.
Here is the original data after being imported to Power BI:
Here is the data pre-processing:
Remove unneeded column
DPT table
Remove column – No, Date, Quarter
DTP table
Remove column – Count, Email, Date
LLD table
Remove column – Email, To calculate, Learning Hours
Rename column name & impute missing value with “Not given”
DPT table
Trainee = Name, Training Provider = Provider, Course name = Course, Focus area = Domain
DTP table
Participant Name = Name, Event/Training Name = Course, Training providers = Provider
Create new column and impute them with “No given” and put in same
position (for append tables later)
DTP table
Level
LLD table
Company, Provider, Level
Create a new column called Program and impute value as it program
name for each row.
Post cleaned table:
After appending the 3 tables and calling it Master:
. Duplicate the Master table to create Student, Provider and Program table. In each table remove irrelevant columns, remove duplicates and create unique ID.
Final data model:
The focus is the Program, Provider and Student tables. The rest of the tables will be deactivated the relationship when creating calculated columns and measures before I make any correction to the data model.
Is there any proper approach to build the data model?
From my data model in the last picture, does it mean that the Provider table is a fact table while the Student and Program tables are dimensions?
I agree with removing unneeded columns, renaming columns for a better look, and substituting 'Not Given' in place of NULLS (Caution here: Measures and Dimensions handle nulls differently, for dimensional values substituting is okay)
If modeling using PowerBI is a must, then the following strategy should happen:
The dimensions can be Students, Programs, Providers
A Factless Fact table (FactProgram or something like that). It will have dimensional keys to Students, Programs, Providers ( and additional measures that you can create or can be taken from Master )
Remove unnecessary columns from the dimensions, so that the Remove Duplicates will give you what you want. For instance, Student and Program currently have same columns coming from master (Company, Course, Domain, Level, Program, Provider). Make it clear which columns belong to which dimensions and optionally create new dimensions (maybe, DimCompany)
I would like to convert a column called lauder from int to float in Big Query. My table is called historical. I have been able to use this SQL query
SELECT *, CAST(lauder as float64) as temp
FROM sandbox.dailydev.historical
The query works but the changes are not saved into the table. What should I do?
If you use SELECT * you will scan the whole table and thus will be the cost. If table is small this shouldn't be a problem, but if it is big enough to be concern about cost - below is another approach:
apply ALTER TABLE ADD COLUMN to add new column of needed data type
apply UPDATE for new column
UPDATE table
SET new_column = CAST(old_column as float64)
WHERE true
Do you want to save them in a temporary table to use it later?
You can save it to a temporary table like below and then refer to "temp"
with temp as
( SELECT *, CAST(lauder as float64)
FROM sandbox.dailydev.historical)
You can not change a columns data type in a table
https://cloud.google.com/bigquery/docs/manually-changing-schemas#changing_a_columns_data_type
What you can do is either:
Create a view to sit on top and handle the data type conversion
Create a new column and set the data type to float64 and insert values into it
Overwrite the table
Options 2 and 3 are outlined well including pros and cons in the link I shared above.
Your statement is correct. But tables columns in Big Query are immutable. You need to run your query and save results to a new table with the modified column.
Click "More" > "Query settings", and in "Destination" select "Set a destination table for query results" and fill the table name. You can even select if you want to overwrite the existing table with generated one.
After these settings are set, just "Run" your query as usual.
You can use CREATE or REPLACE TABLE to write the structural changes along with data into the same table:
CREATE OR REPLACE TABLE sandbox.dailydev.historical
AS SELECT *, CAST(lauder as float64) as temp FROM sandbox.dailydev.historical;
In this example, historical table will be restructured with an additional column temp.
In some cases you can change column types:
CREATE TABLE mydataset.mytable(c1 INT64);
ALTER TABLE mydataset.mytable ALTER COLUMN c1 SET DATA TYPE NUMERIC;
Check conversion rules.
And google docs.
I want to create a second table from the first table using filters with dates and other variables as follows. How can I create this?
Following is the expected table and original table,
Go to Edit Queries. Lets say our base table is named RawData. Add a blank query and use this expression to copy your RawData table:
=RawData
The new table will be RawDataGrouped. Now select the new table and go to Home > Group By and use the following settings:
The result will be the following table. Note that I didnt use the exactly values you used to keep this sample at a miminum effort:
You also can now create a relationship between this two tables (by the Index column) to use cross filtering between them.
You could show the grouped data and use the relationship to display the RawDate in a subreport (or custom tooltip) for example.
I assume you are looking for a calculated table. Below is the workaround for the same,
In Query Editor you can create a duplicate table of the existing (Original) table and select the Date Filter -> Is Earliest option by clicking right corner of the Date column in new duplicate table. Now your table should contain only the rows which are having minimum date for the column.
Note: This table is dynamic and will give subsequent results based on data changes in the original table, but you to have refresh both the table.
Original Table:
Desired Table:
When I have added new column into it, post to refreshing dataset I have got below result (This implies, it is doing recalculation based on each data change in the original source)
New data entry:
Output:
I have a table in a Redshift cluster with ~1 billion rows. I have a job that tries to update some column values based on some filter. Updating anything at all in this table is incredibly slow. Here's an example:
SELECT col1, col2, col3
FROM SOMETABLE
WHERE col1 = 'a value of col1'
AND col2 = 12;
The above query returns in less than a second, because I have sortkeys on col1 and col2. There is only one row that meets this criteria, so the result set is just one row. However, if I run:
UPDATE SOMETABLE
SET col3 = 20
WHERE col1 = 'a value of col1'
AND col2 = 12;
This query takes an unknown amount of time (I stopped it after 20 minutes). Again, it should be updating one column value of one row.
I have also tried to follow the documentation here: http://docs.aws.amazon.com/redshift/latest/dg/merge-specify-a-column-list.html, which talks about creating a temporary staging table to update the main table, but got the same results.
Any idea what is going on here?
You didn't mention what percentage of the table you're updating but it's important to note that an UPDATE in Redshift is a 2 step process:
Each row that will be changed must be first marked for deletion
Then a new version of the data must be written for each column in the table
If you have a large number of columns and/or are updating a large number of rows then this process can be very labor intensive for the database.
You could experiment with using a CREATE TABLE AS statement to create a new "updated" version of the table and then dropping the existing table and renaming the new table. This has the added benefit of leaving you with a fully sorted table.
Actually I don't think RedShift is designed for bulk updates, RedShift is designed for OLAP instead of OLTP, update operations are inefficient on RedShift by nature.
In this use case, I would suggest to do INSERT instead of UPDATE, while add another column of the TIMESTAMP, and when you do analysis on RedShift, you'll need extra logic to get the latest TIMESTAMP to eliminate possible duplicated data entries.
In my current project I'm trying to check a MySQL database: a MySQL database is updated by another program, so my C++ program needs to select only the new rows. It is not going to be a small table (>10000 rows), so I do not want to search each row. i.e. checking a column like isNew=0 or =1. I already found:
Query to find tables modified in the last hour
http://www.codediesel.com/mysql/how-to-check-when-a-mysql-table-was-last-updated/
However, in this example you can only get the table which is updated. How can I only select the new rows from a table?
How can I only select the new rows from a table?
Assuming new rows means newly inserted, and if you can change the database schema, you could use an auto increment column. By remembering the largest value each time your program selects a result set, it could save that value for the next query:
select * from table where id > 123
I would recommend adding an isNew column to the table with default value 1 and add an index on it. The index will prevent your query from checking all rows. After you have processed the row, set isNew to 0.