Power BI : Convert Duplicate Table into a Reference Table - powerbi

Is there an automated way to convert a Duplicate Table(With all its steps) into a Reference Table preserving all the steps in Query Editor ?

Short answer, not really, but it's possibly trivial to do manually for one query.
Reference Table and Duplicate Table are GUI operations, which like other GUI operations, simply insert M code into the query. You can see the entire query in the Advanced Editor.
Reference Table just inserts the name of the other query; the effect is branching the data processing pipelines. If you change the original query, it affects all downstream queries.
Duplicate Table copies all of the steps; the effect is creating a separate query. You can change them at any point later. There is no link to where the steps came from even if they aren't changed.
So, it seems that you just want to convert duplicated steps to references. There is no automated way of doing it. But if you know two queries start with the same steps, try this: Duplicate to a base query and remove final steps that are not in common. Mark the new query to not load to the report by: Click All Properties; Uncheck Enable load to report. Then you can replace the duplicated initial steps in the other queries with a reference by a step like Source = BaseQuery in the Advanced Editor.
Also, if you find yourself duplicating steps in the middle of a query, you can create a query used as a function.

Related

Insert many items from list into SQLite

I have a list of lots of data (will be near 1000). I want to add it all in one go to a row. Is this straight forward like a for loop over list with multiple inserts?multiple commits? Is this bad practice?thanks
I haven’t tried yet as just setting up table columns which is many so need to know if feasible thanks
If you're using SQL to insert:
INSERT INTO 'tablename' ('column1', 'column2') VALUES
('data1', 'data2'),
('data1', 'data2'),
('data1', 'data2'),
('data1', 'data2');
If you're using code... generate that above query using a for loop then run it.
For a more efficient approach consider a union as shown in: Is it possible to insert multiple rows at a time in an SQLite database?
insert into 'tablename' ('column1','column2')
select data1 as 'column1',data2 as 'column2'
union select data3,data4
union...
In sqlite you don't have network latency, so it does not really matter performance wise to issue many small requests toward the engine. For more reference about that you can read this page from the official documentation: https://www.sqlite.org/np1queryprob.html
But in write mode (insert or update), each individual query will have to pay the cost of an implicit transaction. To avoid that you need to gather your insert queries in an explicit transaction. Depending of your programming language, how you do that may vary. Here is a code sample on how to do that in go. I've simplified error code management, to have a better view of the gist.
tx, _ := db.Begin()
for _, item := range items {
tx.Exec(`INSERT INTO testtable (col1, col2) VALUES (?, ?)`, item.Field1, item.Field2)
}
tx.Commit()
If you detect an error in your loop instead calling tx.Commit() you need to call tx.Rollback() in order to cancel all previous writes to your database so that the final state is as if no insert query has been issued at all.

Simultaneously `CREATE TABLE LIKE` in AWS Redshift and change a few of columns' default values

Workflow
In a data import workflow, we are creating a staging table using CREATE TABLE LIKE statement.
CREATE TABLE abc_staging (LIKE abc INCLUDING DEFAULTS);
Then, we run COPY to import CSV data from S3 into the staging table.
The data in CSV is incomplete. Namely, there are fields partition_0, partition_1, partition_2 which are missing in the CSV file; we fill them in like this:
UPDATE
abc_staging
SET
partition_0 = 'BUZINGA',
partition_1 = '2018',
partition_2 = '07';
Problem
This query seems expensive (takes ≈20 minutes oftentimes), and I would like to avoid it. That could have been possible if I could configure DEFAULT values on these columns when creating the abc_staging table. I did not find any method as to how that can be done; nor any explicit indication that is impossible. So perhaps this is still possible but I am missing how to do that?
Alternative solutions I considered
Drop these columns and add them again
That would be easy to do, but ALTER TABLE ADD COLUMN only adds columns to the end of the column list. In abc table, they are not at the end of the column list, which means the schemas of abc and abc_staging will mismatch. That breaks ALTER TABLE APPEND operation that I use to move data from staging table to the main table.
Note. Reordering columns in abc table to alleviate this difficulty will require recreating the huge abc table which I'd like to avoid.
Generate the staging table creation script programmatically with proper columns and get rid of CREATE TABLE LIKE
I will have to do that if I do not find any better solution.
Fill in the partition_* fields in the original CSV file
That is possible but will break backwards compatibility (I already have perhaps hundreds thousands of files in there). Harder but manageable.
As you are finding you are not creating a table exactly LIKE the original and Redshift doesn't let you ALTER a column's default value. Your proposed path is likely the best (define the staging table explicitly).
Since I don't know your exact situation other paths might be better so me explore a bit. First off when you UPDATE the staging table you are in fact reading every row in the table, invalidating that row, and writing a new row (with new information) at the end of the table. This leads to a lot of invalidated rows. Now when you do ALTER TABLE APPEND all these invalidated rows are being added to your main table. Unless you vacuum the staging table before hand. So you may not be getting the value you want out of ALTER TABLE APPEND.
You may be better off INSERTing the data onto your main table with an ORDER BY clause. This is slower than the ALTER TABLE APPEND statement but you won't have to do the UPDATE so the overall process could be faster. You could come out further ahead because of reduced need to VACUUM. Your situation will determine if this is better or not. Just another option for your list.
I am curious about your UPDATE speed. This just needs to read and then write every row in the staging table. Unless the staging table is very large it doesn't seem like this should take 20 min. Other activity could be creating this slowdown. Just curious.
Another option would be to change your main table to have these 3 columns last (yes this would be some work). This way you could add the columns to the staging table and things would line up for ALTER TABLE APPEND. Just another possibility.
The easiest solution turned to be adding the necessary partition_* fields to the source CSV files.
After employing that change and removing the UPDATE from the importer pipeline, the performance has greatly improved. Imports now take ≈10 minutes each in total (that encompasses COPY, DELETE duplicates and ALTER TABLE APPEND).
Disk space is no longer climbing up to 100%.
Thanks everyone for help!

How to use a query in another query?

If I want to use a query in another query what should I use :
reference
duplicate
and what's the difference between the 2 functions ?
I'm not sure where you see "related" but the other two show up if you right-click a query in the Power Query Editor.
Duplicate is like copy and pasting a query. It creates another query with all of the same steps and code. It's independent and modifying one query doesn't affect the other.
Reference does not reproduce the query. The new query starts exactly where the referenced query ended and is therefore dependent on it. If you change the referenced query, then the new query is affected too.
Duplicate creates a new copy with all the existing steps. The new copy will be isolated from the original query. You can make changes in the original or the new query, and they will NOT affect each other.
Reference, on the other hand, is a new copy with only one single step: getting data from the original query.

Does a referenced query result in fetching the data again?

In Power query I have a query fetching data from tblSales and performing row filtering on amount>100.
I want to reuse this query for another query. There are 2 options i have: duplicate or reference.
In duplicate the steps from the query are brought into the new query.
Where as in reference, the new query references the original query.
I right clicked on the original query and selected reference. Then added additional row filters. This new query that is generated, does it reuse the data fetched by original query?
Yes it will run the query 'again'.
If you want to only load the data once, you need to wrap your base query in one of the Buffer functions.
Table.Buffer()
Binary.Buffer()
List.Buffer()
See here for more details
In this scenario, the data is loaded twice:
In this Scenario, the data is loaded once ( Table.Buffer() on the base table ):

Detecting delta records for nightly capture?

I have an existing HANA warehouse which was built without create/update timestamps. I need to generate a number of nightly batch delta files to send to another platform. My problem is how to detect which records are new or changed so that I can capture those records within the replication process.
Is there a way to use HANA's built-in features to detect new/changed records?
SAP HANA does not provide a general change data capture interface for tables (up to current version HANA 2 SPS 02).
That means, to detect "changed records since a given point in time" some other approach has to be taken.
Depending on the information in the tables different options can be used:
if a table explicitly contains a reference to the last change time, this can be used
if a table has guaranteed update characteristics (e.g. no in-place update and monotone ID values), this could be used. E.g.
read all records where ID is larger than the last processed ID
if the table does not provide intrinsic information about change time then one could maintain a copy of the table that contains
only the records processed so far. This copy can then be used to
compare the current table and compute the difference. SAP HANA's
Smart Data Integration (SDI) flowgraphs support this approach.
In my experience, efforts to try "save time and money" on this seemingly simple problem of a delta load usually turn out to be more complex, time-consuming and expensive than using the corresponding features of ETL tools.
It is possible to create a Log table and organize columns according to your needs so that by creating a trigger on your database tables you can create a log record with timestamp values. Then you can query your log table to determine which records are inserted, updated or deleted from your source tables.
For example, following is from one of my test trigger codes
CREATE TRIGGER "A00077387"."SALARY_A_UPD" AFTER UPDATE ON "A00077387"."SALARY" REFERENCING OLD ROW MYOLDROW,
NEW ROW MYNEWROW FOR EACH ROW
begin INSERT
INTO SalaryLog ( Employee,
Salary,
Operation,
DateTime ) VALUES ( :mynewrow.Employee,
:mynewrow.Salary,
'U',
CURRENT_DATE )
;
end
;
You can create AFTER INSERT and AFTER DELETE triggers as well similar to AFTER UPDATE
You can organize your Log table so that so can track more than one table if you wish just by keeping table name, PK fields and values, operation type, timestamp values, etc.
But it is better and easier to use seperate Log tables for each table.