Efficiently Set Up Reference Table in Power BI - powerbi

I have two 300MB base files on the network and several reference tables, which only use one or two columns from the base files. The first step of a reference table is to bring in the source table (all columns) and then the second step is to remove all the unneeded columns. However, this is extremely inefficient, and every time I do a data refresh after altering my table queries, it takes 5-10 minutes for every reference table to load the entire datasets.
Is there a more efficient way of doing this which would lead to faster load times? I am assuming that instead of the reference tables I could have a new table which selects only the one or two columns needed.
Thanks

What you can do is create your BaseTable and import the source (all Columns), name it "BaseTable"
Now create a new query (blank) and type = BaseTable. On enter you can now remove the columns you do not need. On all other tables you repeat this. So your source is only imported and you get the columns you need from this source without repeating the imoporting of your source data.

Related

Transform two source DynamoDB tables into a new DynamoDB using AWS

So I have two source tables lets call the, table1 and table2, and the destination table table3 - inside these tables there is information that needs to be extracted from columns of one table, columns of another table, and then combined to give entries of columns to the new table.
Think of it as a complex transformation; for example:
partial text in column1 extracted from table1 and complete text in column1 of table2 combined into 4 rows of column1 (depending on the JSON of column1 in table1) in new transformed table.
So it's not a 1 to 1 mapping between 1 table and another, but a 1 to many mapping where the 1 row of the source comes from a mix of one row from two source table that translates to many rows of the new destination table.
Is this something that glue jobs can accomplish? or am I better of just writing a throwaway Python script? You can assume that the size of the table is not of any concern
Provided you plan to run this process at some frequency, this is a perfect use case for Glue. If this is just a one off, Glue is also a fine choice, but Glue is primarily designed for repeated use.
In you glue script I expect you will end up joining the two tables, and then select new result columns and rows by combining your existing columns. Typically the pattern to follow would be to convert the dynamic frames (created by glue), into pyspark data frames, and then work with pyspark from there, converting back to a dynamic frame before outputting to the database.
Note that depending on your design you may not need to add rows, it of course depends on the outcome you are seeking, but Dynamo does have support for some nifty hierarchical approaches that may remove your need for multiple rows.
If you have more specific examples of schema and the outcomes you are seeking, I could show you a bit of example code.

Power BI - how to load from a source based on min/max values of another already-loaded source

I am attempting to build a data model in Power BI from a data mart (star schema) in SQL Server. This data mart has a fact table and several dimension tables. One of the dimension tables is a date table. I want to load all rows from the fact table. However, I only want to load a subset of the date table. In particular, I want those dates (rows) between the min and max dates in my fact table. This way, when I create slicers and such, I don't have unnecessary dates appearing.
In other BI tools (e.g., Qlik Sense), the usual solution is to first load the fact table into memory, compute and load its min/max dates into another table (also in-memory), set variables from this other table, load the date dimension table (into memory) based on the min/max variables, and finally drop the temporary table from memory so that it doesn't stay in the model and cause problems. This seems like the most efficient solution to me. It only reads the required (as opposed to unnecessary) data from the source dimension table, it doesn't need to perform any joins in the source, and it only reads each table once (as opposed to 2+ times).
How can I achieve this in Power BI? Or, more importantly, is this solution method even possible in Power BI?
I found this solution, but it seems inefficient, as it creates 2 queries (instead of just 1) for the min/max and, moreover, it performs the dimension table filtering after all rows have already been fetched from the source. (In my particular case, this isn't too bad. But, it could be problematic in other situations in which my dimension table is large.)

How to create a Fact table from multiple different tables in Pentaho

I have been following a tutorial on creating a data warehouse using Pentaho Data Integration/Kettle.
The tutorial is based off of a CSV file but I am practicing with the northwinds database and postgresql I am trying to figure out how to select values from more than one table then output them into a single table.
My ETL process goes like this: I have several stages for each table, values are selected from each table and stored in a stage table for each table in the database, from there I have my dimensions table set up but I am trying to figure out the step between the stages and the dimensions which is where I am trying to select the values to update the dimensions table.
I have several stages set up for each of my tables at this point I am not sure if I should create a separate values table for each table or a single values table. Any help would be greatly appreciated. Thanks
When I try to select values from multiple tables I get an error that says "we detected rows with varying number of fields" It' seems I would need to create separate tables with
In kette, the metadata structure of the data stream cannot change. As such, if row 1 has 3 columns, one integer and two strings, for example, all rows must have the same structure.
If you're combining rows coming from different sources, you must ensure the structure is the same. That error is telling you that some of the incoming streams of data have a different number of fields.

power bi - how to manage unrelated dimensions

I'm attempting to create a shared date dimensions between two fact tables in Power BI, based off of a relational data source.
Currently, if I include an unrelated dimension in the report, I get numbers duplicated across multiple rows, where they don't really apply.
I'm wondering if there is any way to tell Power BI that certain dimensions cannot be used with certain fact tables, similar to using IgnoreUnrelatedDimensions in SSAS.
Currently the only solution I can find is to create a separate date dimension, so that the two fact tables have no relationship that could be used to join them, however this would mean forfeiting the ability to do any time based comparisons.
Create a combined view of the fact tables with only compatible columns to be used for time based comparison:
In Query Editor, create new queries for your fact tables by
referencing i.e. right click original query and select "Reference".
Then in those "copies" cut out the incompatible dimensions.
Rename columns to align terminology (e.g. Sales Date ==> Transaction Date, Payment Date ==> Transaction Date).
Use "Merge Queries" function to combine the copies using Full Outer Join.
Join this merged view to your date dimension

Power Query Formula Language - Detect type of columns

In Power BI, I've got some query tables generated from imported data. All the data comes in as type 'Any', and I'm trying to automatically detect the type of the data in each column.
Some of the queries generate tables with columns based on the in-coming data - I don't know what the columns are going to be until the query runs and sets up the table (data comes from an Azure blob). As I will have quite a few tables to maintain, which columns can change (possibly new columns being added) with any data refresh, it would be unmanageable to go through all of them each time and press 'Detect Data Type' on the columns.
So I'm trying to figure out how I can do a 'Detect Data Type' in the query formula language to attach to the end of the query that generates the table columns. I've tried grabbing the first entry in a column and do Value.Type(column{0}), however this seems to come out as 'Text' for a column which has integers in it. Pressing 'Detect Data Type' does however correctly identifies the type as 'Whole Number'.
Does anyone know how to detect a column's entry types?
P.S. I'm not too worried about a column possibly holding values of different data types
You seem to have multiple issues here. And your solution will be fragile, there's a better way. But let's first deal with column type detection. Power Query uses the 'any' data type as it's go to data type. You can write a function that samples the rows of a column in a table does a best match data type detection then explicitly sets the data type of the column. This is probably messy and tricky since you need to do it once per column. This might be workable for a fixed schema but for a dynamic schema you'll run into a couple of things very quickly. First you'll need to write some crazy PQ code to list all the columns and run you function on each. This will work the first time, but might break in subsequent refreshes because data model changes are not allowed during refresh. If you're using a tool like Power BI Desktop, you'll be able to fix things up. If you publish your report to the Power BI service, you'll just see refresh errors.
Dynamic Schemas will suffer the same data model change issue I mentioned above.
The alternate solution that you won't have problems with is using a Direct Query data source instead of using Power Query. If you load your data into Azure SQL or a Tabular Model, the reporting layer will get the updated fields automatically so you don't have to try to work around using PQ.