I would like to know whether my data modeling is working for Power BI.
The dataset I am using is training course for students and corporate. The original data has 3 tables separated by individual program. The purpose of my visualization is to analyze the 3 programs from all students in a single dashboard.
Here is the original data after being imported to Power BI:
Here is the data pre-processing:
Remove unneeded column
DPT table
Remove column – No, Date, Quarter
DTP table
Remove column – Count, Email, Date
LLD table
Remove column – Email, To calculate, Learning Hours
Rename column name & impute missing value with “Not given”
DPT table
Trainee = Name, Training Provider = Provider, Course name = Course, Focus area = Domain
DTP table
Participant Name = Name, Event/Training Name = Course, Training providers = Provider
Create new column and impute them with “No given” and put in same
position (for append tables later)
DTP table
Level
LLD table
Company, Provider, Level
Create a new column called Program and impute value as it program
name for each row.
Post cleaned table:
After appending the 3 tables and calling it Master:
. Duplicate the Master table to create Student, Provider and Program table. In each table remove irrelevant columns, remove duplicates and create unique ID.
Final data model:
The focus is the Program, Provider and Student tables. The rest of the tables will be deactivated the relationship when creating calculated columns and measures before I make any correction to the data model.
Is there any proper approach to build the data model?
From my data model in the last picture, does it mean that the Provider table is a fact table while the Student and Program tables are dimensions?
I agree with removing unneeded columns, renaming columns for a better look, and substituting 'Not Given' in place of NULLS (Caution here: Measures and Dimensions handle nulls differently, for dimensional values substituting is okay)
If modeling using PowerBI is a must, then the following strategy should happen:
The dimensions can be Students, Programs, Providers
A Factless Fact table (FactProgram or something like that). It will have dimensional keys to Students, Programs, Providers ( and additional measures that you can create or can be taken from Master )
Remove unnecessary columns from the dimensions, so that the Remove Duplicates will give you what you want. For instance, Student and Program currently have same columns coming from master (Company, Course, Domain, Level, Program, Provider). Make it clear which columns belong to which dimensions and optionally create new dimensions (maybe, DimCompany)
Related
I have a data base with the following tables:
Customers, Invoices, Salesman, Target.
The ones concerned about my question are Customers, Invoices.
There are customersIDs used in the Invoices but doesn't exist in the Customers table.
If I used only the customers from Customers Table, my customer dimension would be incomplete.
My solution is to append these IDs from Invoices to Customers and fill other columns in the Customers table with nulls.
I don't know if this is the best approche according to Kimball?
also, if it is a good solution, how can I add accomplish it with Power bi desktop?
Customers table: "generated Data"
Invoice table:
..... just a sample the table is thousands of rows.
There's two points here:
Firstly, (in import mode at least) PBI already creates the "blank row" for items present in your fact table but missing from your dimension table for precisely this scenario. If you don't need the granularity of each individual missing customer id, then you don't need to do anything.
Secondly, if you need to to retain that granularity then your approach is the correct one. The way to do this in Power Query is as follows:
Create a new query which takes your customer dimension table and does a left outer join on customer id with your invoice fact table.
Expand the newly joined table but retain only the new customer id column.
Remove all columns apart from the new customer id column.
Remove duplicates
You now have a list of missing customer ids. Ensure the column name is the same as the column name of you customer id in the customer dimension table. Append this to the original customer dimension query and the nulls will be filled in automatically for the missing columns.
Please keep in mind that It is Kimball, not Kimble.
There are 4 steps of DWH Methodology:
1) Understand Business Process (What your process is actually measuring?)
2) Deciding the grain (It means what every row in your fact table actually represents?)
3) Deciding Dimensions (Ask Where-What-Who-Where-How-HowMany-HowMuch to your grain declaration formed together with business processing)
4) Define Facts (Metrics)
According to this order, You define Dimension tables before building your fact tables: If your dimension table , Customer table in this case, is missing in terms of customers available in your fact table, My biggest biggest advice according to the DWH Dimensional Modeling is to set your customer table right!!! Define every piece of customer in your dimension table!!!! Then populate your Fact table with records:
[Customer ID] in Customer Table : PRIMARY KEY
[CustomerID] in Invoice Table : FOREIGN KEY
SQL and Power BI reacts very differently in your problem:
1) Power BI has no referential integrity concept: It adds a blank row to your dimension table in such a case.
2) SQL gives referential integrity error, and you can't even add rows to your fact table. I support SQL in this case personally!!!!
Finally: Use some ETL tool(SSIS, Talend, ODI or even Power Query) to make your dimension table as accurate as possible:
For example:
Do not leave any column value as null!
If an unknown date exists, put a default date value like '1900-12-31'
If an unknown textual property, put in keywords 'unknown','not available' etc..
Because dimensional table are sources of querying in SQL statements; and different SQL Vendors (SQL Server, Oracle, MySQL) has to deal with NULL values in a different way, and this cause problems in terms of performance wise!!
I'm trying to figure out the best way to build a relationship from a table that has records in a daily format (one record represents a single date) to a table that contains records in a date-range format (one record has a start-date and end-date, consequentially representing a period or range of dates).
Since my actual datafiles contain work-related information, I created 2 demo tables that contain dummy data that reflects the date columns in question.
Here is my DailyDate table
Here is my DateRanges table
Here is the current model view
I would like to be able to have a relationship built between the tables so that if I were to have 2 tables/matrices in the Report view, with one table showing the Daily Date Data and the other table showing the Ranged Data Data, I would be able to select a record in the Daily table and Power BI's highlight functionality would filter the records in the Ranged table so that only date ranges containing my selected date appear, and vice-versa (if possible).
For example, referencing this screenshot, if I were to select index 0 in the 'Daily Data Data' table, the 'Ranged Date Data' table should be filtered to only show records with index of 0 in the other table. If I were to select Index 2 (01/03/2022) in the Daily Date Data table then the Ranged Date Data table should be filtered to only show indices (0, 1).
In the model view, when trying to build this relationship, I can create a relationship from DailyDates.Date to DateRanges.StartDate and then from DailyDates.Date to DateRanges.EndDate; however, only a single relationship can be active so the highlight and slicer functionality will not give me the results I'm looking for.
As you can see from this demo, the datasets are small; however, my actual datasets contain around 50 million records in the Daily table and 10+ million records in teh Ranged table, so I'm hoping there can be an efficient method of getting this functionality that will not be too much of a load on memory.
Any advice into how to accomplish this would be greatly appreciated.
I have the following tables
I have several tables on coal consumption, natural gas, .....
I want to have a power BI model that can allow viewing these data with years (x-axis) while filtering on countries.
I have transposed the data to have a column with years and countries on the columns. But I cannot filter on countries.
Another thing I have done is to unpivot all the data and have four columns (country, year, value, fuel type) but the problem is that I could not manage to create suitable relationships between the tables as there is no primary key.
I have thought on putting all the data fro the different energy sources in one table. But how can I manage to link it to more data per country at year as well.
Another thing I have done is to unpivot all the data and have four columns (country, year, value, fuel type)
This is totally the right approach.
The next step required in minimum is to combine all unpivotted tables vertically into one EnergyConsumption table. You can utilize Append Queries command in Power Query Editor, or Table.Combine function in M language.
Additionally, you should consider to create three tables: Years, Countries, and FuelTypes, which have unique values of the respective dimensions, and establish one to many relationship with EnergyConsumption table.
I'm trying to create a column that has a total of values between 3 columns from 3 tables. How would I go about doing this?
The 2 tables are tables of values that share an id, and they are both linked to a table of account by Id. The goal is to add up 3 columns, and place it into a table grouped by the Id.
I've attempted summing them, trying to use the USERELATIONSHIP function, and creating a relationship between them. It seems to give very inaccurate results, as if it's summing all of the totals together, and passing them to each Id. That, or it won't let me use the column, as if it never existed.
EDIT: General Idea of what I'm trying to do (Lines should be pointing to Account's Id column, but I messed up the lines)
EDIT 2: I also forgot to illustrate or mention. There are more columns with information in each table that can't be summarized for each account preventing me from just merging the table together.
Make sure your data model looks like this (change names as you please, but the structure must be the same):
In dimensional modeling, your table "Account" is a Dimension, and both fee tables are Fact tables. The operation of combining data from multiple fact tables that share the same dimension is called "drill-across", and it's a standard functionality of Power BI.
To combine fees from these tables, you just need to use measures, not columns. This article explains the difference:
Calculated Columns and Measures in DAX
First, create 2 measures for the fees:
Fee1 Amount = SUM(Fee_1[Amount])
Fee2 Amount = SUM(Fee_2[Amount])
Then, create a third measure to combine them:
Total Fee Amount = [Fee1 Amount] + [Fee2 Amount]
Create matrix visual, and place Account_ID from the Account table on the rows. Then drop all these measures into the matrix values area, like this:
Result:
Of course, you don't have to have all these measure in the matrix, I just showed them for your convenience, to validate the results. If you remove them, the last measure still works:
I have two tables from Azure SQL in PowerBI, using direct query:
EMP(empID PK)
contactInfo(contactID PK, empID FK, contactDetail)
which have an obvious one-to-many relationship from EMP.empID to contactInfo.empID. The foreign key constraint is successfully enforced.
However I can only create a many-to-one relationship (contactInfo.empID to EMP.empID) in PowerBI. If I ever try the opposite, PowerBI always automatically converts the relationship to many-to-one (by swapping the from and to column), which prevents me from creating visuals. Does PowerBI think the two are equivalent?
Update:
What I'm doing is to just create a table in PowerBI showing the join results of these two tables. The foreign key constraint is contactInfo.empID REFERENCES EMP.empID, which is many-to-one. That should not be a problem, I guess, since I can directly query the join using SQL.
Please also suggest if I should create the foreign key in the opposite direction.
More info on failure to create visual
The exact error message is:
Can't display the data because Power BI can't determine
the relationship between two or more fields.
Version: 2.43.4647.541 (PBIDesktop)
To reproduce the error:
DB schema is as follows:
What I want is a table in PowerBI showing contact and sales info of am employee, that is, joining all the four tables. The error will occur when VALUES of the table visual contains "empName, contactDetail, contactType, productName", however, error will NOT occur if I only include "empName, contactDetail, contactType" or "empName, productName". At first I thought the problem may lie in the relationship between contactInfo and emp, but it now seems to be more complicated. I guess it may be caused by multiple one-to-many relationships?
Expanding my comments to make an answer:
Root of the Problem
In your data model, a single employee can have multiple contacts and multiple sales. But, there's no way for Power BI to know which contactDetail corresponds to which productName, or vice versa (which it needs to know to display them together in a table).
Deeper Explanation
Let's say you have 1 emp row, that joins to 10 rows in the sales table, and 13 rows in the contactInfo table. In SQL, if you start from the emp row and outer join to the other 2 tables, you'll get back (1*10)*(1*13) rows (130 rows in total). Each row in the contactInfo table is repeated for each row in the sales table.
That repetition can be a problem if you do something like sum the sales and don't realize a single sales record is repeated 13 times but might be fine otherwise (e.g. if you just want a list of sales and all associated contacts).
Power BI vs. SQL
Power BI works slightly differently. Power BI is designed primarily to aggregate numbers, and then break them down by different attributes. E.g. sales by product. Sales by contact. Sales by day. In order to do this, Power BI needs to know 100% how to divide numbers up between the attributes on your table.
At this point, I'll note that your database diagram doesn't include any obvious metrics that you'd use Power BI to aggregate. However, Power BI doesn't know that. It behaves the same whether you have metrics to aggregate or not. (And failing all else, Power BI can always count your rows to make a metric.)
Let's say that you have a metric on your sales table called Amt Sold. If you bring in the empName, productName, and Amt Sold columns, Power BI will know exactly how to divide up Amt Sold between empName and ProductName. There's no problem.
Now add in contactDetail. Using your database diagram, Power BI has no way of knowing how an Amt Sold metric in the sales table relates to a given contactDetail. It might know that $100 belongs to empID 27. And that empID 27 corresponds to 3 records in the contactInfo table. But it has no way of knowing how to divide up the $100 between those 3 contacts.
In SQL, what you'd get is 3 contacts, each showing the $100 amount sold. But in Power BI, that would imply $300 was sold, which isn't the case. Even equally dividing the $100 up would be misleading. What if the $100 belonged entirely to 1 contact? So instead, Power BI shows the error you're seeing.
My Recommendations
If you can, I recommend changing your data model before your bring it in. Power BI works best with a single fact table, which would contain your metrics (like amount sold). You then join this fact table to as many lookup tables as you like (e.g. customer, product, etc.), directly. This allows you to slice & dice your metrics with any combination of attributes from any of the lookup tables. I would recommend checking out the star schema data model and the concept of lookup tables: powerpivotpro.com/2016/02/data-modeling-power-pivot-power-bi
At the very least, you would want to flatten your tables (i.e. merge the contactInfo and sales tables into a single table before importing them into your data model.
It may be that Power BI isn't the best tool for what you're trying to accomplish. If all you want is a table showing all sales & contact info for an employee, without any associated metrics, a regular reporting tool + SQL query might be a better way to go.
Side Note: You can't reverse a many:one relationship to get past this error. The emp table contains one row per empID. Both the contactInfo and sales tables contain multiple rows with the same empID. This means the emp table is necessarily the "one" side of the relationship to both those tables. You can't arbitrarily change that.