I’m building an model in Azure Analysis Services. The model should contain only data for the last 3 months and is processed every day.
I have a separate dimension for date that has a relation with a fact table using a datekey. I’m using a power query to only load the last 3 months in the date dimension. In the power query to load the fact table I used Table.nestedjoin to only load the rows that have a value in the date table.
When I do this, the processing of the model takes forever. After some troubleshooting I saw that the query Analysis Services is using to retrieve data from the SQL database retrieves all rows. So, Am I correct saying AS load all data before it merge the rows? Is there a way to change this? Or is there a better way to a chief my solution?
Kind regards,
Joins are super slow in Power Query. You should avoid them if you can do it in the datasource or use normal relationships in the data model.
Also, you can setup the date dimension in DAX and dynamically populate it to contain only dates present in the FACT table.
As for the load of all the data, it could be because the data is fetched as is, and only then power query applies the transformations (the join).
You can modify the query in the Power Query Editor / Advenced Editor to add a where clause direclty in the query
Related
I have 3 OLTP databases, all using the same database schema. Each db represents one department.
I am exploring Power BI as a solution for reporting at the company level, so all departments combined.
What is the approach to combine data from multiple dbs into a data warehouse? For example - do I need SSIS to combine the 3 dbs into 1 data warehouse?
Another option could be to have 1 shared dataset per db, and then the final report can connect and combine multiple live datasets? Or is there another way with Power BI like combining multiple live datasets?
Any reference link on how if someone has done this?
Or is there another way with Power BI
Yes. Simply create a single import model and load data from all three databases in it. So for each table in your Power BI model you would have three Power Queries set to not load into the model, and you would append them in a query that is used to load your model. See eg: https://learn.microsoft.com/en-us/power-query/append-queries
Best practice would be to:
Extract the data into a single database (DWH or reporting schema)
Build the necessary items there for your data model, be it reporting schema, or star/snowflake schemas
Connect Power BI to that schema.
Combining datasets is going to be tricky, you may have the same measures in each of the datasets. Combining in the database, with any added columns to indicate the department is the best option in terms of supporting updating/adding/removing items. For example, if the schema changes in the DB's you do it in one place, not three datasets. The toolset in DB/SSIS will be better suited to the heavy lifting of the data to a location.
You would use SSIS to extract the data if on-prem data, Azure Data Factory for Azure DB's. Extract to a staging schema, convert/transform the data into its final from, with a new schema to define what it is, facts/dimensions other schema names such as reporting can be used, depending on the data model you wish to build. Most of this is covered by the standard ETL pattern of OLTP to an OLAP database.
I have a sql database linked where I have the complete history of products and users. I want to the user to be able to select on the slicer a year and the data automatically shows active prodcuts, expired products and new products added in that year (or snapshot).
Is there a way this can be done? I am not able to find a measure to best do this for me.
I recommend creating a date dimension table first - I usually call mine Calendar. Please read this useful post by Radacad which will show you how to create one > https://radacad.com/power-bi-date-or-calendar-table-best-method-dax-or-power-query
Once it's done create relationships between your fact tables and calendar table on key dates of when your products are active or expired - I'm making a huge assumption that's what your tables store.
Your calendar table will then act as a single time/date point of truth and should be used to slice and dice your fact table.
Hope this helps!
Connection Type: Direct Query to multiple sources so limited DAX available especially in Power Query load.
Data Model Query: The Data model is not a perfect star schema but there is an attempt to separate tables into business processes and lookup tables. There are probably a few issues to discuss the current data model. I only have 1 question at this time.
My current goal is to generate a single summarised customer table to replace the current two tables that have some measure I need like the number of app customers, a number of total customers, date customer first accessed app etc.
So I cannot merge the 2 customer tables and add calculated columns and measures at the import stage as power bi does not support or allow it and sql is out as I am using direct query. My plan is to create a summarised customer table using DAX summarise function on front end visual page, that has only the app customers and then measures like the total number of customers etc. Is this best practice or is there a better way of approaching this? Understand you would ideally do in sql, or power query but in these circumstances, I think this is the best way but wanted a second view.
Is there a reason to use Direct Query over Import? If you are in Import mode, you can easily Append the two client tables together in PowerQuery.
Treb Gatte, Power BI MVP
Google BigQuery (BQ) allows you to create a partition using timestamp or date types only.
99% of my data has a very clear selector, idClient. I've created to my customer's views with a predicate like idClient = code so the privacy is guaranteed.
The problem with this strategy is that there are customers with 5M rows and others with 200K and as BQ does not have indexes, they are always processing data from each other (and the costs are rising).
I am intending to create a timestamp field where each customer will have a different timestamp that will be repeated for every Insert in every customer sensitive table and thus I can query by timestamp by fixing it as it would be with a standard ID.
Does this make any sense? If BQ was an indexed database I'd be concerned about skewed data but as it is always full table scan, I think I'd have only benefits and no downsides.
The solution for your problem is to add Cluster field to your table which is equivalent to an Index in other databases
This link provides the basic on how to use cluster field
Clustering can improve the performance of certain types of queries such as queries that use filter clauses and queries that aggregate data. When data is written to a clustered table by a query job or a load job, BigQuery sorts the data using the values in the clustering columns
Note: When using cluster field BigQuert dryRun doesn't show the cost improvement which can only be seen post-execution
I have multiple databases which have similar schemas. I need to combine the data from all these databases and do reporting over it.
For example -
Customer table in AdventureWorks in Server 1
Customer table in AdventureWorks in Server 2
Customer table in AdventureWorks in Server 3
Now in Power BI i will have a data set called Customer. The data for this needs to come from all the 3 servers mentioned above. I know I can do it using merge queries in Power BI but it means I will have to pull the data from different server as different datasets in power bi and merge which I want to avoid.
Do let me know if there is any other way to do this.
You must use Edit queries -> Append queries (or Append queries as new) to combine the data from these data sources into one table:
I find Append queries as new more meaningful in this case. It will create a new (fourth) table, which will contain all the rows from the other 3 tables. Select Three or more tables and select all customers tables from your data sources:
Then use this new table in the report.
You should not worry that duplicating the data will lead to increased size of the data. The data itself will be reused and not copied twice.
Your model is the right place for combining data from different data sources in one place. Maybe this should not be your report, but one dataset shared between your reports, or a SSAS model. Combining the data on SQL Server level (e.g. partitioned view over 3 databases) is not a good idea. Combining these in your model also gives you the option to combine data from different types of data sources, e.g. Customers 1 is in SQL Server, Customers 2 is .csv file in OneDrive and Customers 3 is coming from a web service call.