In the PBI service, there is a refresh option for dataflows. What does a refresh operation for dataflows actually do?
A Power BI Dataflow is much like a data storage component on its own (internally using Azure Data Lake) and and a refresh will simply update data from the connected data source by applying all the predefined ETL steps.
The biggest advantage of Dataflows is that a Power BI Dataset can connect to more than one of them at a time so that you can define your ETL steps in one place only and feed the results into serveral datasets, avoiding code duplication.
Another advantage is probably that you can author your ETL code directly in the Online Service w/o a PBIDesktop.exe
When refreshing Datasets be aware that they do not trigger a refresh of the connected Dataflows. This has to be scheduled separately.
Dataflows are essentially the cloud version of M queries in Power Query / Query Editor. A Dataflow is the ETL layer that connects to the data sources, extracts and transforms the data, then stores the result as a table.
When you refresh a Datafow, it's just like refreshing a query in a Power BI model. It re-connects to the underlying data sources and pulls in the data from those sources as they exist at the time of refresh and stores the transformed data which can then be used in data models.
Things are a bit more complex with DirectQueries, linked tables, and incremental refreshes, which I'm choosing to ignore for the sake of simplicity.
Resources:
https://learn.microsoft.com/en-us/power-bi/transform-model/dataflows/dataflows-introduction-self-service
https://radacad.com/dataflow-vs-dataset-what-are-the-differences-of-these-two-power-bi-components
Related
I'm trying to establish if my planned way of working is correct.
I have two data sources; a MySql & MSSQL database. I need to combine these data sources and expose this data for Power BI to consume.
I've decided to use Azure Synapse Analytics for the ETL and would like to understand if there is anything in the process I can simplify or do better.
The process is as followed:
MySql & MSSQL delta loaded into ASA as parquet format, stored in Azure Gen 2 Storage.
Once copy pipeline is complete a subsiquent data flow unions the data from the two sources and inserts into MSSQL storage in ASA.
BI Consumes from this workspace / data soruce.
I'm not sure if I should be storing from the data sources to Azure Gene 2, or I should just perform the transform and insert from the source straight into the MSSQL storage. Any thoughts or suggestions would be greatly appreciated.
The pattern that you're following is the data lake pattern, where data is moved between 3 zones:
Raw
Enriched
Curated
The Raw zone keeps an original copy of the data before transformation. The benefit of storing the data this way (as parquet files, here) is so that you can troubleshoot a problem with the transformation or create a different transformation to address a new need.
The Enriched zone is where you have done some transformation, like UNIONing your data, or providing some other clean up steps, maybe removing unneeded columns, correcting addresses, etc. You have done this by inserting the data into a SQL database, but this might also be accomplished by using views in the serverless pool, if the transformations are simple enough: https://learn.microsoft.com/en-us/azure/synapse-analytics/sql/create-use-views
The Curated zone is a place to transform your data into a form that BI applications will do well with, i.e. a star schema. Even if this is a very simple dataset, it will be well worth incorporating a date dimension, which will yield a lot of benefits in Power BI. The bottom line here is that Power BI is optimized to work with star schemas, so that's what you should give it.
You do not need to use data lake technologies to follow this pattern and still get the benefits. As far as whether what you are doing is good will be based on how everything performs versus how simple you can keep it.
Here's more on the topic: https://learn.microsoft.com/en-us/azure/cloud-adoption-framework/scenarios/cloud-scale-analytics/best-practices/data-lake-overview
Once copy pipeline is complete a subsiquent data flow unions the data
from the two sources and inserts into MSSQL storage in ASA
What is the use MSSQL storage ? Is it only used by PowerBI to create reports , if yes then you can use ADLS gen2 , as it will be cheaper, ( basically very in line with Mark said above as "curated"
Just one more thing to consider , PowerBI can read data from both the sources and then do the transformation within itself.
My scenario is:
I have 3 Dataflows:
Recent Data (from SQL Server. Refreshes 8 times a day)
Historical Data (does not refresh, just once initially)
Sharepoint Excel file Data
In my Dataset, I want to have a single Fact table that "union all" all 3 sources.
Instead of Append transformation, I want to create 3 custom Partition (well explained here: https://www.youtube.com/watch?v=6CRqdsLjHNA&t=127s).
I want to somehow tell the schedule refresh to only process the Recent Data and Excel Data partitions only.
The reasoning is - if I do Append, then the dataset will each time process the Historical Data again and again.
Now 2 questions:
How do I tell the scheduled refresh to only process two of 3 partitions? (I can do it manually via XMLA endpoint, but I need it scheduled)
What if I change something in my report (like visuals) - how do I deploy the changes without needing to recreate the partitions?
See Advanced Refresh Scenarios which includes Metadata Only Deployment, and Automate Premium workspace and dataset tasks with service principals.
The easiest way to generate the TMSL scripts for the advanced refresh scenarios is with SQL Server Management Studio (SSMS) which has wizards for configuring refresh, and can generate the script for you. Then you use the script through PowerShell cmdlets or using ADOMD.NET, which in turn can be automated with Azure Automation or an Azure Function.
If you don't need full TMSL scripting capabilities, Power Automate has connectors that hit the Power BI REST APIs, but doesn't support partition-based refresh currently.
But you can call the REST Refresh API directly through any programming language, or the Power Automate HTTP Action.
Also you should take a look at the new (Preview) Hybrid Tables feature which would enable you to have the recent data in a DirectQuery partition, while the historical data is in Import mode.
We have a requirement of generating reports on PowerBI for real time transactions. We have roughly 2000,000 transactions flowing in 1 day and we would like reports generated atleast for these number of rows.
We have I understand that the push streaming API has a limitation of 200,000 rows for FIFO datasets and 5,000,000 for "none retention policy" Link
My questions are as follows:
If we create a streaming data set push API via the PowerBI service, what dataset is created by default in the background? FIFO or the none retention policy dataset?
For a none-retention policy dataset, what happens when we cross the 5000,000 limit? If there is a failure, does that mean we need to delete old rows via an API call on a frequent basis? An example API to do this will help. Deleting all rows is not an option as business would like reports like KPIs over the last 24 hrs for example.
If we use Azure stream analytics to push data to PowerbI, what are the limitation of data storage in PowerBI in this case?
I'm afraid you misunderstood the idea of Power BI. Power BI is not a database! Do not try to use it as such. There are better options out there. That's why you have hard times trying to work around these limitations.
What I'm trying to say is that you should store and process your data somewhere else. Use Power BI for visualizing it only. In this case, if we say that you want to use real-time streaming, which must be updated every second, this means you need to send only 86,400 records per day (which is way less than the limit of 200,000 records of a FIFO dataset). If you do not want to use real-time streaming dashboard, but a normal Power BI report, then why you are looking at push datasets? So collect your data somewhere, aggregate the results, and then push the aggregated data to Power BI.
And to answer your questions anyway:
If you create a dataset using the Power BI REST API without specifying the retention policy, it will create Push dataset with none retention policy - basicFIFO must be enabled explicitly.
If you reach the limit of 5M rows, you will get an error when trying to push more rows to the dataset. Your only option is to delete all rows - there is no way to delete only some of them, because Power BI is not a database. That's why your data should be stored somewhere else and this is the idea behind basicFIFO retention policy and Power BI's streaming dataset.
Power BI limits doesn't change based on the data source. It doesn't matter are you pushing data through Azure Stream Analytics or a service written by you - the Power BI dataset is the same.
As far as I know, deploying a Power BI report from Power BI Desktop results in two items, the report itself and the dataset. When deploying a new report using the same dataset, will deploy the new report and a second copy of the same dataset in Power BI Service. That is not what I wanted. To not confuse end users and other, I want only an unique dataset deployed.
I want to make use of Azure Devops deploying to Power BI Service in a Dev, Test and Prod way. The dataset will be an azure analysis services data model, but the principle should be the same. I need to reduce the dataset to be exactly one and all reports must relate to that data model. I have heard of a Rest API or powershell scripting that can come to a rescue here.
So if any of you have done this or know of a good article that describes how to do this, I would be grateful.
Regards Geir
The best option is to separate the Power BI report in the frontend and the backend. You create a file purely for the dataset if you are importing, no reports created on it. You can then create the reports, using the service connection to the dataset, or with Power BI desktop, in the connection to Power BI Dataset option. Both will use 'Live Connection' mode, so you cannot add any other data sources to the model, for example bring in a CSV file or SQL database.
If you are connecting to an Azure Analysis Service data model, you can use this approach, however as it is only a connection only, not a full fat dataset, it should not be an issue to have copies of the dataset, as it is just the connection. Having copies of the dataset is only an issue if you are importing data, then it is best to move things to data flows, and use the same front/back end method, and the planning around the scheduling of the dataflows then datasets
You can use the REST API to move reports and the datasets that they connect to, and move items to new workspaces. If you have Power BI Premium that has a life cycle tool to move items between dev/test/live workspaces
If you create a report in desktop and choose 'Power BI Dataset' as live connection to work over it - when you upload the report to the same workspace, it will only upload the report and connect to the same dataset
https://radacad.com/power-bi-shared-datasets-what-is-it-how-does-it-work-and-why-should-you-care#:~:text=A%20shared%20dataset%20is%20a%20dataset%20that%20shared%20between%20multiple,tenant%20in%20Power%20BI%20environment.
Currently we have a problem with loading data when updating the report data with respect to the DB, since it has too many records and it takes forever to load all the data. The issue is how can I load only the data from the last year to avoid taking so long to load everything. As I see, trying to connect to the COSMO DB in the box allows me to place an SQL query, but I don't know how to do it in this type of non-relational database.
Example
Power BI has an incremental refresh feature. You should be able to refresh the current year only.
If that still doesn’t meet expectations I would look at a preview feature called Azure Synapse Link which automatically pulls all Cosmos DB updates out into analytical storage you can query much faster in Azure Synapse Analytics in order to refresh Power BI faster.
Depending on the volume of the data you will hit a number of issues. First is you may exceed your RU limit, slowing down the extraction of the data from CosmosDB. The second issue will be the transforming of the data from JSON format to a structured format.
I would try to write a query to specify the fields and items that you need. That will reduce the time of processing and getting the data.
For SQL queries it will be some thing like
SELECT * FROM c WHERE c.partitionEntity = 'guid'
For more information on the CosmosDB SQL API syntax please see here to get you started.
You can use the query window in Azure to run the SQL commands, or Azure Storage Explorer to test the query, then move it to Power BI.
What is highly recommended is to extract the data into a place where is can be transformed into a strcutured format like a table or csv file.
For example use Azure Databricks to extract, then turn the JSON format into a table formatted object.
You do have the option of using running Databricks notebook queries in CosmosDB, or Azure DataBricks in its own instance. One other option would to use change feed to send the data and an Azure Function to send and shred the data to Blob Storage and query it from there, using Power BI, DataBricks, Azure SQL Database etc.
In the Source of your Query, you can make a select based on the CosmosDB _ts system property, like:
Query ="SELECT * FROM XYZ AS t WHERE t._ts > 1609455599"
In this case, 1609455599 is the timestamp which corresponds to 31.12.2020, 23:59:59. So, only data from 2021 will be selected.