My scenario is:
I have 3 Dataflows:
Recent Data (from SQL Server. Refreshes 8 times a day)
Historical Data (does not refresh, just once initially)
Sharepoint Excel file Data
In my Dataset, I want to have a single Fact table that "union all" all 3 sources.
Instead of Append transformation, I want to create 3 custom Partition (well explained here: https://www.youtube.com/watch?v=6CRqdsLjHNA&t=127s).
I want to somehow tell the schedule refresh to only process the Recent Data and Excel Data partitions only.
The reasoning is - if I do Append, then the dataset will each time process the Historical Data again and again.
Now 2 questions:
How do I tell the scheduled refresh to only process two of 3 partitions? (I can do it manually via XMLA endpoint, but I need it scheduled)
What if I change something in my report (like visuals) - how do I deploy the changes without needing to recreate the partitions?
See Advanced Refresh Scenarios which includes Metadata Only Deployment, and Automate Premium workspace and dataset tasks with service principals.
The easiest way to generate the TMSL scripts for the advanced refresh scenarios is with SQL Server Management Studio (SSMS) which has wizards for configuring refresh, and can generate the script for you. Then you use the script through PowerShell cmdlets or using ADOMD.NET, which in turn can be automated with Azure Automation or an Azure Function.
If you don't need full TMSL scripting capabilities, Power Automate has connectors that hit the Power BI REST APIs, but doesn't support partition-based refresh currently.
But you can call the REST Refresh API directly through any programming language, or the Power Automate HTTP Action.
Also you should take a look at the new (Preview) Hybrid Tables feature which would enable you to have the recent data in a DirectQuery partition, while the historical data is in Import mode.
Related
I'm trying to find the best approach to delivering a BI solution to 400+ customers which each have their own database.
I've got PowerBI Embedded working using service principal licensing and I have the PowerBI service connected to my data through the On Premise Data Gateway.
I've build my first report pointing to 1 of the customer databases. Which works lovely.
What I want to do next, when embedding the report, is to tell PowerBI, for this session, to get the database from a different database.
I'm struggling to find somewhere where this is explained, or to understand if this is even possible.
I'm trying to avoid creating 400+ WorkSpaces or 400+ Data Sets.
If someone could point me in the right direction, it would be appreciated.
You can configure the report to use parameters and these parameters can be used to configure the source for your dataset:
https://www.phdata.io/blog/how-to-parameterize-data-sources-power-bi/
These parameters can be set by the app hosting the embedded report:
https://learn.microsoft.com/en-us/rest/api/power-bi/datasets/update-parameters-in-group
Because the app is setting the parameter, each user will only see their own data. Since this will be a live connection, you would need to think about how the underlying server can support the workload.
An alternative solution would be to consolidate the customer databases into a single database (just the relevant tables) and use row level security to restrict access for each customer. The advantage to this design is that you take the burden off of the underlying SQL instance and push it into a PBI dataset that is made to handle huge datasets with sub-second response times.
More on that here: https://learn.microsoft.com/en-us/power-bi/enterprise/service-admin-rls
In the PBI service, there is a refresh option for dataflows. What does a refresh operation for dataflows actually do?
A Power BI Dataflow is much like a data storage component on its own (internally using Azure Data Lake) and and a refresh will simply update data from the connected data source by applying all the predefined ETL steps.
The biggest advantage of Dataflows is that a Power BI Dataset can connect to more than one of them at a time so that you can define your ETL steps in one place only and feed the results into serveral datasets, avoiding code duplication.
Another advantage is probably that you can author your ETL code directly in the Online Service w/o a PBIDesktop.exe
When refreshing Datasets be aware that they do not trigger a refresh of the connected Dataflows. This has to be scheduled separately.
Dataflows are essentially the cloud version of M queries in Power Query / Query Editor. A Dataflow is the ETL layer that connects to the data sources, extracts and transforms the data, then stores the result as a table.
When you refresh a Datafow, it's just like refreshing a query in a Power BI model. It re-connects to the underlying data sources and pulls in the data from those sources as they exist at the time of refresh and stores the transformed data which can then be used in data models.
Things are a bit more complex with DirectQueries, linked tables, and incremental refreshes, which I'm choosing to ignore for the sake of simplicity.
Resources:
https://learn.microsoft.com/en-us/power-bi/transform-model/dataflows/dataflows-introduction-self-service
https://radacad.com/dataflow-vs-dataset-what-are-the-differences-of-these-two-power-bi-components
This is what I am trying to do: I have various SQL server databases with data. I created views in all of them. All views will need to be imported, and I specify their relationships. I want this to be refreshed nightly. I want to build various reports of the same data source.
Do I have to use a PowerBI desktop application to import data into PowerBI Report Service? [I have done this so far, but then can create new reports in the cloud on existing data. It would make sense to connect directly from PowerBI report service to my SQL servers.]
Once I uploaded data using a desktop application (as I have done so far), how can I view the data model in the report service once it is uploaded in the cloud?
In order to get routinely refreshed data I need to setup a gateway. Is the local PowerBI desktop application still involved in this process, or could I [in theory] delete the local desktop application that pushed the data in initially?
For your questions:
You have two options, use PBI Desktop to connect to the data using import/direct query, then load it to the service. You can use dataflows to create an import based on your views, but you will then need to create reports from those. Using dataflows, you'll have to set up a refresh schedule, then for the dataset(s) built on top of those, you'll have to set another refresh schedule.
You will be limited to the dataset sizes of 1GB for the workspace if importing data. You cannot use direct query on dataflows (unless you have enhanced compute with PBI premium). Once the dataset is loaded, you can then create new reports in the service or via desktop on top of that dataset. If possible it is recommended to use direct query.
To see the data model, you can use desktop to connect to PBI Service Dataset. This will connect in 'Live Connection' mode, and will be limited to that one dataset, you can't add others to it, Excel, CSV, SQL etc. You can also use Analyse in Excel, a plugin for Excel, that can connect to the data model. You can create new reports in the service for existing data models as well.
When creating the report in PBI Desktop it does not use the Gateway, you connect to your data sources as normal, then once you load the dataset to Power BI it will match the data sources in the file to the ones set up in the Gateway Admin settings. So you will still need PBI Desktop to create reports, but the gateway is there for the refreshing. The Desktop is not used in the process for refreshing. You could delete the workbook or application, but if you have to make changes, what will you refer to? (You could download a copy of the report from the service).+ It is easier to make changes in the desktop app, then the service, as there is a feature difference between dataset creation in the desktop vs service.
Currently we have a problem with loading data when updating the report data with respect to the DB, since it has too many records and it takes forever to load all the data. The issue is how can I load only the data from the last year to avoid taking so long to load everything. As I see, trying to connect to the COSMO DB in the box allows me to place an SQL query, but I don't know how to do it in this type of non-relational database.
Example
Power BI has an incremental refresh feature. You should be able to refresh the current year only.
If that still doesn’t meet expectations I would look at a preview feature called Azure Synapse Link which automatically pulls all Cosmos DB updates out into analytical storage you can query much faster in Azure Synapse Analytics in order to refresh Power BI faster.
Depending on the volume of the data you will hit a number of issues. First is you may exceed your RU limit, slowing down the extraction of the data from CosmosDB. The second issue will be the transforming of the data from JSON format to a structured format.
I would try to write a query to specify the fields and items that you need. That will reduce the time of processing and getting the data.
For SQL queries it will be some thing like
SELECT * FROM c WHERE c.partitionEntity = 'guid'
For more information on the CosmosDB SQL API syntax please see here to get you started.
You can use the query window in Azure to run the SQL commands, or Azure Storage Explorer to test the query, then move it to Power BI.
What is highly recommended is to extract the data into a place where is can be transformed into a strcutured format like a table or csv file.
For example use Azure Databricks to extract, then turn the JSON format into a table formatted object.
You do have the option of using running Databricks notebook queries in CosmosDB, or Azure DataBricks in its own instance. One other option would to use change feed to send the data and an Azure Function to send and shred the data to Blob Storage and query it from there, using Power BI, DataBricks, Azure SQL Database etc.
In the Source of your Query, you can make a select based on the CosmosDB _ts system property, like:
Query ="SELECT * FROM XYZ AS t WHERE t._ts > 1609455599"
In this case, 1609455599 is the timestamp which corresponds to 31.12.2020, 23:59:59. So, only data from 2021 will be selected.
For context, we would like to visualize our data in google data studio - this dataset receives more entries each week. I have tried hosting our data sets in google drive, but it seems that they're too large and this slows down google data studio (the file is only 50 mb, am I doing something wrong?).
I have loaded our data into google cloud storage --> google bigquery, and connected my google data studio to my bigquery table. This has allowed me to use the google data studio dashboard much quicker!
I'm not sure what is the best way to update our data weekly in google cloud/bigquery. I have found a slow way to do this by uploading the new weekly data to google cloud, then appending the data to my table manually in bigquery, but I'm wondering if there's a better way to do this (or at least a more automated way)?
I'm open to any suggestions, and if you think that bigquery/google cloud storage is not the answer for me, please let me know!
If I understand your question correctly, you want to automate the query that populate your table, which is connected to Data Studio.
If this is the case, then you can use Scheduled Query from BigQuery. Scheduled query allow you to define a query which results can be inserted in a new table. Particularly you can specify different rules for repetition (minimum each 15 minutes) and execution, as well as destination writing options (destination table, writing mode: append, truncate).
In order to use Scheduled Queries your account must have the right permissions. You can have a look at the following documentation to better understand how to use Scheduled Query [1].
Also, please note that at the front end the updated data in the BigQuery table will be seen updated in Datastudio at each refresh (click on refresh button in Datastudio). To automatically refresh the front-end visualization you can use the following plugin [2] or automate the click on the refresh button through Browser console commands.
[1] https://cloud.google.com/bigquery/docs/scheduling-queries
[2] https://chrome.google.com/webstore/detail/data-studio-auto-refresh/inkgahcdacjcejipadnndepfllmbgoag?hl=en