Is there a way to connect PBI to a Databricks cluster that is not running? - powerbi

In my scenario, Databricks is performing read and writing transformations in Delta tables. We have PBI connected to the Databricks cluster that needs to be running most of the time, which is expensive.
Knowing that delta tables are in a container, what would be the best way in terms of cost x performance to feed PBI from delta tables?

If your set size is under max allowed size in PowerBI (100 GB I guess) and daily refresh is enough you can just load everything to your PowerBI model.
https://blog.gbrueckl.at/2021/01/reading-delta-lake-tables-natively-in-powerbi/
If you want to save the costs maybe you don't need transactions and can save it in csv in data lake, than loading everything to PowerBI and refresh daily is really easy.
If you want to save the costs and query new incoming data all the time using DirectQuery consider using Azure SQL. It has really competitive prices starting from 5 eur/usd. Integration with databricks is also perfect write in append mode do all magic.

Another option to consider is to create an Azure Synapse workspace and use serverless SQL compute to query the delta lake files. This is a pay-per-the-TB consumed pricing model so you don’t have to have your Databricks cluster running all the time. It’s a great way to load Power BI import models.

Related

Scalable Power BI data model on delta lake ADLS Gen2

Our current architecture for reporting and dashboarding is similar to the following:
[Sql Azure] <-> [Azure Analysis Services (AAS)] <-> [Power BI]
We have almost 30 Power BI Pro Licenses (no Premium Tier)
As we migrate our on-premise data feeds to ADLS Gen2 with Data Factory and Databricks (in the long run, we will dismiss SQL Azure DBs), we investigate how to connect Power BI to the delta tables.
Several approaches suggest using SQL Databricks endpoints for this purpose:
https://www.youtube.com/watch?v=lkI4QZ6FKbI&t=604s
IMHO this is nice as long as you have a few reports. What if you have, say, 20-30? Is there a middle layer between ADLS Gen2 delta tables and Power BI for a scalable and efficient tabular model? How to define measures, calculated tables, manage relationships efficiently without the hassle of doing this from scratch in every single .pbix?
[ADLS Gen2] <-> [?] <-> [Power BI]
As far as I can tell, no AAS Direct Query is allowed in this scenario:
https://learn.microsoft.com/en-us/azure/analysis-services/analysis-services-datasource
Is there a workaround to avoid the use of Azure Synapse Analytics? We are not using it, and I am afraid we will not include it in the roadmap.
Thanks in advance for your invaluable piece of advice
Is there a middle layer between ADLS Gen2 delta tables and Power BI for a scalable and efficient tabular model?
If you want to build Power BI Import Models from Delta tables without routing through Databricks SQL or Spark, you can look into the new Delta Sharing Connector for Power BI. Or run a Spark job to export the model data to a Data Lake format that Power BI/AAS can read directly.
If you want DirectQuery models, Synapse SQL Pool or Synapse Serverless would be the path, as these expose the data as SQL Server endpoints, for which Power BI and AAS support DirectQuery.
How to define measures, calculated tables, manage relationships efficiently without the hassle of doing this from scratch in every single .pbix?
Define them in an AAS Tabular Model or a Power BI Shared Data Set.

What does refreshing a Dataflow in the PBI service actually do?

In the PBI service, there is a refresh option for dataflows. What does a refresh operation for dataflows actually do?
A Power BI Dataflow is much like a data storage component on its own (internally using Azure Data Lake) and and a refresh will simply update data from the connected data source by applying all the predefined ETL steps.
The biggest advantage of Dataflows is that a Power BI Dataset can connect to more than one of them at a time so that you can define your ETL steps in one place only and feed the results into serveral datasets, avoiding code duplication.
Another advantage is probably that you can author your ETL code directly in the Online Service w/o a PBIDesktop.exe
When refreshing Datasets be aware that they do not trigger a refresh of the connected Dataflows. This has to be scheduled separately.
Dataflows are essentially the cloud version of M queries in Power Query / Query Editor. A Dataflow is the ETL layer that connects to the data sources, extracts and transforms the data, then stores the result as a table.
When you refresh a Datafow, it's just like refreshing a query in a Power BI model. It re-connects to the underlying data sources and pulls in the data from those sources as they exist at the time of refresh and stores the transformed data which can then be used in data models.
Things are a bit more complex with DirectQueries, linked tables, and incremental refreshes, which I'm choosing to ignore for the sake of simplicity.
Resources:
https://learn.microsoft.com/en-us/power-bi/transform-model/dataflows/dataflows-introduction-self-service
https://radacad.com/dataflow-vs-dataset-what-are-the-differences-of-these-two-power-bi-components

Big Query vs Cloud SQL for audit data

Which database is better for transactional data - cloud SQL or Google Big query?
I have a requirement wherein from multiple jobs I need to load data into a single table. Which database will be better for this Google Big query or Cloud SQL?
I know that in terms of cost effectiveness cloud sql is a better choice. But are there any other pointers apart from this?
CloudSQL is a managed database for transactional loads (OLTP). It can be tweaked to work with OLAP (analytical); but it is intended to be a transactional database.
BigQuery is for analytical data (OLAP), that's data that won't change. Think of it as data "at rest" that's not going to be changed.
If some of your transactions are not finalized (there are in-flight transactions, or your end-to-end process need some steps), you need a transactional database - like the ones from Cloud SQL.
If your data won't change, use BigQuery.

Connect BigQuery with Power BI- Best Practice and Cost Effective way

Cost Effective way to connect Google BigQuery with Power BI. What intermediate layer is required in between GCP and Power BI?
You can access BigQuery directly from DataStudio using a custom query or loading the whole table. Technically nothing is necessary between BigQuery and DataStudio.
Regarding best practices, if your dashboard reads a lot of data and its constantly used it can lead to a high cost. In this case a "layer" makes sense.
If this is your case you could pre-aggregate your data in BigQuery to avoid a big amount of data to be read many times by DataStudio. My suggestion is:
Create a process (could be a scheduled query) that periodically aggregate your data and then save it in another table
In DataStudio read your data from the aggregated table
These steps can help you reducing costs and also can make your dashboards loading faster. The negative point is that if you are working with streaming data this approach in general will not let you see the most recent registries unless you run the aggregation process very constantly.

Optimize data load from Azure Cosmos DB to Power BI

Currently we have a problem with loading data when updating the report data with respect to the DB, since it has too many records and it takes forever to load all the data. The issue is how can I load only the data from the last year to avoid taking so long to load everything. As I see, trying to connect to the COSMO DB in the box allows me to place an SQL query, but I don't know how to do it in this type of non-relational database.
Example
Power BI has an incremental refresh feature. You should be able to refresh the current year only.
If that still doesn’t meet expectations I would look at a preview feature called Azure Synapse Link which automatically pulls all Cosmos DB updates out into analytical storage you can query much faster in Azure Synapse Analytics in order to refresh Power BI faster.
Depending on the volume of the data you will hit a number of issues. First is you may exceed your RU limit, slowing down the extraction of the data from CosmosDB. The second issue will be the transforming of the data from JSON format to a structured format.
I would try to write a query to specify the fields and items that you need. That will reduce the time of processing and getting the data.
For SQL queries it will be some thing like
SELECT * FROM c WHERE c.partitionEntity = 'guid'
For more information on the CosmosDB SQL API syntax please see here to get you started.
You can use the query window in Azure to run the SQL commands, or Azure Storage Explorer to test the query, then move it to Power BI.
What is highly recommended is to extract the data into a place where is can be transformed into a strcutured format like a table or csv file.
For example use Azure Databricks to extract, then turn the JSON format into a table formatted object.
You do have the option of using running Databricks notebook queries in CosmosDB, or Azure DataBricks in its own instance. One other option would to use change feed to send the data and an Azure Function to send and shred the data to Blob Storage and query it from there, using Power BI, DataBricks, Azure SQL Database etc.
In the Source of your Query, you can make a select based on the CosmosDB _ts system property, like:
Query ="SELECT * FROM XYZ AS t WHERE t._ts > 1609455599"
In this case, 1609455599 is the timestamp which corresponds to 31.12.2020, 23:59:59. So, only data from 2021 will be selected.