Optimize data load from Azure Cosmos DB to Power BI - powerbi

Currently we have a problem with loading data when updating the report data with respect to the DB, since it has too many records and it takes forever to load all the data. The issue is how can I load only the data from the last year to avoid taking so long to load everything. As I see, trying to connect to the COSMO DB in the box allows me to place an SQL query, but I don't know how to do it in this type of non-relational database.
Example

Power BI has an incremental refresh feature. You should be able to refresh the current year only.
If that still doesn’t meet expectations I would look at a preview feature called Azure Synapse Link which automatically pulls all Cosmos DB updates out into analytical storage you can query much faster in Azure Synapse Analytics in order to refresh Power BI faster.

Depending on the volume of the data you will hit a number of issues. First is you may exceed your RU limit, slowing down the extraction of the data from CosmosDB. The second issue will be the transforming of the data from JSON format to a structured format.
I would try to write a query to specify the fields and items that you need. That will reduce the time of processing and getting the data.
For SQL queries it will be some thing like
SELECT * FROM c WHERE c.partitionEntity = 'guid'
For more information on the CosmosDB SQL API syntax please see here to get you started.
You can use the query window in Azure to run the SQL commands, or Azure Storage Explorer to test the query, then move it to Power BI.
What is highly recommended is to extract the data into a place where is can be transformed into a strcutured format like a table or csv file.
For example use Azure Databricks to extract, then turn the JSON format into a table formatted object.
You do have the option of using running Databricks notebook queries in CosmosDB, or Azure DataBricks in its own instance. One other option would to use change feed to send the data and an Azure Function to send and shred the data to Blob Storage and query it from there, using Power BI, DataBricks, Azure SQL Database etc.

In the Source of your Query, you can make a select based on the CosmosDB _ts system property, like:
Query ="SELECT * FROM XYZ AS t WHERE t._ts > 1609455599"
In this case, 1609455599 is the timestamp which corresponds to 31.12.2020, 23:59:59. So, only data from 2021 will be selected.

Related

Create PowerBI Datamart from Azure Analysis Service

I am trying to create PowerBI Datamart from Azure Analyis service. There is a datamodel available in the Azure Analysis Service and I can connect using URL and Database Name. The datamodel has ~100 tables present in it and relationship also setup. So my question is, if I want to create a PowerBI datamart from the Azure Analyis service datamode, I need to do the Get Data option of PowerBI datamart and connect to Azure Analyis service, select table, select fields 100 time for getting all the tables of Azure Analyis service datamode into my PowerBI datamart? Is there any import function available where I can import all the tables in a single time?
Why do you want to copy data from AAS into a database?
The reason you find it difficult is that it's an odd thing to do. The query designer for AAS/SSAS generates MDX queries which are indented to run aggregate queries that return a handful of rows, and are wholly unsuitable for extracting whole tables. If you try, the queries will just run forever and fail.
It's possible to extract data from AAS/SSAS tabular models, but you must use DAX not MDX, and so you need to the Power Query or "Transform Data" window, and use the advanced editor.
Each query to load a table should look like this, eg to load the 'Customer' table:
let
Dax = "evaluate Customer",
Source = AnalysisServices.Database("asazure://southcentralus.asazure.windows.net/myserver", "mydatabase", [Query=Dax])
in
Source

Power BI develop query with cached dataset

When developing a query in Power BI with a database data source, making any changes causes the query editor to 'start from scratch' and re-query the database.
Wondering if there is a workaround that allows you to develop a query without repeated long wait times by eg downloading a temporary local flat file of the full dataset which can be used to develop the query offline and can then be swapped out for the live database connection when you are happy with it.
Importing the data once, exporting as a csv from a Power BI table visualisation and re-importing as a new data source would work but maybe there's a simpler way?
Thanks
There's two approaches you can use.
If your database supports query folding, make the first step take just the top 200 records whilst you develop your query. Once your happy with it, remove the firstN filter.
Load the entire table to the model, export it to a csv using DAX studio, develop your query using the CSV and then switch back to the DB once you're happy with it.

What does refreshing a Dataflow in the PBI service actually do?

In the PBI service, there is a refresh option for dataflows. What does a refresh operation for dataflows actually do?
A Power BI Dataflow is much like a data storage component on its own (internally using Azure Data Lake) and and a refresh will simply update data from the connected data source by applying all the predefined ETL steps.
The biggest advantage of Dataflows is that a Power BI Dataset can connect to more than one of them at a time so that you can define your ETL steps in one place only and feed the results into serveral datasets, avoiding code duplication.
Another advantage is probably that you can author your ETL code directly in the Online Service w/o a PBIDesktop.exe
When refreshing Datasets be aware that they do not trigger a refresh of the connected Dataflows. This has to be scheduled separately.
Dataflows are essentially the cloud version of M queries in Power Query / Query Editor. A Dataflow is the ETL layer that connects to the data sources, extracts and transforms the data, then stores the result as a table.
When you refresh a Datafow, it's just like refreshing a query in a Power BI model. It re-connects to the underlying data sources and pulls in the data from those sources as they exist at the time of refresh and stores the transformed data which can then be used in data models.
Things are a bit more complex with DirectQueries, linked tables, and incremental refreshes, which I'm choosing to ignore for the sake of simplicity.
Resources:
https://learn.microsoft.com/en-us/power-bi/transform-model/dataflows/dataflows-introduction-self-service
https://radacad.com/dataflow-vs-dataset-what-are-the-differences-of-these-two-power-bi-components

PowerBi - Connection Type (DIRECT QUERY or IMPORT DATA) Question

I am working on a PowerBi project and I need some advice/questions on the best way to approach this project. I am tasked to create a dashboard for employee metrics pulled from an onsite SQL Server database. The managers here are going to have access to the PowerBi cloud, so I will end up uploading this to the cloud. There are 10 or so metrics that need to be shown on the dashboard. We have 5000+ employees. My first thought was to create a table and dump all the metrics into a table and set the PowerBi report to import the data, but that seems excessive and a waste of space to upload all that data to the CLOUD because all of the managers don't need access to every employee. They may want to see 1 or 2 employees' metrics on the dashboard.
My second thought is to (and if this is possible) create a stored procedure that will take the employee id and output a dataset for PowerBi to create a visual for. On the dashboard, have a list of employees and when a manager selects one, PowerBi will call the stored procedure with the employee id and the dataset will be returned for PowerBi to decipher into a visual based on my measurements. I guess I would set the PowerBi report connection type as DIRECT QUERY?
Here are my questions:
Is this possible? Is it possible to what I am thinking for my second plan? Is this how DIRECT QUERY works?
If so, how does DIRECT QUERY work with the PowerBi cloud?
What is setup like? Do I just install the PowerBi Data Gateway/configure it like IMPORT DATA and PowerBi does the rest?
A couple of queries:
What is the frequency of data update ?
In case if it is a batch job, it is ideally preferable to import that data from source into powerbi model and do reporting on the imported data as
a) The performance would be quicker
b) There would be no to and for of data across on prem database and cloud
c) the source would not be impacted constantly
So is the ask to have RLS wherein the managers should see only the employees under them?
Then it is pretty easy to implement RLS in imported version rather than in case of direct query.
Also you won't be able to pass parameters to stored procedures, and you can't execute them in direct query mode. You can however, create table valued functions which give you the ability to use table variables and perform other functions that are more complex in nature in Direct Query mode
you can refer this for additional details :
https://community.powerbi.com/t5/Desktop/Can-i-call-Stored-Procedure-with-Direct-Query/m-p/267141#:~:text=%40Pallavi%20you%20won't%20be,nature%20in%20Direct%20Query%20mode.

Which data is stored in Power BI - the one after query or the one after modelling?

In Power BI first we get source data. And then we add multiple query steps to filter data/remove column/etc. Then we add relations and model the data.
We can have calculated columns that are stored in the data. And measures that are not stored in the data but calculated on the fly.
Which data is stored in Power BI - the one after query or the one after modelling?
Power BI has 3 connection types for data access. They are import, direct query and live connections.
If we use import method as a connection type, data imported into Power bi file using Power BI desktop. So all the data always stays in disk. When query or refresh, data stays in computer memory.This data we can use to query and modeling. After work, we save the Power BI file it will save as file with .pbix extension. Data compressed and stored inside this file.
in direct query mode , data stays in remote location and we can connect data. each time we refresh or make change in slicer request goes to data source and bring back data to power bi. In this method, we can't access data but we can create data model.
live connection is another method. It only support for few data sources. In this method, data not stored in computer memory and can't create data model using Power BI desktop.
Power BI is very well documented. Many of the questions you've recently asked are answered in that resource, so please take a look. I get the feeling that you are using this community because you don't want to read the manual. I strongly suggest you take a look at the documentation, because everything we write in answer to your questions has already been written and documented, and SO is not meant to be a shadow user guide for well documented systems.
Depending on the data source you use in Power BI Desktop, Power BI supports query folding, which will do as much processing of the data at the source (for example SQL Server).
If query folding is not possible because the source does not support it, then the source data is loaded before the query steps are applied.
Read more about query folding here: https://learn.microsoft.com/en-us/power-bi/guidance/power-query-folding
When you perform additional modelling after the Power Queries are loaded, i.e. creating tables with DAX, adding columns, etc., these will be performed when the PBIX file is published to the Power BI service, and they will be performed each time the data is refreshed with the data gateway.