What is the difference between writing multiple dataflows and combining via datamart versus connect to data source via data mart? - powerbi

I'm exploring power bi data flows and data marts.
Data flow allow to connect to data source.
Data mart also allows to connect to data source and even to data flows.
What is the benefit of having data flows and connecting to them via data mart versus directly using datamart to connect to data source?

A Data Flow uses Power Query to copy data into a Data Lake in .CSV format.
A Data Mart uses Power Query to copy data into a relational database where it's stored in a highly-compressed columnstore.
What is the benefit of having data flows and connecting to them via data mart versus directly using datamart to connect to data source?
So this is essentially the same question as whether to load a data warehouse directly, or to load a Data Lake first, and load the warehouse from there.
The data would be available for other uses. So you might copy the raw data with a Data Flow, and then transform while loading into a Datamart. Another user might use that same raw data to build a Dataset.
On the other hand, it's an additional copy of the data, and you've got two refreshes to worry about.

Related

What is the difference between using normal vs CDM tables in data flows?

I am creating Power BI data flow, and there is a button called Map to standard. I understand this stores the data into CDM.
If I don't click on this button, then where does the data get stored? Is it into the azure data lake?
What is the difference between using non-CDM (data lake I assume) and CDM tables in data flows? For example - does either improve performance?

Correct appreoach to consume and prepare data from multiple sources power BI

I'm trying to establish if my planned way of working is correct.
I have two data sources; a MySql & MSSQL database. I need to combine these data sources and expose this data for Power BI to consume.
I've decided to use Azure Synapse Analytics for the ETL and would like to understand if there is anything in the process I can simplify or do better.
The process is as followed:
MySql & MSSQL delta loaded into ASA as parquet format, stored in Azure Gen 2 Storage.
Once copy pipeline is complete a subsiquent data flow unions the data from the two sources and inserts into MSSQL storage in ASA.
BI Consumes from this workspace / data soruce.
I'm not sure if I should be storing from the data sources to Azure Gene 2, or I should just perform the transform and insert from the source straight into the MSSQL storage. Any thoughts or suggestions would be greatly appreciated.
The pattern that you're following is the data lake pattern, where data is moved between 3 zones:
Raw
Enriched
Curated
The Raw zone keeps an original copy of the data before transformation. The benefit of storing the data this way (as parquet files, here) is so that you can troubleshoot a problem with the transformation or create a different transformation to address a new need.
The Enriched zone is where you have done some transformation, like UNIONing your data, or providing some other clean up steps, maybe removing unneeded columns, correcting addresses, etc. You have done this by inserting the data into a SQL database, but this might also be accomplished by using views in the serverless pool, if the transformations are simple enough: https://learn.microsoft.com/en-us/azure/synapse-analytics/sql/create-use-views
The Curated zone is a place to transform your data into a form that BI applications will do well with, i.e. a star schema. Even if this is a very simple dataset, it will be well worth incorporating a date dimension, which will yield a lot of benefits in Power BI. The bottom line here is that Power BI is optimized to work with star schemas, so that's what you should give it.
You do not need to use data lake technologies to follow this pattern and still get the benefits. As far as whether what you are doing is good will be based on how everything performs versus how simple you can keep it.
Here's more on the topic: https://learn.microsoft.com/en-us/azure/cloud-adoption-framework/scenarios/cloud-scale-analytics/best-practices/data-lake-overview
Once copy pipeline is complete a subsiquent data flow unions the data
from the two sources and inserts into MSSQL storage in ASA
What is the use MSSQL storage ? Is it only used by PowerBI to create reports , if yes then you can use ADLS gen2 , as it will be cheaper, ( basically very in line with Mark said above as "curated"
Just one more thing to consider , PowerBI can read data from both the sources and then do the transformation within itself.

Which data is stored in Power BI - the one after query or the one after modelling?

In Power BI first we get source data. And then we add multiple query steps to filter data/remove column/etc. Then we add relations and model the data.
We can have calculated columns that are stored in the data. And measures that are not stored in the data but calculated on the fly.
Which data is stored in Power BI - the one after query or the one after modelling?
Power BI has 3 connection types for data access. They are import, direct query and live connections.
If we use import method as a connection type, data imported into Power bi file using Power BI desktop. So all the data always stays in disk. When query or refresh, data stays in computer memory.This data we can use to query and modeling. After work, we save the Power BI file it will save as file with .pbix extension. Data compressed and stored inside this file.
in direct query mode , data stays in remote location and we can connect data. each time we refresh or make change in slicer request goes to data source and bring back data to power bi. In this method, we can't access data but we can create data model.
live connection is another method. It only support for few data sources. In this method, data not stored in computer memory and can't create data model using Power BI desktop.
Power BI is very well documented. Many of the questions you've recently asked are answered in that resource, so please take a look. I get the feeling that you are using this community because you don't want to read the manual. I strongly suggest you take a look at the documentation, because everything we write in answer to your questions has already been written and documented, and SO is not meant to be a shadow user guide for well documented systems.
Depending on the data source you use in Power BI Desktop, Power BI supports query folding, which will do as much processing of the data at the source (for example SQL Server).
If query folding is not possible because the source does not support it, then the source data is loaded before the query steps are applied.
Read more about query folding here: https://learn.microsoft.com/en-us/power-bi/guidance/power-query-folding
When you perform additional modelling after the Power Queries are loaded, i.e. creating tables with DAX, adding columns, etc., these will be performed when the PBIX file is published to the Power BI service, and they will be performed each time the data is refreshed with the data gateway.

Optimize data load from Azure Cosmos DB to Power BI

Currently we have a problem with loading data when updating the report data with respect to the DB, since it has too many records and it takes forever to load all the data. The issue is how can I load only the data from the last year to avoid taking so long to load everything. As I see, trying to connect to the COSMO DB in the box allows me to place an SQL query, but I don't know how to do it in this type of non-relational database.
Example
Power BI has an incremental refresh feature. You should be able to refresh the current year only.
If that still doesn’t meet expectations I would look at a preview feature called Azure Synapse Link which automatically pulls all Cosmos DB updates out into analytical storage you can query much faster in Azure Synapse Analytics in order to refresh Power BI faster.
Depending on the volume of the data you will hit a number of issues. First is you may exceed your RU limit, slowing down the extraction of the data from CosmosDB. The second issue will be the transforming of the data from JSON format to a structured format.
I would try to write a query to specify the fields and items that you need. That will reduce the time of processing and getting the data.
For SQL queries it will be some thing like
SELECT * FROM c WHERE c.partitionEntity = 'guid'
For more information on the CosmosDB SQL API syntax please see here to get you started.
You can use the query window in Azure to run the SQL commands, or Azure Storage Explorer to test the query, then move it to Power BI.
What is highly recommended is to extract the data into a place where is can be transformed into a strcutured format like a table or csv file.
For example use Azure Databricks to extract, then turn the JSON format into a table formatted object.
You do have the option of using running Databricks notebook queries in CosmosDB, or Azure DataBricks in its own instance. One other option would to use change feed to send the data and an Azure Function to send and shred the data to Blob Storage and query it from there, using Power BI, DataBricks, Azure SQL Database etc.
In the Source of your Query, you can make a select based on the CosmosDB _ts system property, like:
Query ="SELECT * FROM XYZ AS t WHERE t._ts > 1609455599"
In this case, 1609455599 is the timestamp which corresponds to 31.12.2020, 23:59:59. So, only data from 2021 will be selected.

Backup/Remove Datastore entities to Big Query Automatically at some interval

I am looking to stream some data into Big Query and had a question around Step 3 of Google's best practices for Streaming data into Big Query. The process makes sense at a high level but I'm struggling with the implementation for step 3. (I am looking to use the datastore as my transactional data store.) For step 3 says to "reconciled data from the transactional data store and truncate the unreconciled data table.". My question is this; If my reconciled data is in the Google Datastore is there a way to automate the backup and deletion of this data without manually intervention?
I know I could achieve this recommended practice by using the Datastore Admin. I could:
1) Pause all writes to the datastore
2) Backup the datastore table to Cloud Storage
3) Delete all entities in the table I just backed up.
4) Import the backup into Big Query
Is there a way I can automate this so I don't have to do it manually at regular intervals?
Real-time dashboards and queries
In certain situations, streaming data into BigQuery enables real-time analysis over transactional data. Since streaming data comes with a possibility of duplicated data, ensure that you have a primary, transactional data store outside of BigQuery.
You can take a few precautions to ensure that you'll be able to perform analysis over transactional data, and also have an up-to-the-second view of your data:
1) Create two tables with an identical schema. The first table is for the reconciled data, and the second table is for the real-time, unreconciled data.
2) On the client side, maintain a transactional data store for records.
Fire-and-forget insertAll() requests for these records. The insertAll() request should specify the real-time, unreconciled table as the destination table.
3) At some interval, append the reconciled data from the transactional data store and truncate the unreconciled data table.
4) For real-time dashboards and queries, you can select data from both tables. The unreconciled data table might include duplicates or dropped records.