Data Warehouse and connection to Power Bi on AWS Structure - amazon-web-services

I work for a startup where I have created several dashboards in Power BI using tables that are stored in an AWS RDS that I connect to using MySQL. To create additional columns for visualizations, I created views of the tables in MySQL and used DAX to add some extra columns for the Power BI visualizations.
However, my team now wants to use the AWS structure and build a data lake to store the raw data and a data warehouse to store the transformed data. After researching, I believe I should create the additional columns in either Athena or Redshift. My question is, which option is best for our needs?
I think the solution is to connect to the RDS using AWS Glue to perform the necessary transformations and deposit the transformed data in either Athena or Redshift. Then, we can connect to the chosen platform using Power BI. Please let me know if I am misunderstanding anything.
To give an approximate number of the number of records I'm handling, the fact tables have about 10 thousand new records every month
Thank you in advance!

Related

Power BI Paginated report using power BI dataset from multiple azure sql servers

Just looking for a pointer as to the best way to go about this.
I'm comfortable with Power BI Report Builder (SSRS experience), but am pretty much a Power BI novice.
Basically, we have to create a Paginated (non-interactive) report for client consumption. It's going to be large, have multiple datasets, and use parameters / presence of data in the data sets to group data and/or turn sections on or off.
Not too much visualisation - some illustrative graphs and tables here and there - and quite a bit of text, some of it with data / text inserted via placeholders from the various datasets.
There are 3 Azure SQL databases I need to combine data from for this, (split roughly into config, data and results).
In SSRS / SQL Server, I would have used one of my databases as the data source, and written a stored procedure per SSRS data set, joining to tables in other databases in the stored procedure query.
Then in Report builder just set up the data sets joining to the stored procs and gone from there.
On Azure SQL Server, I think I've got 2 options:
write elastic queries so I can bring in the data I need from each database, but just query on one database.
Build a Power BI Model / Dataset that joins the relevant tables from the 3 databases together, publish to power bi service and use that as my datasource.
What's the best solution for my reporting scenario?
Cheers

Power BI and Shared datasets how to allow users to create new measures and reports and publish

We are having difficulty finding a method of sharing a dataset and allowing users to use that dataset to create and publish their own reports. This would include ability to create new measures (Dax) and then publish themselves. Using the "service" live connection does not seem to allow that and if not using that there seems to be an issue of refreshing the data once that dataset is downloaded and modified with new columns/measures etc. 
Greatly appreciate any help on this. So far I have seen nothing that shows how to do any of this so I have to assume it may not be possible? Thank you. 
Live Connect to a Power BI Dataset allows for local measures.
If you need more modeling changes when working with a remote Data Set, the DirectQuery for Power BI Datasets and AAS feature (currently in preview) enables you to mash-up remote Data Set tables, with local tables, and allows for adding calculated columns to remote tables.
But you should use this with some care, as the query processing is split between the local model and the remote model(s), which can cause performance issues.

PowerBI streaming dataset limitations

We have a requirement of generating reports on PowerBI for real time transactions. We have roughly 2000,000 transactions flowing in 1 day and we would like reports generated atleast for these number of rows.
We have I understand that the push streaming API has a limitation of 200,000 rows for FIFO datasets and 5,000,000 for "none retention policy" Link
My questions are as follows:
If we create a streaming data set push API via the PowerBI service, what dataset is created by default in the background? FIFO or the none retention policy dataset?
For a none-retention policy dataset, what happens when we cross the 5000,000 limit? If there is a failure, does that mean we need to delete old rows via an API call on a frequent basis? An example API to do this will help. Deleting all rows is not an option as business would like reports like KPIs over the last 24 hrs for example.
If we use Azure stream analytics to push data to PowerbI, what are the limitation of data storage in PowerBI in this case?
I'm afraid you misunderstood the idea of Power BI. Power BI is not a database! Do not try to use it as such. There are better options out there. That's why you have hard times trying to work around these limitations.
What I'm trying to say is that you should store and process your data somewhere else. Use Power BI for visualizing it only. In this case, if we say that you want to use real-time streaming, which must be updated every second, this means you need to send only 86,400 records per day (which is way less than the limit of 200,000 records of a FIFO dataset). If you do not want to use real-time streaming dashboard, but a normal Power BI report, then why you are looking at push datasets? So collect your data somewhere, aggregate the results, and then push the aggregated data to Power BI.
And to answer your questions anyway:
If you create a dataset using the Power BI REST API without specifying the retention policy, it will create Push dataset with none retention policy - basicFIFO must be enabled explicitly.
If you reach the limit of 5M rows, you will get an error when trying to push more rows to the dataset. Your only option is to delete all rows - there is no way to delete only some of them, because Power BI is not a database. That's why your data should be stored somewhere else and this is the idea behind basicFIFO retention policy and Power BI's streaming dataset.
Power BI limits doesn't change based on the data source. It doesn't matter are you pushing data through Azure Stream Analytics or a service written by you - the Power BI dataset is the same.

How to deploy Power BI reports and connect them to a single Power BI Dataset

As far as I know, deploying a Power BI report from Power BI Desktop results in two items, the report itself and the dataset. When deploying a new report using the same dataset, will deploy the new report and a second copy of the same dataset in Power BI Service. That is not what I wanted. To not confuse end users and other, I want only an unique dataset deployed.
I want to make use of Azure Devops deploying to Power BI Service in a Dev, Test and Prod way. The dataset will be an azure analysis services data model, but the principle should be the same. I need to reduce the dataset to be exactly one and all reports must relate to that data model. I have heard of a Rest API or powershell scripting that can come to a rescue here.
So if any of you have done this or know of a good article that describes how to do this, I would be grateful.
Regards Geir
The best option is to separate the Power BI report in the frontend and the backend. You create a file purely for the dataset if you are importing, no reports created on it. You can then create the reports, using the service connection to the dataset, or with Power BI desktop, in the connection to Power BI Dataset option. Both will use 'Live Connection' mode, so you cannot add any other data sources to the model, for example bring in a CSV file or SQL database.
If you are connecting to an Azure Analysis Service data model, you can use this approach, however as it is only a connection only, not a full fat dataset, it should not be an issue to have copies of the dataset, as it is just the connection. Having copies of the dataset is only an issue if you are importing data, then it is best to move things to data flows, and use the same front/back end method, and the planning around the scheduling of the dataflows then datasets
You can use the REST API to move reports and the datasets that they connect to, and move items to new workspaces. If you have Power BI Premium that has a life cycle tool to move items between dev/test/live workspaces
If you create a report in desktop and choose 'Power BI Dataset' as live connection to work over it - when you upload the report to the same workspace, it will only upload the report and connect to the same dataset
https://radacad.com/power-bi-shared-datasets-what-is-it-how-does-it-work-and-why-should-you-care#:~:text=A%20shared%20dataset%20is%20a%20dataset%20that%20shared%20between%20multiple,tenant%20in%20Power%20BI%20environment.

Optimize data load from Azure Cosmos DB to Power BI

Currently we have a problem with loading data when updating the report data with respect to the DB, since it has too many records and it takes forever to load all the data. The issue is how can I load only the data from the last year to avoid taking so long to load everything. As I see, trying to connect to the COSMO DB in the box allows me to place an SQL query, but I don't know how to do it in this type of non-relational database.
Example
Power BI has an incremental refresh feature. You should be able to refresh the current year only.
If that still doesn’t meet expectations I would look at a preview feature called Azure Synapse Link which automatically pulls all Cosmos DB updates out into analytical storage you can query much faster in Azure Synapse Analytics in order to refresh Power BI faster.
Depending on the volume of the data you will hit a number of issues. First is you may exceed your RU limit, slowing down the extraction of the data from CosmosDB. The second issue will be the transforming of the data from JSON format to a structured format.
I would try to write a query to specify the fields and items that you need. That will reduce the time of processing and getting the data.
For SQL queries it will be some thing like
SELECT * FROM c WHERE c.partitionEntity = 'guid'
For more information on the CosmosDB SQL API syntax please see here to get you started.
You can use the query window in Azure to run the SQL commands, or Azure Storage Explorer to test the query, then move it to Power BI.
What is highly recommended is to extract the data into a place where is can be transformed into a strcutured format like a table or csv file.
For example use Azure Databricks to extract, then turn the JSON format into a table formatted object.
You do have the option of using running Databricks notebook queries in CosmosDB, or Azure DataBricks in its own instance. One other option would to use change feed to send the data and an Azure Function to send and shred the data to Blob Storage and query it from there, using Power BI, DataBricks, Azure SQL Database etc.
In the Source of your Query, you can make a select based on the CosmosDB _ts system property, like:
Query ="SELECT * FROM XYZ AS t WHERE t._ts > 1609455599"
In this case, 1609455599 is the timestamp which corresponds to 31.12.2020, 23:59:59. So, only data from 2021 will be selected.