how to test data load performance in Power BI - powerbi

Report authoring in Power BI is done in Power BI Desktop, which is installed on users' workstations. Report sharing in Power BI is done in the Power BI cloud service (either shared or dedicated capacity). This means that different resources (i.e., memory, CPU, disk) are available during report authoring and report sharing, particularly for data load (dataset refresh). So, it seems impossible to test a report's data load / ETL performance prior to releasing to production (i.e., publish to the cloud service). And, usually, data load performance is faster in the cloud service than in Desktop. Because my reports contain a lot of data and transformations, data loads in Desktop can take a long time. How can I make the resources available to Desktop identical to the resources in the cloud service, so that I can reduce data load times in Desktop (during development) and to predict performance in the cloud service?
Perhaps a better question to ask is, should I even be doing this? That is, should I be trying to predict (in Desktop) a report's refresh performance in the cloud service (and / or load production-level data volumes into Desktop during development)?

Microsoft do not specify what hardware CPU/Memory is used in the Power BI Service. It is also a shared service, so more that one Power BI tenancy could be hosted on the same cluster. They do mention that you may suffer from noisy neighbour issues, so if some other tenancy is hitting it hard, your performance may suffer.
I know from experience that the memory available is greater than 25GB, as queries that have not run on Premium P1 nodes, have run ok in the service. With the dedicated nodes, you can use the admin reports to see what's going on in the background, query times, refresh time cpu/memory usage.
There are a few of issues trying to performance test Desktop vs Service. For example, a SQL query in desktop will run twice, first to check the structure and data, the second to get the data. This doesn't happen when it is deployed to the service so in that example your load will be quicker.
If you are accessing on-premise data then it will be quicker in the desktop, than the service as you'll have to go via a gateway. Also if you are connecting to an Azure SQL Database, then the connections and bandwidth between the Azure Services will be slightly quicker when you deploy it to the service, than a desktop connection to an Azure Service as the data has to travel outside the data centre to get to you.
So for importing datasets, you can look at the dataset refresh start and end times and work out how long it did take.
For a base line test, generate 1 millions rows of data, it doesn't have to be complex. Test the load time in desktop a few time to get an average, deploy and then try it in the service. Then keep adding 1 million rows to see if there is a liner relationship between the amount and time taken.
However it will not be a full like for like comparison depending on the type of data, the location and network speed, but it should give you a fair indication of any performance increase you may get when using the service to balance desktop spec to the service.

I've developed a tool at some point that uses the PowerBI-Tools-For-Capacities Microsoft under the hood.

Related

Power BI - reports embedded, row level security & refresh rate for customers

My team plans to build a web platform which gathers data in a DB about different crypto transactions. I am planning to use Power BI to get that data from the db and build some reports which will be embedded into the web platform, reports which will be accessed by users who log in in the web platform.
Is this possible, taking into consideration the following aspects?
I want to apply row level security access so that users who log on the web platform will be able to see only data related to them?
Should I assign a Power BI Pro license to each user who registers the platform in order to be able to see the data or is there any other solution to this?
How often may I set-up data refreshes/updates? 30 minutes?
I am looking to apply row level security access and have users access the reports based on their web platfrom login credentials. Hopefully this is possible. I read something about Power BI Report for Customers using App Owns Data. Is this the right solution?
For the App Owns Data, you will be building a portal on top of an embedded capacity. I assume that you will be using an 'A' Sku.
I want to apply row level security access so that users who log on the
web platform will be able to see only data related to them?
Yes you can use RLS to control what users see what data, in an embedded context . (See here)
Should I assign a Power BI Pro license to each user who registers the
platform in order to be able to see the data or is there any other
solution to this?
No, you don't need a PBI Pro license for each user for your platform, this is handled by the capacity. You'll only need Pro for those who are developing the reports. Your other users, handled by your web portal will be 'read only'.
How often may I set-up data refreshes/updates? 30 minutes?
You can set up the report schedule as normal in the portal, up to 48 times per day with a capacity based Power BI Dataset.
I would take a look at the MS documentation here for more details on the what embedded can do, and also capacity planning for your users.

Need help building an uptime dashboard for a distributed system

I have a product for which I would like to create a dashboard to show
its availability/uptime over time and display any outages.
Specifically I am looking for
ability to report historical information on service uptime
provide details on any service outages
The product is running on a fleet of linux servers and connects to a DB running
on a separate instance, also we have some dedicated instances that run nightly
batch jobs. My system also relies on some external services to provide
additional functionality for select customers. There is redis cache also for
caching data for multiple customers.
We replicate all the above setup (application servers, DB, jobs servers, redis
cache etc) into dedicated clusters for large customers. Small customers are put
on one of the shared clusters to keep costs low.
Currently we are running health checks on application servers only and providing
that information in a simple HTML page. This is a go to page for end-users/customers
and support teams.
Since the product is constructed using multiple systems/services our current HTML
page often times says that the system is up and running fine while can be experiencing
issues with some of its components or external services.
Current health check is using a simple HTTP request and looks for a 200
status code, this check runs every minute and we plot this data into a simple
chart to show last 30 days. We also show a list of outages with timestamp and
additional static information that is manually added.
We would like to build a more robust solution that monitors much more than the HTTP port
and where we have more details like what part
of the system is having issues and how those issues are impacting the system and
which customers are impacted.
Appreciate any guidance or help. We prefer to build the solution using
open source tools since we dont have much budget. Goal is to improve things for
my team members who are already overloaded.
I'm not sure if this will be overkill or not for your setup, given that I don't know your product, but have a look at the ELK Stack and see if you can use some components or at least some ideas from there:
What is the ELK Stack?
The Complete Guide to the ELK Stack

SQL DB on AWS with Power BI Embedded

I need your help.
We have a plan to run "SQL DB and Web services" on AWS and need to publish the Power BI report by embedding to web service running on AWS.
Do you think it's possible scenario? IF yes, how can I achieve this?
You can't embed Power BI in a web service, so I will assume you want to embed it in a web application.
You need at least three components in such architecture - a place to store your data (assuming it will be in some kind of SQL Server), Power BI (assuming Power BI Service) and web application.
The database can be managed by your cloud provider (e.g. Amazon RDS) or "normal" instance running in a VM in the cloud. Of course, it could be something else (not SQL Server), or even be in a different cloud (e.g. Azure), or on-premise. The point is that you store your data there and use this as a data source for your reports.
The you need Power BI to create reports. Assuming that you will use Power BI Service (the online portal), you will design your reports in Power BI Desktop, getting data from your data source, and publishing these reports to Power BI Service. At this point you can view these reports in the portal using the browser. Power BI Service will render them using shared resources. For embedding and relatively heavy usage, you should buy a capacity. Think for capacities as resources (CPU, memory) dedicated only for you. They are not shared with other Power BI users. There are different licensing models and ways to buy a capacity. You can buy Power BI Premium or Azure SKUs. This FAQ tries to explain the differences, but in general A SKU means "pay what you use, stop at any moment, without any commitments", while EM SKU and P SKU are for bigger scale projects with monthly or yearly commitment. When you buy a capacity, you can assign it to a workspace containing your reports, and then they will be rendered using your own dedicated resources (which should give you better performance).
And the last part is your application (assuming web application, which you can host in Amazon Web Hosting or in VM), where you want to embed your reports. Generally speaking, there are two scenarios - "user own data" and "app own data". In the first, each of your users needs Azure AD account. Using this account, he will get access to the reports and data, as he has in the Power BI Service itself. In the second scenario, your app uses one "master" account to access the Power BI, thus your users doesn't need their own accounts in Azure AD. You can use your own authentication in your app. Embedding Power BI is quite large topic and your question isn't specific, so I will recommend to start with Embedding with Power BI article, take a look at Power BI Embedded Playground and review the samples.

Why is BigQuery so slow on non-large data sizes?

We have found BigQuery to work great on data sets larger than 100M rows, where the 'initialization time' doesn't really come into effect (or is negligible compared to the rest of the query).
However, on anything under that, the performance is quite slow and poor, which makes it (1) ill-suited to working in an interactive BI tool; and (2) inferior to other products, such as Redshift or even ElasticSearch where the data size is under 100M rows. Actually, we had an engineer at our organization that was evaluating a technology for doing queries on data sizes between 1M and 100M rows for an analytics product that has about 1000 users, and his feedback was that he could not believe how slow BigQuery was.
Without a defense of the BigQuery product, I was wondering if there were any plans on improving:
The speed of BigQuery -- especially its initialization time -- on queries of non-massive data sets?
Will BigQuery ever be able to deliver sub-second response times on 'regular' queries (such as a simple aggregation group by) on datasets under a certain size?
It's time spent on metadata/initiation, but actual execution time is very small. We have work in progress that will address this, but some of the changes are complicated and will take a while.
You can imagine that in its infancy, BigQuery could have central systems for managing jobs, metadata, etc. in a manner that performed very well for all N0 entities using the service. Once you get to N1 entities, however, it may be necessary to rearchitect some things to make them have as little latency as possible. For notification about new features--which is also where we would announce API improvements related to start-up latency--keep an eye on our release notes, which you can also subscribe to as an RSS feed.
After exacts 4 years since this question, we have amazing news to BigQuery users! As stated in this Bi Engine release note from 2021-02-25:
The BI Engine SQL interface expands BI Engine to integrate with other business intelligence (BI) tools such as Looker, Looqbox, Tableau, Power BI, and custom applications to accelerate data exploration and analysis. This page provides an overview of the BI Engine SQL interface, and the expanded capabilities that it brings to this preview version of BI Engine.
I believe this can solve the query latency issue mentioned by David542 question.

Are azure web jobs appropriate to import large amounts of data

I am working on an application where we receive csv files from a govt. dept. that has approx 1.5 million rows, monthly. We have to get this into azure table storage. We are trying to avoid having to provision VM's for this and are wondering if webjobs are a good choice for such a large dataset?
Thanks.
Yes, they should work. WebJobs are nothing more that a process running on the website machine.
You'll probably want to turn on the "Always On" feature if your webjob will take a long time to complete.