AAS tabular model in DirectQuery mode performance benefits - powerbi

Suppose you have 10 pretty big fact tables (each 50-100 GBs) that should be queried with Power BI. They doesn't fit into Azure Analysis Services RAM (reasonable price). So in order to use tabular model and AAS you have stay with the following schema:
(1) Power BI Desktop -> Azure Analysis Services -> [DirectQuery] -> SQL Database
But as far as I know from this article, AAS tabular model doesn't cache any aggregated results (means won't imply any additional performance optimizations). Moreover, AFAIK, Power BI (PowerPivot) already has embedded AAS.
As alternative, I can query SQL datasource directly from Power BI:
(2) Power BI Desktop -> [DirectQuery] -> SQL Database
Does the 1st schema (using AAS) provide any performance benefits over the 2nd schema (not using AAS)?
P.S. My question isn't about pros and cons of semantic layer, for that see this article. This question isn't the same as this question, because it's asking only about performance aspect of ASS DirectQuery.

The performance benefits will require testing depending on your work load, and other factors.
Caveat (This answer is based on my own and my colleagues experience and testing)
Service Standard:
From a service point of view, the main difference will be between Azure Analysis Services (AAS) and the Power BI Service (PBIS), is that AAS is a known set of hardware/performance, where as PBIS is a shared capacity, and can suffer from 'noisy neighbour' issues, if another customer is on the same cluster and using it heavily it will have an impact on you report performance.
Performance:
Essentially, PBI and AAS are doing the same thing, translating DAX to a SQL query and then returning the data. From my experience of building PBI and AAS in terms of performance there isn't much difference between the two. The main issue that tends to be the bottleneck is using a gateway to an on-prem SQL and the capacity of the SQL Server either on-prem or in the cloud. For example for better performance you can use Clustered Column Indexes to bring for example the fact tables into memory, and it is easier to increase/decrease the Azure SQL Database DTU's/capacity during business hours.
At the moment AAS doesn't have the Aggregated Mode that PBI does, which can reduce the number of queries being sent back and is a bit quicker, but also has the drawback of they have to be refreshed at some point.
I would recommend testing using for example DAX Studio to see what variability you may get in performance. My own testing has shown differences in the millisecond to 1 second range in favour of AAS.
However the benefits of the semantic layer is a powerful consideration
Connections:
AAS supports other connections such as Excel, SSMS, SSRS etc better than Power BI. Excel can connect to Power BI models with an additional plugin.
Maintainability:
Maintaining the data model across its life-cycle is a lot easier to do in Visual Studio/SSDT with Azure DevOps, Git etc. than it is in Power BI Desktop. With AAS you can also use Calculation Groups for Time Intelligence calculations, rather than multiple measure or workarounds for YTD, Parallel Period, MTD etc
If there was slightly better performance in a pure Power BI approach I would still use AAS due to the benefits of the none performance factors, it would have to show significantly improved performance before switching.
Hope that helps

Related

Metadata in Power BI

Years ago I used a BI product called Hyperion Interactive Reporting. It allowed me to connect to a data source and create data models from which I would create reports. So far, sounds like Power BI right?
It also had the ability to connect to a metadata repository database. This database would contain data that mapped the actual, often cryptic, table and column names in the database to human-readable, business terms. For example a column that I saw in Hyperion as "Cost Center" may have been in the database as costCenter, work_order, or PROJECT-NUMBER. (It would also allow me to define the default join paths, but let's keep this question smaller.) This provided a way to make report development easier.
In Power BI, I see that I can manually rename columns, one-at-a-time. (And each time I touch something minor like this, Power BI takes several seconds to validate the entire file.) But I also see the need for many models that use the same data sources. So I may be defining the "Cost Center" column a few hundred times (a handful of reports per data set to answer a specific type of question, a few data sets that need Cost Center because the transformations in the model will be different for each type of question, several different combinations of data sources that include Cost Center, etc.)
Is there a way to connect Power BI to a metadata repository? Is there a way in Power BI to say, "Across all of my models/datasets, if I'm using the costCenter column from the Financial database, display Cost Center to the user"?
With about 20,000 columns in my data warehouse and 20,000 reports in my current reporting system, this could become a big deal if we intend to migrate to Power BI.
TLDR; There isn't an easy way to achieve this. What you have now is probably better than you could achieve without a ton of work. If you do try it, use SSAS instead of Power BI Desktop to author models.
Does Power BI have a metadata repository? No. There are tools that can get metadata from Power BI models, but you would have to manually build the metadata repository. If you want a centrally managed environment like this, I would highly recommend using SQL Server Analysis Services (SSAS) for on premise, or even better, Azure SSAS in the cloud. (Azure SSAS will get new features sooner than SSAS installed on premise.) While Power BI Desktop is a great self-service tool, I wouldn't author in it if I needed to control or report across the environment. There just aren't easy ways to corral all of the Power BI models in a report and it's a much more manual process. SSAS will need IT Support and is a higher cost and you will hit more issues than Power BI Desktop, but you will need it if you need central control. It's possible that more management controls will be added to the PowerBI.com service over time, but as of November 2021, you can't do this easily.
So what's the difference between Power BI Desktop and SSAS? The same Power BI engine in Power BI Desktop also exists in SSAS. When you start Power BI Desktop, it's actually starting a SSAS instance behind the scenes. Using SSAS directly just makes it easier for you to connect to the database behind the scenes and see all the models in the environment from one place, while Power BI Desktop doesn't let you peak behind the scenes and it only loads a single model at a time.
How do you get the metadata? It is an easy thing to get SSAS metadata using Power Query (or any SQL tool) to pull Direct Management View (DMV) data. DMVs are management tables that hold all of the metadata of the model, and you just use SQL commands to get the data. Search on "SSAS DMV" to learn more. I have a Excel file that uses Power Query to pull all the key DMV views for all our models in our servers. It makes it easy to do the kind of analysis as in your example.
For Power BI Desktop, you can connect to the hidden SSAS instance and do the same thing, but the report has to be open to do it, and there is no easy way to refresh the data--you pretty much just repeat the process each time. You will connect via localhost:port_number, and the port number is randomly created each time you start Power BI making it impossible to refresh the data pull. There are External Tools such as DAX Studio, Power BI Helper, and dataMarc's Document Model that make that easier, but there's no easy way to automate building the metadata repository for Power BI Desktop files. I would use SSAS directly rather than trying to automate building a large metadata repository.
What about making changes to models? To my knowledge, there aren't any tools that make it easy to make changes across models, though again, you could manually build them. I don't think I would trust my own tool to automate changes across models though. There's just too much that could go wrong. But if you had the need and the budget, you could build it. Look at tools like Tabular Editor, ALM Toolkit, and Microsoft's SSMS, and read on DevOps pipelines for automating updates. These tools work against SSAS and Power BI Desktop, but again, you have to open the Power BI files to work with those models, which makes automation that much harder to do.
Note that all the external tools I've mentioned except Tabular Editor v3 are free (though Tabular Editor v2 is free). PowerBI.tips is a great place to install all these tools from a single installer.

How to understand an OLAP cube in S3 or Cassandra?

In this repository, its author mentions that we can stage OLAP cubes in Cassandra or S3:
Once the data is in Redshift, our chief goal is for the BI apps to be
able to connect to Redshift cluster and do some analysis. The BI apps
can either directly connect to the Redshift cluster or go through an
intermediate stage where data is in the form of aggregations
represented by OLAP cubes.
How is it possible? How would that work? Am I missing any essential concept? As I understand OLAP cubes are a special data structure that exists in OLAP databases. Does he maybe mean specific pre-calculated combinations of dimensions and facts stored in a OLTP-oriented database, like Cassandra?
Key features of OLAP are:
pivoting
slicing
dicing
drilling
And Redshift can do this.
It's architecture is aimed to solve OLAP and BI tasks. See amazon-redshift-developer-guide
Amazon Redshift is specifically designed for online analytic processing (OLAP) and business intelligence (BI) applications, which require complex queries against large datasets. Because it addresses very different requirements, the specialized data storage schema and query execution engine that Amazon Redshift uses are completely different from the PostgreSQL implementation. For example, where online transaction processing (OLTP) applications typically store data in rows, Amazon Redshift stores data in columns, using specialized data compression encodings for optimum memory usage and disk I/O. Some PostgreSQL features that are suited to smaller-scale OLTP processing, such as secondary indexes and efficient single-row data manipulation operations, have been omitted to improve performance.
But the line between terms is very smooth.
As Diana Shealy said:
Stop Abusing OLTP as OLAP
There’s a lot of confusion in the market between OLTP and OLAP, and due to the high price of commercial OLAPs, startups and budget-constrained developers have gone on to abuse an OLTP database as an OLAP database. The abuse falls into two categories:
An often multi-shard MySQL database with application layer scripting to perform historical event data analysis. Although this setup is extremely common, it is one of the least productive ways to approach analytics. MySQL is not optimized in any way for reading large ranges of data and its support for analytic functions is weak. As there are multiple alternatives, avoid this “inexpensive” solution because you’ll be paying the price in other places eventually.
Using PostgreSQL as an OLAP layer. This is a more legitimate choice than above for starting an analytics platform because of Postgres’s solid analytic User Defined Functions (UDFs). Also, thanks to its c-store extension, PostgreSQL can be turned into a columnar database, making it an affordable alternative to commercial OLAPs.
Finally, if you are considering moving from OLTPs abused as OLAPs to “real” OLAPs like Redshift, I encourage you to learn how to use Redshift’s COPY Command so that you can start seeing your data inside Redshift.
As for your questions:
How is it possible?
It's possible due to Redshift architecture (column database) and analytical features such as:
Window functions
Data Warehouse System Architecture
Performance
Columnar Storage
Internal Architecture and System Operation
Workload Management
Aggregate functions
How would that work?
See System and Architecture Overview for a detailed explanation of the Amazon Redshift data warehouse system architecture.
(Some links are already mentioned before in this post)
Essential concept?
Am I missing any essential concept?
I'd suggest more rely on technical details of specific solution instead of marketing terms. In the end, practical tasks are not solved by software naming or marketing, but with it's real functionality.
What's really important in DB landscape - is to consider two theorems:
CAP theorem
According to Iron triangle of CAP theorem, you can choose two points of three DB architecture components:
* consistency
* availability
* persistence
PIE theorem
Rick Houlihan of Amazon had a speech on choosing the DB archotecture. In addition to the CAP theorem, he also presented PIE theorem:
The PIE theorem posits that you can choose two out of three desirable features in a data system:
Pattern Flexibility
Efficiency
Infinite Scale
And Redshift is on PI dimension of the PIE triangle
Data structure
As I understand OLAP cubes are an special data structure that exists in OLAP databases. Does he maybe mean specific pre-calculated combinations of dimensions and facts stored in a OLTP-oriented database, like Cassandra?
Both OLAP aggregated data structures and Redshift distribution styles aimed one goal: make queries faster.
Column DB, distribution, parallel queries and other features are good for analytical tasks.
UPD
In comments you asked if Cassandra can work as OLAP service.
Cassandra and S3 can be used as a storage for pre-calculated aggregated data of dimensions.

Power BI in the context of Data warehousing

My company is currently building an enterprise data warehouse in SQL server. We are looking at using PowerBI but I'm struggling to see how PowerBI works in the context of a data warehouse.
For instance what would it offer us, other than nicer looking reports, that Cognos, which we are using now, doesn't? How is it at handling immense amounts of data?
In the context of the Enterprise Data Warehouse Power BI has a number of options.
1) It can be the visualisation layer of your SSAS Data Models, users can connect and quickly create reports as it will sit over, not import data to the Power BI Report. Data processing is done on the server side, and can access huge data models/databases
2) Rather than create SSAS Data Models. Power BI can create a semi-semantic layer, as it is a branch of SSAS Tabluar technology. Your users can quickly deploy the reports, based directly on the database. You can use it in Direct Query mode, as with option 1, this sits over the database, and query processing is dome on the server side. You can import data, but it will be limited to 1GB dataset sizes. All report queries are served from the imported dataset, not the server. With Aggregation Mode you can combine import and direct query to sit over large databases
The real benefit is to enable self-service BI, to get the users to create their own reports. So you can mix strategic (built by the business) and tactical reports (user built). Power BI allows a quick process to mix and match data sources, for example data under your organisation domain, Databases, Cubes, Execl file etc, and data not under your domaim, webpages, API's, other sources.
You can also have Power BI on-prem or in the cloud. On-prem will depend on the SQL Server license type, or it will be another cost. Power BI also fully integrates with O365, and Azure so depending on your application/tech stack, that may be a benefit. It also integrates very well with Power Apps, Power Automate so Power Users can build solutions without requests to IT or others.
This is from my personal experience. I have had a number of projects for enterprise scale customers, that have moved from Cognos (And other tech like Tableau), fully or in part, due to the cost and and the integration of Power BI into O365. End users liked the large knowledge base, the support from MS, and the rapid updating/roadmap of the technology. The most common question is, can it replace X tech. The answer is maybe, it will depend on your report requirements, and how it will integrate with your data sources. Other trends I've noticed, moved some work from IT/BI to the Power Users, particularity with Power Apps/Automate functionality.
Power BI is a lightweight ETL and modeling tool, so it is not just a visualisation tool. There are a number of blogs and articles that compare Power BI to Congnos, that seem biased, so it will be tricky to find a objective answer.

Aggregate tables vs real-time analytics [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 3 years ago.
Improve this question
I've been researching different approaches to streaming data to a real-time dashboard. One way that I have done in the past is using a star schema/dimension and fact tables. This would be an implementation of aggregate tables. For example, the dashboard would contain multiple charts, one being the total sales for the day, total sales by product, total sales by manufacturer, etc. etc.
But what if this needed to be real-time? What if the data needs to stream to these charts and do the analytical processing real-time?
I've been looking into solutions like Kinesis streams and Kafka, but I may be missing something obvious. For example, consider the following example. A company runs a website where they sell pies. The company has a backend dashboard where they keep track of all data and analytics related to sales, users, orders, etc.
Custom places order through website
The relational (mysql) database receives this new order
The charts and analytical data updates real-time on the backend, for example total sales for the day, or total sales for the year by user.
If the scenario is that this data needs to be streamed, what is the best approach to this? Aggregate tables seem like the obvious but it seems that would be periodic and not real-time. Kinesis/Kafka feels like it would fit somewhere in here. The other option would be something like Redshift but it's pretty pricey and still may not be the best way to address the issue and scale.
Here is an example of a chart that would need to be updated in real-time that could suffer by just doing place aggregate SQL queries when there are tons and tons of rows to parse.
In case of "always up-to-date" reports like this (sales, users, orders etc) that don't need live updates with near-zero-latency streaming processing might be overkill, and ROLAP-like approach seems to be more optimal in meaning of efforts/result.
You mentioned Redshift, and if you already ready to mirror your data for analytics purposes and only problem is a price you can consider another free open-source alternatives that could be used for handling OLAP (aggregate) queries in the real-time (like Yandex ClickHouse, or maybe MongoDb in some cases).
A lot of depends on the dataset size; unless you have really big data that need to be aggregated (hundreds of GB) you can try to keep using mysql and use some tricks:
use separate slave mysql server with high IOPS for analytics and replicate only tables needed to build your reports; possibly use another table engine, more suitable for analytical queries. Setup indexes specially for these queries, to avoid table full scan if you need to get numbers only for last weeks.
pre-calculate metrics for previous periods (with materialized view-like approach) and refresh them on schedule (say, daily), and then combine pre-calculated aggregates with on-the-fly aggregates only for last period to get actual report data without need to scan whole facts table each time.
use data visualization backend that can efficiently cache reports data in-memory to prevent SQL DB overload because of many similar queries (and if the same report or dashboard is displayed for 100 users SQL DB load will be the same as for 1). BTW, I develop solution like that (cannot adv it here as it is commercial product).
This is a typical trade-off for most the architects. Amazon Redshift offers exemplary read optimisations but AWS stack comes for a price. You may try using Cassandra, but it comes with its own set of challenges. When it comes to analytics, I never recommend going real time for the reasons elaborated below.
Doing analytics at real time is not desired, specially using MySQL
The solution for above comes by seggregating transactional and analytical infra. This involves cost but will make sure you don't have to spend time in housekeeping once you scale. MySQL is a row based RDBMS mostly used for storing transactional data. Being row based, it optimises writes i.e. the writes are almost real time and thus, it compromises on reads. When I say this, I refer to a typical analytics dataset running into millions of records/day. If your dataset is not that voluminous, you might still be able to render a graph showing transactional status. But since you're referring to Kafka, I assume the dataset is very large.
A real-time dashboard with visualisations gives a bad customer experience
Considering the above point, even if you go for a warehouse / a read optimised infra, you need to understand how the visualisations work. If 100 people access the dashboard at the same time, 100 connections will be made to the database, all fetching the same data, putting them in memory, applying calculations, parameters and filters defined in your dashboard, adjust the refined dataset in the visualisation and then render the dashboard. Till this time, the dashboard will simply freeze. A poorly constructed query, inefficient use of indexes etc will further make the matter worse.
The above problems will amplify more and more with the increase in your dataset. Good practices to achieve what you need would be:
To have almost realtime (delay of 1hr, 30 mins, 15 mins etc) rather than an absolute real time system. This will help you to create a flat file with the data already fetched in the memory. Your dashboard will simply read this data and will be extremely fast in terms of responses to filters etc. Also, multiple connections to databases will be avoided.
Have a data structure, database/warehouse optimised for reads.
For these types of operational analytics use-cases where the real-time nature of the data is critical, you're completely correct that most "traditional" methods can be quite clumsy, especially as your data size increases. A quick overview of your options:
Historical Approach (TLDR– Meh)
Up until about 5 years ago, the de facto way to do this looked something like
Set up a primary OLTP database that will handle the data in its raw form and have stricter guarantees on performance or ACID properties. Usually this is something SQL-esque, i.e. MySQL, PostgreSQL.
Set up a secondary OLAP database that is meant for serving offline (aka non user-facing) queries. This could also be a SQL-esque db but its schema would be drastically different because it stores the data in enriched form.
Set up some mechanism by which you can keep these 2 in sync. This pretty much boils down to either a) changing your application to always write to both databases and performing the necessary data enrichment or b) building a stand-alone application that reads from your OLTP database, performs the necessary transformations and enrichment and writes to your OLAP database
Plug your dashboard into your OLAP database which will have a schema and indexes optimized for the kind of queries you want.
Using your example about the pie store, the OLTP database would be used to store the purchases of all the pies and reference things like customer ids, billing information, delivery information, etc. In contrast, the OLAP database might just maintain a table with a schema
purchase_totals(day: Date, weekNumber: int, dayOfWeek: int, year: int, total: float)
While the weekNumber, dayOfWeek, and year and technically redundant they make your queries faster! With the proper indexes on these fields, your dashboard has turned into 5 simple (and fast!) aggregation queries with a group by and sum, and then the differences week-over-week or year-over-year can be computed on the client-side. As long as your dashboard refreshes every minute or so you have near-real-time data at your fingertips.
Current Approach (TLDR– Ok)
The recent trends in computing, database technologies, and data science/analytics have led to improvements to the above process, namely by replacing certain components of it. The changes include
Making the OLTP db, the OLAP db, or both a NoSQL database (Mongo usually being the most popular). The pro here is that you have a more flexible schema which won't break if something upstream changes (say, you start selling cakes in addition to pies).
Keeping the SQL db but shifting to cloud provider solution like AWS RDS or Google Cloud SQL. This fundamentally doesn't change anything about the architecture, but it does significantly reduce your operational burden.
Using hard-to-maintain ETL pipelines on top of streaming platforms like Kafka or AWS Kinesis to act as the middle layer between OLAP and OLTP.
Using dedicated tools for data cleaning and transformation as you plan out how to do your ETL
Using dedicated visualization tools on top of your OLAP db (think Tableau)
Using a pull-based approach for getting data out of your OLTP db or your application directly instead of waiting for it to eventually reach your OLAP db. This is helpful for online services because it actually gives you both the data you want AND confirmation that the service is alive and running well (because it just served your request for data). Systems like Prometheus are quite popular for this now.

Why is BigQuery so slow on non-large data sizes?

We have found BigQuery to work great on data sets larger than 100M rows, where the 'initialization time' doesn't really come into effect (or is negligible compared to the rest of the query).
However, on anything under that, the performance is quite slow and poor, which makes it (1) ill-suited to working in an interactive BI tool; and (2) inferior to other products, such as Redshift or even ElasticSearch where the data size is under 100M rows. Actually, we had an engineer at our organization that was evaluating a technology for doing queries on data sizes between 1M and 100M rows for an analytics product that has about 1000 users, and his feedback was that he could not believe how slow BigQuery was.
Without a defense of the BigQuery product, I was wondering if there were any plans on improving:
The speed of BigQuery -- especially its initialization time -- on queries of non-massive data sets?
Will BigQuery ever be able to deliver sub-second response times on 'regular' queries (such as a simple aggregation group by) on datasets under a certain size?
It's time spent on metadata/initiation, but actual execution time is very small. We have work in progress that will address this, but some of the changes are complicated and will take a while.
You can imagine that in its infancy, BigQuery could have central systems for managing jobs, metadata, etc. in a manner that performed very well for all N0 entities using the service. Once you get to N1 entities, however, it may be necessary to rearchitect some things to make them have as little latency as possible. For notification about new features--which is also where we would announce API improvements related to start-up latency--keep an eye on our release notes, which you can also subscribe to as an RSS feed.
After exacts 4 years since this question, we have amazing news to BigQuery users! As stated in this Bi Engine release note from 2021-02-25:
The BI Engine SQL interface expands BI Engine to integrate with other business intelligence (BI) tools such as Looker, Looqbox, Tableau, Power BI, and custom applications to accelerate data exploration and analysis. This page provides an overview of the BI Engine SQL interface, and the expanded capabilities that it brings to this preview version of BI Engine.
I believe this can solve the query latency issue mentioned by David542 question.