I want to import a 500 GB dataset into Power BI, but Power BI is limited 1 GB. How can I get the data into Power BI?
Thanks.
For 500GB I'd definitely recommend Direct Query mode (as Joe recommends) or a live connection to a SSAS cube. In these scenarios, the data model is hosted in a separate location (such as a database server) and Power BI sends its queries to that location and displays the returned results.
However, I'll add that the 1GB limit is the limit after compression. (Meaning you can fit more than 1GB of uncompressed data into the advertised 1GB dataset limit.)
While it would be incredibly difficult to reduce a 500GB dataset to 1GB (even with compression), there are things you can do once you understand how the compression works in Power BI.
In Power BI, compression is done by columns, not rows. So a column that has 800 million rows with identical values can see significant compression. Likewise, a column with a different value in every row cannot be compressed much at all.
Therefore:
Do not import columns you do not absolutely need for analysis (particularly identity columns, GUIDs, free-form text fields, or binary data such as images)
Look at columns with a high degree of variability and see if you can also eliminate them.
Reduce the variability of a column where possible. E.g. if you only need a date & not a time, do not import the time. If you only need the whole number, do not import 7 decimal places.
Bring in less rows. If you cannot eliminate high-variability columns, then importing 1 year of data instead of 17 (for example) will also reduce the data model size.
Marco Russo & the SQLBI team have a number of good resources for further optimizing the size of a data model (SSAS tabular, Power Pivot & Power BI all use the same underlying modelling engine). For example: Optimizing Multi-Billion Row Tables in Tabular
If possible given your source data, you could use Direct Query mode. The 1 GB limit does not apply to Direct Query. There are some limitations to Direct Query mode, so check the documentation to make sure that it will meet your needs.
Some documentation can be found here.
1) make Aggregation on data on sql side __reduce size
2) import only useful column____________reduce size
Related
My goal is to quickly & dynamically visualize a big data set (> 500 M rows) using QuickSight. To achieve quick query times, it's necessary to load all of the data into SPICE. However, AWS currently has a hard limit for the maximum number of rows that can be imported into SPICE for a single data set, which is 500 M rows. I currently don't see any option that could be used to visualize all of the data. Here are things that I already considered:
Splitting the full data set into individual QS datasets: the problem with this approach is that QuickSight requires that each visual has a single dataset as an input, so values from multiple datasets cannot be shown in the same visual. I'm aware that multiple datasets can be used within one dashboard but that would not suit the use-case of having a single plot visualizing the data.
Pivoting the table: the input table has a lot of rows, so changing the format from long to wide table would circumvent the SPICE row limitations. However, QuickSight doesn't seem to support using an array of columns a y-values to be plotted.
Creating a dataset per visualization: Certain visualizations can theoretically be defined using fewer values than in the original data set. For example, to create a box plot over a set of groups, we mainly need the quartile values for each of the groups to be plotted, rather than the full data set, which would allow us to be below the SPICE limitation. However, QuickSight doesn't allow creating custom plots such as creation of a box plot where quartiles are already pre-processed.
Currently, the only viable approach I see is to create a dashboard per user, since most users would only be interested in a subset of rows from the full data set.
Irrespective of the approach taken, unfortunately, this limitation forces us to do some compromises.
Depending on the number of users, creating a dataset per user might become a headache to manage. So, I would suggest that if possible you use datasets that capture groups of users (example by user group, or user's country).
Pivoting the table might make it harder to build some visuals. As you said, if you pivot multiple values from different rows into an array field, then you would not be able to extract these easily in analyses (you could use string functions and to to extract them that way but there are limitations around this approach too).
Also creating a dataset per visualisation has maintenance overhead in that you would need to update and re-ingest the dataset most times when changing visualisations.
Some other approaches you might consider:
Aggregate multiple rows together Example if your dataset has multiple rows for each user within the same minute, you could aggregate all these into 1 row and summing up values within that minute. The aggregation period should be as large as possible but keep in mind that this will affect the time granularity in your analyses/dashboards
Prune old data If you are more interested in recent data, then you could add a filter to only keep say 1 month of activity. You could then have other non-SPICE (Direct Query) datasets that do not have this restriction but reports would be slower on older data.
Cache in an external database You could load your data into some data warehousing database (such as AWS Redshift) and then not use SPICE in QuickSight. Of course, this will probably get more expensive.
What happens when I filter data in Power BI data?
I am connecting to Analysis Services and loading data from a cube and then filtering it on the Year column = "2022".
What happens to previous years' data? While the historical data is not used for the report will it cause performance issues to load all the data from the source or does filtering restricts data load to only the filter criteria?
Depends where you have filtered.
If you filter the other years out in power query, you'll only get 2022 in Power BI. This may affect import time a little bit.
Power BI itself is working with subsets. If you're using a page filter for the year 2022, it creats a subset, containing only the 2022 rows. So the other years won't affect the performance. But the file will get bigger and maybe opening lasts a bit longer compared with filtering them out in power query. Advantage: On other pages you still have the full dataset including the years before 2022.
I'm using a calculated column that is an average. The problem is, the average is above the range of possible values, which should be impossible. I made a calculated column that calculates the average star rating (out of a range of 1-5) and the value on a visual is coming up as 6, which shouldn't be possible, even if all the values were 5 stars, which it isn't. So there must be an outlier causing the average to be above the range of possible values, but it isn't in the original data source which Power BI pulls from. The original data source shows me a value of 4.1 as an average, which is within the expected range. But Power BI's dataset has introduced an outlier or (data is missing) that caused the average to become a 6.
I can elaborate on the dax below, but what I want to try to do is pull the dataset down from power bi to figure out why it's calculating its average that way. Looking at the source data, the average is 4.1 and there are no outliers in the source data. So, it's not the source data that's the problem. Basically, I want to find the outlier that's causing the average rating to differ in Power BI.
Avg Rating = IF(SUM(data[Total Reviews]) = 0, BLANK(), SUM(data[Monthly Stars])/SUM(data[Total Reviews]))
Here's a screencap that shows the two
relevant columns
Notice that I had to manually calculate (aka eyeball the columns and type into a calculator then calculate manually) these two columns, which came out to ~4.6. I'm trying to download this dataset to explore it in further detail without having to eyeball the dataset, as the source doesn't show this discrepancy.
To get to the data you have a number of options.
Create a new report in Power BI Desktop, and then use the connect to PBI Dataset option to access that data, in for example, a table. You can create your own report based on the dataset in the service as well.
Access that data via Analyze in Excel, which should allow you to access the data in a pivot table using Excel
Use the Export data from the visual option, using this you can download 30,000 rows into a csv, or 150,000 in to xlsx formats
Please note, that these options may not be available to you if you do not have the right permissions in the workspace, or options have been turned off in the Power BI Admin tenancy settings.
I have a PowerBI that pulls from an excel spreadsheet a current inventory of statuses of a system, lets make it easy and say I have a single measure that reads "40% complete".
If I refresh the PowerBI dataset and it now says "60%", is there any way to have a KPI automatically show +20%? Every example I've found requires you to have another dataset that keeps the historical data, and that's not really an option in this situation. Is there any way to calculate it or store it within the PowerBI query itself?
Power BI is not designed to store historical data. This is what a database is for.
In order to calculate that 20% difference, you need to store historical data somewhere but Power BI's purpose is to connect to sources and load data and then visualize it, not to act as a data repository.
I am attempting to build a data model in Power BI from a data mart (star schema) in SQL Server. This data mart has a fact table and several dimension tables. One of the dimension tables is a date table. I want to load all rows from the fact table. However, I only want to load a subset of the date table. In particular, I want those dates (rows) between the min and max dates in my fact table. This way, when I create slicers and such, I don't have unnecessary dates appearing.
In other BI tools (e.g., Qlik Sense), the usual solution is to first load the fact table into memory, compute and load its min/max dates into another table (also in-memory), set variables from this other table, load the date dimension table (into memory) based on the min/max variables, and finally drop the temporary table from memory so that it doesn't stay in the model and cause problems. This seems like the most efficient solution to me. It only reads the required (as opposed to unnecessary) data from the source dimension table, it doesn't need to perform any joins in the source, and it only reads each table once (as opposed to 2+ times).
How can I achieve this in Power BI? Or, more importantly, is this solution method even possible in Power BI?
I found this solution, but it seems inefficient, as it creates 2 queries (instead of just 1) for the min/max and, moreover, it performs the dimension table filtering after all rows have already been fetched from the source. (In my particular case, this isn't too bad. But, it could be problematic in other situations in which my dimension table is large.)