I'm new to PowerBI, and am working on a large database. I am attempting to prepare the data in the PowerQuery Editor.
I would like to code as many steps as possible, as analysing each column manually is extremely time consuming.
My coding goals (in order of priority):
For each query I would like to get their column quality.
Ideally, I would like to export the header names with the column quality, so that I can determine which are relevant. Furthermore, I can also use the column names to determine which column relationships might be relevant. The database is huge, so simply just importing all the data and trying to work with it from their is not feasible, in fact PowerBI comes up with the error that I don't have enough free memory.
I have VBA and some SQL experience.
I know I have a lot to learn w.r.t. PowerBI, and I am working on it, but need some guidance and direction, also on what is possible/feasible.
Any contructive hints, advice, or feedback would be appreciated - thank you!
Use Table.Profile() on each table and load to the data model.
https://learn.microsoft.com/en-us/powerquery-m/table-profile
Related
Guys this is a bit of newbie question. Ive tried to google it and understand how they work but im not having much luck. I have a datasets created by colleague that connect to one of our systems.
I want to look at using it and trying to make some changes. I can see Its create a .pibx file when i saved a copy of the dataset. i want to look at model section and see if i can pull some further fields(column) into table on the dataset that already links some corresponding data from two other tables. Id like to add more fields(columns) that are not currently in that table
However i don't want to affect other datasets and or the data on the system it is communicating with.
Can anyone advise me if this is the case.
As i really only want to test things for now and not make any changes that might affect other people
You can create duplicate for original table( Query ) then play with that duplicat
When I get data into Power BI I can edit the query as well as perform edit to the model.
What is difference between edit performed in query edit vs during modelling?
When you edit the query, you use Power Query, with its own Query Editor user interface. The steps you apply are recorded in the "M" language. Use Power Query to extract, transform, and finally load data into the Data Model.
Once the data is in the Data Model, you use DAX to create measures that you use in visuals. You can also use DAX to add more columns or even tables to the data model.
Whether to use Power Query or DAX to add columns or tables to the data model depends on a variety of factors. Some things are dead easy to do in Power Query, but harder to achieve with DAX, and vice versa. If you create a column with a formula that depends on a DAX measure, then you can only do that with DAX, because Power Query is not aware of the measures that are created after the load into the data model.
Power Query is very powerful, but the M code syntax is very different to the Excel formula syntax, or the VBA macro language. Learning to write advanced M code can be quite challenging.
DAX, on the other hand, behaves very similar to Excel formulas. Many Excel functions can even be used in DAX verbatim. If you know Excel, you've already got a head start on DAX and you can ease your way into it by learning additional functions and then expanding into more complex formulas.
The latter is probably the reason why many data manipulations are done in DAX, even though they could as well have been done in Power Query.
There are also some efficiencies with data storage and performance. Power Query makes use of query folding with SQL queries, for example, where its transformations are actually performed at the data source, i.e. on the SQL server side, and not in desktop client, and only the final query result is transferred to the desktop client.
Edit after comment: When the data is loaded into the data model, an algorithm processes the data and sorts it in a way that is most efficient for maximum compression and minimum storage. I don't have any concreate examples, but adding a column in Power Query will result in a smaller footprint than adding the same column with DAX. Read more about the compression algorithm VertiPaq here: https://towardsdatascience.com/inside-vertipaq-in-power-bi-compress-for-success-68b888d9d463
But apart from that, it mainly comes down to personal preference based on skill and experience.
By the way, many of your questions can be answered by reading through the Microsoft documentation, e.g. https://learn.microsoft.com/en-us/power-bi/guidance/import-modeling-data-reduction
I've been working in a practice PowerBI document and was able to get a measure to work, but now when I try to re-create the measure in the non-practice document, I get this error:
'The column 'Department Names[DESL]' either doesn't exist or doesn't have a relationship to any table available in the current context.'
The data and tables are exactly the same for both files.
I've compared the relationship in the model view and the relationship is the same.
Where else can I troubleshoot to figure out why the measure doesn't work? I feel like PowerBI did something automatically in my practice document that I need to implement.
Also, any great training suggestions would be swell.
I found a problem with my database connection, which would have been nice if PowerBI had told me directly, instead of only throwing an error when I tried to connect the 2 sources.
I'm not done fixing the problem yet, but suspect that once that connection is good, I'll be able to relate the 2 columns in the 2 tables.
the problem I'm solving has many simple solutions but what I need is to find the way to reduce the time and memory needed for the process.
On the one side I have a table with a few hundred ID's and on the other 40 monthly tables and counting.
Each of the tables has between 500 000 to 1 mln records each for unique id. Each table has few thoustand variables but i only need 10-20 of them.
I need to lookup the tables to find the latest table when particular id from base table occur and get variable values that I need.
The newest month table is being calculated every day so many id's from previous months may occur again so I cannot just create indexed dictionary (last.id and variables) once. Also I can't afford creating new dictionary based on all tables every day.
Visual description
I came up with some ideas but I need your help to find the most efficient concept:
Concatenate all monthly tables with variables needed, sort ascending ID and month, select last.id using data step. Use join or merge with base table.
Problem: too much memory needed to set all tables.
Alternatively I used proc append in loop. Unfortunately not very time and memory efficient.
Inner join with all of the tables separately in loop:
Low memory use but very time consuming.
Create dictionary based on all months besides the latest and update it every day.
Problem: Large dictionary table.
Now I'm looking for smart concepts how to solve this kind of problem. Maybe hash objects.. but how?
I would greatly appreciate it if you give me some feedback on this case.
Thank you!
If someone was to write some code to generate some dummy data based on your specs they may be able to provide a more specific answer to your question. But without sample data it's hard to know the best way without trial and error.
Instead I've paraphrased some of my old answers into a more comprehensive list of things you can check.
Below are some ways to boost performance (roughly in order of performance improvement, YMMV):
Index the fields in each table that you will be joining on or using in a where clause. Not all fields are good candidates for indexes so do a little research on how to determine this before indexing.
Reduce the number of rows as early in the process as possible (ie. use a where clause to get rid of anything you don't care about).
If the joins are still time consuming, consider replacing them with hash table lookups.
Compression. When you build the datasets make sure you use the compress=yes option if you're not already. This will shrink the size of the table on disk resulting in less disk I/O (the slowest part of querying).
If the steps are IO intensive, consider using views rather than creating temporary tables.
Make sure you are using proc append to append datasets together to reduce IO (sounds like you are, just adding this for completeness). Append the smaller dataset to the larger dataset. Alternatively use a view to 'append' them without duplicating overhead.
Limit the columns you are processing by using a keep statement (reduces IO).
Check column lengths - make sure you're not using a field length of $255 to store something that only needs a length of $20 etc...
Use the SAS SPDE (Scalable Performance Data Engine). It allows you to partition your SAS datasets into multiple files and optionally spread them across different disks. Once your SAS datasets reach a certain size you can see performance improvements. I generally tend to use SPD libnames any time a dataset grows > 10G. No additional SAS modules are requires - this is enabled as part of Base SAS.
In Power BI, I've got some query tables generated from imported data. All the data comes in as type 'Any', and I'm trying to automatically detect the type of the data in each column.
Some of the queries generate tables with columns based on the in-coming data - I don't know what the columns are going to be until the query runs and sets up the table (data comes from an Azure blob). As I will have quite a few tables to maintain, which columns can change (possibly new columns being added) with any data refresh, it would be unmanageable to go through all of them each time and press 'Detect Data Type' on the columns.
So I'm trying to figure out how I can do a 'Detect Data Type' in the query formula language to attach to the end of the query that generates the table columns. I've tried grabbing the first entry in a column and do Value.Type(column{0}), however this seems to come out as 'Text' for a column which has integers in it. Pressing 'Detect Data Type' does however correctly identifies the type as 'Whole Number'.
Does anyone know how to detect a column's entry types?
P.S. I'm not too worried about a column possibly holding values of different data types
You seem to have multiple issues here. And your solution will be fragile, there's a better way. But let's first deal with column type detection. Power Query uses the 'any' data type as it's go to data type. You can write a function that samples the rows of a column in a table does a best match data type detection then explicitly sets the data type of the column. This is probably messy and tricky since you need to do it once per column. This might be workable for a fixed schema but for a dynamic schema you'll run into a couple of things very quickly. First you'll need to write some crazy PQ code to list all the columns and run you function on each. This will work the first time, but might break in subsequent refreshes because data model changes are not allowed during refresh. If you're using a tool like Power BI Desktop, you'll be able to fix things up. If you publish your report to the Power BI service, you'll just see refresh errors.
Dynamic Schemas will suffer the same data model change issue I mentioned above.
The alternate solution that you won't have problems with is using a Direct Query data source instead of using Power Query. If you load your data into Azure SQL or a Tabular Model, the reporting layer will get the updated fields automatically so you don't have to try to work around using PQ.