Database Architecture Question - Dimension Creation for Non-Atomic Data

Database Architecture Question - Dimension Creation for Non-Atomic Data - powerbi

I am looking at an Excel file that will be imported into Power BI. I am not allowed to have access to the database itself due to employment reasons, so they gave me an Excel file to work with that I will then upload into Power BI.
On one of the fact "tables", they have data that looks like this
s-ID success% late% on-time% schedule
1 10% 2% 5% calculus-1;algebra-2
1 5% 10% 27% Calculus-1
1 5% 3% 80% algebra-2
2 33% 50% 3% null
5 5% 34% 8% English-1;English-10;theatre;art
I realize the numbers do not make any sense, but that's basically how the data is structure wise. There are also roughly 100,000 records in this fact "table".
I have a dimension for courses, but I'm not sure how to handle this schedule column. If I split the column vertically, the measure columns will be double counted.
How can I model this and put the schedule into a dimension intelligently in Power-BI?
My goal is as model the data as follows:
Be able to split the schedule into separate rows, but simultaneously not double count all of the values.
I also want to show that the s-ID records have the student taking a
class that has both the calculus-1 and algebra together.
Sometimes the professors schedule 2 classes together into 1 class whenever they are talking about topics that apply to both. There could be 2 classes together, there could be as many as 8 classes together or anything in between.
Is this a scenario where a bridge table would be appropriate?

You can use a bridge table. In a classic dimensional schema, each dimension attached to a fact table has a single value consistent with the fact table’s grain. But there are a number of situations in which a dimension is legitimately multivalued. Like in your example, a student can enroll many courses :

Related

How many rows are required by partition to have good performances in BigQuery?

I receive every day 100 rows from an application. Good practices in my company suggest to partition every table by day. I dont think is good to do this on the new table that I will create to daily insert a hundred of rows. I want to partition the data by year, is it good?
How many rows by partition are required for the best performances?

It really also depends on the queries that you are going to execute on this table that is what kind of date filters are going to use and joins on what columns. Refer to below answer which will really help you to decide on this.
Answer1
Answer2

Keep in mind that the number of partitions is limited (to 4000). Therefore partitioning is great for low cardinality. Per day, is perfect (about 11 years -> 4000 days).
If you have higher cardinality, customer ID for example (and I hope you have more than 4000 customers!), clustering is the solution to speed up the request.
When you partition, and cluster, your data, you create small bag. Lesser the data to process (load, read, store in cache (...)) you have, faster will be your query! Of course, on only 100 rows, you won't see any differences

Great Expectation profiling on SparkDF takes a long time when there are many columns

I need to profile data coming from snowflake in Databricks. The data is just a sample of 100 rows but containing 3k+ columns, and will eventually have more rows. When I reduce the number of columns, the profiling is done very fast but the more columns there are, the longer it gets. I tried profiling the sample and after more than 10h and I had to cancel the job.
Here is the code I use
df = spark.read.format('snowflake').options(**sfOptions).option('query', f'select * from {db_name}')
df_ge = ge.dataset.SparkDFDataset(df_sf)
BasicDatasetProfiler.profile(df_ge)
You can test this with any data having a lot of columns. Is this normal or am I doing something wrong?

Basically, GE computes metrics for each columns individually, hence, it make an action (probably a collect) for each column and each metric it computes. collects are the most expensive operations you can have on spark so that is almost normal that, the more columns you have, the longer it takes.

How to display data from different data source tables in a single table in Power BI

I have a couple of different tables in my Report, for demonstration purposes lets say that I have 1 data source that is Actual Invoice amounts and then I have another table that is Forecasted amounts. Each table has several dimensions that are the same between them, let say Country, Region, Product Classification and Product.
What I want is to be able to display a table/matrix that pulls information from both of these data sources like this
Description Invoice Forecast vs Forecast
USA 300 325 92%
East 150 175 86%
Product Grouping 1 125 125 100%
Product 1 50 75 67%
Product 2 75 50 150%
Product Grouping 3 25 50 50%
Product 3 25 50 50%
West 150 150 100%
Product Grouping 1 75 100 75%
Product 1 25 50 50%
Product 2 50 50 100%
Product Grouping 3 75 50 150%
Product 3 75 50 150%
I have not been able to figure out a way to combine the information from the multiple data source into a single matrix table, so any help would be appreciated. The one thing that I did find was somebody hard coded the structure of the rows into a separate data source and then used DAX expressions to pull in the pieces of information into the columns, but I don't like this solution because the structure of the rows is not constant.

What you're asking about is a common part of the star schema: combining facts from different fact tables together into a single visual or report.
What Not To Do (That You Might Be Tempted To)
What you don't want to do is combine the 2 fact tables into a single table in your Power BI data model. That's a lot of work and there's absolutely no need. Especially, since there are likely dimensions that the 2 fact tables do not have in common (e.g. actual amounts might be associated with a customer dimension, but forecast amounts wouldn't be).
What you also don't want to do is relate the 2 fact tables to each other in any way. Again, that's a lot of work. (Especially since there's no natural way to relate them at the row level.)
What To Do
Generally, how you handle 2 fact tables is the same as you handle a single fact table. First, you have your dimensions (country, region, classification, product, date, customer). Then you load your fact tables, and join them to the dimensions. You do not join your fact tables to each other. You then create measures (i.e. DAX expressions).
When you want to combine measures from the two facts together in a single matrix, you only use rows/columns that are meaningful to both fact tables. For example, actual amounts might be associated with a customer, but forecast amounts aren't. So you can't include customer information in the matrix. Another possibility is that actual amounts are recorded each day, whereas forecasts were done for the whole month. In this situation, you could put month in your matrix (since that's meaningful to both), but you wouldn't want to use date because Power BI wouldn't know how to divide up forecasts to individual dates.
As long as you're only using dimensions & attributes that are meaningful to both fact tables, you can easily create a matrix as you envision above. Simply drag on the attributes you want, then add the measures (i.e. DAX expressions).
The Invoice & Forecast columns would both be measures. The two measures from different fact tables can be combined into a 3rd measure for the vs. Forecast measure. Everything will work as long as you're just using dimensions/attributes that mean something to both fact tables.
I don't see anything in your proposed pivot table that strikes me as problematic.
Other Situations
If you have a situation where forecasts are at a month level and actual is at a date level, then you may be wondering how you'd relate them both to the same date dimension. This situation is called having different granularities, and there's a good article here I'd recommend reading that has advice: https://www.daxpatterns.com/handling-different-granularities/. Indeed, there's a whole section on comparing budget with revenue that you might find useful.
Finally, you mention that someone hard-coded the structure of the rows and used DAX expressions to build everything. This does, admittedly, sound like overkill. The goal with Power BI is flexibility. Once you have your facts, measures & dimensions, you can combine them in any way that makes sense. Hard-coding the rows eliminates that flexibility, and is a good clue that something isn't right. (Another good clue that something isn't right is when DAX expressions seem really complicated for something that should be easy)
I hope my answer helps. It's a general answer since your question is general. If you have specific questions about your specific situation, definitely post additional questions. (Sample data, a description of the model, the problem you're seeing, and what you want to see is helpful to get a good answer.)
If you're brand new to Power BI, data models, and the star schema, Alberto Ferrari and Marco Russo have an excellent book that I'd recommend reading to get a crash course: https://www.sqlbi.com/books/analyzing-data-with-microsoft-power-bi-and-power-pivot-for-excel/

Import 500 GB of Data into Power BI

I want to import a 500 GB dataset into Power BI, but Power BI is limited 1 GB. How can I get the data into Power BI?
Thanks.

For 500GB I'd definitely recommend Direct Query mode (as Joe recommends) or a live connection to a SSAS cube. In these scenarios, the data model is hosted in a separate location (such as a database server) and Power BI sends its queries to that location and displays the returned results.
However, I'll add that the 1GB limit is the limit after compression. (Meaning you can fit more than 1GB of uncompressed data into the advertised 1GB dataset limit.)
While it would be incredibly difficult to reduce a 500GB dataset to 1GB (even with compression), there are things you can do once you understand how the compression works in Power BI.
In Power BI, compression is done by columns, not rows. So a column that has 800 million rows with identical values can see significant compression. Likewise, a column with a different value in every row cannot be compressed much at all.
Therefore:
Do not import columns you do not absolutely need for analysis (particularly identity columns, GUIDs, free-form text fields, or binary data such as images)
Look at columns with a high degree of variability and see if you can also eliminate them.
Reduce the variability of a column where possible. E.g. if you only need a date & not a time, do not import the time. If you only need the whole number, do not import 7 decimal places.
Bring in less rows. If you cannot eliminate high-variability columns, then importing 1 year of data instead of 17 (for example) will also reduce the data model size.
Marco Russo & the SQLBI team have a number of good resources for further optimizing the size of a data model (SSAS tabular, Power Pivot & Power BI all use the same underlying modelling engine). For example: Optimizing Multi-Billion Row Tables in Tabular

If possible given your source data, you could use Direct Query mode. The 1 GB limit does not apply to Direct Query. There are some limitations to Direct Query mode, so check the documentation to make sure that it will meet your needs.
Some documentation can be found here.

1) make Aggregation on data on sql side __reduce size
2) import only useful column____________reduce size

Best order of joins and append for performance

I'm having huge performance issues with a SAS DI job that I need to get up and running. Therefore I'm looking for clever ways to optimize the job.
One thing in particular that I thought of is that I should perhaps permute the order of some joins and an append. Currently, my job is configured as follows:
there are several similarly structured source tables which I first apply a date filter to (to reduce the number of rows) and sort on two fields, say a and b, then I left join each table to a table with account table on the same fields a and b (I'd like to create indexes for these if possible, but don't know how to do it for temporary work tables in SAS DI). After each of these joins is complete, I append the resulting tables into one dataset.
It occurs to me that I could first append, and then do just one join, but I have no notion of which approach is faster, or if the answer is that it depends I have no notion of what it depends on (though I'd guess the size of the constituent tables).
So, is it better to do many joins then append, or to append then do one join?
EDIT
Here is an update with some relevant information (requested by user Robert Penridge).
The number of source tables here is 7, and the size of these tables ranges from 1500 to 5.2 million. 10 000 is typical. The number of columns is 25. These tables are each being joined with the same table, which has about 5000 rows and 8 columns.
I estimate that the unique key partitions the tables into subsets of roughly equal size; the size reduction here should be between 8% and 30% (the difference is due to the fact that some of the source tables carry much more historical data than others, adding to the percentage of the table grouped into the same number of groups).
I have limited the number of columns to the exact minimum amount required (21).
By default SAS DI creates all temporary datasets as views, and I have not changed that.
The code for the append and joins are auto-generated by SAS DI after constructing them with GUI elements.
The final dataset is not sorted; my reason for sorting the data which feeds the joins is that the section of this link on join performance (page 35) mentions that it should improve performance.
As I mentioned, I'm not sure if one can put indexes on temporary work tables or views in SAS DI.
I cannot say whether the widths of the fields is larger than absolutely necessary, but if so I doubt it is egregious. I hesitate to change this since it would have to be done manually, on several tables, and when new data comes in it might need that extra column width.
Much gratitude

Performance in SAS is mainly about reducing IO (ie. reading/writing to the disk).
Without additional details it's difficult to help but some additional things you can consider are:
limit the columns you are processing by using a keep statement (reduces IO)
if the steps performing the joins are IO intensive, consider using views rather than creating temporary tables
if the joins are still time consuming, consider replacing them with hash table lookups
make sure you are using proc append to append the 2 datasets together to reduce the IO. Append the smaller dataset to the larger dataset.
consider not sorting the final dataset but placing an index on it for consumers of the data.
ensure you are using some type of dataset compression, or ensure your column widths are set appropriately for all columns (ie. you don't have a width of 200 on a field that uses a width of 8)
reduce the number of rows as early in the process as possible (you are already doing this, just listing it here for completeness)
Adjusting the order of left-joins and appends probably won't make as much difference as doing the above.

As per your comments it seems that
1. There are 7 input source tables
2. Join these 7 source tables to 1 table
3. Append the results
In SAS DI studio, use a Lookup to perform the above much faster
1. Connect the 7 Input tables to a Lookup Transform (lets call them SRC 1-7)
2. The table with 5000 records is the tables on which lookup is performed on keys A and B (lets call this LKUP-1)
3. Take the relevant columns from LKUP-1 to propagate into the TARGET tables.
This will be much faster and you don't have to perform JOINs in this case as I suspect you are doing a Many-Many join which is degrading the performance in SAS DIS.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js