Power BI DAX Functions not Working Because of Star Schema Relationships - powerbi

I've been trying to create my first star schema based on Google Classroom data for a week. I put a description of the tables from my most recent attempt below. I didn't list descriptive fields not relevant to my question.
I have a table visual that shows CourseName, StudentsEnrolled (it works)
StudentsEnrolled = CALCULATE(DISTINCTCOUNT(gc_FactSubmissions[StudentID]))
I am trying to create a table visual that shows StudentName, CourseWorkTitle, PointsEarned, MarkPct.
MarkPct =
divide(sum(gc_FactSubmissions[PointsEarned]),sum(gc_DimCourseWork[MaxPoints]))
When I try to add StudentName to the visual, I end up with incorrect results (some blank student names and incorrect totals). When I try to use DAX Related(), I can only select fields in the Submissions table.
I’ve spent countless hours of Googling sites/pages like the following one and others:
https://exceleratorbi.com.au/the-optimal-shape-for-power-pivot-data/
I think the problem is the gc_DimStudents table because it contains a student record for every student that is enrolled in a gc_DimCourses. Not all students enrolled have submitted assignments, so if I limited the gc_DimStudents to only the StudentIDs in gc_FactSubmissions, then I won’t be able to get a count of StudentsEnrolled in courses.
I’m not sure how to resolve this. Should gc_DimCourses also be made into a fact table? With a gc_DimCourseStudents and a gc_DimSubmissionStudents? Then I’d have to create a surrogate key to join gc_FactSubmissions to the new gc_FactCourses? If that is true, then as I add more fact tables to my model, is it normal to have many DimAnotherStudentList in many places in a Star Schema model?
I want to keep building on this star schema because we want reports/dashboards that relate things like online marks, to attendance, to disciplinary actions, etc., etc. So I want to get the relationships correct this time.
Any help is very much appreciated.
Thanks,
JMC
gc_FactSubmissions (contains one record for every combination of the 4 ID fields, no blanks)
CourseID (many to 1 join to gc_Dimcourses.CourseID )
OwnerID (many to 1 join to gc_DimOwners.OwnerID)
CourseWorkID (many to 1 join to gc_DimCourseWork.CourseWorkID)
StudentID (many to 1 join to gc_DimStudents)
SubmissionID
SubmissionName
PointsEarned (int, default to sum)
(other descriptive fields)
gc_DimCourseWork (one CourseWorkID for each gc_FactSubmissions.CourseWorkID)
CourseWorkID (it is distinct, no blanks)
CourseWorkName
MaxPoints (int, default to sum)
(other descriptive fields)
gc_DimCourses (one CourseID for each gc_FactSubmissions.Course CourseID)
CourseID (it is distinct, no blanks)
CourseName
(other descriptive fields)
gc_DimOwners (one OwnerID for each gc_DimOwners.OwnerID)
OwnerID (it is distinct, no blanks)
OwnerName
(other descriptive fields)
gc_DimStudents (one StudentID for each gc_FactSubmissions.Course CourseID)
StudentID (distinct, no blanks)
StudentName
(other descriptive fields)

A Snowflake Schema is one where Dimensions are related to each other directly rather than via a Fact table - so no, adding another fact table to your model doesn't make it a Snowflake.
An Enrolment fact would have FKs to any Dimensions that are relevant to Enrolments - so Course, Student, probably at least 1 date and whatever other enrolment attributes there may be.
As an additional comment, while there are many incorrect ways of modelling a star schema there can also be many correct ways of modelling it: there is rarely one correct answer. For example, for your Submissions Star you could denormalise your Course data into your CourseWork Dim and possibly also include the Owner data (I assume Owner is Owner of the course?). The fewer joins there are in any query the better the performance. If another fact, such as Enrolment, needed to be related to a Course Dim (rather than to Coursework) then you'd need to consider the trade-off in performance of having fewer joins to one fact and having to maintain the course data in two different Dims (Course and Coursework).
As a star schema is denormalised there is no issue with the same data appearing in multiple tables (within reason). The most common example is a Date Dim that has date, week, month and year attributes and a Month Dim that has just month and year attributes.

Related

Join two records from same model in django queryset

Been searching the web for a couple hours now looking for a solution but nothing quite fits what I am looking for.
I have one model (simplified):
class SimpleModel(Model):
name = CharField('Name', unique=True)
date = DateField()
amount = FloatField()
I have two dates; date_one and date_two.
I would like a single queryset with a row for each name in the Model, with each row showing:
{'name': name, 'date_one': date_one, 'date_two': date_two, 'amount_one': amount_one, 'amount_two': amount_two, 'change': amount_two - amount_one}
Reason being I would like to be able to find the rank of amount_one, amount_two, and change, using sort or filters on that single queryset.
I know I could create a list of dictionaries from two separate querysets then sort on that and get the ranks from the index values ...
but perhaps nievely I feel like there should be a DB solution using one queryset that would be faster.
union seemed promising but you cannot perform some simple operations like filter after that
I think I could perhaps split name into its own Model and generate queryset with related fields, but I'd prefer not to change the schema at this stage. Also, I only have access to sqlite.
appreciate any help!
Your current model forces you to have ONE name associated with ONE date and ONE amount. Because name is unique=True, you literally cannot have two dates associated with the same name
So if you want to be able to have several dates/amounts associated with a name, there are several ways to proceed
Idea 1: If there will only be 2 dates and 2 amounts, simply add a second date field and a second amount field
Idea 2: If there can be an infinite number of days and amounts, you'll have to change your model to reflect it, by having :
A model for your names
A model for your days and amounts, with a foreign key to your names
Idea 3: You could keep the same model and simply remove the unique constraint, but that's a recipe for mistakes
Based on your choice, you'll then have several ways of querying what you need. It depends on your final model structure. The best way to go would be to create custom model methods that query the 2 dates/amount, format an array and return it

How to generate Sum (and other aggregates) in Django where aggregate depends on values from related tables

My model consists of a Portfolio, a Holding, and a Company. Each Portfolio has many Holdings, and each Holding is of a single Company (a Company may be connected to many Holdings).
Portfolio -< Holding >- Company
I'd like the Portfolio query to return the sum of the product of the number of Holdings in the Portfolio, and the value of the Company.
Simplified model:
class Portfolio(model):
some fields
class Company(model):
closing = models.DecimalField(max_digits=10, decimal_places=2)
class Holding(model):
portfolio = models.ForeignKey(Portfolio)
company = models.ForeignKey(Company)
num_shares = models.IntegerField(default=0)
I'd like to be able to query:
Portfolio.objects.some_function()
and have each row annotated with the value of the Portfolio, where the value is equal to the sum of the product of the related Company.closing, and Holding.num_shares. ie something like:
annotate(value=Sum('holding__num_shares * company__closing'))
I'd also like to obtain a summary row, which contains the sum of the values of all of a user's Portfolios, and a count of the number of holdings. ie something like:
aggregate(Sum('holding__num_shares * company__closing'), Count('holding__num_shares'))
I would like to do have a similar summary row for a single Portfolio, which would be the sum of the values of each holding, and a count of the total number of holdings in the portfolio.
I managed to get part of the way there using extra:
return self.extra(
select={
'value': 'select sum(h.num_shares * c.closing) from portfolio_holding h '
'inner join portfolio_company as c on h.company_id = c.id '
'where h.portfolio_id = portfolio_portfolio.id'
}).annotate(Count('holding'))
but this is pretty ugly, and extra seems to be frowned upon, for obvious reasons.
My question is: is there a more Djangoistic way to summarise and annotate queries based on multiple fields, and across related tables?
These two options seem to move in the right direction:
Portfolio.objects.annotate(Sum('holding__company__closing'))
(ie this demonstrates annotation/aggregation over a field in a related table)
Holding.objects.annotate(Sum('id', field='num_shares * id'))
(this demonstrates annotation/aggregation over the product of two fields)
but if I attempt to combine them: eg
Portfolio.objects.annotate(Sum('id', field='holding__company__closing * holding__num_shares'))
I get an error: "No such column 'holding__company__closing'.
So far I've looked at the following related questions, but none of them seem to capture this precise problem:
Annotating django QuerySet with values from related table
Product of two fields annotation
Do I just need to bite the bullet and use raw / extra? I'm hoping that Django ORM will prove the exception to the rule that ORMs really only work as designed for simple queries / models, and anything beyond the most basic ones require either seriously gnarly tap-dancing, or stepping out of the abstraction, which somewhat defeats the purpose...
Thanks in advance!

optimal django manytomany query

I'm having trouble reducing the number of queries for a particular view. It's a fairly heavy one but I'm sure it can be reduced:
Profile:
name = CharField()
Officers:
club= ManyToManyField(Club, related_name='officers')
title= CharField()
Club:
name = CharField()
members = ManyToManyField(Profile)
Election:
club = ForeignKey(Club)
elected = ForeignKey(Profile)
title= CharField()
when = DateTimeField()
Clubs have members and officers (president, tournament director). People can be members of multiple clubs etc...
Officers are elected at elections, the results of which are stored.
Given a player how can I find out the most recently elected officer at each of the players clubs?
At the moment I have
clubs = Club.objects.filter(members=me).prefetch_related('officers')
for c in clubs:
officers = c.officers.all()
most_recent = Elections.objects.filter(club=c).filter(elected__in=officers).order_by('-when')[:1].get()
print(c.name + ' elected ' + most_recent.name + ' most recently')
Problem is the looped query, it's nice and fast if you're a member of 1 club but if you join fifty my database crawls.
Edit:
The answer from Nil does what I want but doesn't get the object. I don't really need the object but I do need another field as well as the datetime. If it's helpful the query:
Club.objects.annotate(last_election=Max('election__when'))
produces the raw SQL
SELECT "organisation_club"."id", "organisation_club"."name", MAX("organisation_election"."when") AS "last_election"
FROM "organisation_club"
LEFT OUTER JOIN "organisation_election" ON ( "organisation_club"."id" = "organisation_election"."club_id" )
GROUP BY "organisation_club"."id", "organisation_club"."name"
I'd really like an ORM answer if at all possible (or a 'mostly' ORM answer).
I believe this is what you're looking for:
from django.db.models import Max, F
Election.objects.filter(club__members=me) \
.annotate(max_date=Max('club__election_set__when')) \
.filter(when=F('max_date')).select_related('elected')
Relations can be followed forwards and backwards again in a single statement, allowing you to annotate the max_date for any election related to the club of the current election. The F class allows you to filter a queryset based on selected fields in SQL, including any extra fields added through annotation, aggregation, joins etc.
What you want is defined here in SQL term: query the Election table, group them by Club and keep only the last election of each club.
Now, how can we translate that in Django ORM? Looking at the documentation, we learn that we can do it with an annotation. The trick is that you need to think in reverse. You want to annotate (add a new data) each club with its last election. This gives us:
Club.objects.annotate(last_election=Max('election__when'))
# Use it in a for loop like that
for club in Club.objects.annotate(last_election=Max('election__when')):
print(club, club.last_election)
Sadly, this only adds the date, which doesn't answer your question! You want the name or the complete Club object. I searched and I still don't know how to do it properly. If everything fails though, you can still use a raw SQL query in Django using a query like in the first link.
The simplest way I can think of is filtering partially at the application level
If you do
e = Election.objects.filter(club__members=me).select_related('elected')
or
e = me.club_set.election_set.select_related('elected')
This is a single query and it should get back all the elections that happened for the all the clubs that the member me is in. Then you can use python to just get the most recent date. Of course, if you have many elections per club, you end up fetching much more data than will be used.
Another way which should do it in two queries:
# Get all member's clubs & most recent election
clubs = Club.objects.filter(members=me).annotate(last_election=Max('election__when'))
# Create filters for election based on the club id and the latest election time
election_Q = [Q(club__id=c.id) & Q(when=c.last_election) for c in clubs]
# Combine filters with an OR
election_filter = reduce(lambda f1, f2: f1 | f2, election_Q)
# Get elections restricting by specific clubs & election date
elections = Election.objects.filter(election_filter).select_related('elected')
for e in elections:
print '%s elected %s most recently at %s' % (e.club.name, e.elected, e.when)
This builds upon #Nil's method and uses its result to build a query in python, then feeds it into the second query. However, there is a limit with the size of a SQL statement and if there are a lot of clubs that a member is in, then you may hit the limit. The limit is fairly high though and I've only ever reached it when importing large datasets in a single INSERT statement so I think it should be fine for your purpose.
Sorry I cannot think of a way that the Django ORM can link them together using a single SQL query. The Django ORM is actually quite limited for complex queries so if you really need the efficiency I think it's probably best to write the raw SQL query.

Select no fields from table in Korma

I'm trying to do a join across a number of tables (three plus a join table in the middle). I think korma is lazily evaluating the last join. What I'm trying to do is to add a condition that restricts the results of the first table in the join, but I'm only interested in fields from the last table in the join.
So for example say I've got clubs, people and hobbies tables, and a people-to-hobbies join table for the last two. Each club can have many people, and each person can have many hobbies.
I'm trying to get the full details of all the hobbies of people who belong to a specific club, but I don't want any fields from the club table. The join table means that korma will create two queries, one to get all the people that are in a specific club, and another to retrieve the hobbies for that person via the people-to-hobbies join table.
My korma query looks something like this:
(select clubs
(with people
(with hobbies
(fields :hobby-name :id)
(where {:clubs.name "korma coders"}))))
The problem is that I haven't specified which fields I want from clubs and people, and the default is to select *. How can I include no fields from these tables? Is this possible, or does the fact that the hobbies are lazily loaded mean korma has to return some results in the first query (which gets me a filtered list of people), so that when I come to interrogate it later for the hobbies, it has the ids it needs to run the second query?
I'd use join macro in such case:
(select clubs
(join people)
(join people-to-hobbies)
(join hobbies)
(fields :hobbies.hobby-name :hobbies.id)
(where {:clubs.name "korma coders"}))))
It's a bit more explicit, but as an additional benefit it's run with a single query.

Grouping Custom Attributes in a Query

I have an application that allows for "contacts" to be made completely customized. My method of doing that is letting the administrator setup all of the fields allowed for the contact. My database is as follows:
Contacts
id
active
lastactive
created_on
Fields
id
label
FieldValues
id
fieldid
contactid
response
So the contact table only tells whether they are active and their identifier; the fields tables only holds the label of the field and identifier, and the fieldvalues table is what actually holds the data for contacts (name, address, etc.)
So this setup has worked just fine for me up until now. The client would like to be able to pull a cumulative report, but say state of all the contacts in a certain city. Effectively the data would have to look like the following
California (from fields table)
Costa Mesa - (from fields table) 5 - (counted in fieldvalues table)
Newport 2
Connecticut
Wallingford 2
Clinton 2
Berlin 5
The state field might be id 6 and the city field might be id 4. I don't know if I have just been looking at this code way to long to figure it out or what,
The SQL to create those three tables can be found at https://s3.amazonaws.com/davejlong/Contact.sql
You've got an Entity Attribute Value (EAV) model. Use the field and fieldvalue tables for searching only - the WHERE caluse. Then make life easier by keeping the full entity's data in a CLOB off the main table (e.g. Contacts.data) in a serialized format (WDDX is good for this). Read the data column out, deserialize, and work with on the server side. This is much easier than the myriad of joins you'd need to do otherwise to reproduce the fully hydrated entity from an EAV setup.