I'm trying to do a join across a number of tables (three plus a join table in the middle). I think korma is lazily evaluating the last join. What I'm trying to do is to add a condition that restricts the results of the first table in the join, but I'm only interested in fields from the last table in the join.
So for example say I've got clubs, people and hobbies tables, and a people-to-hobbies join table for the last two. Each club can have many people, and each person can have many hobbies.
I'm trying to get the full details of all the hobbies of people who belong to a specific club, but I don't want any fields from the club table. The join table means that korma will create two queries, one to get all the people that are in a specific club, and another to retrieve the hobbies for that person via the people-to-hobbies join table.
My korma query looks something like this:
(select clubs
(with people
(with hobbies
(fields :hobby-name :id)
(where {:clubs.name "korma coders"}))))
The problem is that I haven't specified which fields I want from clubs and people, and the default is to select *. How can I include no fields from these tables? Is this possible, or does the fact that the hobbies are lazily loaded mean korma has to return some results in the first query (which gets me a filtered list of people), so that when I come to interrogate it later for the hobbies, it has the ids it needs to run the second query?
I'd use join macro in such case:
(select clubs
(join people)
(join people-to-hobbies)
(join hobbies)
(fields :hobbies.hobby-name :hobbies.id)
(where {:clubs.name "korma coders"}))))
It's a bit more explicit, but as an additional benefit it's run with a single query.
Related
I've been trying to create my first star schema based on Google Classroom data for a week. I put a description of the tables from my most recent attempt below. I didn't list descriptive fields not relevant to my question.
I have a table visual that shows CourseName, StudentsEnrolled (it works)
StudentsEnrolled = CALCULATE(DISTINCTCOUNT(gc_FactSubmissions[StudentID]))
I am trying to create a table visual that shows StudentName, CourseWorkTitle, PointsEarned, MarkPct.
MarkPct =
divide(sum(gc_FactSubmissions[PointsEarned]),sum(gc_DimCourseWork[MaxPoints]))
When I try to add StudentName to the visual, I end up with incorrect results (some blank student names and incorrect totals). When I try to use DAX Related(), I can only select fields in the Submissions table.
I’ve spent countless hours of Googling sites/pages like the following one and others:
https://exceleratorbi.com.au/the-optimal-shape-for-power-pivot-data/
I think the problem is the gc_DimStudents table because it contains a student record for every student that is enrolled in a gc_DimCourses. Not all students enrolled have submitted assignments, so if I limited the gc_DimStudents to only the StudentIDs in gc_FactSubmissions, then I won’t be able to get a count of StudentsEnrolled in courses.
I’m not sure how to resolve this. Should gc_DimCourses also be made into a fact table? With a gc_DimCourseStudents and a gc_DimSubmissionStudents? Then I’d have to create a surrogate key to join gc_FactSubmissions to the new gc_FactCourses? If that is true, then as I add more fact tables to my model, is it normal to have many DimAnotherStudentList in many places in a Star Schema model?
I want to keep building on this star schema because we want reports/dashboards that relate things like online marks, to attendance, to disciplinary actions, etc., etc. So I want to get the relationships correct this time.
Any help is very much appreciated.
Thanks,
JMC
gc_FactSubmissions (contains one record for every combination of the 4 ID fields, no blanks)
CourseID (many to 1 join to gc_Dimcourses.CourseID )
OwnerID (many to 1 join to gc_DimOwners.OwnerID)
CourseWorkID (many to 1 join to gc_DimCourseWork.CourseWorkID)
StudentID (many to 1 join to gc_DimStudents)
SubmissionID
SubmissionName
PointsEarned (int, default to sum)
(other descriptive fields)
gc_DimCourseWork (one CourseWorkID for each gc_FactSubmissions.CourseWorkID)
CourseWorkID (it is distinct, no blanks)
CourseWorkName
MaxPoints (int, default to sum)
(other descriptive fields)
gc_DimCourses (one CourseID for each gc_FactSubmissions.Course CourseID)
CourseID (it is distinct, no blanks)
CourseName
(other descriptive fields)
gc_DimOwners (one OwnerID for each gc_DimOwners.OwnerID)
OwnerID (it is distinct, no blanks)
OwnerName
(other descriptive fields)
gc_DimStudents (one StudentID for each gc_FactSubmissions.Course CourseID)
StudentID (distinct, no blanks)
StudentName
(other descriptive fields)
A Snowflake Schema is one where Dimensions are related to each other directly rather than via a Fact table - so no, adding another fact table to your model doesn't make it a Snowflake.
An Enrolment fact would have FKs to any Dimensions that are relevant to Enrolments - so Course, Student, probably at least 1 date and whatever other enrolment attributes there may be.
As an additional comment, while there are many incorrect ways of modelling a star schema there can also be many correct ways of modelling it: there is rarely one correct answer. For example, for your Submissions Star you could denormalise your Course data into your CourseWork Dim and possibly also include the Owner data (I assume Owner is Owner of the course?). The fewer joins there are in any query the better the performance. If another fact, such as Enrolment, needed to be related to a Course Dim (rather than to Coursework) then you'd need to consider the trade-off in performance of having fewer joins to one fact and having to maintain the course data in two different Dims (Course and Coursework).
As a star schema is denormalised there is no issue with the same data appearing in multiple tables (within reason). The most common example is a Date Dim that has date, week, month and year attributes and a Month Dim that has just month and year attributes.
If we have 2 models A, B with a many to many relation.
I want to obtain a sql query similar to this:
SELECT *
FROM a LEFT JOIN ab_relation
ON ab_relation.a_id = a.id
JOIN b ON ab_relation.b_id = b.id;
So in django when I try:
A.objects.prefetch_related('bees')
I get 2 queries similar to:
SELECT * FROM a;
SELECT ab_relation.a_id AS prefetch_related_val_a_id, b.*
FROM b JOIN ab_relation ON b.id = ab_relation.b_id
WHERE ab_relation.a_id IN (123, 456... list of all a.id);
Given that A and B have moderately big tables, I find the way django does it too slow for my needs.
The question is: Is it possible to obtain the left join manually written query through the ORM?
Edits to answer some clarifications:
Yes a LEFT OUTER JOIN would be preferable to get all A's in the queryset, not only those with a relation with B (updated sql).
Moderately big means ~4k rows each, and too slow means ~3 seconds (on first load, before redis cache.) Keep in mind there are other queries on the page.
Actually yes, we need only B.one_field but having tried with Prefetch('bees', queryset=B.objects.values('one_field')) an error said you can't use values in a prefetch.
The queryset will be used as options for a multi-select form-field, where we will need to represent A objects that have a relation with B with an extra string from the B.field.
For the direct answer skip to point 6)
Let'ts talk step by step.
1) N:M select. You say you want a query like this:
SELECT *
FROM a JOIN ab_relation ON ab_relation.a_id = a.id
JOIN b ON ab_relation.b_id = b.id;
But this is not a real N:M query, because you are getting only A-B related objects The query should use outer joins. At least like:
SELECT *
FROM a left outer JOIN
ab_relation ON ab_relation.a_id = a.id left outer JOIN
b ON ab_relation.b_id = b.id;
In other cases you are getting only A models with a related B.
2) Read big tables You say "moderately big tables". Then, are you sure you want to read the whole table from database? This is not usual on a web environment to read a lot of data, and, in this case, you can paginate data. May be is not a web app? Why you need to read this big tables? We need context to answer your question. Are you sure you need all fields from both tables?
3) Select * from Are you sure you need all fields from both tables? May be if you read only some values this query will run faster.
A.objects.values( "some_a_field", "anoter_a_field", "Bs__some_b_field" )
4) As summary. ORM is a powerful tool, two single read operations are "fast". I write some ideas but perhaps we need more context to answer your question. What means moderate big tables, wheat means slow, what are you doing with this data, how many fields or bytes has each row from each table, ... .
Editedd Because OP has edited the question.
5) Use right UI controls. You say:
The queryset will be used as options for a multi-select form-field, where we will need to represent A objects that have a relation with B with an extra string from the B.field.
It looks like an anti-pattern to send to client 4k rows for a form. I suggest to you to move to a live control that loads only needed data. For example, filtering by some text. Take a look to django-select2 awesome project.
6) You say
The question is: Is it possible to obtain the left join manually written query through the ORM?
The answer is: Yes, you can do it using values, as I said it on point 3. Sample: Material and ResultatAprenentatge is a N:M relation:
>>> print( Material
.objects
.values( "titol", "resultats_aprenentatge__codi" )
.query )
The query:
SELECT "material_material"."titol",
"ufs_resultataprenentatge"."codi"
FROM "material_material"
LEFT OUTER JOIN "material_material_resultats_aprenentatge"
ON ( "material_material"."id" =
"material_material_resultats_aprenentatge"."material_id" )
LEFT OUTER JOIN "ufs_resultataprenentatge"
ON (
"material_material_resultats_aprenentatge"."resultataprenentatge_id" =
"ufs_resultataprenentatge"."id" )
ORDER BY "material_material"."data_edicio" DESC
I'm having trouble reducing the number of queries for a particular view. It's a fairly heavy one but I'm sure it can be reduced:
Profile:
name = CharField()
Officers:
club= ManyToManyField(Club, related_name='officers')
title= CharField()
Club:
name = CharField()
members = ManyToManyField(Profile)
Election:
club = ForeignKey(Club)
elected = ForeignKey(Profile)
title= CharField()
when = DateTimeField()
Clubs have members and officers (president, tournament director). People can be members of multiple clubs etc...
Officers are elected at elections, the results of which are stored.
Given a player how can I find out the most recently elected officer at each of the players clubs?
At the moment I have
clubs = Club.objects.filter(members=me).prefetch_related('officers')
for c in clubs:
officers = c.officers.all()
most_recent = Elections.objects.filter(club=c).filter(elected__in=officers).order_by('-when')[:1].get()
print(c.name + ' elected ' + most_recent.name + ' most recently')
Problem is the looped query, it's nice and fast if you're a member of 1 club but if you join fifty my database crawls.
Edit:
The answer from Nil does what I want but doesn't get the object. I don't really need the object but I do need another field as well as the datetime. If it's helpful the query:
Club.objects.annotate(last_election=Max('election__when'))
produces the raw SQL
SELECT "organisation_club"."id", "organisation_club"."name", MAX("organisation_election"."when") AS "last_election"
FROM "organisation_club"
LEFT OUTER JOIN "organisation_election" ON ( "organisation_club"."id" = "organisation_election"."club_id" )
GROUP BY "organisation_club"."id", "organisation_club"."name"
I'd really like an ORM answer if at all possible (or a 'mostly' ORM answer).
I believe this is what you're looking for:
from django.db.models import Max, F
Election.objects.filter(club__members=me) \
.annotate(max_date=Max('club__election_set__when')) \
.filter(when=F('max_date')).select_related('elected')
Relations can be followed forwards and backwards again in a single statement, allowing you to annotate the max_date for any election related to the club of the current election. The F class allows you to filter a queryset based on selected fields in SQL, including any extra fields added through annotation, aggregation, joins etc.
What you want is defined here in SQL term: query the Election table, group them by Club and keep only the last election of each club.
Now, how can we translate that in Django ORM? Looking at the documentation, we learn that we can do it with an annotation. The trick is that you need to think in reverse. You want to annotate (add a new data) each club with its last election. This gives us:
Club.objects.annotate(last_election=Max('election__when'))
# Use it in a for loop like that
for club in Club.objects.annotate(last_election=Max('election__when')):
print(club, club.last_election)
Sadly, this only adds the date, which doesn't answer your question! You want the name or the complete Club object. I searched and I still don't know how to do it properly. If everything fails though, you can still use a raw SQL query in Django using a query like in the first link.
The simplest way I can think of is filtering partially at the application level
If you do
e = Election.objects.filter(club__members=me).select_related('elected')
or
e = me.club_set.election_set.select_related('elected')
This is a single query and it should get back all the elections that happened for the all the clubs that the member me is in. Then you can use python to just get the most recent date. Of course, if you have many elections per club, you end up fetching much more data than will be used.
Another way which should do it in two queries:
# Get all member's clubs & most recent election
clubs = Club.objects.filter(members=me).annotate(last_election=Max('election__when'))
# Create filters for election based on the club id and the latest election time
election_Q = [Q(club__id=c.id) & Q(when=c.last_election) for c in clubs]
# Combine filters with an OR
election_filter = reduce(lambda f1, f2: f1 | f2, election_Q)
# Get elections restricting by specific clubs & election date
elections = Election.objects.filter(election_filter).select_related('elected')
for e in elections:
print '%s elected %s most recently at %s' % (e.club.name, e.elected, e.when)
This builds upon #Nil's method and uses its result to build a query in python, then feeds it into the second query. However, there is a limit with the size of a SQL statement and if there are a lot of clubs that a member is in, then you may hit the limit. The limit is fairly high though and I've only ever reached it when importing large datasets in a single INSERT statement so I think it should be fine for your purpose.
Sorry I cannot think of a way that the Django ORM can link them together using a single SQL query. The Django ORM is actually quite limited for complex queries so if you really need the efficiency I think it's probably best to write the raw SQL query.
Good morning. I have been looking all over trying to answer this question.
If you have a table that has foreign keys to another table, and you want results from both tables, using basic sql you would do an inner join on the foreign key and you would get all the resulting information that you requested. When you generate your JPA entities on your foreign keys you get a #oneToone annotation, #oneToMany, #ManyToMany, #ManyToOne, etc over your foreign key columns. I have #oneToMany over the foreign keys and a corresponding #ManyToOne over the primary key in the related table column I also have a #joinedON annotation over the correct column... I also have a basic named query that will select everything from the first table. Will I need to do a join to get the information from both tables like I would need to do in basic sql? Or will the fact that I have those annotations pull those records back for me? To be clear if I have table A which is related to Table B based on a foreign key relationship and I want the records from both tables I would join table A to B based on the foreign key or
Select * From A inner Join B on A.column2 = B.column1
Or other some-such non-sense (Pardon my sql if it is not exactly correct, but you get the idea)...
That query would have selected all column froms A and B where those two selected column...
Here is my named query that I am using....
#NamedQuery(name="getQuickLaunch", query = "SELECT q FROM QuickLaunch q")
This is how I am calling that in my stateless session bean...
try
{
System.out.println("testing 1..2..3");
listQL = emf.createNamedQuery("getQuickLaunch").getResultList();
System.out.println("What is the size of this list: number "+listQL.size());
qLaunchArr = listQL.toArray(new QuickLaunch[listQL.size()]);
}
Now that call returns all the columns of table A, but it lack's the column's of table B. My first instinct would be to change the query to join the two tables... But that kind of makes me think what is the point of using JPA then if I am just writing the same queries that I would be writing anyway, just in a different place. Plus, I don't want to overlook something simple. So what say you stack overflow enthusiasts? How does one get back all the data of joined query using JPA?
Suppose you have a Person entity with a OneToMany association to the Contact entity.
When you get a Person from the entityManager, calling any method on its collection of contacts will lazily load the list of contacts of that person:
person.getContacts().size();
// triggers a query select * from contact c where c.personId = ?
If you want to use a single query to load a person and all its contacts, you need a fetch in the SQL query:
select p from Person p
left join fetch p.contacts
where ...
You can also mark the association itself as eager-loaded, using #OneToMany(lazy = false), but then every time a person is loaded (vie em.find() or any query), its contacts will also be loaded.
A relevant image of my model is here: http://i.stack.imgur.com/xzsVU.png
I need to make a queryset that contains all cats who have an associated person with a role of "owner" and a name of "bob".
The sql for this would be shown below.
select * from cat where exists
(select 1 from person inner join role where
person.name="bob" and role.name="owner");
This problem can be solved in two sql queries with the following django filters.
people = Person.objects.filter(name="bob", role__name="owner")
ids = [p.id for p in people]
cats = Cat.objects.filter(id__in=ids)
My actual setup is more complex than this and is dealing with a large dataset. Is there a way to do this with one query? If it is impossible, what is the efficient alternative?
I'm pretty sure this is your query:
cats = Cat.objects.filter(person__name='bob', person__role__name='owner')
read here about look ups spanning relationships