What kind of relations is required to store user's membership in multiple groups to be able to recover:
history of user participation in some groups (date joined, date quit)
list of current user groups (in join order) to determine his current status
list of users, who were participating in that group at a given period of time
I guess it is ManyToMany (or an ugly kind of OneToMany), but can't figure out how to use it; need a minimal example, preferably - for Django's models.
Also, which consistency problems are expected when some group/user needs to be deleted?
members
groups
groupmembers
Groupmembers is your join table and has such things as:
member (one member has many groupmember records)
group (one group has many groupmember records)
create date
remove date (leave null until applicable)
So for your requirements:
query groupmembers with a group and sort by date
query groupmembers with a member (sort by create date)
query groupmembers with a group and remove is null (or inside a date range)
Related
I've been trying to create my first star schema based on Google Classroom data for a week. I put a description of the tables from my most recent attempt below. I didn't list descriptive fields not relevant to my question.
I have a table visual that shows CourseName, StudentsEnrolled (it works)
StudentsEnrolled = CALCULATE(DISTINCTCOUNT(gc_FactSubmissions[StudentID]))
I am trying to create a table visual that shows StudentName, CourseWorkTitle, PointsEarned, MarkPct.
MarkPct =
divide(sum(gc_FactSubmissions[PointsEarned]),sum(gc_DimCourseWork[MaxPoints]))
When I try to add StudentName to the visual, I end up with incorrect results (some blank student names and incorrect totals). When I try to use DAX Related(), I can only select fields in the Submissions table.
I’ve spent countless hours of Googling sites/pages like the following one and others:
https://exceleratorbi.com.au/the-optimal-shape-for-power-pivot-data/
I think the problem is the gc_DimStudents table because it contains a student record for every student that is enrolled in a gc_DimCourses. Not all students enrolled have submitted assignments, so if I limited the gc_DimStudents to only the StudentIDs in gc_FactSubmissions, then I won’t be able to get a count of StudentsEnrolled in courses.
I’m not sure how to resolve this. Should gc_DimCourses also be made into a fact table? With a gc_DimCourseStudents and a gc_DimSubmissionStudents? Then I’d have to create a surrogate key to join gc_FactSubmissions to the new gc_FactCourses? If that is true, then as I add more fact tables to my model, is it normal to have many DimAnotherStudentList in many places in a Star Schema model?
I want to keep building on this star schema because we want reports/dashboards that relate things like online marks, to attendance, to disciplinary actions, etc., etc. So I want to get the relationships correct this time.
Any help is very much appreciated.
Thanks,
JMC
gc_FactSubmissions (contains one record for every combination of the 4 ID fields, no blanks)
CourseID (many to 1 join to gc_Dimcourses.CourseID )
OwnerID (many to 1 join to gc_DimOwners.OwnerID)
CourseWorkID (many to 1 join to gc_DimCourseWork.CourseWorkID)
StudentID (many to 1 join to gc_DimStudents)
SubmissionID
SubmissionName
PointsEarned (int, default to sum)
(other descriptive fields)
gc_DimCourseWork (one CourseWorkID for each gc_FactSubmissions.CourseWorkID)
CourseWorkID (it is distinct, no blanks)
CourseWorkName
MaxPoints (int, default to sum)
(other descriptive fields)
gc_DimCourses (one CourseID for each gc_FactSubmissions.Course CourseID)
CourseID (it is distinct, no blanks)
CourseName
(other descriptive fields)
gc_DimOwners (one OwnerID for each gc_DimOwners.OwnerID)
OwnerID (it is distinct, no blanks)
OwnerName
(other descriptive fields)
gc_DimStudents (one StudentID for each gc_FactSubmissions.Course CourseID)
StudentID (distinct, no blanks)
StudentName
(other descriptive fields)
A Snowflake Schema is one where Dimensions are related to each other directly rather than via a Fact table - so no, adding another fact table to your model doesn't make it a Snowflake.
An Enrolment fact would have FKs to any Dimensions that are relevant to Enrolments - so Course, Student, probably at least 1 date and whatever other enrolment attributes there may be.
As an additional comment, while there are many incorrect ways of modelling a star schema there can also be many correct ways of modelling it: there is rarely one correct answer. For example, for your Submissions Star you could denormalise your Course data into your CourseWork Dim and possibly also include the Owner data (I assume Owner is Owner of the course?). The fewer joins there are in any query the better the performance. If another fact, such as Enrolment, needed to be related to a Course Dim (rather than to Coursework) then you'd need to consider the trade-off in performance of having fewer joins to one fact and having to maintain the course data in two different Dims (Course and Coursework).
As a star schema is denormalised there is no issue with the same data appearing in multiple tables (within reason). The most common example is a Date Dim that has date, week, month and year attributes and a Month Dim that has just month and year attributes.
I have 3 tables in my DB:
user (_id, name)
event (_id, name, ...)
events_partecipants(user_id, event_id)
I have two Doctrine entities which maps those tables and their relations and everything works (eg. I'm able to get all the partecipants for a specific event).
Now, I want to retrieve the number of events each user joined to. Using pure SQL the query will be:
SELECT user_id, COUNT(*) as count
FROM events_partecipants
GROUP BY user_id
ORDER BY count DESC
As result, I want to retrieve also the name of each user, so I can send back into my JSON information about each user with the name, and not the ID.
If I want to use Doctrine how this can be achieved? I cannot find a smart way to do that.
I am using a Django backend with postgresql.
Let's say I have a database with a table called Employees with about 20,000 records.
I need to allow multiple users to edit and verify the Area Code field for every record in Employees.
I'd prefer to allow a user to view the records, say, 30 at a time (to reduce burnout).
How can I select 30 records at a time from Employees to send to the front end UI for editing, without letting multiple users edit the same records, or re-selecting a record that has already been verified?
I don't need comments on the content of the database (these are example table and field names).
One way to do this would be to add 2 more fields to your table, say for example assigned_to and verified. You can update assigned_to, which can be a foreign key to the verifying user, when you allow the user to view that Employee. This will create a record preventing the Employee from being chosen twice. assigned_to can also double as a record of who verified this Employee for future reference.
verified could be simply a Boolean field which keeps track if the Employee has already been verified and can be updated when the user confirms the verification
The actual selects can be done like this:
employees = Employee.objects.filter(assigned_to=None, verified=False)[:30]
Then
for emp in employees:
emp.assigned_to = user
emp.save()
Note: This can still potentially cause a race condition if 2 users make this request at exactly the same time. To avoid this, another possibility could be to partition the employee tables into groups for each user with no overlap. This would ensure that no 2 users would ever have the same employees
I am trying to create a table to store invoice line items in DynamoDB. Let's say the item is defined by CompanyCode, InvoiceNumber and LineItemId, amount and other line item details.
A unique item is defined by the combination of the first 3 attributes. Any 2 of those attributes can be same for the different items. What should I select as the Hash Attribute and the Range Attribute?
Some Intro
For efficiency I would propose totally different design. With NoSQL databases (and DynamoDB is not different) we always need to consider the access patterns first. Also, if possible we should strive to fit all our data within same table and several indexes. From what we have from OP and his comments, these are the two access patterns:
For a company X, get complete invoice Y (including all items or range of items) [based on this comment ]
Get all invoices for company X [ based on this comment ]
We now wonder what is a good Primary Key? Translates to question what is a good Partition Key (PK) and what is a good Sort Key (SK) and which secondary indexes do we need to create and of what kind (local or global)? Some reminders:
Primary Key can be on one column or composite
Composite primary key consists of Partition Key and Sort Key
Partition key is used as input to the hashing function that will determine partition of the items
Sort key can also be composite, which allows us to model one-to-many relationships in DynamoDB as given in one of the comments links: https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/bp-sort-keys.html
When creating query on the table or index, you always need to use '=' operator on the Partition Key
When querying ranges on Sort Key you have option for KeyConditionExpression which provides you with set of operators for sorting and everything in between (one of them being function begins_with (a, substr) )
You are also allowed to use FilterExpression if you need to further refine the Query results (filter on the projected attributes)
Local Secondary Indexes (LSI) have same Partition Key but different Sort Key than your original table and give you different view of your data, organized according to an alternative Sort Key
Global Secondary Indexes (GSI) have different Partition Key and different Sort Key than your original table and give you completely different view on data
All items with the same partition key are stored together, and for composite Primary keys, are ordered by the sort key value. DynamoDB splits partitions by sort key if the collection size grows bigger than 10 GB.
Back To Modeling
It is obvious that we are dealing with multiple entities that need to be modeled and fit into the same table. To satisfy condition of Partition Key being unique on the table, CompanyCode comes as a natural Partition Key - so I would ensure that is unique. If not then you need to ask yourself how can you model the second access pattern?
Assuming we have established uniqueness on the CompanyCode let's simplify and say that it comes in the form of an e-mail (or could be domain or just a code, but I will use email for demonstration).
Relationship between Company and Invoices is always 1:many.
Relationship between Invoice and Items is always 1:many.
I propose design as in the image below:
With PK being CompanyCode and SK being InvoiceNumber can store all attributes about that invoice for that company.
Nothing prevents me to also add record where the SK is Customer which allows me to store all attributes about the company.
With GSI1 , we will create reverse lookup where GSI1PK is my tables SK (InvoiceNumber) and my GSI1SK is my tables PK (CompanyCode).
I am using same table to store line items with PK being LineItemId and SK being CompanyCode (still unique)
For Item entity items my GSI1PK is still InvoiceNumber and my GSI1SK is LineItemId which is tables PK so its same as for Invoice entity items.
Now the access patterns supported with this:
If I want to get invoice Y for company X and all the items (access pattern 1): Query the table where CompanyCode=X and use KeyConditionExpression with = operator on the Sort Key InvoiceNumber. If I want to get all the items tied to that invoice, I will project Items attribute using ProjectionExpression.
By retrieving all the items with previous query for company X and invoice Y, I can now run BatchGetItem API call (using my unique composite key LineItemId+CompanyCode) on table to get all items belonging to that particular invoice of that particular customer. (this comes with some constraints of BatchGetItem API)
To support access pattern 2, I will do a query with CompanyCode=X on PK and use KeyConditionExpression on the SK with begins_with (a, substr) function/operator to get only invoices for company X and not the metadata about that company. That will give me all invoices for given company/customer.
Additionally, with above GSI1, for any given InvoiceNumber I can easily select all the line items that belong to that particular invoice. REMEMBER: The key values in a global secondary index do not need to be unique - so in my GSI1 I could have had easily invoice_1 -> (item_1, item_2) and then another invoice_1 -> (item_1,item_2) but the difference between two items in GSI would be in the SK (it would be associated with different CompanyCode (but for demonstration purposes I used invoice_1 and invoice_2).
I believe the first option offered by #georgeaf99 won't work, because if you do it that way, then CompanyCode has to be unique in the table. Therefore, there would only be one item allowed per company. I think the second solution is the only real way to do it.
You can use CompanyCode as the Hash Key, and then all other fields that combine to make the item unique (in this case InvoiceNumber and LineItemId) need to be somehow combined into one value (such as concatenation with a field delimiter), which would be your Range Key. Unfortunately that is kind of ugly, but that's the nature of a NoSQL database like DynamoDB. However, it will allow you to successfully store records with the correct uniqueness. When reading the records back, if you don't want to parse the combined field back out to its individual parts, then you'll have to add additional separate fields for InvoiceNumber and LineItemID.
If you don't have a large number of invoices per company, you can query by only the Hash Key and do the filtering on the client side. If you have a large number of invoices per company and need to be able to query only the items for a single invoice, then I would create a secondary index on CompanyCode and InvoiceNumber.
As I'm sure you have figured out you cannot have more than two attributes form your primary key (hash+range). Thus, depending on the type of queries you will be performing and the size of your data you can structure your table in different ways.
(Optimized for the query type you mentioned above: only CompanyCode & all 3)
Best sol'n for small/medium size data sets:
Hash Key: CompanyCode
Perform the query using only CompanyCode and
then filter your results on the other two attributes
Optimal solution for large data sets:
Hash Key: CompanyCode
Range Key: InvoiceNumber+LineItemId
This allows you to query only on an index, but the table structure is pretty ugly
I have a model that has four fields. How do I remove duplicate objects from my database?
Daniel Roseman's answer to this question seems appropriate, but I'm not sure how to extend this to situation where there are four fields to compare per object.
Thanks,
W.
def remove_duplicated_records(model, fields):
"""
Removes records from `model` duplicated on `fields`
while leaving the most recent one (biggest `id`).
"""
duplicates = model.objects.values(*fields)
# override any model specific ordering (for `.annotate()`)
duplicates = duplicates.order_by()
# group by same values of `fields`; count how many rows are the same
duplicates = duplicates.annotate(
max_id=models.Max("id"), count_id=models.Count("id")
)
# leave out only the ones which are actually duplicated
duplicates = duplicates.filter(count_id__gt=1)
for duplicate in duplicates:
to_delete = model.objects.filter(**{x: duplicate[x] for x in fields})
# leave out the latest duplicated record
# you can use `Min` if you wish to leave out the first record
to_delete = to_delete.exclude(id=duplicate["max_id"])
to_delete.delete()
You shouldn't do it often. Use unique_together constraints on database instead.
This leaves the record with the biggest id in the DB. If you want to keep the original record (first one), modify the code a bit with models.Min. You can also use completely different field, like creation date or something.
Underlying SQL
When annotating django ORM uses GROUP BY statement on all model fields used in the query. Thus the use of .values() method. GROUP BY will group all records having those values identical. The duplicated ones (more than one id for unique_fields) are later filtered out in HAVING statement generated by .filter() on annotated QuerySet.
SELECT
field_1,
…
field_n,
MAX(id) as max_id,
COUNT(id) as count_id
FROM
app_mymodel
GROUP BY
field_1,
…
field_n
HAVING
count_id > 1
The duplicated records are later deleted in the for loop with an exception to the most frequent one for each group.
Empty .order_by()
Just to be sure, it's always wise to add an empty .order_by() call before aggregating a QuerySet.
The fields used for ordering the QuerySet are also included in GROUP BY statement. Empty .order_by() overrides columns declared in model's Meta and in result they're not included in the SQL query (e.g. default sorting by date can ruin the results).
You might not need to override it at the current moment, but someone might add default ordering later and therefore ruin your precious delete-duplicates code not even knowing that. Yes, I'm sure you have 100% test coverage…
Just add empty .order_by() to be safe. ;-)
https://docs.djangoproject.com/en/3.2/topics/db/aggregation/#interaction-with-default-ordering-or-order-by
Transaction
Of course you should consider doing it all in a single transaction.
https://docs.djangoproject.com/en/3.2/topics/db/transactions/#django.db.transaction.atomic
If you want to delete duplicates on single or multiple columns, you don't need to iterate over millions of records.
Fetch all unique columns (don't forget to include the primary key column)
fetch = Model.objects.all().values("id", "skuid", "review", "date_time")
Read the result using pandas (I did using pandas instead ORM query)
import pandas as pd
df = pd.DataFrame.from_dict(fetch)
Drop duplicates on unique columns
uniq_df = df.drop_duplicates(subset=["skuid", "review", "date_time"])
## Dont add primary key in subset you dumb
Now, you'll get the unique records from where you can pick the primary key
primary_keys = uniq_df["id"].tolist()
Finally, it's show time (exclude those id's from records and delete rest of the data)
records = Model.objects.all().exclude(pk__in=primary_keys).delete()