Address DB schema - more ForeignKeys or just plain text?

Address DB schema - more ForeignKeys or just plain text? - django

I have a dilemma designing a database for my application. Basically, I want store US addresses. I'm using Django, but it's more of a database design question.
Say, I have models for State, City & ZipCode:
class State(models.Model):
short_name = models.CharField(_('state short name'), max_length=2, primary_key=True)
name = models.CharField(_('state full name'), max_length=50)
class City(models.Model):
name = models.CharField(_('city name'), max_length=100)
state = models.ForeignKey(State)
class ZipCode(models.Model):
code = models.CharField(_('zip code'), max_length=6)
city = models.ForeignKey(City)
Then, I want to store a single Address. Here is my dilemma: should I use Foreign Keys (or just a single one) or store the whole address as a CharFields? That is, should I use 1st, 2nd or 3rd version of Address model:
1st version:
class Address(models.Model):
street = models.CharField(_('street address'), max_length=300)
city = models.ForeignKey(City)
zip_code = models.ForeignKey(ZipCode)
state = models.ForeignKey(State)
counter = models.IntegerField()
2nd version:
class Address(models.Model):
street = models.CharField(_('street address'), max_length=300)
city = models.CharField(_('city'), max_length=300)
zip_code = models.CharField(_('zip code'), max_length=6)
state = models.CharField(_('state'), max_length=50)
counter = models.IntegerField()
3rd version:
class Address(models.Model):
street = models.CharField(_('street address'), max_length=300)
zip_code = models.ForeignKey(ZipCode)
counter = models.IntegerField()
My specific use case is that every user search will either generate new Address (if one doesn't exist) with counter = 0 or update existing Address (say, increment counter field; this is just an example). Assume 1 search per second with ~30% of redundant searches.
My notes of different versions:
1st:
overhead with creating new record (worst case: need to create new City & Zip; States will be already populated)
more connected data (not sure if that's a pro/con?)
2nd:
fast creation of new Address record
less "connected" data (not sure if that's a pro/con?)
3rd:
Zip_Code is already assigned to a City, which is already assigned to a State, no need to copy this data
I'm just not sure which schema is better and why. For now I've been using "plain" data, that is no Foreign Keys on the Address, just CharFields and it works ok. But my site is growing and I want to have a solid foundation. Also, I'm really curious how to approach such problem.
Thank you for taking the time to read this.

Thinking about it conceptually, does this hold true?
A state has one or more cities.
A city has one or more zip codes.
A zip code has one or more street addresses.
There's a fairly clear hierarchy here. If you reflect it in the database, then you'd have the following:
Address holding a foreign key to ZipCode.
ZipCode holding a foreign key to City.
City holding a foreign key to State.
So your design for State, City, and ZipCode look right; you should complete it by choosing Option 3.
Here are some benefits to this design:
You'll avoid update anomalies. You won't ever get into a situation where an Address holds/is related to a Zip Code from California while also holding/being related to the state of Wyoming.
You'll not be holding the string "Illinois" over and over again - aside from saving space, if you realise you accidentally typed "Ilinois" three years down the line, you won't need to carry out a huge update script on the Address table of your live database to correct the problem.
If a state border changed and a city which used to be a part of Arizona became part of New Mexico (OK, this is unlikely, but bear with me for the sake of sticking with your example!), you'd only have to update the foreign key on a single record in the City table.
If there's ever a different need for this same data (Reporting? Business intelligence/analytics? A new website feature?), having a solid structure like this with each data item held in only one place and without spurious foreign keys will make it clear which data to use, will help avoid the need for time consuming and potentially problematic data cleansing, and will reduce development time. Duplicated and inconsistent data in source systems takes up a huge amount of my time as a business intelligence/data warehousing developer.
You have the right idea in looking ahead and thinking about whether your current database design can stand up to your website's growth. The sooner you resolve issues like this, the easier they'll be to change and the less disruption you're likely to suffer.
If you're currently working with something more like Option 2, then I'm guessing you might well have used a similar pattern elsewhere in your database. If this is the case, and you'd like to avoid the issues I've mentioned above (and others), then it's really worth doing some reading or training on database design, and specifically how to carry out normalization.

Related

A better way of representing two extremely similar, yet different objects in Django?

so I'm trying to create a good way of modelling both "houses" and "house groups".
Houses and house groups are extremely similar in that they both carry a description and have related pricing information.
However, "bookings" can only be assigned to Houses and not to HouseGroups.
At the moment, my model looks like this:
class Houselike(models.Model):
max_guests = models.IntegerField()
name = models.CharField(max_length=20)
description = models.TextField(blank=True)
class House(Houselike):
pass
class HouseGroup(Houselike):
houses = models.ManyToManyField(House)
Semantically, this actually very close to what I want. However, in the database, this leads to there being two tables that both only have a single field "houselike_ptr_id" referring to the "Houselike" base object.
Checking whether a Houselike object is a House or a Housegroup thus involves looking in two different tables.
A more efficient alternative would be to do:
class Houselike(models.Model):
max_guests = models.IntegerField()
name = models.CharField(max_length=20)
description = models.TextField(blank=True)
is_group = models.BooleanField()
houses = models.ManyToManyField(House)
This results in only 1 extra field in the "houselike" table, and the other table containing the related houses is only hit if we actually look them up. This is the best solution from a storage point of view IMHO.
However, this isn't quite as good from a semantic point of view: Houses and Housegroups are similar, but different objects.
Also, this allows for stuff like housegroups containing other housegroups, non-groups containing houses, things I have to all check manually.
I also really like being able to explicitly work with House and HouseGroup objects. Representing them both with the same class just feels wrong.
Is there a better way to do this?
EDIT:
I forgot to mention that pricing information (as well as other entities) can be associated with either a House or a Housegroup, and is implemented (roughly) as follows:
class PricePeriod(models.Model):
house = models.ForeignKey(Houselike, on_delete=models.CASCADE)
arrival_date = models.DateField()
# Date of last departure date
departure_date = models.DateField()
price = models.DecimalField(max_digits = 10, decimal_places=2)
This is why I don't simply make the Houselike an abstract model, because these other objects are related to it.

Turns out, this is something called "single table inheritance", which is perfect in my case.
And, this being the Internet, there's an app for that: https://github.com/craigds/django-typed-models
from typedmodels.models import TypedModel
# Create your models here.
class Houselike(TypedModel):
max_guests = models.IntegerField()
name = models.CharField(max_length=20)
description = models.TextField(blank=True)
class House(Houselike):
pass
class HouseGroup(Houselike):
houses = models.ManyToManyField(House)
This resulted in pretty much exactly what I was asking: a single table in the database, and an explicit, semantically-correct model in Python/Django.
Now just I just need to fix my awful naming...

Many to Many Exclude on Multiple Objects

I have the following models:
class Deal(models.Model):
date = models.DateTimeField(auto_now_add=True)
retailer = models.ForeignKey(Retailer, related_name='deals')
description = models.CharField(max_length=255)
...etc
class CustomerProfile(models.Model):
saved_deals = models.ManyToManyField(Deal, related_name='saved_by_customers', null=True, blank=True)
dismissed_deals = models.ManyToManyField(Deal, related_name='dismissed_by_customers', null=True, blank=True)
What I want to do is retrieve deals for a customer, but I don't want to include deals that they have dismissed.
I'm having trouble wrapping my head around the many-to-many relationship and am having no luck figuring out how to do this query. I'm assuming I should use an exclude on Deal.objects() but all the examples I see for exclude are excluding one item, not what amounts to multiple items.
When I naively tried just:
deals = Deal.objects.exclude(customer.saved_deals).all()
I get the error: "'ManyRelatedManager' object is not iterable"
If I say:
deals = Deal.objects.exclude(customer.saved_deals.all()).all()
I get "Too many values to unpack" (though I feel I should note there are only 5 deals and 2 customers in the database right now)
We (our client) presumes that he/she will have thousands of customers and tens of thousands of deals in the future, so I'd like to stay performance oriented as best I can. If this setup is incorrect, I'd love to know a better way.
Also, I am running django 1.5 as this is deployed on App Engine (using CloudSQL)
Where am I going wrong?

Suggest you use customer.saved_deals to get the list of deal ids to exclude (use values_list to quickly convert to a flat list).
This should save you excluding by a field in a joined table.
deals = Deals.exclude( id__in=customer.saved_deals.values_list('id', flat=True) )

You'd want to change this:
deals = Deal.objects.exclude(customer.saved_deals).all()
To something like this:
deals = Deal.objects.exclude(customer__id__in=[1,2,etc..]).all()
Basically, customer is the many-to-many foreign key, so you can't use it directly with an exclude.

Deals saved and deals dismissed are two fields describing almost same thing. There is also a risk too much columns may be used in database if these two field are allowed to store Null values. It's worth to consider remove dismissed_deals at all, and use saved_deal only with True or False statement.
Another thing to think about is move saved_deals out of CustomerProfile class to Deals class. Saved_deals are about Deals so it can prefer to live in Deals class.
class Deal(models.Model):
saved = models.BooleandField()
...
A real deal would have been made by one customer / buyer rather then few. A real customer can have milions of deals, so relating deals to customer would be good way.
class Deal(models.Model):
saved = models.BooleanField()
customer = models.ForeignKey(CustomerProfile)
....
What I want to do is retrieve deals for a customer, but I don't want to include deals that they have dismissed.
deals_for_customer = Deals.objects.all().filter(customer__name = "John")
There is double underscore between customer and name (customer__name), which let to filter model_name (customer is related to CustomerProfile which is model name) and name of field in that model (assuming CutomerProfile class has name attribute)
deals_saved = deals_for_customer.filter(saved = True)
That's it. I hope I could help. Let me know if not.

Django ORM and SQL inner joins

I am trying to get all Horse objects which fall within a specific from_date and to_date range on a related listing object. eg.
Horse.objects.filter(listings__to_date__lt=to_date.datetime,
listings__from_date__gt=from_date.datetime)
Now as I understand this database query creates an inner join which then enables me to find all my horse objects based on the related listing dates.
My question is how this exactly works, it probably comes down to a major lack of understanding in how inner joins actually work. Would this query need to first 'check' each and ever horse object first to ascertain whether or not it has a related listing object? I'd imagine this could prove to be quite inefficient because you might have 5million horse objects with no related listing object yet you still would have to check each and every one first?
Alternatively I could start with my Listings and do something like this first:
Listing.objects.filter(to_date__lt=to_date.datetime,
from_date__gt=from_date.datetime)
And then:
for listing in listing_objs:
if listing.horse:
horses.append(horse)
But this seems like a rather odd way of achieving my results too.
If anyone could help me understand how queries work in Django and which is the most efficient way to go about doing such a query it would be a great help!
This is my current model setup:
class Listing(models.Model):
to_date = models.DateTimeField(null=True, blank=True)
from_date = models.DateTimeField(null=True, blank=True)
promoted_to_date = models.DateTimeField(null=True, blank=True)
promoted_from_date = models.DateTimeField(null=True, blank=True)
# Relationships
horse = models.ForeignKey('Horse', related_name='listings', null=True, blank=True)
class Horse(models.Model):
created_date = models.DateTimeField(null=True, blank=True, auto_now=True)
type = models.CharField(max_length=200, null=True, blank=True)
name = models.CharField(max_length=200, null=True, blank=True)
age = models.IntegerField(null=True, blank=True)
colour = models.CharField(max_length=200, null=True, blank=True)
height = models.IntegerField(null=True, blank=True)

The way you write your query really depends on what information you want back most of the time. If you are interested in the horses, then query from Horse. If you're interested in listings then you should query from Listing. That's generally the correct thing to do, especially when you're working with simple foreign keys.
Your first query is probably the better one with regards to Django. I've used slightly simpler models to illustrate the differences. I've created an active field rather than using datetimes.
In [18]: qs = Horse.objects.filter(listings__active=True)
In [19]: print(qs.query)
SELECT
"scratch_horse"."id",
"scratch_horse"."name"
FROM "scratch_horse"
INNER JOIN "scratch_listing"
ON ( "scratch_horse"."id" = "scratch_listing"."horse_id" )
WHERE "scratch_listing"."active" = True
The inner join in the query above will ensure that you only get horses that have a listing. (Most) databases are very good at using joins and indexes to filter out unwanted rows.
If Listing was very small, and Horse was rather large, then I would hope the database would only look at the Listing table, and then use an index to fetch the correct parts of Horse without doing a full table scan (inspecting every horse). You will need to run the query and check what your database is doing though. EXPLAIN (or whatever database you use) is extremely useful. If you're guessing what the database is doing, you're probably wrong.
Note that if you need to access the listings of each horse then you'll be executing another query each time you access horse.listings. prefetch_related can help you if you need to access listings, by executing a single query and storing it in cache.
Now, your second query:
In [20]: qs = Listing.objects.filter(active=True).select_related('horse')
In [21]: print(qs.query)
SELECT
"scratch_listing"."id",
"scratch_listing"."active",
"scratch_listing"."horse_id",
"scratch_horse"."id",
"scratch_horse"."name"
FROM "scratch_listing"
LEFT OUTER JOIN "scratch_horse"
ON ( "scratch_listing"."horse_id" = "scratch_horse"."id" )
WHERE "scratch_listing"."active" = True
This does a LEFT join, which means that the right hand side can contain NULL. The right hand side is Horse in this instance. This would perform very poorly if you had a lot of listings without a Horse, because it would bring back every single active listing, whether or not a horse was associated with it. You could fix that with .filter(active=True, horse__isnull=False) though.
See that I've used select_related, which joins the tables so that you're able to access listing.horse without incurring another query.
Now I should probably ask why all your fields are nullable. That's usually a terrible design choice, especially for ForeignKeys. Will you ever have a listing that's not associated with a horse? If not, get rid of the null. Will you ever have a horse that won't have a name? If not, get rid of the null.
So the answer is, do what seems natural most of the time. If you know a particular table is going to be large, then you must inspect the query planner (EXPLAIN), look into adding/using indexes on filter/join conditions, or querying from the other side of the relation.

Database design under Django

I have a probably quite basic question: I am currently setting up a database for students and their marks in my courses. I currently have two main classes in my models.py: Student (containing their name, id, email address etc) and Course (containing an id, the year it is running in and the assessment information - for example "Essay" "40%" "Presentation" "10%" "Exam" "50%"). And, of course, Student has a ManyToMany field so that I can assign students to courses and vice versa. I have to be able to add and modify these things.
Now, obviously, I would like to be able to add the marks for the students in the different assignments (which are different from course to course). As I am very unexperienced in database programming, I was hoping one of you could give me a tip how to set this up within my models.
Thanks,
Tobi

Perhaps the way to go about it is to have a separate class for assignment, something like this.
class Assignment(models.Model):
ASSIGNMENT_TYPES = (
('essay', "Essay"),
...
)
ASSIGNMENT_GRADES = (
('a+', "A+"),
('a', "A"),
...
)
student = models.ForeignKey("Student")
course = models.ForeignKey("Course")
assignment_type = models.CharField(choices=ASSIGNMENT_TYPES, max_length=15, default='essay')
progress = models.IntegerField()
grade = models.CharField(choices=ASSIGNMENT_GRADES, max_length=3, default="a+")
This way you have one assignment connected to one student and one course. It can be modified relatively easy if you have multiple students per one assignment, by adding another class (for example StudentGroup) and including it in the model.
Hope that this helps :)

Create a model called "Assessments", which has a foreign key to Course. In addition ,create a field called "Assessment Type", another called "Assessment Result" and a final one called "Assesment Date". Should look like this:
ASSESSMENTS = (('E','Essay'),('P','Presentation'))
class Assessment(models.MOdel):
course = models.ForeignKey('Course')
assessment = models.CharField(choices=ASESSMENTS)
result = models.CharField(max_length=250)
taken_on = models.DateField()
overall_result = models.BooleanField()
is_complete = models.BooleanField()
Each time there is an exam, you fill in a record in this table for each assessment taken. You can use the overall result as a flag to see if the student has passed or failed, and the is_complete to see if there are any exams pending for a course.

You should look at models.py file of classcomm,
a content management system written in Django for delivering and managing Courses on the Web.
It has following Models
Department
Course
Instructor
Mentor
Enrollment
Assignment
DueDateOverride
Submission
Grade
ExtraCredit
Information
Resource
Announcement
You may not need such a complex relationship for you case, but it's wort looking into it's models design.
You can find more details on homepage of this project.

What's the best way to ensure balanced transactions in a double-entry accounting app?

What's the best way to ensure that transactions are always balanced in double-entry accounting?
I'm creating a double-entry accounting app in Django. I have these models:
class Account(models.Model):
TYPE_CHOICES = (
('asset', 'Asset'),
('liability', 'Liability'),
('equity', 'Equity'),
('revenue', 'Revenue'),
('expense', 'Expense'),
)
num = models.IntegerField()
type = models.CharField(max_length=20, choices=TYPE_CHOICES, blank=False)
description = models.CharField(max_length=1000)
class Transaction(models.Model):
date = models.DateField()
description = models.CharField(max_length=1000)
notes = models.CharField(max_length=1000, blank=True)
class Entry(models.Model):
TYPE_CHOICES = (
('debit', 'Debit'),
('credit', 'Credit'),
)
transaction = models.ForeignKey(Transaction, related_name='entries')
type = models.CharField(max_length=10, choices=TYPE_CHOICES, blank=False)
account = models.ForeignKey(Account, related_name='entries')
amount = models.DecimalField(max_digits=11, decimal_places=2)
I'd like to enforce balanced transactions at the model level but there doesn't seem to be hooks in the right place. For example, Transaction.clean won't work because transactions get saved first, then entries are added due to the Entry.transaction ForeignKey.
I'd like balance checking to work within admin also. Currently, I use an EntryInlineFormSet with a clean method that checks balance in admin but this doesn't help when adding transactions from a script. I'm open to changing my models to make this easier.

(Hi Ryan! -- Steve Traugott)
It's been a while since you posted this, so I'm sure you're way past this puzzle. For others and posterity, I have to say yes, you need to be able to split transactions, and no, you don't want to take the naive approach and assume that transaction legs will always be in pairs, because they won't. You need to be able to do N-way splits, where N is any positive integer greater than 1. Ryan has the right structure here.
What Ryan calls Entry I usually call Leg, as in transaction leg, and I'm usually working with bare Python on top of some SQL database. I haven't used Django yet, but I'd be surprised (shocked) if Django doesn't support something like the following: Rather than use the native db row ID for transaction ID, I instead usually generate a unique transaction ID from some other source, store that in both the Transaction and Leg objects, do my final check to ensure debits and credits balance, and then commit both Transaction and Legs to the db in one SQL transaction.
Ryan, is that more or less what you wound up doing?

This may sound terribly naive, but why not just record each transaction in a single record containing "to account" and "from account" foreign keys that link to an accounts table instead of trying to create two records for each transaction? From my point of view, it seems that the essence of "double-entry" is that transactions always move money from one account to another. There is no advantage using two records to store such transactions and many disadvantages.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js