Boolean attribute or new table (Django + PostgreSQL) - django

Situation: I have a Books set. Book can be one of the types: "Test", "Premium" and "Common". Data proportional: 2%, 15%, 83%. Amount query per time unit (in percent): 40%, 20%, 40%
I see some ways for resolve it in database:
Boolean: is_test, is_premium. If we need only "Tests" book: Book.objects.filter(is_test=True). It is can be a proxy model, for example. Analogy for premium books;
Separate Tables: books_test, books_premium, books_common.
Choice field: string in ['Test', 'Premium', 'Common'];
Combine 1 and 2: books_test table and books table with 'is_premium' attribute.
And we need optimally querying this data! All three Book variants need in one page. Exist queryset combinations: only tests, only common, common + premium, only premium.
If we use 1,3 variant: 1 endpoint with specific filter;
If we use 2 variant: one of the tree endpoints without filters (frontend should know what kind endpoint use). Or we can create one endpoint with some conditions and check by backend. Anyway: need extend logic;
Which way is more correct and why?

If you need to mix different types on one page, separate models/tables would complicate things for no good reason. The same goes for mapping more than two exclusive states to a combination of boolean fields.
This leaves you with a choice field or a separate BookType model containing the choices.

Related

AWS Personalize items attributes

I'm trying to implement personalization and having problems with Items schema.
Imagine I'm Amazon, I've products their brands and their categories. In what kind of Items schema should I include this information?
Should I include brand name as string as categorical field? Should I rather include brand ID as string or numeric? or should I include both?
What about categories? I've the same questions.
Metadata Fields Metadata includes string or non-string fields that
aren't required or don't use a reserved keyword. Metadata schemas have
the following restrictions:
Users and Items schemas require at least one metadata field,
Users and Interactions datasets can contain up to five metadata
fields. An Items dataset can contain up to 50 metadata fields.
If you add your own metadata field of type string, it must include the
categorical attribute. Otherwise, Amazon Personalize won't use the
field when training a model.
https://docs.aws.amazon.com/personalize/latest/dg/how-it-works-dataset-schema.html
There are simply 2 ways to include your metadata in Items/Users datasets:
If it can be represented as a number value, then provide the actual value if it makes sense.
If it can be represented as string, then provide the string value and make sure, that categorical is set to true.
But let's take a look into "Why does they need me, to categorize my strings metadata?". The answer is pretty simple.
Let's start with an example.
If you would have Items as Amazon.com products and you would like to provide rates metadata field, then:
You could take all of the rates including the full review text sent by clients and simply put it as metadata field.
You can take just stars rating, calculate the average and put it as metadata field.
Probably the second one is making more sense in general. Having random, long reviews of product as metadata, pretty much changes nothing. Personalize doesn't understands if the review itself is good or bad, or if the author also recommends another product, so pretty much it doesn't really add anything to the recommendations.
However if you simply "cut" your dataset and calculate the average rating, like in the 2. point, then it makes a lot more sense. Maybe some of our customers like crappy products? Maybe they want to buy them, because they are famous YouTubers and they create videos about that? Based on their previous interactions and much more, Personalize will be able to perform just slightly better, because now it knows, that this product has rating of 5/5 or 3/5.
I wanted to show you, that for some cases, providing Items metadata as string makes no sense. That's why your string metadata must be categorical. It means, that it should be finite set of values, so it adds some knowledge for Personalize about given Item and why some of people might want to interact with it.
Going back to your question:
Should I include brand name as string as categorical field? Should I rather include brand ID as string or numeric? or should I include both?
I would simply go with brand ID as string. You could also go with brand name, but probably single brand can be renamed, when it's still the same brand, so picking up the ID would be more constant. Also two different brands could have the same names, because they are present on different markets, so picking up the ID solves that.
The "categorical": true switch in your schema just tells Personalize:
Hey, do you see that string field? It's categorised, finite set of values. If you train a model for me, please include this one during the training, it's important!
And as it's said in documentation, if you will provide string metadata field, which is not marked as categorical, then Personalize will "think" that:
Hmm.. this field is a string, it has pretty random values and it's not marked as categorical. It's probably just a leftover from Items export job. Let's ignore that.

What are the trade-offs in Cloud Datastore for list property vs multiple properties vs ancestor key?

My application has models such as the following:
class Employee:
name = attr.ib(str)
department = attr.ib(int)
organization_unit = attr.ib(int)
pay_class = attr.ib(int)
cost_center = attr.ib(int)
It works okay, but I'd like to refactor my application to more of a microkernel (plugin) pattern, where there is a core Employee model that just might just have the name, and plugins can add other properties. I imagine perhaps one possible solution might be:
class Employee:
name = attr.ib(str)
labels = attr.ib(list)
An employee might look like this:
Employee(
name='John Doe'
labels=['department:123',
'organization_unit:456',
'pay_class:789',
'cost_center:012']
)
Perhaps another solution would be to just create an entity for each "label" with the core employee as the ancestor key. One concern with this solution is that currently writes to an entity group are limited to 1 per second, although that limitation will go away (hopefully soon) once Google upgrades existing Datastores to the new "Cloud Firestore in Datastore mode":
https://cloud.google.com/datastore/docs/firestore-or-datastore#in_native_mode
I suppose an application-level trade-off between the list property and ancestor keys approaches is that the list approach more tightly couples plugins with the core, whereas the ancestor key has a somewhat more decoupled data scheme (though not entirely).
Are there any other trade-offs I should be concerned with, performance or otherwise?
Personally I would go with multiple properties for many reasons but it's possible to mix all of these solutions for varying degree of flexibility as required by the app. The main trade-offs are
a) You can't do joins in data store, so storing related data in multiple entities will prevent querying with complex where clauses (ancestor key approach)
b) You can't do range queries if you make numeric and date fields as labels (list property approach)
c) The index could be large and expensive if you index your labels field and only a small set of the labels actually need to be indexed
So, one way to think of mixing all these 3 is
a) For your static data and application logic, use multiple properties.
b) For dynamic data that is not going to be used for querying, you can use a list of labels.
c) For a pluggable data that a plugin needs to query on but doesn't need to join with the static data, you can create another entity that again uses a) and b) so the plugin stores all related data together.

Postgresql + Django: What is better between same strings or a many to many relationship?

I'm currently designing my data base using postgresql with Django and I was wondering: What is best practice - having several instances of the same model with the same value or a many to many relation ship?
Let me elaborate. Let's say I'm designing a store. The store sells items. Items can have one or many statuses (e.g. ordered, shipped, delivered, paid, pre-ordered etc.).
What would be a better practice:
Relating the items to their status via a many-to-many relationship, which will lead to one status having hundreds of thousand and later millions of relations? Will so many relations become problematic?
Or is it better for each item to have a foreignkey to their statuses? So that each status only has one item. And if I would like to query all the items that have the same status (e.g. shipped), I would have to iterate over all statuses with a common name.
What would be better, especially for the long term?
I would recommend going with a many-to-many relationship.
Hundreds of thousands or even millions of relations should not be a problem. The many-to-many relationship is stored as a table with id, item_id, status_id. SQL will be performant at querying the table either by status_id or item_id even if the table gets big. This is exactly the kind of thing it was built to handle.
Let me elaborate. Let's say I'm designing a store. The store sells
items. Items can have one or many statuses (e.g. ordered, shipped,
delivered, paid, pre-ordered etc.).
If many people will have this many itens you should use manytomany relations, better let django handle with this "third table", since this table just hold ids you can interate over them using reverse lookup, i do prefer using many to many instad of simple foreignkeys.
In your case, who you will handle when your User will hold many itens? like what if my User buy one potato and 2 bananas? you will duplicate the tuple in your User Table to tell "here he have the potato and in this second one he have the banana"? so you will be slave of Disctinct attribute while you still dirtying your main table User
...
class Item(models.Model):
...
class User(models.Model):
items = models.ManyToMany(Item)
So when i query my Item and my User will only bring attributes related to them... while if you use item inside of User Model you will have multiple instances of same user.
So instead of use User.items.all() you will use User.objects.filter(id=id)and them items = [user.item for user in User.objects.filter(id=id)]
Look how complex this get and makeing your database so dirty

AWS DynamoDB Table Design: Store two UserIDs and Details in Table

I'm building an app where two users can connect with each other and I need to store that connection (e.g. a friendship) in a DynamoDB table. Basically, the connection table has have two fields:
userIdA (hash key)
userIdB (sort key)
I was thinking to add an index on userIdB to query on both fields. Should I store a connection with one record (ALICE, BOB) or two records (ALICE, BOB; BOB, ALICE)? The first option needs one write operation and less space, but I have to query twice to get all all connections of an user. The second option needs two write operations and more space, but I only have to query once for the userId.
The user tablehas details like name and email:
userId (hash key)
name (sort key)
email
In my app, I want to show all connections of a certain user with user details in a listview. That means I have two options:
Store the user details of the connected users also in the connection table, e.g. add two name fields to that table. This is fast, but if the user name changes (name and email are retrieved from Facebook), the details are invalid and I need to update all entries.
Query the user details of each userId with a Batch Get request to read multiple items. This may be slower, but I always have up to date user details and don't need to store them in the connection table.
So what is the better solution, or are there any other advantages/disadvantages that I may have overlooked?
EDIT
After some google research regarding friendship tables with NoSQL databases, I found the following two links:
How does Facebook maintain a list of friends for each user? Does it maintain a separate table for each user?
NoSQL Design Patterns for Relational Data
The first link suggests to store the connection (or friendship) in a two way direction with two records, because it makes it easier and faster to query:
Connections:
1 userIdA userIdB
2 userIdB userIdA
The second link suggests to save a subset of duplicated data (“summary”) into the tables to read it faster with just one query. That would be mean to save the user details also into the connection table and to save the userIds into an attribute of the user table:
Connections:
# userIdA userIdB userDetails status
1 123 456 { userId: 456, name: "Bob" } connected
2 456 123 { userId: 123, name: "Alice" } connected
Users:
# userId name connections
1 123 Alice { 456 }
2 456 Bob { 123 }
This database model makes it pretty easy to query connections, but seems to be difficult to update if some user details may change. Also, I'm not sure if I need the userIds within the user table again because I can easily query on a userId.
What do you think about that database model?
In general, nosql databases are often combined with a couple of assumptions:
Eventual consistency is acceptable. That is, it's often acceptable in application design if during an update some of the intermediate answers aren't right. That is, it might be fine if for a few seconds while alice is becoming Bob's friend, It's OK if "Is Alice Bob's friend" returns true while "is Bob Alice's friend" returns false
Performance is important. If you're using nosql it's generally because performance matters to you. It's also almost certainly because you care about the performance of operations that happen most commonly. (It's possible that you have a problem where the performance of some uncommon operation is so bad that you can't do it; nosql is not generally the answer in that situation)
You're willing to make uncommon operations slower to improve the performance of common operations.
So, how does that apply to your question. First, it suggests that ultimately the answer depends on performance. That is, no matter what people say here, the right answer depends on what you observe in practice. You can try multiple options and see what results you get.
With regard to the specific options you enumerated.
Assuming that performance is enough of a concern that nosql is a reasonable solution for your application, it's almost certainly query rather than update performance you care about. You probably will be happy if you make updates slower and more expensive so that queries can be faster. That's kind of the whole point.
You can likely handle updates out of band--that is eventually consistency likely works for you. You could submit update operations to a SQS queue rather than handling them during your page load. So if someone clicks a confirm friend button, you could queue a request to actually update your database. It is OK even if that involves rebuilding their user row, rebuilding the friend rows, and even updating some counts about how many friends they have.
It probably does make sense to store a friend row in each direction so you only need one query.
It probably does make sense to store the user information like Name and picture that you typically display in a friend list duplicated in the friendship rows. Note that whenever the name or picture changes you'll need to go update all those rows.
It's less clear that storing the friends in the user table makes sense. That could get big. Also, it could be tricky to guarantee eventual consistency. Consider what happens if you are processing updates to two users' friendships at the same time. It's very important that you not end up with inconsistency once all the dust has settled.
Whenever you have non-normalized data such as duplicating rows in each direction, or copying user info into friendship tables, you want some way to revalidate and fix your data. You want to write code that in the background can go scan your system for inconsistencies caused by bugs or crashed activities and fix them.
I suggest you have the following fields in the table:
userId (hash key)
name (sort key)
email
connections (Comma separated or an array of userId assuming you have multiple connections for a user)
This structure can ensure consistency across your data.

Query and paginate three types of models at the same time in django

In django I have three models:
SimpleProduct
ConfigurableProduct Instead of showing several variations of SimpleProducts, the user will see one product with options like color.
GroupProduct - Several SimpleProducts that are sold together.
First I'm creating all the SimpleProducts, then I create ConfigurableProducts from several products that are variations on the same product and last GroupProducts which are combiniations of several SimpleProducts.
When a user navigate to a category I need to show him all the three types. If a SimpleProduct is part of a ConfigurableProduct I don't want to show it twice.
How do I make the query? Do I have to create three several queries?
How do I use pagination on three models at the same time?
Can I somehow use inheritance?
Thanks
I think this question is tough to answer without understanding your business logic a little more clearly. Here are my assumptions:
Configurable options are ad hoc, i.e., you sell balls in red, blue, and yellow, shirts in small, medium, and large, etc. There is no way to represent these options abstractly because they don't transcend categories. (If they did, your database design is all wrong. If everything had custom color options, you would just make that a column in your database table.)
Each configuration option has a pre-existing business identity at your company. There's some sku associated with red balls or something like that. For whatever reason, it is necessary to have a database row for each possible configuration option. (If it isn't, then again, you're doing it all wrong.)
If this is the case, my simplest recommendation would be to have some base class that all products inherit from with a field: representative_product_id. The idea is that for every product, there is a representative version that gets shown on the category page, or anywhere else in your catalog. In your database, this will look like:
Name id representative_id
red_ball 1 1
blue_ball 2 1
green_ball 3 1
small_shirt 4 4
medium_shirt 5 4
large_shirt 6 4
unique_thing 7 7
As for django queries, I would use F objects if you have version 1.1 or later. Just:
SimpleProduct.objects.filter(representative_id=F('id'))
That will return a queryset whose representative ids match their own ids.
At this point, someone will clamor for data integrity. The main condition is that representative_id must in all cases point to an object whose representative_id matches its id. There are ways to enforce this directly, such as with a pre_save validator or something like that. You could also do effectively the same thing by factoring out a ProductType table that contains a representative_id column. I.e.:
Products
Name id product_type
_________________________________
red_ball 1 ball
blue_ball 2 ball
green_ball 3 ball
small_shirt 4 shirt
medium_shirt 5 shirt
large_shirt 6 shirt
unique_thing 7 thing
Types
Name representative_id
_______________________________
ball 1
shit 4
thing 7
This doesn't replace the need to enforce integrity with some validator, but it makes it a little more abstract.
Go with Django's multi-table inheritance, with a base class you won't instanciate directly. The base class still has a manager you can run queries against, and that will contain the base attributes of any subclass instance.
To tackle your question about configurable products that must not be displayed redundantly, I think you have two options:
Make configurable products a multiple choice of ConfigurableProductChoice (unrelated to SimpleProduct). Have the ConfigurableProductChoice extend the ConfigurableProduct. That way you'll have a single ConfigurableProduct in your results and no redundancy.
Make configurable products be associated to various options, and design a rule to compute the price from what options are selected. A simple addition would be fine. Your product IDs will need to encode what options are selected. You still have no redundancy, because you didn't involve SimpleProduct.