Django Designing Models for Dating App Matches - django

I’m working on a dating app for a hackathon project. We have a series of questions that users fill out, and then every few days we are going to send suggested matches. If anyone has a good tutorial for these kinds of matching algorithms, it would be very appreciated. One idea is to assign a point value for each question and then to do a
def comparison(person_a, person_b) function where you iterate through these questions, and where there’s a common answer, you add in a point. So the higher the score, the better the match. I understand this so far, but I’m struggling to see how to save this data in the database.
In python, I could take each user and then iterate through all the other users with this comparison function and make a dictionary for each person that lists all the other users and a score for them. And then to suggest matches, I iterate through the dictionary list and if that person hasn’t been matched up already with that person, then make a match.
person1_dictionary_of_matches = {‘person2’: 3, ‘person3’: 5, ‘person4’: 10, ‘person5’: 12, ‘person6’: 2,……,‘person200’:10}
person_1_list_of_prior_matches = [‘person3’, 'person4']
I'm struggling on how to represent this in django. I could have a bunch of users and make a Match model like:
class Match(Model):
person1 = models.ForeignKey(User)
person2 = models.ForeignKey(User)
score = models.PositiveIntegerField()
Where I do the iteration and save all the pairwise scores.
and then do
person_matches = Match.objectsfilter(person1=sarah, person2!=sarah).order_by('score').exclude(person2 in list_of_past_matches)
But I’m worried with 1000 users, I will have 1000000 rows in my table if do this. Will this be brutal to have to save all these pairwise scores for each user in the database? Or does this not matter if I run it at like Sunday night at 1am or just cache these responses once and use the comparisons for a period of months? Is there a better way to do this than matching everyone up pairwise? Should I use some other data structure to capture the people and their compatibility score? Thanks so much for any guidance!

Interesting question. In machine learning's current paradigm you work with sparse matrices that means that you would not have to perform every single match evaluation. The sparsity may come from two alternatives:
Create a batch offline analysis of your data to perform some clustering (fancy solution).
Filter the individuals by some key attributes: a) gender/sexual preference, b) geographical location, c) dating status etc. (simple solution)
After the filtering you could perform a function for estimating appropriate matches for the new user. Based on the selected choices of the user adscribe selected matches into the database for future queries. However, if you get serious about this problem I suggest you give Spark a try. This is not a problem for an SQL database but for a Big Data Engine.

Related

DynamoDB query all users sorted by name

I am modelling the data of my application to use DynamoDB.
My data model is rather simple:
I have users and projects
Each user can have multiple projects
Users can be millions, project per users can be thousands.
My access pattern is also rather simple:
Get a user by id
Get a list of paginated users sorted by name or creation date
Get a project by id
get projects by user sorted by date
My single table for this data model is the following:
I can easily implement all my access patterns using table PK/SK and GSIs, but I have issues with number 2.
According to the documentation and best practices, to get a sorted list of paginated users:
I can't use a scan, as sorting is not supported
I should not use a GSI with a PK that would put all my users in the same partition (e.g. GSI PK = "sorted_user", SK = "name"), as that would make my single partition hot and would not scale
I can't create a new entity of type "organisation", put all users in there, and query by PK = "org", as that would have the same hot partition issue as above
I could bucket users and use write sharding, but I don't really know how I could practically query paginated sorted users, as bucket PKs would need to be possibly random, and I would have to query all buckets to be able to sort all users together. I also thought that bucket PKs could be alphabetical letters, but that could crated hot partitions as well, as the letter "A" would probably be hit quite hard.
My application model is rather simple. However, after having read all docs and best practices and watched many online videos, I find myself stuck with the most basic use case that DynamoDB does not seem to be supporting well. I suppose it must be quite common to have to get lists of users in some sort of admin panel for practically any modern application.
What would others would do in this case? I would really want to use DynamoDB for all the benefits that it gives, especially in terms of costs.
Edit
Since I have been asked, in my app the main use case for 2) is something like this: https://stackoverflow.com/users?tab=Reputation&filter=all.
As to the sizing, it needs to scale well, at least to the tens of thousands.
I also thought that bucket PKs could be alphabetical letters, but
that could create hot partitions as well, as the letter "A" would
probably be hit quite hard.
I think this sounds like a reasonable approach.
The US Social Security Administration publishes data about names on its website. You can download the list of name data from as far back as 1879! I stumbled upon a website from data scientist and linguist Joshua Falk that charted the baby name data from the SSA, which can give us a hint of how names are distributed by their first letter.
Your users may not all be from the US, but this can give us an understanding of how names might be distributed if partitioned by the first letter.
While not exactly evenly distributed, perhaps it's close enough for your use case? If not, you could further distribute the data by using the first two (or three, or four...) letters of the name as your partition key.
1 million names likely amount to no more than a few MBs of data, which isn't very much. Partitioning based on name prefixes seems like a reasonable way to proceed.
You might also consider using a tool like ElasticSearch, which could support your second access pattern and more.

How to find entity in search query in Elasticsearch?

I'm using Elasticsearch to build search for ecommerece site.
One index will have products stored in it, in products index I'll store categories in it's other attributes along with. Categories can be multiple but the attribute will have single field value. (E.g. color)
Let's say user types in Black(color) Nike(brand) shoes(Categories)
I want to process this query so that I can extract entities (brand, attribute, etc...) and I can write Request body search.
I have tought of following option,
Applying regex on query first to extract those entities (But with this approach not sure how Fuzzyness would work, user may have typo in any of the entity)
Using OpenNLP extension (But this one only works on indexation time, in above scenario we want it on query side)
Using NER of any good NLP framework. (This is not time & cost effective because I'll have millions of products in engine also they get updated/added on frequent basis)
What's the best way to solve above issue ?
Edit:
Found couple of libraries which would allow fuzzy text matching in regex. But the entities to find will be many, so what's the best solution to optimise that ?
Still not sure about OpenNLP
NER won't work in this case because there are fixed number of entities so prediction is not right when there are no entity available in the query.
If you cannot achieve desired results with tuning of built-in ElasticSearch scoring/boosting most likely you'll need some kind of 'natural language query' processing:
Tokenize free-form query. Regex can be used for splitting lexems, however very often it is better to write custom tokenizer for that.
Perform named-entity recognition to determine possible field(s) for each keyword. At this step you will get associations like (Black -> color), (Black -> product name) etc. In fact you don't need OpenNLP for that as this should be just an index (keyword -> field(s)), and you can try to use ElasticSearch 'suggest' API for this purpose.
(optional) Recognize special phrases or combinations like "released yesterday", "price below $20"
Generate possible combinations of matches, and with help of special scoring function determine 'best' recognition result. Scoring function may be hardcoded (reflect 'common sense' heuristics) or it this may be a result of machine learning algorithm.
By recognition result (matches metadata) produce formal query to produce search results - this may be ElasticSearch query with field hints, or even SQL query.
In general, efficient NLQ processing needs significant development efforts - I don't recommend to implement it from scratch until you have enough resources & time for this feature. As alternative, you can try to find existing NLQ solution and integrate it, but most likely this will be commercial product (I don't know any good free/open-source NLQ components that really ready for production use).
I would approach this problem as NER tagging considering you already have corpus of tags. My approach for this problem will be as below:
Create a annotated dataset of queries with each word tagged to one of the tags say {color, brand, Categories}
Train a NER model (CRF/LSTMS).
This is not time & cost effective because I'll have millions of
products in engine also they get updated/added on frequent basis
To handle this situation I suggest dont use words in the query as features but rather use the attributes of the words as features. For example create an indicator function f(x',y) for word x with context x' (i.e the word along with the surrounding words and their attributes) and tag y which will return a 1 or 0. A sample indicator function will be as below
f('blue', 'y') = if 'blue' in `color attribute` column of DB and words previous to 'blue' is in `product attribute` column of DB and 'y' is `colors` then return 1 else 0.
Create lot of these indicator functions also know as features maps.
These indicator functions are then used to train a models using CRFS or LSTMS. Finially we use viterbi algorithm to find the best tagging sequence for your query. For CRFs you can use packages like CRFSuite or CRF++. Using these packages all you have go do is create indicator functions and the package will train a model for you. Once trained you can use this model to predict the best sequence for your queries. CRFs are very fast.
This way of training without using vector representation of words will generalise your model without the need of retraining. [Look at NER using CRFs].

Reducing the number of calls to MongoDB with mongoengine

I'm working to optimize a Django application that's (mainly) backed by MongoDB. It's dying under load testing. On the current problematic page, New Relic shows over 700 calls to pymongo.collection:Collection.find. Much of the code was written by junior coders and normally I would look for places to add indicies, make smarter joins and remove loops to reduce query calls, but joins aren't an option here. What I have done (after adding indicies based on EXPLAINs) is tried to reduce the cost in loops by making a general query and then filtering that smaller set in the loops*. While I've gotten the number down from 900 queries, 700 still seems insane even with the intense amount of work being done on the page. I thought perhaps find was called even when filtering an existing queryset, but the code suggests it's always a database query.
I've added some logging to mongoengine to see where the queries come from and to look at EXPLAIN statements, but I'm not having a ton of luck sifting through the wall of info. mongoengine itself seems to be part of the performance problem: I switched to mongomallard as a test and got a 50% performance improvement on the page. Unfortunately, I got errors on a bunch of other pages (as best I can tell it appears Mallard doesn't do well when filtering an existing queryset; the error complains about a call to deepcopy that's happening in a generator, which you can't do-- I hit a brick wall there). While Mallard doesn't seem like a workable replacement for us, it does suggest a lot of the proessing time is spent converting objects to and from Python in mongoengine.
What can I do to further reduce the calls? Or am I focusing on the wrong thing and should be attacking the problem somewhere else?
EDIT: providing some code/ models
The page in question displays the syllabus for a course, showing all the modules in the course, their lessons and the concepts under the lessons. For each concept, the user's progress in the concept is also shown. So there's a lot of looping to get the hierarchy teased out (and it's not stored according to any of the patterns the Mongo docs suggest).
class CourseVersion(Document):
...
course_instances = ListField(ReferenceField('CourseInstance'))
courseware_containers = ListField(EmbeddedDocumentField('CoursewareContainer'))
class CoursewareContainer(EmbeddedDocument):
id = UUIDField(required=True, binary=False, default=uuid.uuid4)
....
courseware_containers = ListField(EmbeddedDocumentField('self'))
teaching_element_instances = ListField(StringField())
The course's modules, lessons and concepts are stored in courseware_containers; we need to get all of the concepts so we can get the list of ids in teaching_element_instances to find the most recent one the user has worked on (if any) for that concept and then look up their progress.
* Just to be clear, I am using a profiler and looking at times and doings things The Right Way as best I know, not simply changing things and hoping for the best.
The code sample isn't bad per-sae but there are a number of areas that should be considered and may help improve performance.
class CourseVersion(Document):
...
course_instances = ListField(ReferenceField('CourseInstance'))
courseware_containers = ListField(EmbeddedDocumentField('CoursewareContainer'))
class CoursewareContainer(EmbeddedDocument):
id = UUIDField(required=True, binary=False, default=uuid.uuid4)
....
courseware_containers = ListField(EmbeddedDocumentField('self'))
teaching_element_instances = ListField(StringField())
Review
Unbounded lists.
course_instances, courseware_containers, teaching_element_instances
If these fields are unbounded and continuously grow then the document will move on disk as it grows, causing disk contention on heavily loaded systems. There are two patterns to help minimise this:
a) Turn on Power of two sizes. This will cost disk space but should lower the amount of io churn as the document grows
b) Initial Padding - custom pad the document on insert so it gets put into a larger extent and then remove the padding. Really an anti pattern but it may give you some mileage.
The final barrier is the maximum document size - 16MB you can't grow your data bigger than that.
Lists of ReferenceFields - course_instances
MongoDB doesn't have joins so it costs an extra query to look up a ReferenceField - essentially they are an in app join. Which isn't bad per-sae but its important to understand the tradeoff. By default mongoengine won't automatically dereference the field only doing course_version.course_instances will it do another query and then populate the whole list of references. So it can cost you another query - if you don't need the data then exclude() it from the query to stop any leaking queries.
EmbeddedFields
These fields are part of the document, so there is no cost for them, other than the wire costs of transmitting and loading the data. **As they are part of the document, you don't need select_related to get this data.
teaching_element_instances
Are these a list of id's? It says its a StringField in the code sample above. Either way, if you don't need to dereference the whole list then storing the _ids as a StringField and manually dereferencing may be more efficient if coded correctly - especially if you just need the latest (last?) id.
Model complexity
The CoursewareContainer is complex. For any given CourseVersion you have n CoursewareContainers with themselves have a list of n containers and those each have n containers and on...
Finding the most recent instances
We need to get all of the concepts so we can get the list of ids in
teaching_element_instances to find the most recent one the user has
worked on (if any) for that concept and then look up their progress.
I'm unsure if there is a single instance you are after or one per Container or one per Course. Either way - the logic for querying the data should be examined. If its a single instance you are after - then that could be stored against the user so to simplify the logic of looking this up. If its per course or container then to improve performance ensure you minimise the number of queries - if possible collect all the ids and then at the end issue a single $in query, rather than doing a query per container.
Mongoengine costs
Currently, there is a performance cost to loading the data into Mongoengine classes - if you don't need the classes and are happy to work with simple dictionaries then either issue a raw pymongo query or use as_pymongo.
Schema design
The schema looks logical enough but is it suitable for the use case - in essence is it using MongoDB's strengths or is it putting a relational peg in a document database shaped hole? I can't answer than for you but I do know the way to the happy path with MongoDB is design the schema based on its use case. With relational databases schema design from the outset is simple - you normalise, with document databases how the data is used is a primary factor.
MongoDB best practices
There are many other best practices and mongodb have a guide which might be of interest: MongoDB Operations Best Practices.
Feel free to contact me via the Mongoengine mailing list to discuss further and if needs be discuss in private.
Ross

Designing MySQL table for Achievements system

I am creating a database for an achievement system (like something you would see in a Blizzard game). I would like to have a GUI that displays the current progress of all achievements in the game which means I will need to query the progress of all achievements for a user in order to populate the GUI. I plan on having somewhere around 100 achievements.
This brings about a design question. What is the best way to design the database and querying code to query the progress of ~100 bit fields?
It seems like the brute force method would be to get the entire row of achievements and then for each field in the row do some hardcoded string comparison to determine which achievement we are dealing with.
Another possible solution may be to have a big switch statement based on the column index of the table and handle each achievement for each case (requires not modifying the table or you have to refactor a lot of C++ code).
I'm curious to hear any other designs you guys may have for this.
Thanks!
I suggest building a solution using 3 tables. These tables are users, achievements and user_achievements. A user would be identified with a u_id in the users table. An achievement would be identified with a a_id in the achievements table. You would then keep track of users achievements by inserting a row in the user_achievements table that includes a u_id to identify the user and a a_id to identify the achievement. The user_achievements table would also contain a column that would specify the % completion of that achievement for the given user.
Came across this question and even though it's 5 years old, perhaps someone would be interested in following approach.
Achievements are usually broken down to numbers (the rest, like Name, Description of each achievement can be put to site/app core to avoid bloating the DB).
lets be simple, we are not FB and don't need separate table for them, so in "users" table we add just 1 single column: "Achievements" it is a varchar(50). Number in brackets (50) will depend on your actual needs to this column (i.e. how much data it stores).
so you end up having in each cell of the Achievements column a numerical sequence: 10982039482084109384
Read this line of digits as follows, from left to right: user has reached "1098 profile views", received "2039 likes", etc. Optionally, add a separator for easier distinction + to instantly handle cases when as first user had 25 likes, then 125, then 2039 (2 digits, 3 digits, 4 digits - or another alternative is to use 0025 then 0125 then 2039 given you know max digits is 4 per achievement). But still lets say we decide to use separators, i.e. a comma:
1098,2039,4820,8410,9384
Then once you need a data, just SELECT achievements belonging to specific userID and subsequently (if you added a separator)
explode (',', $array)
then your site php core knows that first 4 digits stand for "profile views" and lets say this means that he has a level 10 badge for profile views (1 badge for 100 views).
Thereon, you can easily do operations with no further need for SQL queries. Example, user wants to know his progress on achieving a level 20 badge, you display: he has a 1098/2000 (or 55%) progress.
At that, achievement Description, Name, level information is stored in site core, while percentage is calculated on the go.
Hope the logic is clear and may be useful to any1 in community out there.
Cheers!

Solr + Haystack searching

I am trying to implement a search engine for a new app.
The app allows people to rate items (+1 or -1) - Giving the items a +ve or -ve score.
When people search for items, I'd like to take into account their rating and to order the results accordingly. If the item is a match, it should show up. But if it's a match with a high score it should be boosted up the results a bit.
A really good match should win over a fairly good match with a high score, so it needs to be weighted along with the rest of it (i.e. I boosted my titles a bit).
Not stuck on Solr by any means, only just started playing today.
With Solr, you can maintain a field with the document which holds the difference.
The difference can be between the total +1ve's and the -1ve's.
Solr allows you to boost on field values using function queries.
So you can query with the boost on the difference field, with documents with better difference scoring over others.
From indexing front, as this difference would change quite often, the respective document needs to be updated everytime.
Solr does not allow the updation of the single field, so you need to handle the incremental updates of the difference field.
If that would be a concern to you, can try using ExternalFileField.
This allows mapping of certain fields of documents such as ranking, popularity external to the index in a separate file.
The file can be updated and index committed to reflect the changes.
The field can also be used with function queries to boost the results as needed, however have lot of limitations.
You can order your results by a field that stores the ranking.
sqs.filter(content='blah').order_by('rating')