use fuzzy matching in django queryset filter - django

Is there a way to use fuzzy matching in a django queryset filter?
I'm looking for something along the lines of:
Object.objects.filter(fuzzymatch(namevariable)__gt=.9)
or is there a way to use lambda functions, or something similar in django queries, and if so, how much would it affect performance time (given that I have a stable set of ~6000 objects in my database that I want to match to)
(realized I should probably put my comments in the question)
I need something stronger than contains, something along the lines of difflib. I'm basically trying to get around doing a Object.objects.all() and then a list comprehension with fuzzy matching.
(although I'm not necessarily sure that doing that would be much slower than trying to filter based on a function, so if you have thoughts on that I'm happy to listen)
also, even though it's not exactly what I want, I'd be open to some kind of tokenized opposite-contains, like:
Object.objects.filter(['Virginia', 'Tech']__in=Object.name)
Where something like "Virginia Technical Institute" would be returned. Although case insensitive, preferably.

When you're using the ORM, the thing to understand is that everything you do converts to SQL commands and it's the performance of the underlying queries on the underlying database that matter. Case in point:
SELECT COUNT (*) ...
Is that fast? Depends on whether your database stores any records to give you that information - MySQL/MyISAM does, MySQL/InnoDB does not. In English - this is one lookup in MYISAM, and n in InnoDB.
Next thing - in order to do exact match lookups efficiently in SQL you have to tell it when you create the table - you can't just expect it to understand. For this purpose SQL has the INDEX statement - in django, use db_index=True in the field options of your model. Bear in mind that this has an added performance hit on writes (to create the index) and obviously extra storage is needed (for the datastructure) so you cannot "INDEX all the things". Also, I don't think it will help for fuzzy matching - but it's worth noting anyway.
Next consideration - how do we do fuzzy matching in SQL? Well apparently LIKE and CONTAINS allow a certain amount of searching and wildcard-results to be executed in SQL. These are T-SQL links - translate for your database server :) You can achieve this via Model.objects.get(fieldname__contains=value) which will produce LIKE SQL, or similar. There are a number of options available there for different lookups.
This may or may not be powerful enough for you - I'm not sure.
Now, for the big question: performance. Chances are if you're doing a contains search that the SQL server will have to hit all of the rows in the database - don't take my word on that, but it would be my bet - even with indexing on. With 6000 rows this might not take all that long; then again, if you're doing this on a per-connection-to-your-app basis it's probably going to create a slowdown.
Next thing to understand about the ORM: if you do this:
Model.objects.get(fieldname__contains=value)
Model.objects.get(fieldname__contains=value)
You will issue two queries to the database server. In other words, the ORM doesn't always cache the results - so you might just want to do an .all() and search in memory. Do read about caching and querysets.
Further on on that last page, you'll also see Q objects - useful for more complicated queries.
So in summary then:
SQL contains some basic fuzzy matching-like parameters.
Whether or not these are sufficient depends on your needs.
How they perform depends on your SQL server - definitely measure it.
Whether you can cache these results in memory depends on how likely scaling is - again might be worth measuring the memory commit as a result - if you can share between instances and if the cache will be frequently invalidated (if it will be, don't do it).
Ultimately, I'd start by getting your fuzzy matching working, then measure, then tweak, then measure until you work out how to improve performance. 99% of this I learnt doing exactly that :)

with postgres as database, you can use TrigramSimilarity to do fuzzy search and rank your results on different weight as well. Here is the link to documentation :
https://docs.djangoproject.com/en/2.0/ref/contrib/postgres/search/#trigram-similarity
For full text search you can refer to https://czep.net/17/full-text-search.html

If you need something stronger than contains lookup, have a look at regex lookups: https://docs.djangoproject.com/en/1.0/ref/models/querysets/#regex

Related

How to order django query set filtered using '__icontains' such that the exactly matched result comes first

I am writing a simple app in django that searches for records in database.
Users inputs a name in the search field and that query is used to filter records using a particular field like -
Result = Users.objects.filter(name__icontains=query_from_searchbox)
E.g. -
Database consists of names- Shiv, Shivam, Shivendra, Kashiva, Varun... etc.
A search query 'shiv' returns records in following order-
Kahiva, Shivam, Shiv and Shivendra
Ordered by primary key.
My question is how can i achieve the order -
Shiv, Shivam, Shivendra and Kashiva.
I mean the most relevant first then lesser relevant result.
It's not possible to do that with standard Django as that type of thing is outside the scope & specific to a search app.
When you're interacting with the ORM consider what you're actually doing with the database - it's all just SQL queries.
If you wanted to rearrange the results you'd have to manipulate the queryset, check exact matches, then use regular expressions to check for partial matches.
Search isn't really the kind of thing that is best suited to the ORM however, so you may which to consider looking at specific search applications. They will usually maintain an index, which avoids database hits and may also offer a percentage match ordering like you're looking for.
A good place to start may be with Haystack

Reducing the number of calls to MongoDB with mongoengine

I'm working to optimize a Django application that's (mainly) backed by MongoDB. It's dying under load testing. On the current problematic page, New Relic shows over 700 calls to pymongo.collection:Collection.find. Much of the code was written by junior coders and normally I would look for places to add indicies, make smarter joins and remove loops to reduce query calls, but joins aren't an option here. What I have done (after adding indicies based on EXPLAINs) is tried to reduce the cost in loops by making a general query and then filtering that smaller set in the loops*. While I've gotten the number down from 900 queries, 700 still seems insane even with the intense amount of work being done on the page. I thought perhaps find was called even when filtering an existing queryset, but the code suggests it's always a database query.
I've added some logging to mongoengine to see where the queries come from and to look at EXPLAIN statements, but I'm not having a ton of luck sifting through the wall of info. mongoengine itself seems to be part of the performance problem: I switched to mongomallard as a test and got a 50% performance improvement on the page. Unfortunately, I got errors on a bunch of other pages (as best I can tell it appears Mallard doesn't do well when filtering an existing queryset; the error complains about a call to deepcopy that's happening in a generator, which you can't do-- I hit a brick wall there). While Mallard doesn't seem like a workable replacement for us, it does suggest a lot of the proessing time is spent converting objects to and from Python in mongoengine.
What can I do to further reduce the calls? Or am I focusing on the wrong thing and should be attacking the problem somewhere else?
EDIT: providing some code/ models
The page in question displays the syllabus for a course, showing all the modules in the course, their lessons and the concepts under the lessons. For each concept, the user's progress in the concept is also shown. So there's a lot of looping to get the hierarchy teased out (and it's not stored according to any of the patterns the Mongo docs suggest).
class CourseVersion(Document):
...
course_instances = ListField(ReferenceField('CourseInstance'))
courseware_containers = ListField(EmbeddedDocumentField('CoursewareContainer'))
class CoursewareContainer(EmbeddedDocument):
id = UUIDField(required=True, binary=False, default=uuid.uuid4)
....
courseware_containers = ListField(EmbeddedDocumentField('self'))
teaching_element_instances = ListField(StringField())
The course's modules, lessons and concepts are stored in courseware_containers; we need to get all of the concepts so we can get the list of ids in teaching_element_instances to find the most recent one the user has worked on (if any) for that concept and then look up their progress.
* Just to be clear, I am using a profiler and looking at times and doings things The Right Way as best I know, not simply changing things and hoping for the best.
The code sample isn't bad per-sae but there are a number of areas that should be considered and may help improve performance.
class CourseVersion(Document):
...
course_instances = ListField(ReferenceField('CourseInstance'))
courseware_containers = ListField(EmbeddedDocumentField('CoursewareContainer'))
class CoursewareContainer(EmbeddedDocument):
id = UUIDField(required=True, binary=False, default=uuid.uuid4)
....
courseware_containers = ListField(EmbeddedDocumentField('self'))
teaching_element_instances = ListField(StringField())
Review
Unbounded lists.
course_instances, courseware_containers, teaching_element_instances
If these fields are unbounded and continuously grow then the document will move on disk as it grows, causing disk contention on heavily loaded systems. There are two patterns to help minimise this:
a) Turn on Power of two sizes. This will cost disk space but should lower the amount of io churn as the document grows
b) Initial Padding - custom pad the document on insert so it gets put into a larger extent and then remove the padding. Really an anti pattern but it may give you some mileage.
The final barrier is the maximum document size - 16MB you can't grow your data bigger than that.
Lists of ReferenceFields - course_instances
MongoDB doesn't have joins so it costs an extra query to look up a ReferenceField - essentially they are an in app join. Which isn't bad per-sae but its important to understand the tradeoff. By default mongoengine won't automatically dereference the field only doing course_version.course_instances will it do another query and then populate the whole list of references. So it can cost you another query - if you don't need the data then exclude() it from the query to stop any leaking queries.
EmbeddedFields
These fields are part of the document, so there is no cost for them, other than the wire costs of transmitting and loading the data. **As they are part of the document, you don't need select_related to get this data.
teaching_element_instances
Are these a list of id's? It says its a StringField in the code sample above. Either way, if you don't need to dereference the whole list then storing the _ids as a StringField and manually dereferencing may be more efficient if coded correctly - especially if you just need the latest (last?) id.
Model complexity
The CoursewareContainer is complex. For any given CourseVersion you have n CoursewareContainers with themselves have a list of n containers and those each have n containers and on...
Finding the most recent instances
We need to get all of the concepts so we can get the list of ids in
teaching_element_instances to find the most recent one the user has
worked on (if any) for that concept and then look up their progress.
I'm unsure if there is a single instance you are after or one per Container or one per Course. Either way - the logic for querying the data should be examined. If its a single instance you are after - then that could be stored against the user so to simplify the logic of looking this up. If its per course or container then to improve performance ensure you minimise the number of queries - if possible collect all the ids and then at the end issue a single $in query, rather than doing a query per container.
Mongoengine costs
Currently, there is a performance cost to loading the data into Mongoengine classes - if you don't need the classes and are happy to work with simple dictionaries then either issue a raw pymongo query or use as_pymongo.
Schema design
The schema looks logical enough but is it suitable for the use case - in essence is it using MongoDB's strengths or is it putting a relational peg in a document database shaped hole? I can't answer than for you but I do know the way to the happy path with MongoDB is design the schema based on its use case. With relational databases schema design from the outset is simple - you normalise, with document databases how the data is used is a primary factor.
MongoDB best practices
There are many other best practices and mongodb have a guide which might be of interest: MongoDB Operations Best Practices.
Feel free to contact me via the Mongoengine mailing list to discuss further and if needs be discuss in private.
Ross

Add Indexes (db_index=True)

I'm reading a book about coding style in Django and one thing they discuss is db_index=True. Ever since I started using Django, I've never used this function because I'm not really sure what it does.
So my question is, when to consider adding indexes?
This is not really django specific; more to do with databases. You add indexes on columns when you want to speed up searches on that column.
Typically, only the primary key is indexed by the database. This means look ups using the primary key are optimized.
If you do a lot of lookups on a secondary column, consider adding an index to that column to speed things up.
Keep in mind, like most problems of scale, these only apply if you have a statistically large number of rows (10,000 is not large).
Additionally, every time you do an insert, indexes need to be updated. So be careful on which column you add indexes.
As always, you can only optimize what you can measure - so use the EXPLAIN statement and your database logs (especially any slow query logs) to find out where indexes can be useful.
The above answer is correct but in some cases where the search is being done on columns that have only varchar datatype like email. There you need to add an index.
Following is the way of doing that:
Index(name='covering_index', fields=['headline'], include=['pub_date'])
reference from https://docs.djangoproject.com/en/3.2/ref/models/indexes/

Design pattern for caching dynamic user content (in django)

On my website I'm going to provide points for some activities, similarly to stackoverflow. I would like to calculate value basing on many factors so each computation for each user will take for instance 10 SQL queries.
I was thinking about caching it:
in memcache,
in user's row in database (so that wherever I need to get user from base I easly show the points)
Storing in database seems easy but on other hand it's redundant information and I decided to ask, since maybe there is easier and prettier solution which I missed.
I'd highly recommend this app for storing the calculated values in the model: https://github.com/initcrash/django-denorm
Memcache is faster than the db... but if you already have to retrieve the record from the db anyway, having the calculated values cached in the rows you're retrieving (as a 'denormalised' field) is even faster, plus it's persistent.

How I can encode/escape a varchar to be more secure without using cfqueryparam?

How I can encode/escape a varchar to be more secure without using cfqueryparam? I want to implement the same behaviour without using <cfqueryparam> to get around "Too many parameters were provided in this RPC request. The maximum is 2100" problem. See: http://www.bennadel.com/blog/1112-Incoming-Tabular-Data-Stream-Remote-Procedure-Call-Is-Incorrect.htm
Update:
I want the validation / security part, without generating a prepared-statement.
What's the strongest encode/escape I can do to a varchar inside <cfquery>?
Something similar to mysql_real_escape_string() maybe?
As others have said, that length-related error originates at a deeper level, not within the queryparam tag. And it offers some valuable protection and therefore exists for a reason.
You could always either insert those values into a temporary table and join against that one or use the list functions to split that huge list into several smaller lists which are then used separately.
SELECT name ,
..... ,
createDate
FROM somewhere
WHERE (someColumn IN (a,b,c,d,e)
OR someColumn IN (f,g,h,i,j)
OR someColumn IN (.........));
cfqueryparam performs multiple functions.
It verifies the datatype. If you say integer, it makes sure there is an integrer, and if not, it does nto allow it to pass
It separates the data of a SQL script from the executable code (this is where you get protection from SQL injection). Anything passed as a param cannot be executed.
It creates bind variables at the DB engine level to help improve performance.
That is how I understand cfqueryparam to work. Did you look into the option of making several small calls vs one large one?
It is a security issue. Stops SQL injections
Adobe recommends that you use the cfqueryparam tag within every cfquery tag, to help secure your databases from unauthorized users. For more information, see Security Bulletin ASB99-04, "Multiple SQL Statements in Dynamic Queries," at www.adobe.com/devnet/security/security_zone/asb99-04.html, and "Accessing and Retrieving Data" in the ColdFusion Developer's Guide.
The first thing I'd be asking myself is "how the heck did I end up with more than 2100 params in a single query?". Because that in itself should be a very very big red flag to you.
However if you're stuck with that (either due to it being outwith your control, or outwith your motivation levels to address ;-), then I'd consider:
the temporary table idea mentioned earlier
for values over a certain length just chop 'em in half and join 'em back together with a string concatenator, eg:
*
SELECT *
FROM tbl
WHERE col IN ('a', ';DROP DATABAS'+'E all_my_data', 'good', 'etc' [...])
That's a bit grim, but then again your entire query sounds grim, so that might not be such a concern.
param values that are over a certain length or have stop words in them or something. This is also quite a grim suggestion.
SERIOUSLY go back over your requirement and see if there's a way to not need 2100+ params. What is it you're actually needing to do that requires all this???
The problem does not reside with cfqueryparam, but with MsSQL itself :
Every SQL batch has to fit in the Batch Size Limit: 65,536 * Network Packet Size.
Maximum size for a SQL Server Query? IN clause? Is there a Better Approach
And
http://msdn.microsoft.com/en-us/library/ms143432.aspx
The few times that I have come across this problem I have been able to rewrite the query using subselects and/or table joins. I suggest trying to rewrite the query like this in order to avoid the parameter max.
If it is impossible to rewrite (e.g. all of the multiple parameters are coming from an external source) you will need to validate the data yourself. I have used the following regex in order to perform a safe validation:
<cfif ReFindNoCase("[^a-z0-9_\ \,\.]",arguments.InputText) IS NOT 0>
<cfthrow type="Application" message="Invalid characters detected">
</cfif>
The code will force an error if any special character other than a comma, underscore, or period is found in a text string. (You may want to handle the situation cleaner than just throwing an error.) I suggest you modify this as necessary based on the expected or allowed values in the fields you are validating. If you are validating a string of comma separated integers you may switch to use a more limiting regex like "[^0-9\ \,]" which will only allow numbers, commas, and spaces.
This answer will not escape the characters, it will not allow them in the first place. It should be used on any data that you will not use with <cfqueryparam>. Personally, I have only found a need for this when I use a dynamic sort field; not all databases will allow you to use bind variables with the ORDER BY clause.