Add Indexes (db_index=True) - django

I'm reading a book about coding style in Django and one thing they discuss is db_index=True. Ever since I started using Django, I've never used this function because I'm not really sure what it does.
So my question is, when to consider adding indexes?

This is not really django specific; more to do with databases. You add indexes on columns when you want to speed up searches on that column.
Typically, only the primary key is indexed by the database. This means look ups using the primary key are optimized.
If you do a lot of lookups on a secondary column, consider adding an index to that column to speed things up.
Keep in mind, like most problems of scale, these only apply if you have a statistically large number of rows (10,000 is not large).
Additionally, every time you do an insert, indexes need to be updated. So be careful on which column you add indexes.
As always, you can only optimize what you can measure - so use the EXPLAIN statement and your database logs (especially any slow query logs) to find out where indexes can be useful.

The above answer is correct but in some cases where the search is being done on columns that have only varchar datatype like email. There you need to add an index.
Following is the way of doing that:
Index(name='covering_index', fields=['headline'], include=['pub_date'])
reference from https://docs.djangoproject.com/en/3.2/ref/models/indexes/

Related

Using column versions for time series

In the official documentation there is a text for which I can't totally understand the reason:
When working with time series, do not leverage the transactional behavior of rows. Changes to data in an existing row should be stored as a new, separate row, not changed in the existing row. This is an easier model to construct, and it enables you to maintain a history of activity without relying upon column versions.
The last sentence is not obvious and concrete, so it doesn't convince me. For now, using versioning for updating the cell's data still looks to me like a good fit for the update task. At least versions are managed by BigTable, so it's simplier solution.
Can anybody please provide more obvious explanation of why the versioning shouldn't be used in that use case?
Earlier in that page under Patterns for row key design, a bit more detail is explained. The high level view being that using row keys instead of column versions will:
Make it easier to run queries across your data, allowing for scanning of less data.
Avoid going over the recommended maximum row size.
The one caveat being:
It is acceptable to use versions of a column where the use case is
actually amending a value, and the value's history is important. For
example, suppose you did a set of calculations based on the closing
price of ZXZZT, and initially the data was mistakenly entered as
559.40 for the closing price instead of 558.40. In this case, it might be important to know the value's history in case the incorrect value
had caused other miscalculations.

Better method for querying DynamoDB table randomly?

I've included some links along with our approaches to other answers, which seem to be the most optimal on the web right now.
Our records need to be categorized (eg. "horror", "thriller", "tv"), and randomly accessible both in specific categories and across all/some categories. We generally need to access about 20 - 100 items at a time. We also have a smallish number of categories (less than 100).
We write to the database for uploading/removing content, although this is done in batches and does not need to be real time.
We have tried two different approaches, with two different data structures.
Approach 1
AWS DynamoDB - Pick a record/item randomly?
Help selecting nth record in query.
In short, using the category as a hash key, and a UUID as the sort key. Generate a random UUID, query Dynamo using greater than or less than, and limit to 1. This is even suggested by an AWS employee in the second link. (We've also tried increasing the limit to the number of items we need, but this increases the probability of the query failing the first time around).
Issues with this approach:
First query can fail if it is greater than/less than any of the UUIDs
Querying on any specific category will cause throttling at scale (Small number of partitions)
We've also considered adding a suffix to each category to artificially increase the number of partitions we have, as pointed out in the following link.
AWS Database Blog
Choosing the Right DynamoDB Partition Key
Approach 2
Amazon Web Services: How do we get random item from the dynamoDb's table?
Doing something similar to this, where we concatenate the category with a sequential number, and use this as the hash key. e.g. horror-000001.
By knowing the number of records in each category, we're able to perform random queries across our entire data set, while also avoiding hot partitions/keys.
Issues with this approach
We need a secondary data structure to manage the sequential counts across each category
Writing (especially deleting) is significantly more complex, although this doesn't need to happen in real time.
Conclusion
Both approaches solve our main use case of random queries on category/categories, but the cons they offer are really deterring us from using them. We're leaning more towards approach #1 using suffixes to solve the hot partitioning issue, although we would need the additional retry logic for failed queries.
Is there a better way of approaching this problem? Specifically looking for solutions capable of scaling well (No scan), without requiring extra resources be implemented. #1 fits the bill, but needing to manage suffixes and failed attempts really deters us from using it, especially when it is being called inside a lambda (billed for time used).
Thanks!
Follow Up
After more research and testing, my team has decided to move towards MySQL hosted on RDS for these tables. We learned that this is one of the few use cases were DynamoDB does not fit, and requires rewriting your use case to fit the DB (Bad).
We felt that the extra complexity required to integrate random sampling on DynamoDB wasn't worth it, and we were unable to come up with any comparable solutions. We are, however, sticking with DynamoDB for our tables that do not need random accessibility due to the price and response times.
For anyone wondering why we chose MySQL, it was largely due to the Nodejs library available, great online resources (which DynamoDB definitely lacks), easy integration via RDS with our Lambdas, and the option to migrate to Amazons Aurora database.
We also looked at PostgreSQL, but we weren't as happy with the client library or admin tools, and we believe that MySQL will suit our needs for these tables.
If anybody has anything else they'd like to add or a specific question please leave a comment or send me a message!
This was too long for a comment, and I guess it's pretty much a full fledged answer now.
Approach 2
I've found that my typical time to get a single item from dynamodb to a host in the same region is <10ms. As long as you're okay with at most 1-2 extra calls, you can quite easily implement approach 2.
If you use a keys only GSI where the category is your hash key and the primary key of the table is your range key, you can quickly find the largest numbered single item within a category.
When you add a new item, find the largest number for that category from the GSI and then write the new item to the table with sequence number n+1.
When you delete, find the item with the largest sequence number for that category from the GSI, overwrite the item you are deleting, and then delete the now duplicated item from its position at the highest sequence number.
To randomly get an item, query the GSI to find the highest numbered item in the category, and then randomly pick a number since you now know the valid range.
Approach 1
I'm not sure exactly what you mean when you say "without requiring extra resources to be implemented". If you're okay with using a managed resource (no dev work to implement), you can also make Approach 1 work by putting a DAX cluster in front of your dynamodb table. Then you can query to your heart's content without really worrying about hot partitions. (Though the caching layer means that new/deleted items won't be reflected right away.)

nosql/dynamodb hash and range use case

It's my first time using a NoSQL database so I'm really confused. I'd really appreciate any help I can get.
I want to store data comprising announcements in my table. Essentially, each announcement has an ID, a date, and a text.
So for example, an announcement might have ID of 1, date of 2014/02/26, and text of "This is a sample announcement". Newer announcements always have a greater ID value than older announcements, since they are added to the table later.
There are two types of queries I want to run on this table:
I want to retrieve the text of the announcements sorted in order of date.
I want to retrieve the text and dates of the x most recent announcements (say, the 3 most recent announcements).
So I've set up the table with the following attributes:
ID (number) as primary key, and
date (string) as range
Is this appropriate for what my use cases? And if so, what kind of query/reads/requests/scans/whatever (I'm really confused about the terminology here too) should I be running to accomplish the two types of queries I want to make?
Any help will be very much appreciated. Thanks!
You are on the right track.
As far as sorting, DynamoDB will sort by the range key, so date will work but I'd recommend storing it as a number, perhaps milliseconds since the Unix epoch, rather than a String. This will make it trivial to get the announcements in ascending or descending order based on their created date.
See this answer for an overview of local vs global secondary indexes and what capabilities they provide: Optional secondary indexes in DynamoDB
As far as retrieving all items, you would need to perform a scan. Scans are not as efficient as queries, but since all of Dynamo is on SSD's they're still relatively quick. You don't get the single digit millisecond performance with a scan that you get with a query, so if there's a way to associate announcements with a user ID, you might get better performance than with a scan.
Note that you cannot modify the table schema (hash key, range key, and indexes) after you create the table. There are ways to manually migrate a table or import/export it, but the point is that you should think hard about current and future query requirements up front and design the table to support them. It's very easy to add or stop storing non-key or non-item attributes though, which provides nice flexibility.
Finally, try to avoid thinking of Dynamo as relational. With Dynamo, in a lot of cases you may well be better off de normalizing or duplicating some of the data in exchange for fast query performance.

use fuzzy matching in django queryset filter

Is there a way to use fuzzy matching in a django queryset filter?
I'm looking for something along the lines of:
Object.objects.filter(fuzzymatch(namevariable)__gt=.9)
or is there a way to use lambda functions, or something similar in django queries, and if so, how much would it affect performance time (given that I have a stable set of ~6000 objects in my database that I want to match to)
(realized I should probably put my comments in the question)
I need something stronger than contains, something along the lines of difflib. I'm basically trying to get around doing a Object.objects.all() and then a list comprehension with fuzzy matching.
(although I'm not necessarily sure that doing that would be much slower than trying to filter based on a function, so if you have thoughts on that I'm happy to listen)
also, even though it's not exactly what I want, I'd be open to some kind of tokenized opposite-contains, like:
Object.objects.filter(['Virginia', 'Tech']__in=Object.name)
Where something like "Virginia Technical Institute" would be returned. Although case insensitive, preferably.
When you're using the ORM, the thing to understand is that everything you do converts to SQL commands and it's the performance of the underlying queries on the underlying database that matter. Case in point:
SELECT COUNT (*) ...
Is that fast? Depends on whether your database stores any records to give you that information - MySQL/MyISAM does, MySQL/InnoDB does not. In English - this is one lookup in MYISAM, and n in InnoDB.
Next thing - in order to do exact match lookups efficiently in SQL you have to tell it when you create the table - you can't just expect it to understand. For this purpose SQL has the INDEX statement - in django, use db_index=True in the field options of your model. Bear in mind that this has an added performance hit on writes (to create the index) and obviously extra storage is needed (for the datastructure) so you cannot "INDEX all the things". Also, I don't think it will help for fuzzy matching - but it's worth noting anyway.
Next consideration - how do we do fuzzy matching in SQL? Well apparently LIKE and CONTAINS allow a certain amount of searching and wildcard-results to be executed in SQL. These are T-SQL links - translate for your database server :) You can achieve this via Model.objects.get(fieldname__contains=value) which will produce LIKE SQL, or similar. There are a number of options available there for different lookups.
This may or may not be powerful enough for you - I'm not sure.
Now, for the big question: performance. Chances are if you're doing a contains search that the SQL server will have to hit all of the rows in the database - don't take my word on that, but it would be my bet - even with indexing on. With 6000 rows this might not take all that long; then again, if you're doing this on a per-connection-to-your-app basis it's probably going to create a slowdown.
Next thing to understand about the ORM: if you do this:
Model.objects.get(fieldname__contains=value)
Model.objects.get(fieldname__contains=value)
You will issue two queries to the database server. In other words, the ORM doesn't always cache the results - so you might just want to do an .all() and search in memory. Do read about caching and querysets.
Further on on that last page, you'll also see Q objects - useful for more complicated queries.
So in summary then:
SQL contains some basic fuzzy matching-like parameters.
Whether or not these are sufficient depends on your needs.
How they perform depends on your SQL server - definitely measure it.
Whether you can cache these results in memory depends on how likely scaling is - again might be worth measuring the memory commit as a result - if you can share between instances and if the cache will be frequently invalidated (if it will be, don't do it).
Ultimately, I'd start by getting your fuzzy matching working, then measure, then tweak, then measure until you work out how to improve performance. 99% of this I learnt doing exactly that :)
with postgres as database, you can use TrigramSimilarity to do fuzzy search and rank your results on different weight as well. Here is the link to documentation :
https://docs.djangoproject.com/en/2.0/ref/contrib/postgres/search/#trigram-similarity
For full text search you can refer to https://czep.net/17/full-text-search.html
If you need something stronger than contains lookup, have a look at regex lookups: https://docs.djangoproject.com/en/1.0/ref/models/querysets/#regex

Solr + Haystack searching

I am trying to implement a search engine for a new app.
The app allows people to rate items (+1 or -1) - Giving the items a +ve or -ve score.
When people search for items, I'd like to take into account their rating and to order the results accordingly. If the item is a match, it should show up. But if it's a match with a high score it should be boosted up the results a bit.
A really good match should win over a fairly good match with a high score, so it needs to be weighted along with the rest of it (i.e. I boosted my titles a bit).
Not stuck on Solr by any means, only just started playing today.
With Solr, you can maintain a field with the document which holds the difference.
The difference can be between the total +1ve's and the -1ve's.
Solr allows you to boost on field values using function queries.
So you can query with the boost on the difference field, with documents with better difference scoring over others.
From indexing front, as this difference would change quite often, the respective document needs to be updated everytime.
Solr does not allow the updation of the single field, so you need to handle the incremental updates of the difference field.
If that would be a concern to you, can try using ExternalFileField.
This allows mapping of certain fields of documents such as ranking, popularity external to the index in a separate file.
The file can be updated and index committed to reflect the changes.
The field can also be used with function queries to boost the results as needed, however have lot of limitations.
You can order your results by a field that stores the ranking.
sqs.filter(content='blah').order_by('rating')