Sphinx Search powered app - wrong design?

Sphinx Search powered app - wrong design? - django

I'm building web app and using django and Sphinx for free text search. I need to apply additional restrictions before making request to searchd, consider 2 tables:
Entity
id
title
description
created_by_id
updated_by_id
created_date
updated_date
and
EntityUser
id
entity_id [FK to the table above]
joining_user_id
is_approved
created_by_id
updated_by_id
created_date
updated_date
I've built RT index for main table Entity, all works fine, but then I want to make a query only on those entities to which user has joined, i.e. where for specific user_id & entity_id exists record in EntityUser with is_approved=1. Problem is that I can't index EntityUser, because there are no string fields - this table only holds integers/timestamps as you see. Not sure if I could make a query in SphinxQL containing subquery to another idex even if I could build index for that table. Knowing that Sphinx was used for quite big projects with great success, I doubt it's a limitation of Sphinx - is it bad design of DB/application or leak of knowledge how to build proper RT index? Can I somehow extend existing index so that I can use restriction above?
I was thinking that I could apply the additional restrictions after Sphinx returns IDs of records on MySQL side, but that's not going to work: N records with highest weight would be returned, but after applying additional restrictions the result could be empty. So I need to get an area of search and then perform query only on those entities user can possibly see.

Adapting the example from http://sphinxsearch.com/docs/current.html#attributes, you might be able to use something like this in your conf:
...
sql_query = SELECT app_entity.id as id,
app_entity.title as title,
app_entity.description as description,
app_entityuser.id as userid
FROM app_entity, app_entityuser
WHERE app_entity.id = app_entityuser.entity_id AND app_entityuser.is_approved = 1
sql_attr_uint = id
sql_attr_uint = userid
...
I should provide a disclaimer: I have not tried this.
I did find a related SO post, but it doesn't look like they quite solved it: Django-sphinx result filtering using attributes?
Good luck!

Actually I've found the answer and it has nothing to do with the design of application or DB.
In fact that's simple - I just need to use MVA for RT index as I would do for plain one (rt_attr_multi or rt_attr_multi_64). In configuration file I will have to do something like this:
...
rt_attr_multi = entity_users
}
and then populate it with IDs of users which have joined the Entity and have been approved. Problem was that I couldn't understand how to use MVA with RT index, but not it's clear. There are not enough real-word examples with RT indexes and MVA I think, so I've shared this to help to solve similar problems.
UPDATE: was fighting last hour to generate RT index and always was getting "unknown column: 'entity_users'". Finally found the reason - if you add MVA to RT index (don't know if that's the same for plain), you've got to not only restart searchd daemon (service), but also DELETE everything you have in "data" folder (or where you have stored your index)!

Related

Converting SQL to something that Django can use

I am working on converting some relatively complex SQL into something that Django can play with. I am trying not to just use the raw SQL, since I think playing with the standard Django toolkit will help me learn more about Django.
I have already managed to break up parts of the sql into chunks, and am tackling them piecemeal to make things a little easier.
Here is the SQL in question:
SELECT i.year, i.brand, i.desc, i.colour, i.size, i.mpn, i.url,
COALESCE(DATE_FORMAT(i_eta.eta, '%M %Y'),'Unknown')
as eta
FROM i
JOIN i_eta ON i_eta.mpn = i.mpn
WHERE category LIKE 'kids'
ORDER BY i.brand, i.desc, i.colour, FIELD(size, 'xxl','xl','l','ml','m','s','xs','xxs') DESC, size+0, size
Here is what I have (trying to convert line by line):
(grabbed automatically when performing filters)
(have to figure out django doc on coalesce for syntax)
db alias haven't been able to find yet - it is crucial since there is a db view that requires it
already included in the original q
.select_related?
.filter(category="kids")
.objects.order_by('brand','desc','colour') - don't know how to deal with SQL FIELDS
Any advice would be appreciated!

Here's how I would structure this.
First, I'm assuming your models for i and i_eta look something like this:
class I(models.Model):
mpn = models.CharField(max_length=30, primary_key=True)
year = models.CharField(max_length=30)
brand = models.CharField(max_length=30)
desc = models.CharField(max_length=100)
colour = models.CharField(max_length=30)
size = models.CharField(max_length=3)
class IEta(models.Model):
i = models.ForeignKey(I, on_delete=models.CASCADE)
eta = models.DateField()
General thoughts:
To write the coalesce in Django: I would not replace nulls with "Unknown" in the ORM. This is a presentation-layer concern: it should be dealt with in a template.
For date formatting, you can do date formatting in Python.
Not sure what a DB alias is.
For using multiple tables together, you can use either select_related(), prefetch_related(), or do nothing.
select_related() will perform a join.
prefect_related() will get the foreign key ID's from the first queryset, then generate a query like SELECT * FROM table WHERE id in (12, 13, 14).
Doing nothing will work automatically, but has the disadvantage of the SELECT N+1 problem.
I generally prefer prefetch_related().
For customizing the sort order of the size field, you have three options. My preference would be option 1, but any of the three will work.
Denormalize the sort criteria. Add a new field called size_numeric. Override the save() method to populate this field when saving new instances, giving xxl the value 1, xl the value 2, etc.
Sort in Python. Essentially, you use Python's built-in sorting methods to do the sort, rather than sorting it in the database. This doesn't work well if you have thousands of results.
Invoke the MySQL function. Essentially, using annotate(), you add the output of a function to the queryset. order_by() can sort by that function.

Solving a slow query with a Foreignkey that has isnull=False and order_by in a Django ListView

I have a Django ListView that allows to paginate through 'active' People.
The (simplified) models:
class Person(models.Model):
name = models.CharField()
# ...
active_schedule = models.ForeignKey('Schedule', related_name='+', null=True, on_delete=models.SET_NULL)
class Schedule(models.Model):
field = models.PositiveIntegerField(default=0)
# ...
person = models.ForeignKey(Person, related_name='schedules', on_delete=models.CASCADE)
The Person table contains almost 700.000 rows and the Schedule table contains just over 2.000.000 rows (on average every Person has 2-3 Schedule records, although many have none and a lot have more). For an 'active' Person, the active_schedule ForeignKey is set, of which there are about 5.000 at any time.
The ListView is supposed to show all active Person's, sorted by field on Schedule (and some other conditions, that don't seem to matter for this case).
The query then becomes:
Person.objects
.filter(active_schedule__isnull=False)
.select_related('active_schedule')
.order_by('active_schedule__field')
Specifically the order_by on the related field makes this query terribly slow (that is: it takes about a second, which is too slow for a web app).
I was hoping the filter condition would select the 5000 records, which then become relatively easily sortable. But when I run explain on this query, it shows that the (Postgres) database is messing with many more rows:
Gather Merge (cost=224316.51..290280.48 rows=565366 width=227)
Workers Planned: 2
-> Sort (cost=223316.49..224023.19 rows=282683 width=227)
Sort Key: exampledb_schedule.field
-> Parallel Hash Join (cost=89795.12..135883.20 rows=282683 width=227)
Hash Cond: (exampledb_person.active_schedule_id = exampledb_schedule.id)
-> Parallel Seq Scan on exampledb_person (cost=0.00..21263.03 rows=282683 width=161)
Filter: (active_schedule_id IS NOT NULL)
-> Parallel Hash (cost=67411.27..67411.27 rows=924228 width=66)
-> Parallel Seq Scan on exampledb_schedule (cost=0.00..67411.27 rows=924228 width=66)
I recently changed the models to be this way. In a previous version I had a model with just the ~5.000 active Person's in it. Doing the order_by on this small table was considerably faster! I am hoping to achieve the same speed with the current models.
I tried retrieving just the fields needed for the Listview (using values) which does help a little, but not much. I also tried setting the related_name on active_schedule and approaching the problem from Schedule, but that makes no difference. I tried putting a db_index on the Schedule.field, but that seems only to make things slower. Conditional queries also did not help (although I probably have not tried all possibilities). I'm at a loss.
The SQL statement generated by the ORM query:
SELECT
"exampledb_person"."id",
"exampledb_person"."name",
...
"exampledb_person"."active_schedule_id",
"exampledb_person"."created",
"exampledb_person"."updated",
"exampledb_schedule"."id",
"exampledb_schedule"."person_id",
"exampledb_schedule"."field",
...
"exampledb_schedule"."created",
"exampledb_schedule"."updated"
FROM
"exampledb_person"
INNER JOIN
"exampledb_schedule"
ON ("exampledb_person"."active_schedule_id" = "exampledb_schedule"."id")
WHERE
"exampledb_person"."active_schedule_id" IS NOT NULL
ORDER BY
"exampledb_schedule"."field" ASC
(Some fields were left out, for simplicity.)
Is it possible to speed up this query, or should I revert back to using a special Model for the active Person's?
EDIT: When I change the query, just for comparison/testing, to sort on an UNindexed field on Person, the query is equally show. However, if I then add an index to that field, the query is fast! I had to try this, as the SQL statement indeed shows that it's ordering on "exampledb_schedule"."field" - a field without index, but like I said: adding an index on the field makes no difference.
EDIT: I suppose it's also worth noting that when trying a much simpler sort query directly on Schedule, either on an indexed field or not, it's MUCH faster. For instance, for this test I've added an index to Schedule.field, then the following query is blazing fast:
Schedule.objects.order_by('field')
Somewhere in here lies the solution...

The comments by #guarav and my edits pointed me in the direction of the solution, which was staring in my face for a while...
The filter clause in my questions - filter(active_schedule__isnull=False) - seems to invalidate the database indexes. I wasn't aware of this, and had hoped a database expert would point me in this direction.
The solution is to filter on Schedule.field, which is 0 for inactive Person records and >0 for active ones:
Person.objects
.select_related('active_schedule')
.filter(active_schedule__field__gte=1)
.order_by('active_schedule__field')
This query properly uses the indexes and is fast (20ms opposed to ~1000ms).

django subquery with a join in it

I've got django 1.8.5 and Python 3.4.3, and trying to create a subquery that constrains my main data set - but the subquery itself (I think) needs a join in it. Or maybe there is a better way to do it.
Here's a trimmed down set of models:
class Lot(models.Model):
lot_id = models.CharField(max_length=200, unique=True)
class Lot_Country(models.Model):
lot = models.ForeignKey(Lot)
country = CountryField()
class Discrete(models.Model):
discrete_id = models.CharField(max_length=200, unique=True)
master_id = models.ForeignKey(Inventory_Master)
location = models.ForeignKey(Location)
lot = models.ForeignKey(Lot)
I am filtering on various attributes of Discrete (which is discrete supply) and I want to go "up" through Lot, over the Lot_Country, meaning "I only want to get rows from Discrete if the Lot associated with that row has an entry in Lot_Country for my appropriate country (let's say US.)
I've tried something like this:
oklots=list(Lot_Country.objects.filter(country='US'))
But, first of all that gives me the str back, which I don't really want (and changed it to be lot_id, but that's a hack.)
What's the best way to constrain Discrete through Lot and over to Lot_Country? In SQL I would just join in the subquery (or even in the main query - maybe that's what I need? I guess I don't know how to join up to a parent then down into that parent's other child...)
Thanks in advance for your help.

I'm not sure what you mean by "it gives me the str back"... Lot_Country.objects.filter(country='US') will return a queryset. Of course if you print it in your console, you will see a string.
I also think your models need refactoring. The way you have currently defined it, you can associate multiple Lot_Countrys with one Lot, and a country can only be associated with one lot.
If I understand your general model correctly that isn't what you want - you want to associate multiple Lots with one Lot_Country. To do that you need to reverse your foreign key relationship (i.e., put it inside the Lot).
Then, for fetching all the Discrete lots that are in a given country, you would do:
discretes_in_us = Discrete.objects.filter(lot__lot_country__country='US')
Which will give you a queryset of all Discretes whose Lot is in the US.

Searching a many to many database using Google Cloud Datastore

I am quite new to google app engine. I know google datastore is not sql, but I am trying to get many to many relationship behaviour in it. As you can see below, I have Gif entities and Tag entities. I want my application to search Gif entities by related tag. Here is what I have done;
class Gif(ndb.Model):
author = ndb.UserProperty()
link = ndb.StringProperty(indexed=False)
class Tag(ndb.Model):
name = ndb.StringProperty()
class TagGifPair(ndb.Model):
tag_id = ndb.IntegerProperty()
gif_id = ndb.IntegerProperty()
#classmethod
def search_gif_by_tag(cls, tag_name)
query = cls.query(name=tag_name)
# I am stuck here ...
Is this a correct start to do this? If so, how can I finish it. If not, how to do it?

You can use repeated properties https://developers.google.com/appengine/docs/python/ndb/properties#repeated the sample in the link uses tags with entity as sample but for your exact use case will be like:
class Gif(ndb.Model):
author = ndb.UserProperty()
link = ndb.StringProperty(indexed=False)
# you store array of tag keys here you can also just make this
# StringProperty(repeated=True)
tag = ndb.KeyProperty(repeated=True)
#classmethod
def get_by_tag(cls, tag_name):
# a query to a repeated property works the same as if it was a single value
return cls.query(cls.tag == ndb.Key(Tag, tag_name)).fetch()
# we will put the tag_name as its key.id()
# you only really need this if you wanna keep records of your tags
# you can simply keep the tags as string too
class Tag(ndb.Model):
gif_count = ndb.IntegerProperty(indexed=False)

Maybe you want to use list? I would do something like this if you only need to search gif by tags. I'm using db since I'm not familiar with ndb.
class Gif(db.Model):
author = db.UserProperty()
link = db.StringProperty(indexed=False)
tags = db.StringListProperty(indexed=True)
Query like this
Gif.all().filter('tags =', tag).fetch(1000)

There's different ways of doing many-to-many relationships. Using ListProperties is one way. The limitation to keep in mind if using ListProperties is that there's a limit to the number of indexes per entity, and a limit to the total entity size. This means that there's a limit to the number of entities in the list (depending on whether you hit the index count or entity size first). See the bottom of this page: https://developers.google.com/appengine/docs/python/datastore/overview
If you believe the number of references will work within this limit, this is a good way to go. Considering that you're not going to have thousands of admins for a Page, this is probably the right way.
The other way is to have an intermediate entity that has reference properties to both sides of your many-to-many. This method will let you scale much higher, but because of all the extra entity writes and reads, this is much more expensive.

Select DISTINCT individual columns in django?

I'm curious if there's any way to do a query in Django that's not a "SELECT * FROM..." underneath. I'm trying to do a "SELECT DISTINCT columnName FROM ..." instead.
Specifically I have a model that looks like:
class ProductOrder(models.Model):
Product = models.CharField(max_length=20, promary_key=True)
Category = models.CharField(max_length=30)
Rank = models.IntegerField()
where the Rank is a rank within a Category. I'd like to be able to iterate over all the Categories doing some operation on each rank within that category.
I'd like to first get a list of all the categories in the system and then query for all products in that category and repeat until every category is processed.
I'd rather avoid raw SQL, but if I have to go there, that'd be fine. Though I've never coded raw SQL in Django/Python before.

One way to get the list of distinct column names from the database is to use distinct() in conjunction with values().
In your case you can do the following to get the names of distinct categories:
q = ProductOrder.objects.values('Category').distinct()
print q.query # See for yourself.
# The query would look something like
# SELECT DISTINCT "app_productorder"."category" FROM "app_productorder"
There are a couple of things to remember here. First, this will return a ValuesQuerySet which behaves differently from a QuerySet. When you access say, the first element of q (above) you'll get a dictionary, NOT an instance of ProductOrder.
Second, it would be a good idea to read the warning note in the docs about using distinct(). The above example will work but all combinations of distinct() and values() may not.
PS: it is a good idea to use lower case names for fields in a model. In your case this would mean rewriting your model as shown below:
class ProductOrder(models.Model):
product = models.CharField(max_length=20, primary_key=True)
category = models.CharField(max_length=30)
rank = models.IntegerField()

It's quite simple actually if you're using PostgreSQL, just use distinct(columns) (documentation).
Productorder.objects.all().distinct('category')
Note that this feature has been included in Django since 1.4

User order by with that field, and then do distinct.
ProductOrder.objects.order_by('category').values_list('category', flat=True).distinct()

The other answers are fine, but this is a little cleaner, in that it only gives the values like you would get from a DISTINCT query, without any cruft from Django.
>>> set(ProductOrder.objects.values_list('category', flat=True))
{u'category1', u'category2', u'category3', u'category4'}
or
>>> list(set(ProductOrder.objects.values_list('category', flat=True)))
[u'category1', u'category2', u'category3', u'category4']
And, it works without PostgreSQL.
This is less efficient than using a .distinct(), presuming that DISTINCT in your database is faster than a python set, but it's great for noodling around the shell.
Update:
This is answer is great for making queries in the Django shell during development. DO NOT use this solution in production unless you are absolutely certain that you will always have a trivially small number of results before set is applied. Otherwise, it's a terrible idea from a performance standpoint.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js