How to speed up Gensim Word2vec model by filtering out some words? - word2vec

Suppose I filter in a list the word that i want to use in my next word2vec model load. How can I construct my own KeyedVectors that contain only these filtered words list?
I tried to make:
w2v_model_keyed = w2v_model.wv
w2v_model_keyed.drop(word)
for a given word but i get the following error:
AttributeError: 'KeyedVectors' object has no attribute 'drop'
Thank you

The gensim KeyedVectors class doesn't support incremental expansion or modification (as with a .drop() method). You'll need to construct a new instance, of just the right size/contents.
You should look the gensim KeyedVectors source code, and especially the .load_word2vec_format() method, to learn how existing instances are created in gensim, and mimic that to create one that is of just the size/words you need.

Starting from version 4.1.2, Gensim KeyedVectors object supports method .vectors_for_all that takes a list of words and creates a new KeyedVectors object with the corresponding vectors for those words only.
from gensim.models import KeyedVectors
model = KeyedVectors.load_word2vec_format("model.bin", binary=True)
words_to_keep = ["salmon", "tuna", "cod"]
smaller_model = model.vectors_for_all(words_to_keep)
smaller_model.save_word2vec_format("smaller_model.bin", binary=True)

Related

Django ORM: Count individual entries in related field with condition

I have a model Game and a model Line. Line has a foreign key to Game and a DateTimeField called created which records when the line was created
I would like to annotate a queryset of Game, to count all Lines in each game that were created after a certain date.
something like
games = Game.objects.all().annotate(
recent_lines=Count(lines, filter=Q(lines__created__gt=date_to_check)))
This does not work as it count ALL the lines as valid...
How can I achieve what I am trying to achieve?
I suspect the issue is that you're using Django < 2.0 which doesn't support the filter argument to Count (it basically just ignores it). The filter argument was added in Django 2.0.
For older versions of Django, you have to use the Case and When conditional expressions which are a bit more verbose unfortunately, but should do the job.
from django.db.models import CharField, Case, When
games = Game.objects.all().annotate(
recent_lines=Count(Case(
When(recent_lines__created__gt=date_to_check, then=1),
output_field=CharField())))
If you want to count lines for each game, you can do like this
game = Game.objects.get(id=1)
Line.objects.filter(game=game,created__gte=datetime.date(2018,2,12).count()

How to use a tsvector field to perform ranking in Django with postgresql full-text search?

I need to perform a ranking query using postgresql full-text search feature and Django with django.contrib.postgres module.
According to the doc, it is quite easy to do this using the SearchRank class by doing the following:
>>> from django.contrib.postgres.search import SearchQuery, SearchRank, SearchVector
>>> vector = SearchVector('body_text')
>>> query = SearchQuery('cheese')
>>> Entry.objects.annotate(rank=SearchRank(vector, query)).order_by('-rank')
This probably works well but this is not exactly what I want since I have a field in my table which already contains tsvectorized data that I would like to use (instead of recomputing tsvector at each search query).
Unforunately, I can't figure out how to provide this tsvector field to the SearchRank class instead of a SearchVector object on a raw data field.
Is anyone able to indicate how to deal with this?
Edit:
Of course, simply trying to instantiate a SearchVector from the tsvector field does not work and fails with this error (approximately since I translated it from french):
django.db.utils.ProgrammingError: ERROR: function to_tsvector(tsvector) does not exist
If your model has a SearchVectorField like so:
from django.contrib.postgres.search import SearchVectorField
class Entry(models.Model):
...
search_vector = SearchVectorField()
you would use the F expression:
from django.db.models import F
...
Entry.objects.annotate(
rank=SearchRank(F('search_vector'), query)
).order_by('-rank')
I've been seeing mixed answers here on SO and in the official documentation. F Expressions aren't used in the documentation for this. However it may just be that the documentation doesn't actually provide an example for using SearchRank with a SearchVectorField.
Looking at the output of .explain(analyze=True) :
Without the F Expression:
Sort Key: (ts_rank(to_tsvector(COALESCE((search_vector)::text, ''::text))
When the F Expression is used:
Sort Key: (ts_rank(search_vector, ...)
In my experience, it seems the only difference between using an F Expression and the field name in quotes is that using the F Expression returns much faster, but is sometimes less accurate - depending on how you structure the query - it can be useful to enforce it with a COALESCE in some cases. In my case it's about a 3-5x speedboost to use the F Expression with my SearchVectorField.
Ensuring your SearchQuery has a config kwarg also improves things dramatically.

Django Array contains a field

I am using Django, with mongoengine. I have a model Classes with an inscriptions list, And I want to get the docs that have an id in that list.
classes = Classes.objects.filter(inscriptions__contains=request.data['inscription'])
Here's a general explanation of querying ArrayField membership:
Per the Django ArrayField docs, the __contains operator checks if a provided array is a subset of the values in the ArrayField.
So, to filter on whether an ArrayField contains the value "foo", you pass in a length 1 array containing the value you're looking for, like this:
# matches rows where myarrayfield is something like ['foo','bar']
Customer.objects.filter(myarrayfield__contains=['foo'])
The Django ORM produces the #> postgres operator, as you can see by printing the query:
print Customer.objects.filter(myarrayfield__contains=['foo']).only('pk').query
>>> SELECT "website_customer"."id" FROM "website_customer" WHERE "website_customer"."myarrayfield_" #> ['foo']::varchar(100)[]
If you provide something other than an array, you'll get a cryptic error like DataError: malformed array literal: "foo" DETAIL: Array value must start with "{" or dimension information.
Perhaps I'm missing something...but it seems that you should be using .filter():
classes = Classes.objects.filter(inscriptions__contains=request.data['inscription'])
This answer is in reference to your comment for rnevius answer
In Django ORM whenever you make a Database call using ORM, it will generally return either a QuerySet or an object of the model if using get() / number if you are using count() ect., depending on the functions that you are using which return other than a queryset.
The result from a Queryset function can be used to implement further more refinement, like if you like to perform a order() or collecting only distinct() etc. Queryset are lazy which means it only hits the database when they are actually used not when they are assigned. You can find more information about them here.
Where as the functions that doesn't return queryset cannot implement such things.
Take time and go through the Queryset Documentation more in depth explanation with examples are provided. It is useful to understand the behavior to make your application more efficient.

Get object from list of objects without extra database calls - Django

I have an import of objects where I want to check against the database if it has already been imported earlier, if it has I will update it, if not I will create a new one. But what is the best way of doing this.
Right now I have this:
old_books = Book.objects.filter(foreign_source="import")
for book in new_books:
try:
old_book = old_books.get(id=book.id):
#update book
except:
#create book
But that creates a database call for each book in new_books. So I am looking for a way where it will only make one call to the database, and then just fetch objects from that queryset.
Ps: not looking for a get_or_create kind of thing as the update and create functions are more complex than that :)
--- EDIT---
I guess I haven't been good enough in my explanation, as the answers does not reflect what the problem is. So to make it more clear (I hope):
I want to pick out a single object from a queryset, based on an id of that object. I want the full object so I can update it and save it with it's changed values. So lets say I have a queryset with 3 objects, A and B and C. Then I want a way to ask if the queryset has object B and if it has then get it, without an extra database call.
Assuming new_books is another queryset of Book you can try filter on id of it as
old_books = Book.objects.filter(foreign_source="import").filter(id__in=[b.id for b in new_books])
With this old_books has books that are already created.
You can use the values_list('id', flat=True) to get all ids in a single DB call (is much faster than querysets). Then you can use sets to find the intersections.
new_book_ids = new_books.values_list('id', flat=True)
old_book_ids = Book.objects.filter(foreign_source="import") \
.values_list('id', flat=True)
to_update_ids = set(new_book_ids) & set(old_book_ids)
to_create_ids = set(new_book_ids) - to_update_ids
-- EDIT (to include the updated part) --
I guess the problem you are facing is in bulk updating rather than bulk fetch.
If the updates are simple, then something like this might work:
old_book_ids = Book.objects.filter(foreign_source="import") \
.values_list('id', flat=True)
to_update = []
to_create = []
for book in new_books:
if book.id in old_book_ids:
# list of books to update
# to_update.append(book.id)
else:
# create a book object
# Book(**details)
# Update books
Book.objects.filter(id__in=to_update).update(field='new_value')
Book.objects.bulk_create(to_create)
But if the updates are complex (update fields are dependent upon related fields), then you can check insert... on duplicated key update option in MySQL and its custom manager for Django.
Please leave a comment if the above is completely off the track.
You'll have to do more than one query. You need two groups of objects, you can't fetch them both and split them up at the same time arbitrarily like that. There's no bulk_get_or_create method.
However, the example code you've given will do a query for every object which really isn't very efficient (or djangoic for that matter). Instead, use the __in clause to create smart subqueries, and then you can limit database hits to only two queries:
old_to_update = Book.objects.filter(foreign_source="import", pk__in=new_books)
old_to_create = Book.objects.filter(foreign_source="import").exclude(pk__in=new_books)
Django is smart enough to know how to use that new_books queryset in that context (it can also be a regular list of ids)
update
Queryset objects are just a sort of list of objects. So all you need to do now is loop over the objects:
for book in old_to_update:
#update book
for book in old_to_create:
#create book
At this point, when it's fetching the books from the QuerySet, not from the databse, which is a lot more efficient than using .get() for each and every one of them - and you get the same result. each iteration you get to work with an object, the same as if you got it from a direct .get() call.
The best solution I have found is using the python next() function.
First evaluate the queryset into a set and then pick the book you need with next:
old_books = set(Book.objects.filter(foreign_source="import"))
old_book = next((book for book in existing_books if book.id == new_book.id), None )
That way the database is not queried everytime you need to get a specific book from the queryset. And then you can just do:
if old_book:
#update book
old_book.save()
else:
#create new book
In Django 1.7 there is an update_or_create() method that might solve this problem in a better way: https://docs.djangoproject.com/en/dev/ref/models/querysets/#django.db.models.query.QuerySet.update_or_create

GeoDjango distance query for a ForeignKey Relationship

I have the following models (simplified)
from django.contrib.gis.db import models as geomodels
modelB (geomodels.Model):
objects = geomodels.GeoManager()
modelA (geomodels.Model):
point = geomodels.PointField(unique=True)
mb = models.ForeignKey(modelB,related_name='modela')
objects = geomodels.GeoManager()
I am trying to find all modelB objects and sort them by distance from a given location (where distance is defined as distance between a given location and the point object of associated modelA). When I try to run the query
modelB.objects.distance((loc, field_name='modela__point')
I get an error saying
TypeError: ST_Distance output only available on GeometryFields.
Note that loc is a Point object.However, when I run the query
modelB.objects.filter(modela__point__distance_lte = (loc, 1000))
this query works without error and as expected.
Any idea what the mistake could be? I am using django 1.2.4, PostGis 1.5.2, PostGres 8.4.
Thanks.
For the first one, you will need to change it to:
modelB.objects.all().distance(loc, field_name='modela__point')
if you want to see all modelB objects.
The ".distance" is used to calculate and add a distance field to each resulting row of the QuerySet (or GeoQuerySet).