I have a Django and Django REST Framework powered RESTful API (talking to a PostgreSQL DB backend) which supports filtering on a specific model.
Now I want to add a fulltext search functionality.
Is it be possible to use Elasticsearch for fulltext search and then apply my existing API filters on top of these search results?
I would suggest you consider using PostgreSQL only to do what you asked for.
In my opinion it is the best solution because you will have the data and the search indexes directly inside PostgreSQL and you will not be forced to install and maintain additional software (such as Elasticsearch) and keep the data and indexes in sync.
This is the simplest code example you can have to perform a full-text search in Django with PostgreSQL:
Entry.objects.filter(body_text__search='Cheese')
For all the basic documentation on using the full-text search in Django with PostgreSQL you can use the official documentation: "Full text search"
If you want to deepen further you can read an article that I wrote on the subject:
"Full-Text Search in Django with PostgreSQL"
Your question is too broad to be answered with code, but it's definitely possible.
You can easily search your elasticsearch for rows matching your full-text criteria.
Then get those rows' PK fields (or any other candidate key, used to uniquely identify rows in your PostgreSQL dB), and filter your django ORM-backed models for PKs matching those you found from your Elasticsearch.
Pseudocode would be:
def get_chunk(l, length):
for i in xrange(0, len(l), length):
yield l[i:i + length]
res = es.search(index="index", body={"query": {"match": ...}})
pks = []
for hit in res['hits']:
pks.append(hit['pk'])
for chunk_10k in get_chunk(pks, 10000):
DjangoModel.objects.filter(pk__in=chunk_10k, **the_rest_of_your_api_filters)
EDIT
To resolve a case in which lots and lots of PKs might be found with your elastic query, you can define a generator that yields successive 10K rows of the results, so you won't step over your DB query limit and to ensure best update performance. I've defined it above with a function called get_chunk.
Something like that would work for alternatives like redis, mongodb, etc ...
Related
I am building a service which would have millions of rows of data in it. We wanted to have good search on it. Eg. we can search by some field values. The structure of the row will be like as follows:
{
"field1" : "value1",
"field2" : "value2",
"field3" : {
"field4": "value4",
"field5": "value5"
}
}
Also, the structure of field3 can be changing with field4 present sometime and sometime not.
We wanted to have filters on following fields field1, field2 and field 4. We can create indexes in dynamodb to do that. But I am not sure if we can create index on field4 in dynamodb easily without flattening the json.
Now, my question is, should we use elastic search datastore for it, which as far as I know, will create indexes on every field in the document and then one can search on every field? Is that right? Or should we use dynamodb or completely any other data store?
Please provide some suggestions.
If search is a key requirement for your application, then use a search product - not a database. Dynamodb is great for a lot of things, but adhoc search is not one of them - you are going to end up running lots of very expensive (slow) scans if you go with dynamodb; this is what ES was built for.
I've a decent working experience with dynamoDB and extensive working experience with Elasticsearch(ES).
Let's first understand the key difference between these two:
dynamoDB is
Amazon DynamoDB is a key-value and document database
while Elasticsearch
Elasticsearch is a distributed, open source search and analytics
engine for all types of data, including textual, numerical,
geospatial, structured, and unstructured data.
Now coming to question, let's discuss how these system works internally and how it affects the performance.
DynamoDB is great to fetch the documents based on keys but not great for filtering and searching, as in relations database for improving performance of these oprations you create index on the columns, in similar way you have to create an index in dynamoDB as its a database, not search engine. And creating index on fields on the fly is pain and its not cached in DynamoDB.
Elasticsearch stores data differently by creating the inverted index for all indexed fields(default as mentioned by OP) and filtering on these fields are super fast if you use the filter context which is the same use case here, more info with example is explained in official ES doc https://www.elastic.co/guide/en/elasticsearch/reference/current/query-filter-context.html#filter-context, Also as these filters are not used for score calculation and cached at elasticsearch so their performance(both read and write) is super fast as compared to dynamoDB and you can benchmark that as well.
What is the correct approach to adding aggregations to Django ElasticSearch DSL DRF? The codebase contains some empty filter backends
https://github.com/barseghyanartur/django-elasticsearch-dsl-drf/tree/a2be5842e36a102ad66988f5c74dec984c31c89b/src/django_elasticsearch_dsl_drf/filter_backends/aggregations
Should I create a custom backend or is there a way to add them directly to my viewset?
In particular, I want to calculate the sum of an IntegerField across all results in a particular facet.
Elasticsearch has more than one type of aggregations. Simple aggregations in the django-elasticsearch-dsl-drf are implemented in the FacetedSearchFilterBackend. Read the docs or simply run the example project to run experiments.
I am using django-oscar == 1.6.1 for a project.
I am trying to add recommended products to an individual product in the dashboard, currently I see that the recommended product field is empty, how do I populate it and give the ranking?
It's a streaming search field, whatever you type should search for related term in your existing products database.
For example, if you type <search_term> it would ultimately query (after several intermediate queries of substrings) & hit http://localhost:8000/dashboard/catalogue/product-lookup/?q=<search_term>, the view for which can be found here. As you can see, it searches the product titles only, if you need something else, you can always modify it.
By the looks of it, you haven't populated your products database yet, or there's something else wrong with your installation or setup.
I've been using Cloudant Query to find documents and found it works really well.
The documentation says "Cloudant Query indexes can also be built using MapReduce Views". I cannot see where the index definition syntax allows specifying a view.
Is there an example of an index or query that uses a view?
The documentation does indeed say that "Cloudant Query indexes can also be built using MapReduce Views" but it is referring to the technology which underpins the Cloudant Query service.
Cloudant Query indexes can take two forms:
type: "text" is an index built on Apache Lucene which is suitable for fielded searches and full-text queries
type: "json" is an index built on MapReduce where materialized views are created to be able to answer the queries you supply
The sentence you refer to in the documentation is intended to convey that Cloudant Query indexes can be specified as type: "json" which result the MapReduce views behind the scenes.
I am trying to use django's database routers to shard my database, but I am not able to find a solution for that.
I'd like to define two databases, create the same table in both and then save the even rows in one db, the odd ones in the other one. The examples in the documentation show how to write to a master db and to read from the readonly slaves, which is not what I want, because I don't want to store the whole dataset in both dbs.
Do know any webpage explaining what I am trying to do?
Thank you
PS: I am using Postgresql and I know there are tools to achieve the same goal at DB level. My goal is to study if it can also be done in django and to explore if there are some advantages by doing this.