How to find entity in search query in Elasticsearch? - regex

I'm using Elasticsearch to build search for ecommerece site.
One index will have products stored in it, in products index I'll store categories in it's other attributes along with. Categories can be multiple but the attribute will have single field value. (E.g. color)
Let's say user types in Black(color) Nike(brand) shoes(Categories)
I want to process this query so that I can extract entities (brand, attribute, etc...) and I can write Request body search.
I have tought of following option,
Applying regex on query first to extract those entities (But with this approach not sure how Fuzzyness would work, user may have typo in any of the entity)
Using OpenNLP extension (But this one only works on indexation time, in above scenario we want it on query side)
Using NER of any good NLP framework. (This is not time & cost effective because I'll have millions of products in engine also they get updated/added on frequent basis)
What's the best way to solve above issue ?
Edit:
Found couple of libraries which would allow fuzzy text matching in regex. But the entities to find will be many, so what's the best solution to optimise that ?
Still not sure about OpenNLP
NER won't work in this case because there are fixed number of entities so prediction is not right when there are no entity available in the query.

If you cannot achieve desired results with tuning of built-in ElasticSearch scoring/boosting most likely you'll need some kind of 'natural language query' processing:
Tokenize free-form query. Regex can be used for splitting lexems, however very often it is better to write custom tokenizer for that.
Perform named-entity recognition to determine possible field(s) for each keyword. At this step you will get associations like (Black -> color), (Black -> product name) etc. In fact you don't need OpenNLP for that as this should be just an index (keyword -> field(s)), and you can try to use ElasticSearch 'suggest' API for this purpose.
(optional) Recognize special phrases or combinations like "released yesterday", "price below $20"
Generate possible combinations of matches, and with help of special scoring function determine 'best' recognition result. Scoring function may be hardcoded (reflect 'common sense' heuristics) or it this may be a result of machine learning algorithm.
By recognition result (matches metadata) produce formal query to produce search results - this may be ElasticSearch query with field hints, or even SQL query.
In general, efficient NLQ processing needs significant development efforts - I don't recommend to implement it from scratch until you have enough resources & time for this feature. As alternative, you can try to find existing NLQ solution and integrate it, but most likely this will be commercial product (I don't know any good free/open-source NLQ components that really ready for production use).

I would approach this problem as NER tagging considering you already have corpus of tags. My approach for this problem will be as below:
Create a annotated dataset of queries with each word tagged to one of the tags say {color, brand, Categories}
Train a NER model (CRF/LSTMS).
This is not time & cost effective because I'll have millions of
products in engine also they get updated/added on frequent basis
To handle this situation I suggest dont use words in the query as features but rather use the attributes of the words as features. For example create an indicator function f(x',y) for word x with context x' (i.e the word along with the surrounding words and their attributes) and tag y which will return a 1 or 0. A sample indicator function will be as below
f('blue', 'y') = if 'blue' in `color attribute` column of DB and words previous to 'blue' is in `product attribute` column of DB and 'y' is `colors` then return 1 else 0.
Create lot of these indicator functions also know as features maps.
These indicator functions are then used to train a models using CRFS or LSTMS. Finially we use viterbi algorithm to find the best tagging sequence for your query. For CRFs you can use packages like CRFSuite or CRF++. Using these packages all you have go do is create indicator functions and the package will train a model for you. Once trained you can use this model to predict the best sequence for your queries. CRFs are very fast.
This way of training without using vector representation of words will generalise your model without the need of retraining. [Look at NER using CRFs].

Related

How to order django query set filtered using '__icontains' such that the exactly matched result comes first

I am writing a simple app in django that searches for records in database.
Users inputs a name in the search field and that query is used to filter records using a particular field like -
Result = Users.objects.filter(name__icontains=query_from_searchbox)
E.g. -
Database consists of names- Shiv, Shivam, Shivendra, Kashiva, Varun... etc.
A search query 'shiv' returns records in following order-
Kahiva, Shivam, Shiv and Shivendra
Ordered by primary key.
My question is how can i achieve the order -
Shiv, Shivam, Shivendra and Kashiva.
I mean the most relevant first then lesser relevant result.
It's not possible to do that with standard Django as that type of thing is outside the scope & specific to a search app.
When you're interacting with the ORM consider what you're actually doing with the database - it's all just SQL queries.
If you wanted to rearrange the results you'd have to manipulate the queryset, check exact matches, then use regular expressions to check for partial matches.
Search isn't really the kind of thing that is best suited to the ORM however, so you may which to consider looking at specific search applications. They will usually maintain an index, which avoids database hits and may also offer a percentage match ordering like you're looking for.
A good place to start may be with Haystack

I have large file contents that I want to make searchable on AWS CloudSearch but the maximum document size is 1MB - how do I deal with this?

I could split the file contents up into separate search documents but then I would have to manually identify this in the results and only show one result to the user - otherwise it will look like there are 2 files that match their search when in fact there is only one.
Also the relevancy score would be incorrect. Any ideas?
So the response from AWS support was to split the files up into separate documents. In response to my concerns regarding relevancy scoring and multiple hits they said the following:
You do raise two very valid concerns here for your more challenging use case here. With regard to relevance, you face a very significant problem already in that is harder to establish a strong 'signal' and degrees of differentiation with large bodies of text. If the documents you have are much like reports or whitepapers, a potential workaround to this may be in indexing the first X number of characters (or the first identified paragraph) into a "thesis" field. This field could be weighted to better indicate what the document subject matter may be without manual review.
With regard to result duplication, this will require post-processing on your end if you wish to filter it. You can create a new field that can generate a unique "Parent" id that will be shared for each chunk of the whole document. The post-processing can check to see if this "Parent" id has already been return(the first result should be seen as most relevant), and if it has, filter the subsequent results. What is doubly useful in such a scenario, is that you include a refinement link into your results that could filter on all matches within that particular Parent id.

What is the effect of using filtered classifier over normal classifier in weka

I have used weka for text classification. First I used StringToWordVector filter and filtered data were used with SVM classifier (LibSVM) for cross validation. Later I have read a blog post here
It said that it is not suitable to use filter first and then perform cross validation. Instead it proposes FilteredClassifer to use. His justification is
Two weeks ago, I wrote a post on how to chain filters and classifiers in WEKA, in order to avoid misleading results when performing experiments with text collections. The issue was that, when using N Fold Cross Validation (CV) in your data, you should not apply the StringToWordVector (STWV) filter on the full data collection and then perform the CV evaluation on your data, because you would be using words that are present in your test subset (but not in your training subset) for each run.
I can not understand the reason behind this. Anyone knows that?
When you using filter before N Fold cross validation you would be filtering every word appear in each instance despite being a test instance or train instance. At the moment Filter has no way to know if a instance is a test instance or a train instance. So if you are using StringtoWordVector with TFTransform or any similar operation, any word in test instances may affect the transform value. (Simply, if you are implementing bag of words then you would take test instance for the consideration too). This is not acceptable since the training parameters should not affected by the testing data. So instead you can do the Filtering on the run. That is FilteredClassifer.
For get an idea about how N Fold cross validation works, please refer to the Rushdi Shams's answer in following question. Please let me know if you understood it or not. Cheers..!!
Cross Validation in Weka

SOLR query exclusions

I'm having an issue with querying an index where a common search term also happens to be part of a company name interspersed throughout most of the documents. How do I exclude the business name in results without effecting the ranking on a search that includes part of the business name?
example: Bobs Automotive Supply is the business name.
How can I include relevant results when someone searches automotive or supply without returning every document in the index?
I tried "-'Bobs Automotive Supply' +'search term'" but this seems to exclude any document with Bobs Automotive Supply and isn't very effective on searching 'supply' or 'automotive'
Thanks in advance.
Second answer here, based on additional clarification from first answer.
A few options.
Add the business name as StopWords in the StopWordFilter. This will stop Solr from Indexing them at all. Searches that use them will only really search for those words that aren't in the business name.
Rely on the inherent scoring that Solr will apply due to Term frequency. It sounds like these terms will be in the index frequently. Queries for them will still return the documents, but if the user queries for other, less common terms, those will get a higher score.
Apply a low query boost (not quite negative, but less than other documents) to documents that contain the business name. This is covered in the Solr Relevancy FAQ http://wiki.apache.org/solr/SolrRelevancyFAQ#How_do_I_give_a_negative_.28or_very_low.29_boost_to_documents_that_match_a_query.3F
Do you know that the article is tied to the business name or derive this? If so, you could create another field and then just exclude entities that match on the business name using a filter query. Something like
q=search_term&fq=business_name:(NOT search_term)
It may be helpful to use subqueries for this or to just boost down rather than filter out results.
EDIT: Update to question make this irrelavent. Leaving it hear for posterity. :)
This is why Solr Documents have different fields.
In this case, it sounds like there is a "Footer" field that is separate from your "Body" field in your documents. When searches are performed, they would only done against the Body, which won't include data from the Footer. You could even have a third field which is the "OriginalContent" field, which contains the original copy for display purposes. You wouldn't search that, just store it for later.
The important part is to create the two separate fields in your schema and make sure that you index those field that you want to be able to search.

Solr + Haystack searching

I am trying to implement a search engine for a new app.
The app allows people to rate items (+1 or -1) - Giving the items a +ve or -ve score.
When people search for items, I'd like to take into account their rating and to order the results accordingly. If the item is a match, it should show up. But if it's a match with a high score it should be boosted up the results a bit.
A really good match should win over a fairly good match with a high score, so it needs to be weighted along with the rest of it (i.e. I boosted my titles a bit).
Not stuck on Solr by any means, only just started playing today.
With Solr, you can maintain a field with the document which holds the difference.
The difference can be between the total +1ve's and the -1ve's.
Solr allows you to boost on field values using function queries.
So you can query with the boost on the difference field, with documents with better difference scoring over others.
From indexing front, as this difference would change quite often, the respective document needs to be updated everytime.
Solr does not allow the updation of the single field, so you need to handle the incremental updates of the difference field.
If that would be a concern to you, can try using ExternalFileField.
This allows mapping of certain fields of documents such as ranking, popularity external to the index in a separate file.
The file can be updated and index committed to reflect the changes.
The field can also be used with function queries to boost the results as needed, however have lot of limitations.
You can order your results by a field that stores the ranking.
sqs.filter(content='blah').order_by('rating')