How word association mining is generalization of n-gram language model - data-mining

I am working on text mining (reading book...) author said word association mining is actually the generalization of n-gram language model Can you please tell how word association mining is generalization of n-gram language model.
For Me word association mining is finding symptomatic relation (finding co-occurring) words and n-gram language model is compare all n-words in query to suggest or return relevant documents.

Association rule mining will try to cover frequent concurrences of arbitrary length.
If you apply this (not just two term correlations) to text, you would indeed find n-grams without a fixed n.

Related

When training an entity recognizer on comprehend using a entity list of abbreviation. Would it start identifying other abbreviations not on the list?

Basically:
If I train a custom entity recogniser on comprehend using an entity list of gene abreviations, would it potentially start identifying other abbreviations that weren't in the entity lists as gene names?
The model has identified the abbreviation UVB as a gene name, however it's not in any of my dictionaries and I'm not sure why this would have happened
Correct, Comprehend's entity list mode is an easy way to provide a non-exhaustive list of entities while training a model. The model will learn the context in documents with these entities and subsequently generalize to unseen entities in similar context.

How to find entity in search query in Elasticsearch?

I'm using Elasticsearch to build search for ecommerece site.
One index will have products stored in it, in products index I'll store categories in it's other attributes along with. Categories can be multiple but the attribute will have single field value. (E.g. color)
Let's say user types in Black(color) Nike(brand) shoes(Categories)
I want to process this query so that I can extract entities (brand, attribute, etc...) and I can write Request body search.
I have tought of following option,
Applying regex on query first to extract those entities (But with this approach not sure how Fuzzyness would work, user may have typo in any of the entity)
Using OpenNLP extension (But this one only works on indexation time, in above scenario we want it on query side)
Using NER of any good NLP framework. (This is not time & cost effective because I'll have millions of products in engine also they get updated/added on frequent basis)
What's the best way to solve above issue ?
Edit:
Found couple of libraries which would allow fuzzy text matching in regex. But the entities to find will be many, so what's the best solution to optimise that ?
Still not sure about OpenNLP
NER won't work in this case because there are fixed number of entities so prediction is not right when there are no entity available in the query.
If you cannot achieve desired results with tuning of built-in ElasticSearch scoring/boosting most likely you'll need some kind of 'natural language query' processing:
Tokenize free-form query. Regex can be used for splitting lexems, however very often it is better to write custom tokenizer for that.
Perform named-entity recognition to determine possible field(s) for each keyword. At this step you will get associations like (Black -> color), (Black -> product name) etc. In fact you don't need OpenNLP for that as this should be just an index (keyword -> field(s)), and you can try to use ElasticSearch 'suggest' API for this purpose.
(optional) Recognize special phrases or combinations like "released yesterday", "price below $20"
Generate possible combinations of matches, and with help of special scoring function determine 'best' recognition result. Scoring function may be hardcoded (reflect 'common sense' heuristics) or it this may be a result of machine learning algorithm.
By recognition result (matches metadata) produce formal query to produce search results - this may be ElasticSearch query with field hints, or even SQL query.
In general, efficient NLQ processing needs significant development efforts - I don't recommend to implement it from scratch until you have enough resources & time for this feature. As alternative, you can try to find existing NLQ solution and integrate it, but most likely this will be commercial product (I don't know any good free/open-source NLQ components that really ready for production use).
I would approach this problem as NER tagging considering you already have corpus of tags. My approach for this problem will be as below:
Create a annotated dataset of queries with each word tagged to one of the tags say {color, brand, Categories}
Train a NER model (CRF/LSTMS).
This is not time & cost effective because I'll have millions of
products in engine also they get updated/added on frequent basis
To handle this situation I suggest dont use words in the query as features but rather use the attributes of the words as features. For example create an indicator function f(x',y) for word x with context x' (i.e the word along with the surrounding words and their attributes) and tag y which will return a 1 or 0. A sample indicator function will be as below
f('blue', 'y') = if 'blue' in `color attribute` column of DB and words previous to 'blue' is in `product attribute` column of DB and 'y' is `colors` then return 1 else 0.
Create lot of these indicator functions also know as features maps.
These indicator functions are then used to train a models using CRFS or LSTMS. Finially we use viterbi algorithm to find the best tagging sequence for your query. For CRFs you can use packages like CRFSuite or CRF++. Using these packages all you have go do is create indicator functions and the package will train a model for you. Once trained you can use this model to predict the best sequence for your queries. CRFs are very fast.
This way of training without using vector representation of words will generalise your model without the need of retraining. [Look at NER using CRFs].

What is the conceptual difference between topic extraction and text categorization?

I'm confused that very similar services for text mining have different names, like topic extraction and text categorization/classification. What is the conceptual difference between them?
Topic extraction example:
https://www.uclassify.com/browse/uclassify/topics?input=Text
Categorization example:
https://dandelion.eu/semantic-text/text-classification-demo/
Topic Model approaches (Topic Extraction) are unsupervised approaches. So, you don't need to know that each document belongs to what categories (classes) [https://en.wikipedia.org/wiki/Topic_model].
Latent Dirichlet allocation (LDA) is a method for Topic Modeling. LDA divides the documents into topics and assigns a name to the topics. [https://en.wikipedia.org/wiki/Latent_Dirichlet_allocation]
Topic Model needs the number of output clusters as the same as clustering methods. But they assign a topic name to each output cluster.
In contrast to Topic Model approaches, Document Classification approaches (Categorization) are supervised. So, they need the class labels. [https://en.wikipedia.org/wiki/Document_classification]

Django Designing Models for Dating App Matches

I’m working on a dating app for a hackathon project. We have a series of questions that users fill out, and then every few days we are going to send suggested matches. If anyone has a good tutorial for these kinds of matching algorithms, it would be very appreciated. One idea is to assign a point value for each question and then to do a
def comparison(person_a, person_b) function where you iterate through these questions, and where there’s a common answer, you add in a point. So the higher the score, the better the match. I understand this so far, but I’m struggling to see how to save this data in the database.
In python, I could take each user and then iterate through all the other users with this comparison function and make a dictionary for each person that lists all the other users and a score for them. And then to suggest matches, I iterate through the dictionary list and if that person hasn’t been matched up already with that person, then make a match.
person1_dictionary_of_matches = {‘person2’: 3, ‘person3’: 5, ‘person4’: 10, ‘person5’: 12, ‘person6’: 2,……,‘person200’:10}
person_1_list_of_prior_matches = [‘person3’, 'person4']
I'm struggling on how to represent this in django. I could have a bunch of users and make a Match model like:
class Match(Model):
person1 = models.ForeignKey(User)
person2 = models.ForeignKey(User)
score = models.PositiveIntegerField()
Where I do the iteration and save all the pairwise scores.
and then do
person_matches = Match.objectsfilter(person1=sarah, person2!=sarah).order_by('score').exclude(person2 in list_of_past_matches)
But I’m worried with 1000 users, I will have 1000000 rows in my table if do this. Will this be brutal to have to save all these pairwise scores for each user in the database? Or does this not matter if I run it at like Sunday night at 1am or just cache these responses once and use the comparisons for a period of months? Is there a better way to do this than matching everyone up pairwise? Should I use some other data structure to capture the people and their compatibility score? Thanks so much for any guidance!
Interesting question. In machine learning's current paradigm you work with sparse matrices that means that you would not have to perform every single match evaluation. The sparsity may come from two alternatives:
Create a batch offline analysis of your data to perform some clustering (fancy solution).
Filter the individuals by some key attributes: a) gender/sexual preference, b) geographical location, c) dating status etc. (simple solution)
After the filtering you could perform a function for estimating appropriate matches for the new user. Based on the selected choices of the user adscribe selected matches into the database for future queries. However, if you get serious about this problem I suggest you give Spark a try. This is not a problem for an SQL database but for a Big Data Engine.

SOLR query exclusions

I'm having an issue with querying an index where a common search term also happens to be part of a company name interspersed throughout most of the documents. How do I exclude the business name in results without effecting the ranking on a search that includes part of the business name?
example: Bobs Automotive Supply is the business name.
How can I include relevant results when someone searches automotive or supply without returning every document in the index?
I tried "-'Bobs Automotive Supply' +'search term'" but this seems to exclude any document with Bobs Automotive Supply and isn't very effective on searching 'supply' or 'automotive'
Thanks in advance.
Second answer here, based on additional clarification from first answer.
A few options.
Add the business name as StopWords in the StopWordFilter. This will stop Solr from Indexing them at all. Searches that use them will only really search for those words that aren't in the business name.
Rely on the inherent scoring that Solr will apply due to Term frequency. It sounds like these terms will be in the index frequently. Queries for them will still return the documents, but if the user queries for other, less common terms, those will get a higher score.
Apply a low query boost (not quite negative, but less than other documents) to documents that contain the business name. This is covered in the Solr Relevancy FAQ http://wiki.apache.org/solr/SolrRelevancyFAQ#How_do_I_give_a_negative_.28or_very_low.29_boost_to_documents_that_match_a_query.3F
Do you know that the article is tied to the business name or derive this? If so, you could create another field and then just exclude entities that match on the business name using a filter query. Something like
q=search_term&fq=business_name:(NOT search_term)
It may be helpful to use subqueries for this or to just boost down rather than filter out results.
EDIT: Update to question make this irrelavent. Leaving it hear for posterity. :)
This is why Solr Documents have different fields.
In this case, it sounds like there is a "Footer" field that is separate from your "Body" field in your documents. When searches are performed, they would only done against the Body, which won't include data from the Footer. You could even have a third field which is the "OriginalContent" field, which contains the original copy for display purposes. You wouldn't search that, just store it for later.
The important part is to create the two separate fields in your schema and make sure that you index those field that you want to be able to search.