What is the conceptual difference between topic extraction and text categorization? - data-mining

I'm confused that very similar services for text mining have different names, like topic extraction and text categorization/classification. What is the conceptual difference between them?
Topic extraction example:
https://www.uclassify.com/browse/uclassify/topics?input=Text
Categorization example:
https://dandelion.eu/semantic-text/text-classification-demo/

Topic Model approaches (Topic Extraction) are unsupervised approaches. So, you don't need to know that each document belongs to what categories (classes) [https://en.wikipedia.org/wiki/Topic_model].
Latent Dirichlet allocation (LDA) is a method for Topic Modeling. LDA divides the documents into topics and assigns a name to the topics. [https://en.wikipedia.org/wiki/Latent_Dirichlet_allocation]
Topic Model needs the number of output clusters as the same as clustering methods. But they assign a topic name to each output cluster.
In contrast to Topic Model approaches, Document Classification approaches (Categorization) are supervised. So, they need the class labels. [https://en.wikipedia.org/wiki/Document_classification]

Related

How do you give negative feedback to amazon personalize? i.e. tell personalize that a user doesn't really like a certain item?

I did the amazon personalize deep dive series on youtube. At the timestamp 8:33 in the video, it was mentioned that 'Personalize does not understand negative feedback.' and that any interaction you submit is assumed to be a positive one.
But I think that giving negative feedback could improve the recommendations that we give on a whole. Personalize knowing that a user does not like a given item 'A' would help ensure that it does not recommend items similar to 'A' in the future.
Is there any way in which we can give negative feedback(ex. user doesn't like items x,y,z) to amazon personalize?
A possible way to give negative feedback that I thought of:
Let's say users can give ratings out of 5 for movies. Every time a user gives a rating >= 3 in the interactions dataset, we add an additional interaction in the dataset (i.e we have two interactions saying that a user rated a movie >=3 in the interactions.csv instead of just one). However, if he gives a rating <=2 (meaning he probably doesn't like the movie), we just keep the single interaction of that in the interactions dataset (i.e. we only have one interaction saying that a user rated a movie <=2 in the interactions.csv file)
Would this in any way help to convey to personalize that ratings <=2
are not as important/that the user did not like them?
Negative feedback, where the user explicitly indicates that they dislike an item, is currently not supported as training input for Amazon Personalize. Furthermore, there is currently no way to add weight/reward to specific interactions by event type or event value (see this answer for details).
With that said, you can use impressions with your interactions to indicate items that were seen by the user but that they chose not to interact with. Impressions are only supported by the User-Personalization recipe. From the docs:
Unlike other recipes, which solely use positive interactions (clicking, watching, or purchasing), the User-Personalization recipe can also use impressions data. Impressions are lists of items that were visible to a user when they interacted with (clicked, watched, purchased, and so on) a particular item.
Using this information, a solution created with the User-Personalization recipe can calculate the suitability of new items based on how frequently an item has been ignored, and change recommendations accordingly. For more information see Impressions data.
Impressions are not the same as an explicitly negative interaction but they do imply that the impressed items were considered less relevant/important to the user than the item that they chose to interact with.
Another approach that can be used to consider negative interactions when making recommendations is to have two event types: one event type for positive intent (e.g., "like", "watch", or "purchase") and one event type for dislike (e.g., "dislike"). Then create a Personalize solution that only trains on the positive event type. Finally, at inference time, use a Personalize filter to exclude items that the user has recently disliked.
EXCLUDE ItemID WHERE Interactions.EVENT_TYPE IN ("dislike")
In this case, Personalize is still training only on positive interactions but with the filter you won't recommend items that the user disliked.

Can I use pre-labeled data in AWS SageMaker Ground Truth NER?

Let's say I have some text data that has already been labeled in SageMaker. This data could have either been labeled by humans or an ner model. Then let's say I want to have a human go back over the dataset, either to label new entity class or correct existing labels. How would I set up a labeling job to allow this? I tried using an output manifest from another labeling job, but all of the documents that were already labeled cannot be accessed by workers to re-label.
Yes, this is possible you are looking for Custom Labelling worklflows you can also apply either Majority Voting (MV) or MDS to evaluate the accuracy of the job

Model Post and Topic through DynamoDB

Heres the relation I'm trying to model in DynamoDB:
My service contains posts and topics. A post may belong to multiple topics. A topic may have multiple posts. All posts have an interest value which would be adjusted based on a combination of likes and time since posted, interest measures the popularity of a post at the current moment. If a post gets too old, its interest value will be 0 and stay that way forever (archival).
The REST api end points work like this:
GET /posts/{id} returns a post-object containing title, text, author name and a link to the authors rest endpoint (doesn't matter for this example) and the number of likes (the interest value is not included)
GET /topics/{name} should return an object with both a list with the N newest posts of the topics as well as one for the N currently most interesting posts
POST /posts/ creates a new post where multiple topics can be specified
POST /topics/
creates a new topic
POST /likes/ creates a like for a specified post (does not actually create an object, just adds the user to the given post-object's list of likers, which is invisible to the users)
The problem now becomes, how do I create a relationship between topics and and posts in DynamoDB NoSql?
I thought about adding a list of copies of posts to tag entries in DynamboDB, where every tag has a list of both the newest and the most interesting Posts.
One way I could do this is by creating a cloudwatch job that would run every 10 minutes and loop through every topic object, finding both the most interesting and newest entries and then replacing the old lists of the topic.
Another job would also have to regularly update the "interest" value of every non archived post (keep in mind both likes and time have an effect on the interest value).
One problem with this is that a lot of posts in the Tag list would be out of date for 10 minutes in case the User makes a change or deletes the post. Likes will also not be properly tracked on the Tags post list. This could perhaps be solved with transactions, although dynamoDB is limited to 10 objects per transaction.
Another problem is that it would require the add-posts-to-tags job to load all the non archived posts into memory in order to manually sort them by both time and interest, split them up by tag and then adding the first N of both sets to the tag lists every 10 minutes.
I also had a another idea, by limiting the tags of a post that are allowed to 1, I could add the tag as a partition key, with the post-time as the sort key, and use a GSI to add Interest as a second sort key.
This does have several downsides though:
very popular tags may be limited to a single parition since all the posts share a single partition key
Tag limit is 1
A cloudwatch job to adjust the Interest value of posts may still be required
It would require use of a GSI which may lead to dangerous race conditions
But it would have the advantage that there are no replications of the post objects aside from the GSI. It would also allow basically infinite paging of all posts by date instead of being limited to just the N newest posts.
So what is a good approach here? It seams both of my solutions have horrible dealbreakers. Is this just one of those problems that NoSQL simply can't solve?
You are trying to model relational data using a non relational DB ,
to do this I would use 2 types of DB ,
I would store in dynamo the post information
in your example it would be :
GET /posts/{id}
POST /posts/
POST /likes/creates
For the topic related information I would use Elastic search (Amazon Elasticsearch Service)
GET /topics/{name} : the search index would stored the full topic info as well post id's that , and the relevant fields you want to search for (in your case update date to get the most recent posts)
what this will entail is background process (in dynamoDB this can be done via streams) that takes changes to the dynamoDB for new post's , update to like count etc.. and populates the search index.
Note: this can also be solved using graphDB but for scaling purposes better separate the source of the data (post's ) and the data relations (topic).

How to find entity in search query in Elasticsearch?

I'm using Elasticsearch to build search for ecommerece site.
One index will have products stored in it, in products index I'll store categories in it's other attributes along with. Categories can be multiple but the attribute will have single field value. (E.g. color)
Let's say user types in Black(color) Nike(brand) shoes(Categories)
I want to process this query so that I can extract entities (brand, attribute, etc...) and I can write Request body search.
I have tought of following option,
Applying regex on query first to extract those entities (But with this approach not sure how Fuzzyness would work, user may have typo in any of the entity)
Using OpenNLP extension (But this one only works on indexation time, in above scenario we want it on query side)
Using NER of any good NLP framework. (This is not time & cost effective because I'll have millions of products in engine also they get updated/added on frequent basis)
What's the best way to solve above issue ?
Edit:
Found couple of libraries which would allow fuzzy text matching in regex. But the entities to find will be many, so what's the best solution to optimise that ?
Still not sure about OpenNLP
NER won't work in this case because there are fixed number of entities so prediction is not right when there are no entity available in the query.
If you cannot achieve desired results with tuning of built-in ElasticSearch scoring/boosting most likely you'll need some kind of 'natural language query' processing:
Tokenize free-form query. Regex can be used for splitting lexems, however very often it is better to write custom tokenizer for that.
Perform named-entity recognition to determine possible field(s) for each keyword. At this step you will get associations like (Black -> color), (Black -> product name) etc. In fact you don't need OpenNLP for that as this should be just an index (keyword -> field(s)), and you can try to use ElasticSearch 'suggest' API for this purpose.
(optional) Recognize special phrases or combinations like "released yesterday", "price below $20"
Generate possible combinations of matches, and with help of special scoring function determine 'best' recognition result. Scoring function may be hardcoded (reflect 'common sense' heuristics) or it this may be a result of machine learning algorithm.
By recognition result (matches metadata) produce formal query to produce search results - this may be ElasticSearch query with field hints, or even SQL query.
In general, efficient NLQ processing needs significant development efforts - I don't recommend to implement it from scratch until you have enough resources & time for this feature. As alternative, you can try to find existing NLQ solution and integrate it, but most likely this will be commercial product (I don't know any good free/open-source NLQ components that really ready for production use).
I would approach this problem as NER tagging considering you already have corpus of tags. My approach for this problem will be as below:
Create a annotated dataset of queries with each word tagged to one of the tags say {color, brand, Categories}
Train a NER model (CRF/LSTMS).
This is not time & cost effective because I'll have millions of
products in engine also they get updated/added on frequent basis
To handle this situation I suggest dont use words in the query as features but rather use the attributes of the words as features. For example create an indicator function f(x',y) for word x with context x' (i.e the word along with the surrounding words and their attributes) and tag y which will return a 1 or 0. A sample indicator function will be as below
f('blue', 'y') = if 'blue' in `color attribute` column of DB and words previous to 'blue' is in `product attribute` column of DB and 'y' is `colors` then return 1 else 0.
Create lot of these indicator functions also know as features maps.
These indicator functions are then used to train a models using CRFS or LSTMS. Finially we use viterbi algorithm to find the best tagging sequence for your query. For CRFs you can use packages like CRFSuite or CRF++. Using these packages all you have go do is create indicator functions and the package will train a model for you. Once trained you can use this model to predict the best sequence for your queries. CRFs are very fast.
This way of training without using vector representation of words will generalise your model without the need of retraining. [Look at NER using CRFs].

How word association mining is generalization of n-gram language model

I am working on text mining (reading book...) author said word association mining is actually the generalization of n-gram language model Can you please tell how word association mining is generalization of n-gram language model.
For Me word association mining is finding symptomatic relation (finding co-occurring) words and n-gram language model is compare all n-words in query to suggest or return relevant documents.
Association rule mining will try to cover frequent concurrences of arbitrary length.
If you apply this (not just two term correlations) to text, you would indeed find n-grams without a fixed n.