How can I see if a document contains any of a set of words using ML.net? - ml.net

I am learning ML.net and want to submit the text of a document to see if it contains one or more specific words. In my example I want to categorize my document with the day of the week. For example, if it contains the word "Monday" or "first", then I want it to categorize it as "Monday.

You will need to create a training dataset. This should consist of documents that contain the words you are looking for AND documents that don't. You will need to label these with a column that contains a Yes or No as the ground truth.
Then use the model builder interface and select a classify task.
Best to start with a small training set just to check you have the data in the right format.

Related

How to find entity in search query in Elasticsearch?

I'm using Elasticsearch to build search for ecommerece site.
One index will have products stored in it, in products index I'll store categories in it's other attributes along with. Categories can be multiple but the attribute will have single field value. (E.g. color)
Let's say user types in Black(color) Nike(brand) shoes(Categories)
I want to process this query so that I can extract entities (brand, attribute, etc...) and I can write Request body search.
I have tought of following option,
Applying regex on query first to extract those entities (But with this approach not sure how Fuzzyness would work, user may have typo in any of the entity)
Using OpenNLP extension (But this one only works on indexation time, in above scenario we want it on query side)
Using NER of any good NLP framework. (This is not time & cost effective because I'll have millions of products in engine also they get updated/added on frequent basis)
What's the best way to solve above issue ?
Edit:
Found couple of libraries which would allow fuzzy text matching in regex. But the entities to find will be many, so what's the best solution to optimise that ?
Still not sure about OpenNLP
NER won't work in this case because there are fixed number of entities so prediction is not right when there are no entity available in the query.
If you cannot achieve desired results with tuning of built-in ElasticSearch scoring/boosting most likely you'll need some kind of 'natural language query' processing:
Tokenize free-form query. Regex can be used for splitting lexems, however very often it is better to write custom tokenizer for that.
Perform named-entity recognition to determine possible field(s) for each keyword. At this step you will get associations like (Black -> color), (Black -> product name) etc. In fact you don't need OpenNLP for that as this should be just an index (keyword -> field(s)), and you can try to use ElasticSearch 'suggest' API for this purpose.
(optional) Recognize special phrases or combinations like "released yesterday", "price below $20"
Generate possible combinations of matches, and with help of special scoring function determine 'best' recognition result. Scoring function may be hardcoded (reflect 'common sense' heuristics) or it this may be a result of machine learning algorithm.
By recognition result (matches metadata) produce formal query to produce search results - this may be ElasticSearch query with field hints, or even SQL query.
In general, efficient NLQ processing needs significant development efforts - I don't recommend to implement it from scratch until you have enough resources & time for this feature. As alternative, you can try to find existing NLQ solution and integrate it, but most likely this will be commercial product (I don't know any good free/open-source NLQ components that really ready for production use).
I would approach this problem as NER tagging considering you already have corpus of tags. My approach for this problem will be as below:
Create a annotated dataset of queries with each word tagged to one of the tags say {color, brand, Categories}
Train a NER model (CRF/LSTMS).
This is not time & cost effective because I'll have millions of
products in engine also they get updated/added on frequent basis
To handle this situation I suggest dont use words in the query as features but rather use the attributes of the words as features. For example create an indicator function f(x',y) for word x with context x' (i.e the word along with the surrounding words and their attributes) and tag y which will return a 1 or 0. A sample indicator function will be as below
f('blue', 'y') = if 'blue' in `color attribute` column of DB and words previous to 'blue' is in `product attribute` column of DB and 'y' is `colors` then return 1 else 0.
Create lot of these indicator functions also know as features maps.
These indicator functions are then used to train a models using CRFS or LSTMS. Finially we use viterbi algorithm to find the best tagging sequence for your query. For CRFs you can use packages like CRFSuite or CRF++. Using these packages all you have go do is create indicator functions and the package will train a model for you. Once trained you can use this model to predict the best sequence for your queries. CRFs are very fast.
This way of training without using vector representation of words will generalise your model without the need of retraining. [Look at NER using CRFs].

modeling datawarehouse multilanguage

I need your help.
I work for a survey company and I am responsible for creating its architecture and modeling a data warehouse that analyzes the results of an international survey (50 countries).
For the architecture, we decided to create a tabular model in PowerBI to analyze our data and to create our reports.
Here below is the model as I thought:
However, I have a design problem.
Since the survey is international, the wording of my dimensions differs from country to country.
My 1st question:
-Would it make more sense to create only one PowerBI embedded model for all countries or 50 PowerBI reports?
My 2nd question:
My model must be multilingual
With my 50 countries, I have several languages (5 languages) and for the same language, I have several variants.
The British English labels differ from the US English labels.
For example, for the Response dimension for France the IdReponse = 1 has the wording 'Vrai' while for the USA the wording is 'True' and for the Britain is 'OK'.
Do you know how to model multi language in a data warehouse?
About question #1 - It's always better, if there is only one model. It will be much easier to maintain. It isn't clear from your question will these 50 reports show the same data (excluding the internationalization of texts like Vrai/True/OK), or each report/country should show it's own subset of the data. In case all reports will show the same data, then definitely it will be better to make one common model and all report use it. You can do this with Power BI by making one "master" report and publishing it, and then the rest of your "per country" reports use it as a data source. And you will need separate reports per country, because you will need to translate the texts (column names, static texts, etc.).
About question #2 - You can create lookup tables in your model (maybe even in the database, it's up to you). The key value (1) will be linked to the key of the table, and there will be columns per language. Depending on the language of the current report, you will select the appropriate column (e.g. French, British, etc.) and even you can fallback to let's say US English, in case there is no translation entered for the current language (e.g. by making a computed column). It is also an option to make separate lookup table per language, but I think it will be more cumbersome to maintain this way.
About question #1: Yes you need only one data model.
About question #2: You Load a question in the language it is asked and the response you get as is in the response DIM. You should create a new column in your response DIM such as Clean_response where you transformed original response to a uniformed value. for example "Vrai", "OK", "True" has same meaning so you may chose to put "Yes" in the Clean_response column. You can also convert different variation of "No", "Nada", "noops", "nah" to a clean value of "No", but keep the original value too.
Labeling a column in the report should be handle in the report code. For example writing a report in French should use your dim column name "Question" and show it as "interroger" as a heading on the report.

How to determine if a given word or phrase from a list is within an anchor tag?

We have a ColdFusion based site that involves a large number of 'document authors' that have little or no knowledge of HTML. The 'documents' they create are comprised of HTML stored in a table in the database. They use a CKEDITOR interface. The content that they create is output into specific area of the page. The document frequently has tons of technical terms that readers may not be familiar with that we would like to have tooltips automatically show up for.
I and the other programmer want to have some code insert 'tooltip' code into the page based on a list of words in a table on our SQL server. The 'dictionary' table in our database has a unique ID, the word/phrase we will look for and a corresponding definition that would be displayed in the tooltip.
For instance, one of the word/phrases we will be looking for is 'Scrum Master'. If it occurs in the document area, we need to insert code around the words to create a tooltip. To do that, we need to see if certain conditions exist. Are the words within an anchor tag? If yes, is there already a title value for the tag (title is used to contain the info to be displayed in a tooltip)? If a title tag exists, don't do anything. If the words are not in an anchor tag, then we would put anchor tags around the words along with the title that will contain the definition.
The tooltip code we use is via jQuery (http://jqueryui.com/tooltip/). It is quick and simple to use. We just need to figure out how to use it dynamically based on our dictionary table.
Do you have any suggestions of how to go about this?
I was hoping that jSoup might have a function that I could use, but that doesn't seem to be the right technology for what I want to do, but I could be wrong and I am happy to be corrected!
We have a large number of these documents and so manually inserting and maintaining the tooltip code is just not an option.
Update you content with something like:
strOut = ReplaceList(strIn, ValueList(qryTT.find), ValueList(qryTT.replace));
Since words are delimited by spaces, the qryTT.find needs to have spaces. The replace column is going to need to include some of the original content. You are going to have to be careful with words followed by a comma or a period too.
I would cache the results because I would expect it to be memory intensive.

Efficiently processing all data in a Cassandra Column Family with a MapReduce job

I want to process all of the data in a column family in a MapReduce job. Ordering is not important.
An approach is to iterate over all the row keys of the column family to use as the input. This could be potentially a bottleneck and could replaced with a parallel method.
I'm open to other suggestions, or for someone to tell me I'm wasting my time with this idea. I'm currently investigating the following:
A potentially more efficient way is to assign ranges to the input instead of iterating over all row keys (before the mapper starts). Since I am using RandomPartitioner, is there a way to specify a range to query based on the MD5?
For example, I want to split the task into 16 jobs. Since the RandomPartitioner is MD5 based (from what I have read), I'd like to query everything starting with a for the first range. In other words, how would I query do a get_range on the MD5 with the start of a and ends before b. e.g. a0000000000000000000000000000000 - afffffffffffffffffffffffffffffff?
I'm using the pycassa API (Python) but I'm happy to see Java examples.
I'd cheat a little:
Create new rows job_(n) with each column representing each row key in the range you want
Pull all columns from that specific row to indicate which rows you should pull from the CF
I do this with users. Users from a particular country get a column in the country specific row. Users with a particular age are also added to a specific row.
Allows me to quickly pull the rows i need based on the criteria i want and is a little more efficient compared to pulling everything.
This is how the Mahout CassandraDataModel example functions:
https://github.com/apache/mahout/blob/trunk/integration/src/main/java/org/apache/mahout/cf/taste/impl/model/cassandra/CassandraDataModel.java
Once you have the data and can pull the rows you are interested in, you can hand it off to your MR job(s).
Alternately, if speed isn't an issue, look into using PIG: How to use Cassandra's Map Reduce with or w/o Pig?

Blocks of similar text for test data

For testing purposes I need to create sets of text files that have similar but not identical text. Each set needs to be different from the other set but also share some commonality.
For example, I may need to create 10 sets of 20 documents each for a total of 200 documents. Each document needs about 250 words in it.
If one of the sets of documents is about dogs then it would be appropriate that the other sets' documents be about animals, for example, such that there is a weak link between each set (in this case animals) and a strong link between the documents within a set (such as dogs in one set and cats in another set).
The words in the documents do not need to be in any particular order, nor do they need to be in sentences or make sense.
Does anybody know how I can generate or obtain this type of data for my unit tests?
How about grabbing some text from Project Gutenberg?
I needed test data set for text indexing to benchmark solr indexing speed.
I downloaded source code from github as zip file. e.g this one is huge-
https://github.com/spring-projects/spring-framework
"download as zip" button.