Removing instances in weka to normalize result classes - weka

Trying construct a classifier, however one class is highly overrepresented.
I have tried weighting, but the data set is large and I need to sub sample it.
So I thought I might be able to remove some instances from the overrepresented class instead.
Is there any filters I could use to remove a subset of only one of the classes?

Take a look at SpreadSubsample filter in supervised instance filters. It should do the job. Good luck.

Related

How to find entity in search query in Elasticsearch?

I'm using Elasticsearch to build search for ecommerece site.
One index will have products stored in it, in products index I'll store categories in it's other attributes along with. Categories can be multiple but the attribute will have single field value. (E.g. color)
Let's say user types in Black(color) Nike(brand) shoes(Categories)
I want to process this query so that I can extract entities (brand, attribute, etc...) and I can write Request body search.
I have tought of following option,
Applying regex on query first to extract those entities (But with this approach not sure how Fuzzyness would work, user may have typo in any of the entity)
Using OpenNLP extension (But this one only works on indexation time, in above scenario we want it on query side)
Using NER of any good NLP framework. (This is not time & cost effective because I'll have millions of products in engine also they get updated/added on frequent basis)
What's the best way to solve above issue ?
Edit:
Found couple of libraries which would allow fuzzy text matching in regex. But the entities to find will be many, so what's the best solution to optimise that ?
Still not sure about OpenNLP
NER won't work in this case because there are fixed number of entities so prediction is not right when there are no entity available in the query.
If you cannot achieve desired results with tuning of built-in ElasticSearch scoring/boosting most likely you'll need some kind of 'natural language query' processing:
Tokenize free-form query. Regex can be used for splitting lexems, however very often it is better to write custom tokenizer for that.
Perform named-entity recognition to determine possible field(s) for each keyword. At this step you will get associations like (Black -> color), (Black -> product name) etc. In fact you don't need OpenNLP for that as this should be just an index (keyword -> field(s)), and you can try to use ElasticSearch 'suggest' API for this purpose.
(optional) Recognize special phrases or combinations like "released yesterday", "price below $20"
Generate possible combinations of matches, and with help of special scoring function determine 'best' recognition result. Scoring function may be hardcoded (reflect 'common sense' heuristics) or it this may be a result of machine learning algorithm.
By recognition result (matches metadata) produce formal query to produce search results - this may be ElasticSearch query with field hints, or even SQL query.
In general, efficient NLQ processing needs significant development efforts - I don't recommend to implement it from scratch until you have enough resources & time for this feature. As alternative, you can try to find existing NLQ solution and integrate it, but most likely this will be commercial product (I don't know any good free/open-source NLQ components that really ready for production use).
I would approach this problem as NER tagging considering you already have corpus of tags. My approach for this problem will be as below:
Create a annotated dataset of queries with each word tagged to one of the tags say {color, brand, Categories}
Train a NER model (CRF/LSTMS).
This is not time & cost effective because I'll have millions of
products in engine also they get updated/added on frequent basis
To handle this situation I suggest dont use words in the query as features but rather use the attributes of the words as features. For example create an indicator function f(x',y) for word x with context x' (i.e the word along with the surrounding words and their attributes) and tag y which will return a 1 or 0. A sample indicator function will be as below
f('blue', 'y') = if 'blue' in `color attribute` column of DB and words previous to 'blue' is in `product attribute` column of DB and 'y' is `colors` then return 1 else 0.
Create lot of these indicator functions also know as features maps.
These indicator functions are then used to train a models using CRFS or LSTMS. Finially we use viterbi algorithm to find the best tagging sequence for your query. For CRFs you can use packages like CRFSuite or CRF++. Using these packages all you have go do is create indicator functions and the package will train a model for you. Once trained you can use this model to predict the best sequence for your queries. CRFs are very fast.
This way of training without using vector representation of words will generalise your model without the need of retraining. [Look at NER using CRFs].

What is the effect of using filtered classifier over normal classifier in weka

I have used weka for text classification. First I used StringToWordVector filter and filtered data were used with SVM classifier (LibSVM) for cross validation. Later I have read a blog post here
It said that it is not suitable to use filter first and then perform cross validation. Instead it proposes FilteredClassifer to use. His justification is
Two weeks ago, I wrote a post on how to chain filters and classifiers in WEKA, in order to avoid misleading results when performing experiments with text collections. The issue was that, when using N Fold Cross Validation (CV) in your data, you should not apply the StringToWordVector (STWV) filter on the full data collection and then perform the CV evaluation on your data, because you would be using words that are present in your test subset (but not in your training subset) for each run.
I can not understand the reason behind this. Anyone knows that?
When you using filter before N Fold cross validation you would be filtering every word appear in each instance despite being a test instance or train instance. At the moment Filter has no way to know if a instance is a test instance or a train instance. So if you are using StringtoWordVector with TFTransform or any similar operation, any word in test instances may affect the transform value. (Simply, if you are implementing bag of words then you would take test instance for the consideration too). This is not acceptable since the training parameters should not affected by the testing data. So instead you can do the Filtering on the run. That is FilteredClassifer.
For get an idea about how N Fold cross validation works, please refer to the Rushdi Shams's answer in following question. Please let me know if you understood it or not. Cheers..!!
Cross Validation in Weka

Add new attribute calculated based on other attributes

I'm starting with WEKA and want to achieve the following.
I have file with 2 attributes: user_id, user_age.
I can successfully load data using WEKA API and get Instances object.
Now I want to calculate new attribute user_age_range - like (0-18) - 0, (19-25) - 1, etc.
Is there a way to calculate this attribute using WEKA Filters?
Also I would like not to iterate manually through all instances, but to define method that operates on single Instance and use some filter (or other abstraction) that'll apply corresponding "transformation" to all instances.
Please advice - how I could achieve this.
Thanks in advance.
After looking through the docs I found one or two filters that you could use in conjunction to achieve what you want.
http://weka.sourceforge.net/doc.dev/weka/filters/unsupervised/attribute/Copy.html
Use copy to create a copy that you will transform.
http://weka.sourceforge.net/doc.dev/weka/filters/unsupervised/attribute/NumericTransform.html
The numeric transform takes a class and a method option, you could write your own class that boxes the ages into the ranges you want and supply this class and method as your options.
Hope this helps
Using a csv file you can do that on Excel.
If you are using arff files, convert it to csv and then you can add the columns that you want depending on the number of new attributes and then just do whatever you want to do with one or more atributes on the first row. Extend that to all rows and it's done.

Changing a nominal variable to remove one particular label with zero instances in weka

I have a very simple questions, but I am all confused with the user interface, and I could not find it in the documentation.
I have a feature in my dataset that is nominal. It used to have 4 classes but I deleted the instances of one class. Now I want to classify based on this feature.
BUT, in the preprocess window, the attribute is still listed as having 4 classes, of which one has 0 instances. It performs the classification as it should, but in the result, there is a column/row in the confusion matrix and accuracy table for the zero class.
Is there a way to remove the label with zero instances, zo weka thinks that the feature only has three values?
Thanks!
Ok got it. I opened the weka file in an editor. There I could change the feature definition.

Custom Date Aggregate Function

I want to sort my Store models by their opening times. Store models contains is_open function which controls Store's opening time ranges and produces a boolean if it's open or not. The problem is I don't want to sort my queryset manually because of efficiency problem. I thought if I write a custom annotate function then I can filter the query more efficiently.
So I googled and found that I can extend Django's aggregate class. From what I understood, I have to use pre-defined sql functions like MAX, AVG etc. The thing is I want to check that today's date is in a given list of time intervals. So anyone can help me that which sql name should I use ?
Edit
I'd like to put the code here but it's really a spaghetti one. One pages long code only generates time intervals and checks the suitable one.
I want to avoid :
alg= lambda r: (not (s.is_open() and s.reachable))
sorted(stores,key=alg)
and replace with :
Store.objects.annotate(is_open = CheckOpen(datetime.today())).order_by('is_open')
But I'm totally lost at how to write CheckOpen...
have a look at the docs for extra