Features combinations

Features combinations - weka

I have a list of features set (40 features) and my idea firstly was to evaluate the classifier on all the combinations that I can get. However, after I did some calculations I found that the combinations will reach millions!! Thus, it will take forever!!!!
I read about the ability of using random search method to chose random features. However, each time I run the random search I got the same features sets. Do I need to change the seed number or any option??
Also, Is using random search effective and can substitute the approach of choosing all combinations???
I would appreciate your help experts.
Many thanks in advance,
Ahmad

When you want to perform an attribute selection in WEKA, yo should take into account 2 algorithms, the searcher and the attribute evaluator (I will talk about it later).
As you said, maybe you cannot try an Exhaustive search because it takes so long, there are greedy alternatives to get good results (depending on the problem) like Best first (based on hill climbing). The option that you comment (Random search) is another approach to make the selection subsets, it makes random iterations to select subsets that will be evaluated.
Why are you getting the same subset of selected attributes? Because the Random search is selecting always the same subsets and the evaluator determines the best one (final output). But if I change the seed parameter it should change. Maybe or... maybe not. Why? Because if the algorithm performs an enough number of iterations (although it starts with a different seed) it will get the same subsets than the previous one (convergence) and the evaluator will choose the same subset as the previous execution.
If you do not want to get convergence in the selector output, just change the seed, but choose a smaller search percent to limit the exploration and get different results.
But, in my opinion, if you are getting always the same results is because the evaluator (I do not know what algorithm are you using) has determined that this subset is "the best" given your dataset. I also recommend you to try another selector like Best first or a Genetic search as your search method.

Related

Evolutionary Algorithm without an objective function

I'm currently trying to find good parameters for my program (about 16 parameters and execution of the program takes about a minute). Evolutionary algorithms seemed like a nice idea and I wanted to see how they perform.
Unfortunately I don't have a good fitness function because the variance of my objective function is very high (I can not run it often enough without waiting until 2016). I can, however, compute which set of parameters is better (test two configurations against each other). Do you know if there are evolutionary algorithms that only use that information? Are there other optimization techniques more suitable? For this project I'm using C++ and MATLAB.
// Update: Thank you very much for the answers. Both look promising but I will need a few days to evaluate them. Sorry for the delay.

If your pairwise test gives a proper total ordering, i.e. if a >= b, and b >= c implies a >= c, and some other conditions . Then maybe you can construct a ranking objective on the fly, and use CMA-ES to optimize it. CMA-ES is an evolutionary algorithm and is invariant to order preserving transformation of function value, and angle-preserving transformation of inputs. Furthermore because it's a second order method, its convergence is very fast comparing to other derivative-free search heuristics, especially in higher dimensional problems where random search like genetic algorithms take forever.

If you can compare solutions in a pairwise fashion then some sort of tournament selection approach might be good. The Wikipedia article describes using it for a genetic algorithm but it is easily applied to an evolutionary algorithm. What you do is repeatedly select a small set of solutions from the population and have a tournament among them. For simplicity the tournament size could be a power of 2. If it was 8 then pair those 8 up at random and compare them, selecting 4 winners. Pair those up and select 2 winners. In a final round -- select an overall tournament winner. This solution can then be mutated 1 or more times to provide member(s) for the next generation.

appropriate minimum support for itemset?

Please suggest me for any kind material about appropriate minimum support and confidence for itemset!
::i use apriori algorithm to search frequent itemset. i still don't know appropriate support and confidence for itemset. i wish to know what kinds of considerations to decide how big is the support.

The answer is that the appropriate values depends on the data.
For some datasets, the best value may be 0.5. But for some other datasets it may be 0.05. It depends on the data.
But if you set minsup =0 and minconf = 0, some algorithms will run out of memory before terminating, or you may run out of disk space because there is too many patterns.
From my experience, the best way to choose minsup and minconf is to start with a high value and then to lower them down gradually until you find enough patterns.
Alternatively, if you don't want to have to set minsup, you can use a top-k algorithms where instead of specifying minsup, you specify for example that you want the k most frequent rules. For example, k = 1000 rules.
If you are interested by top-k association rule mining, you can check my Java code here:
http://www.philippe-fournier-viger.com/spmf/
The algorithm is called TopKRules and the article describing it will be published next month.
Besides that, you need to know that there is many other interestingness measures beside the support and confidence: lift, all-confidence, ... To know more about this, you can read this article: "On selecting interestingness measures for association rules" and "A Survey of Interestingness Measures for Association Rules" Basically, all measures have some problems in some cases... no measure is perfect.
Hope this helps!

In any association rule mining algorithm, including Apriori, it is up to the user to decide what support and confidence values they want to provide. Depending on your dataset and your objectives you decide the minSup and minConf.
Obviously, if you set these values lower, then your algorithm will take longer to execute and you will get a lot of results.

The minimum support and minimum confidence parameters are a user preference. If you want a larger quantity of results (with lower statistical confidence), choose the parameters appropriately. In theory you can set them to 0. The algorithm will run, but it will take a long time, and the result will not be particularly useful, as it contains just about anything.
So choose them so that the result suit your needs. Mathematically, any value is "correct".

Graph - strongly connected components

Is there any fast way to determine the size of the largest strongly connected component in a graph?
I mean, like, the obvious approach would mean determining every SCC (could be done using two DFS calls, I suppose) and then looping through them and taking the maximum.
I'm pretty sure there has to be some better approach if I only need to have the size of that component and only the largest one, but I can't think of a good solution. Any ideas?
Thanks.

Let me answer your question with another question -
How can you determine which value in a set is the largest without examining all of the values?

Firstly you could use Tarjan's algorithm which needs only one DFS instead of two. If you understand the algorithm clearly, the SCCs form a DAG and this algo finds them in the reverse topological sort order. So if you have a sense of the graph (like a visual representation) and if you know that relative big SCCs occur at end of the DAG then you could stop the algorithm once first few SCCs are found.

What is the best autocomplete/suggest algorithm,datastructure [C++/C]

We see Google, Firefox some AJAX pages show up a list of probable items while user types characters.
Can someone give a good algorithm, data structure for implementing autocomplete?

A trie is a data structure that can be used to quickly find words that match a prefix.
Edit: Here's an example showing how to use one to implement autocomplete http://rmandvikar.blogspot.com/2008/10/trie-examples.html
Here's a comparison of 3 different auto-complete implementations (though it's in Java not C++).
* In-Memory Trie
* In-Memory Relational Database
* Java Set
When looking up keys, the trie is marginally faster than the Set implementation. Both the trie and the set are a good bit faster than the relational database solution.
The setup cost of the Set is lower than the Trie or DB solution. You'd have to decide whether you'd be constructing new "wordsets" frequently or whether lookup speed is the higher priority.
These results are in Java, your mileage may vary with a C++ solution.

For large datasets, a good candidate for the backend would be Ternary search trees. They combine the best of two worlds: the low space overhead of binary search trees and the character-based time efficiency of digital search tries.
See in Dr. Dobbs Journal: http://www.ddj.com/windows/184410528
The goal is the fast retrieval of a finite resultset as the user types in. Lets first consider that to search "computer science" you can start typing from "computer" or "science" but not "omputer". So, given a phrase, generate the sub-phrases starting with a word. Now for each of the phrases, feed them into the TST (ternary search tree). Each node in the TST will represent a prefix of a phrase that has been typed so far. We will store the best 10 (say) results for that prefix in that node. If there are many more candidates than the finite amount of results (10 here) for a node, there should be a ranking function to resolve competition between two results.
The tree can be built once every few hours, depending on the dynamism of the data. If the data is in real time, then I guess some other algorithm will give a better balance. In this case, the absolute requirement is the lightning-fast retrieval of results for every keystroke typed which it does very well.
More complications will arise if the suggestion of spelling corrections is involved. In that case, the edit distance algorithms will have to be considered as well.
For small datasets like a list of countries, a simple implementation of Trie will do. If you are going to implement such an autocomplete drop-down in a web application, the autocomplete widget of YUI3 will do everything for you after you provide the data in a list. If you use YUI3 as just the frontend for an autocomplete backed by large data, make the TST based web services in C++, and then use script node data source of the autocomplete widget to fetch data from the web service instead of a simple list.

Segment trees can be used for efficiently implementing auto complete

If you want to suggest the most popular completions, a "Suggest Tree" may be a good choice:
Suggest Tree

For a simple solution : you generate a 'candidate' with a minimum edit (Levenshtein) distance (1 or 2) then you test the existence of the candidate with a hash container (set will suffice for a simple soltion, then use unordered_set from the tr1 or boost).
Example:
You wrote carr and you want car.
arr is generated by 1 deletion. Is arr in your unordered_set ? No. crr is generated by 1 deletion. Is crr in your unordered_set ? No. car is generated by 1 deletion. Is car in your unordered_set ? Yes, you win.
Of course there's insertion, deletion, transposition etc...
You see that your algorithm for generating candidates is really where you’re wasting time, especially if you have a very little unordered_set.

How to test methods that may not always give correct answer

Imagine that you have an internally controlled list of vendors. Now imagine that you want to match unstructured strings against that list. Most will be easy to match, but some may be reasonably impossible. The algorithm will assign a confidence to each match, but a human needs to confirm all matches produced.
How could this algorithm be unit tested? The only idea I have had so far is to take a sample of pairs matched by humans and make sure the algorithm is able to successfully match those, omitting strings that I couldn't reasonably expect our algorithm to handle. Is there a better way?

i'd try some 'canonical' pairs, both "should match" and "shouldn't match" pairs, and test only if the confidence is above (or below) a given threshold.
maybe you can also do some ordering checks, such as "no pair should have greater confidence than the one from the exact match pair", or "the pair that matches all consonants should be >= the only vowels one".

You can also test if the confidence of strings your algorithm won't handle well is sufficiently low. In this way you can see if there is a threshold over which you can trust your algorithm as well.

An interesting exercise would be to store the human answers that correct your algorithm and try to see if you could improve your algorithm to not get them wrong.
If you can, add the new matches to the unit tests.

I don't think there's a better way than what you describe; effectively, you're just using a set of predefined data to test that the algorithm does what you expect. For any very complicated algorithm which has very nonlinear inputs and outputs, that's about the best you can do; choose a good test set, and assure that you run properly against that set of known values. If other values come up which need to be tested in the future, you can add them to the set of tested values.

That sound fair. If it's possible (given time constraints) get as large of a sample of human matches as possible, you could get a picture of how well your algorithm is doing. You could design specific unit tests which pass if they're within X% of correctness.
Best of luck.

I think there are two issues here: The way your code behaves according to the algorithm, and the way the algorithm is successful (i.e does not accept answers which a human later rejects, and does not reject answers a human would accept).
Issue 1 is regular testing. Issue 2 I would go with previous result sets (i.e. compare the algorithm's results to human ones).

What you describe is the best way because it is subjective what is the best match, only a human can come up with the appropriate test cases.

It sounds as though you are describing an algorithm which is deterministic, but one which is sufficiently difficult that your best initial guess at the correct result is going to be whatever your current implementation delivers to you (aka deterministic implementation to satisfy fuzzy requirements).
For those sorts of circumstances, I will use a "Guru Checks Changes" pattern. Generate a collection of inputs, record the outputs, and in subsequent runs of the unit tests, verify that the outcome is consistent with the previous results. Not so great for ensuring that the target algorithm is implemented correctly, but it is effective for ensuring that the most recent refactoring hasn't changed the behavior in the test space.
A variation of this - which may be more palatable for your circumstance, is to start from the same initial data collection, but rather than trying to preserve precisely the same result every time you instead predefine some buckets, and flag any time an implementation change moves a test result from one confidence bucket to another.
Samples that have clearly correct answers (exact matches, null matches, high value corner cases) should be kept in a separate test.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js