attribute selection+weka+Naive Bayes - weka

I wonder which method among the following three methods is the best to perform an attribute selection:
using a meta-classifier
the filter approach
the native approach, using the attribute selection classes directly
The classifier that I'm using is Naive Bayes.
Could anyone guide me to find the best choice?

There is theory called No free lunch. You should try all three of them in your problem an measure results in your domain.

Well, there is no single answer.
You can use a decision tree classifier (like bagging) and select the attributes where the branching decisions are taken by the classifier. You can see the tree of course to see the branching and the attribute based on which the branching is made (and these attributes are important).
You can use forward selection or backward elimination technique.
(a) in forward selection, use a single feature for which the error on validation/ test set is the least. Then with this feature included in your feature pool, try the rest of the features one at a time and choose the one that gives you, again, the least error.
(b) in backward elimination, use all the features and take the error rate. Then eliminate every feature one at a time. Then select the one to take out of your feature pool that gives the maximum decrease in error.
Continue the process unless you are satisfied with your number of features (stopping criteria).
I personally use ranker algorithm and infogain attribute evaluator to rank the attributes first and then use either 2(a) or 2(b) to select attributes.
Errors- you can consider root mean squared error. Others can do good as well.

Related

Best way to feature select using PCA (discussion)

Terminology:
Component: PC
loading-score[i,j]: the j feature in PC[i]
Question:
I know the question regarding feature selection is asked several times here at StackOverflow (SO) and on other tech-pages, and it proposes different answers/discussion. That is why I want to open a discussion for the different solutions, and not post it as a general question since that has been done.
Different methods are proposed for feature selection using PCA: For instance using the dot product between the original features and the components (here) to get their correlation, a discussion at SO here suggests that you can only talk about important features as loading-scores in a component (and not use that importance in the input space), and another discussion at SO (which I cannot find at the moment) suggest that the importance for feature[j] would be abs(sum(loading_score[:,j]) i.e the sum of the absolute value of loading_score[i,j] for all i components.
I personally would think that a way to get the importance of a feature would be an absolute sum where each loading_score[i,j] is weighted by the explained variance of component i i.e
imp_feature[j]=sum_i (abs(loading_score[i,j])*explained_variance[i].
Well, there is no universal way to select features; it totally depends on the dataset and some insights available about the dataset. I will provide some examples which might be helpful.
Since you asked about PCA, initially it separates the whole dataset into two sets under which the variances. On the other ICA (Independent Component Analysis) is able to extract multiple features simultaneously. Look at this example,
In this example, we mix three independent signals and try to separate out them using ICA and PCA. In this case, ICA is doing it a better way than PCA. In general, if you search Blind Souce Separation (BSS) you may find more information about this. Besides, in this example, we know the independent components thus, separation is easy. In general, we do not know the number of components. Hence, you may have to guess based on some prior information about the dataset. Also, you may use LDA (Linear Discriminate Analysis) to reduce the number of features.
Once you extract PC components using any of the techniques, following way we can visualize it. If we assume, those extracted components as random variables i.e., x, y, z
More information about you may refer to this original source where I took about two figures.
Coming back to your proposition,
imp_feature[j]=sum_i (abs(loading_score[i,j])*explained_variance[i]
I would not recommend this way due to the following reasons:
abs(loading_score[i,j]) when we get absolute values you may loose positive or negative correlations of considered features. explained_variance[i] may be used to find the correlation between features, but multiplying does not make any sense.
Edit:
In PCA, each component has its explained variance. Explained variance is the ratio between individual component variance and total variance(sum of all individual components variances). Feature significance can be measured by magnitude of explained variance.
All in all, what I want to say, feature selection totally depends on the dataset and the significance of features. PCA is just one technique. Frist understand the properties of features and the dataset. Then, try to extract features. Hope this helps. If you can provide us with an exact example, we may provide more insights.

SCIP and Branch and Price

I have a general question about SCIP. I need to use the SCIP as a Branch and Price framework for my problem, I code in c++ so I used the VRP example as a template. On some of the instances, the code stops at the fractional solution and returns that as a optimal solution, I think something is wrong, do I have to set some parameters in order to tell SCIP look for integer solution or I made a mistake, I believe it should not stop and instead branch on the fractional solution until it reaches the integer solution (without any other negative reduced cost column). I also solve the subproblem optimally! any commenets?!
If you define your variables to be continous and just add a pricer, SCIP will solve the master problem to optimality (i.e., solve the restricted master, add improving columns, solve the updated restricted master, and so on, until no more improving columns were found).
There is no reason for SCIP to check if the solution is integral, because you explicitly said that you don't mind whether the values of the variables are integral or not (by defining them to be continuous). On the other hand, if you define the variables to be of integral (or binary) type, SCIP will do exactly as I described before, but at the end check whether all integral variables have an integral value and branch if this is not the case.
However, you should note that all branching rules in SCIP do branching on variables, i.e., they take an integer variable with fractional value and split its domain; a binary variable would be fixed to 0 and 1 in the two child nodes. This is typically a bad idea for branch-and-price: first of all, it's quite unbalanced. You have a huge number of variables out of which only few will have value 1 in the end, most will be 0. Fixing a variable to 1 therefore has a high impact, while fixing it to 0 has almost no impact. But more importantly, you need to take the branching decision into account in your pricing problem. If you fixed a variable to 0, you have to keep the pricer from generating a copy of the forbidden column (which would probably improve the LP solution, because it was part of the former optimal solution). In order to to this, you might need to look for the 2nd (or later k)-best solution. Since you are solving the pricing problems as a MIP with SCIP, you might just add a constraint forbidding this solution (logicor (linear) for binary variables or bounddisjunction (not linear) for general integer variables).
I would recommend to implement your own branching rule, which takes into account that you are doing branch-and-price and branches in a way that is more balanced and does not harm your pricing too much. For an example, check out the Ryan&Foster branching rule, which is the standard for binary problems with a set-partitioning master structure. This rule is implemented in Binpacking as well as the Coloring example shipped with SCIP.
Please also check out the SCIP FAQ, where there is a whole section about branch-and-price which also covers the topic branching (in particular, how branching decisions can be stored and enforced by a constraint handler, which is something you need to do for Ryan&Foster branching): http://scip.zib.de/doc/html/FAQ.php
There were also a lot of questions about branch-and-price on the SCIP mailing list
http://listserv.zib.de/mailman/listinfo/scip/. If you want to search it, you can use google and search for "site:listserv.zib.de scip search-string"
Finally, I would like to recommend to have a look at the GCG project: http://www.or.rwth-aachen.de/gcg/
It is an extension of SCIP to a generic branch-cut-and-price solver, i.e., you do not need to implement anything, you just put in an original formulation of your model, which is then reformulated by a Dantzig-Wolfe decomposition and solved via branch-cut-and-price. You can supply the structure for the reformulation, pricing problems are solved as a MIP (as you do it also), and there are also different branching rules. GCG is also part of the SCIP optimization suite and can be easily built within the suite.

Which Data Mining task to retrieve a unique instance

I'm working with data mining, and I'm familiar with classification, clustering and regression tasks. In classification, one can have a lot of instances (e.g. animals), their features (e.g. number of legs) and a class (e.g. mammal, reptile).
But what I need to accomplish is, given some attributes, including the class attribute, to determine which unique instance I'm referring to (e.g. giraffe). I can supply all known attributes that I have, and if the model can’t figure out the answer, it can ask for another attribute – just analogous to a 20 questions style of game.
So, my question is: does this specific task have a name? It seems to be similar to classification, where the class is unique to each instance, but this wouldn’t fit on the current training models, except perhaps for a decision tree model.
Your inputs, denoted features in machine learning, are tuples of species (what I think you mean by "instance"), and physical attributes. Your outputs are broader taxonomic ranks. Thus, assigning one to each input is a classification problem. Since your features are incomplete, you want to perform ... classification with incomplete data, or impute missing features. Searching for these terms will give you enough leads.
(And the other task is properly called clustering.)
IMHO you are looking for simply a decision tree.
Except, that you don't train it on your categorial attribute (your "class"), but on the individual instance label.
You need to carefully choose the splitting measure though, as many measures go for class sizes - and all your classes have size 1 now. Finding a good split for the decision tree may involve planning some splits ahead to get an optimal balanced tree. A random forest like approach may be of use to improve the chance of finding a good tree.

algorithm for data quality in a data warehouse

I'm looking for a good algorithm / method to check the data quality in a data warehouse.
Therefore I want to have some algorithm that "knows" the possible structure of the values and then checks if the values are a member of this structure and then decide if they are correct / not correct.
I thought about defining a regexp and the check each value whether it fits or not.
Is this a good way? Are there some good alternatives? (Any research papers?)
I have seen some authors suggest adding a special dimension called a data quality dimension to describe each facttable-record further.
Typical values in a data quality dimension could then be “Normal value,” “Out-of-bounds value,” “Unlikely value,” “Verified value,” “Unverified value,” and “Uncertain value.”
I would recommend using a dedicated data quality tool, like DataCleaner (http://datacleaner.eobjects.org), which I have been doing quite a lot of work on.
You need a tool that not only check strict rules like constraints, but also one that will give you a profile of your data and make it easy for you to explore and identify inconsistencies on your own. Try for example the "Pattern finder" which will tell you the patterns of your string values - something that will often reveal the outliers and errornous values. You can also use the tool for actual cleansing the data, by transforming values, extracting information from them or enriching using third party services. Good luck improving your data quality!

appropriate minimum support for itemset?

Please suggest me for any kind material about appropriate minimum support and confidence for itemset!
::i use apriori algorithm to search frequent itemset. i still don't know appropriate support and confidence for itemset. i wish to know what kinds of considerations to decide how big is the support.
The answer is that the appropriate values depends on the data.
For some datasets, the best value may be 0.5. But for some other datasets it may be 0.05. It depends on the data.
But if you set minsup =0 and minconf = 0, some algorithms will run out of memory before terminating, or you may run out of disk space because there is too many patterns.
From my experience, the best way to choose minsup and minconf is to start with a high value and then to lower them down gradually until you find enough patterns.
Alternatively, if you don't want to have to set minsup, you can use a top-k algorithms where instead of specifying minsup, you specify for example that you want the k most frequent rules. For example, k = 1000 rules.
If you are interested by top-k association rule mining, you can check my Java code here:
http://www.philippe-fournier-viger.com/spmf/
The algorithm is called TopKRules and the article describing it will be published next month.
Besides that, you need to know that there is many other interestingness measures beside the support and confidence: lift, all-confidence, ... To know more about this, you can read this article: "On selecting interestingness measures for association rules" and "A Survey of Interestingness Measures for Association Rules" Basically, all measures have some problems in some cases... no measure is perfect.
Hope this helps!
In any association rule mining algorithm, including Apriori, it is up to the user to decide what support and confidence values they want to provide. Depending on your dataset and your objectives you decide the minSup and minConf.
Obviously, if you set these values lower, then your algorithm will take longer to execute and you will get a lot of results.
The minimum support and minimum confidence parameters are a user preference. If you want a larger quantity of results (with lower statistical confidence), choose the parameters appropriately. In theory you can set them to 0. The algorithm will run, but it will take a long time, and the result will not be particularly useful, as it contains just about anything.
So choose them so that the result suit your needs. Mathematically, any value is "correct".