Could anyone please clarify the difference between input attribute and predictable attribute for decision tree algorithm in Data mining.
Thanks.
These are concepts pertaining to Naive Bayesian models. Basically, an input attribute is an attribute that is given, i.e. something that you know as a fact from the outside, something you can observe. A predictable attribute is an attribute that cannot be observed directly but can be, hopefully, computed or somehow derived from a combination or relationship of various input attributes.
This is a decent explanation of a Naive Bayesian model implementation: http://technet.microsoft.com/en-us/library/ms174806.aspx
Hope it helps.
Related
I am new to samplers and don't understand why we should use a weighted random sampler. Can anyone explain it to me? Also, should we use a weighted random sampler for the validation set?
This is very much a PyTorch-independent question and, as such might appear a bit off-topic.
Take a classification task, your dataset may contain more instances of a certain class making this class overrepresented. This can usually lead to some issues. Indeed, during training, your model will be presented with more instances from one class than the others. In that sense, it can become biased towards that prominent class.
To counter that you can use a weighted sampler that will effectively level the unequal number of instances such that, on average, during one epoch, the model will have seen as many examples belonging to each of your classes. This will allow having balanced learning with respect to your class, independently from the fact that you may have different numbers of instances per class.
To answer your second question, I don't think you should be using a weighted sampler on your validation. There is no need to adopt a specific sampling policy. The point of validation is to see the performance of your fixed model on unseen data. Similar to the test set, where you won't have access to the class statistics to actually use a weighted sampler.
I would like to know if there is a way to take a non linear model in pyomo format from the user, and load it into the pyomo optimization code. I'm asking this since i'm dealing with non linear models, and need to develop a generic code to implement a particular type of optimization using these models. Kindly help me out.
Thank you
I'm working with data mining, and I'm familiar with classification, clustering and regression tasks. In classification, one can have a lot of instances (e.g. animals), their features (e.g. number of legs) and a class (e.g. mammal, reptile).
But what I need to accomplish is, given some attributes, including the class attribute, to determine which unique instance I'm referring to (e.g. giraffe). I can supply all known attributes that I have, and if the model can’t figure out the answer, it can ask for another attribute – just analogous to a 20 questions style of game.
So, my question is: does this specific task have a name? It seems to be similar to classification, where the class is unique to each instance, but this wouldn’t fit on the current training models, except perhaps for a decision tree model.
Your inputs, denoted features in machine learning, are tuples of species (what I think you mean by "instance"), and physical attributes. Your outputs are broader taxonomic ranks. Thus, assigning one to each input is a classification problem. Since your features are incomplete, you want to perform ... classification with incomplete data, or impute missing features. Searching for these terms will give you enough leads.
(And the other task is properly called clustering.)
IMHO you are looking for simply a decision tree.
Except, that you don't train it on your categorial attribute (your "class"), but on the individual instance label.
You need to carefully choose the splitting measure though, as many measures go for class sizes - and all your classes have size 1 now. Finding a good split for the decision tree may involve planning some splits ahead to get an optimal balanced tree. A random forest like approach may be of use to improve the chance of finding a good tree.
Similar to this question Pivot Table in c#, I'm looking to find an implementation of a pivot table in c++. Due to the project requirements speed is fairly critical and the rest of the performance critical part project is written in c++ so an implementation in c++ or callable from c++ would be highly desirable. Does anyone know of implementations of a pivot table similar to the one found in Excel or open office?
I'd rather not have to code such a thing from scratch, but if I was to do this how should I go about it? What algorithms and data structures would be good to be aware of? Any links to an algorithm would be greatly appreciated.
I am sure you are not asking full feature of pivot table in Excel. I think you want simple statistics table based on discrete explanatory variables and given statistics. If you do, I think this is the case that writing from scratch might be faster than looking at other implementations.
Just update std::map (or similar data structure) of key representing combination of explanatory variables and value of given statistics when program reading each data point.
After done with the reading, it's just matter of organizing output table with the map which might be trivial depending on your goal.
I believe most of C# examples in that question you linked do this approach anyway.
I'm not aware of an existing implementation that would suit your needs, so, assuming you were to write one...
I'd suggest using SQLite to store your data and use SQL to compute aggregates (Note: SQL won't do median, I suggest an abstraction at some stage to allow such behavior), The benefit of using SQLite is that it's pretty flexible and extremely robust, plus it lets you take advantage of their hard work in terms of storing and manipulating data. Wrapping the interface you expect from your pivot table around this concept seems like a good way to start, and save you quite a lot of time.
You could then combine this with a model-view-controller architecture for UI components, I anticipate that would work like a charm. I'm a very satisfied user of Qt, so in this regard I'd suggest using Qt's QTableView in combination with QStandardItemModel (if I can get away with it) or QAbstractItemModel (if I have to). Not sure if you wanted this suggestion, but it's there if you want it :).
Hope that gives you a starting point, any questions or additions, don't hesitate to ask.
I think the reason your question didn't get much attention is that it's not clear what your input data is, nor what options for pivot table you want to support.
A pivot table is in it's basic form, running through the data, aggregating operations into buckets. For example, you want to see how many items you shipped each week from each warehouse for the last few weeks:
You would create a multi-dimensional array of buckets (rows are weeks, columns are warehouses), and run through the data, deciding which bucket that data belongs to, adding the amount in the record you're looking at, and moving to the next record.
I'm looking for a good algorithm / method to check the data quality in a data warehouse.
Therefore I want to have some algorithm that "knows" the possible structure of the values and then checks if the values are a member of this structure and then decide if they are correct / not correct.
I thought about defining a regexp and the check each value whether it fits or not.
Is this a good way? Are there some good alternatives? (Any research papers?)
I have seen some authors suggest adding a special dimension called a data quality dimension to describe each facttable-record further.
Typical values in a data quality dimension could then be “Normal value,” “Out-of-bounds value,” “Unlikely value,” “Verified value,” “Unverified value,” and “Uncertain value.”
I would recommend using a dedicated data quality tool, like DataCleaner (http://datacleaner.eobjects.org), which I have been doing quite a lot of work on.
You need a tool that not only check strict rules like constraints, but also one that will give you a profile of your data and make it easy for you to explore and identify inconsistencies on your own. Try for example the "Pattern finder" which will tell you the patterns of your string values - something that will often reveal the outliers and errornous values. You can also use the tool for actual cleansing the data, by transforming values, extracting information from them or enriching using third party services. Good luck improving your data quality!