Only one ranked attribute, but selected two? InfoGain Ranker in weka - weka

I've run an InfoGain evaluation on my dataset, with a Ranker on threshold 0.1.
My output via the GUI says:
Search Method:
Attribute ranking.
Threshold for discarding attributes: 0.1
Attribute Evaluator (supervised, Class (nominal): 23 class):
Information Gain Ranking Filter
Ranked attributes:
0.141 2 nr_visits
Selected attributes: 2 : 1
In my java implementation, I do the same thing:
Ranker ranker = new Ranker();
ranker.setGenerateRanking(true);
ranker.setThreshold(0.1);
AttributeSelection attsel = new AttributeSelection();
InfoGainAttributeEval eval = new InfoGainAttributeEval();
attsel.setEvaluator(eval);
attsel.setSearch(ranker);
attsel.SelectAttributes(instances);
int[] ranked_attr = attsel.selectedAttributes();
double[][] rawscores = attsel.rankedAttributes();
Where I get similar output:
my ranked_attr is [1, 21] (with 1 being the nr_visits feature, and 21 another)
my rawscores double array does NOT contain ANY entry for 21. It has the 1, and then another feature with a score lower than my threshold.
What gives? Are there one or two selected features? Is this a bug in weka 3.8.4?

Thanks to Eibe on the mailing list:
AFAIK, the set of indices returned by selectedAttributes() includes the index of the class attribute. I assume that attribute 22 in your data is the class attribute. There is no score for the class attribute because it is the attribute that we are trying to predict.
Because yes, the 21 was indeed my class index, which is zero-based in code, one-based in the GUI, which is why I didn't immediately notice.

Related

What is the data type returned for an IF function (SharePoint Calculated Column)?

I have a rather long calculated column which uses the drop-downs of two other columns to return a 'score' based on their values.
A section of it looks like this:
IF(AND([Current Impact]="High",[Current Probability]="Low"),17)
This would return the value 17 if the impact is high and the probability is low in those other columns.
This seems to return the 17 as a string rather than a number because sorting on my calculated column (score) from high to low produces a result sorted on the first digit e.g. 5, 4, 32, 2, 18
I have a workaround to this - I enclose the whole formula in a (an?) =value() function.
However, I'm curious as to why the IF function returns the string data types instead of a number?
As far as I know, in the column settings you could specify what data type returned from the calculated formula:

SAS Action to provide the class probability statistics

I have a vector of nominal values and I need to know the probability of occurring each of the nominal values. Basically, I need those to obtain the min, max, mean, std of the probability of observing the nominal values and to get the Class Entropy value.
For example, lets assume there is a data-set in which the target is predicting 0, 1, or 2. In the training data-set. We can count the number of records which their target is 1, and call it n_1 and similarly we can define n_0 and n_2. Then, the probability of observing class 1 in the training data-set is simply p_1=n_1/(n_0 + n_2). Once p_0, p_1, and p_2 are obtained, one can get min, max, mean, and std of the these probabilitis.
It is easy to get that in python by pandas, but want to avoid reading the data-set twice. I was wondering if there is any CAS-action in SAS that can provide it to me. Note that I use the Python API of SAS through swat and I need to have the API in python.
I found the following solution and it works fine. It uses s.dataPreprocess.highcardinality to get the number of classes and then uses s.dataPreprocess.binning to obtain the number of observations within each class. Then, there is just some straightforward calculation.
import swat
# create a CAS server
s = swat.CAS(server, port)
# load the table
tbl_name = 'hmeq'
s.upload("./data/hmeq.csv", casout=dict(name=tbl_name, replace=True))
# call to get the number of classes
cardinality_result = s.dataPreprocess.highcardinality(table=tbl_name, vars=[target_var])
cardinality_result_df = pd.DataFrame(cardinality_result["HighCardinalityDetails"])
number_of_classes = int(cardinality_result_df["CardinalityEstimate"])
# call dataPreprocess.binning action to get the probability of each class
s.loadactionset(actionset="dataPreprocess")
result_binning = s.dataPreprocess.binning(table=tbl_name, vars=[target_var], nBinsArray=[number_of_classes])
result_binning_df = pd.DataFrame(result_binning["BinDetails"])
probs = result_binning_df["NInBin"]/result_binning_df["NInBin"].sum()
prob_min = probs.min()
prob_max = probs.max()
prob_mean = probs.mean()
prob_std = probs.std()
entropy = -sum(probs*np.log2(probs))

specify moment at which to change value after converting (tri)annual-subject to month-subject observations (Stata)

My goal is to convert a subject-triannual data set to one with subject-month observations, and specify the month at which one string variable (named "strvar" below) should change value, according to the var called "exact_time".
I have a data set with four records per subject (subject-year observations, aka multiple-record-per-subject data set), information was recorded every three years for each subject as follows:
Table with tri-annual-subject obs. & exact_time var
"strvar" changes its value every three years. The variable "exact_time" records the exact (month.day.year) moment at which each the variable "strvar" changes its value. Once "strvar" varies, it keeps the same value for the following months, until the moment indicated by the next value of "exact_time"
I want Stata to change the value of "strvar" according to the variable "exact_time". For instance, subject 1 changed value of "strvar" in April 1, 1992, hence, I want Stata to assign the new value of "strvar" in April 1992. The value of "strvar" for subject 1 should remain the same until "exact_time" changes value (November.30.1995), hence, starting in November 1995, subject one should adopt the new value of "strvar". In 1998, "strvar" of subject one changed value once again, this time at the beginning of next year (January.1.1999), hence, "strvar" will adopt a new value starting in January.1999, until subject one's last observation (December 2002). As follows:
table with monthly-subject obs, example
I believe this can be achieved in in two steps, the second of which I need your support with:
Expand each tri-annual observation 36 times, so as to have monthly-subject observations, i.e., generate var "new_time". I guess this can be achieved through:
expandcl 36, generate(new_time) cluster(subject)
Instruct Stata to change the value "strvar" according to the date specified by "exact_time", which I have no idea how to do, and for which I would appreciate your support.
Thank you in advance!
For future questions, please provide your failed attemps in form of code. They prove that you have done your part trying to solve the problem.
Also, please provide example data that can easily be copied/pasted by other users. Linking images is not the best option, for several reasons.
Find example code below.
clear
set more off
*----- example data -----
input ///
id str1 strvar str22 xtime
1 z "april 1, 1992"
1 u "november 30, 1995"
1 a "january 1, 1999"
2 b "january 15, 1989"
2 z "june 15, 1992"
2 c "august 30, 1995"
end
gen xtime2 = date(xtime, "MDY")
format %td xtime2
list, sepby(id)
*----- what you want -----
xtset id xtime2
tsfill
gen strvar2 = strvar
replace strvar2 = strvar2[_n-1] if missing(strvar2)
browse
tsfill facilitates the job. Se also help xtset, help subscripting and help datetime.
Think about whether you actually need this. You are not adding any new information to the dataset, so what's the point of having a blown-up version of the original?
(The output doesn't exactly match the one in your image; but this really is meant to be an example.)

Can you get the selected leaf from a DecisionTreeRegressor in scikit-learn

just reading this great paper and trying to implement this:
... We treat each individual
tree as a categorical feature that takes as value the
index of the leaf an instance ends up falling in. We use 1-
of-K coding of this type of features. For example, consider
the boosted tree model in Figure 1 with 2 subtrees, where
the first subtree has 3 leafs and the second 2 leafs. If an
instance ends up in leaf 2 in the first subtree and leaf 1 in
second subtree, the overall input to the linear classifier will
be the binary vector [0, 1, 0, 1, 0], where the first 3 entries
correspond to the leaves of the first subtree and last 2 to
those of the second subtree ...
Anyone know how I can predict a bunch of rows and for each of those rows get the selected leaf for each tree in the ensemble? For this use case I don't really care what the node represents, just its index really. Had a look at the source and I could not quickly see anything obvious. I can see that I need to iterate the trees and do something like this:
for sample in X_test:
for tree in gbc.estimators_:
leaf = tree.leaf_index(sample) # This is the function I need but don't think exists.
...
Any pointers appreciated.
The following function goes beyond identifying the selected leaf from the Decision Tree and implements the application in the referenced paper. Its use is the same as the referenced paper, where I use the GBC for feature engineering.
def makeTreeBins(gbc, X):
'''
Takes in a GradientBoostingClassifier object (gbc) and a data frame (X).
Returns a numpy array of dim (rows(X), num_estimators), where each row represents the set of terminal nodes
that the record X[i] falls into across all estimators in the GBC.
Note, each tree produces 2^max_depth terminal nodes. I append a prefix to the terminal node id in each incremental
estimator so that I can use these as feature ids in other classifiers.
'''
for i, dt_i in enumerate(gbc.estimators_):
prefix = (i + 2)*100 #Must be an integer
nds = prefix + dt_i[0].tree_.apply(np.array(X).astype(np.float32))
if i == 0:
nd_mat = nds.reshape(len(nds), 1)
else:
nd_mat = np.hstack((nd, nds.reshape(len(nds), 1)))
return nd_mat
DecisionTreeRegressor has tree_ property which gives you access to the underlying decision tree. It has method apply, which seemingly finds corresponding leaf id:
dt.tree_.apply(X)
Note that apply expects its input to have type float32.

Create prioritization log in Excel - Two lists

I am trying to create a prioritization list. I have 6 distinct values that the user inputs into a worksheet (by way of a VBA GUI). Excel calculates these values and creates a prioritization number. I need to list them (through a function(s)) in two tables. The problem comes into play when there are duplicate values (ie ProjA = 23 and ProjB = 23).
I don't care which one is listed first, but everything I have tried has secondary issues. There are two sheets to my work book. The first is where the "raw" data is entered and the second is where I would like the two lists to be located. *I do not want to use pivots for these lists.
Priority Number Proj Name
57 Project Alpha c
57 DUI Button Project
56 asdf
57 asdfsdfg
56 asdfasdf
56 Project Alpha a
56 Project Alpha b
18 Project BAS
List A (would include a value range of 1-20 and
List B (would include a value range of 20 - inf)
So, I want it to look like this:
Table 1 (High Priority) Table 2 (Low Priority)
Project BAS Project Apha C
DUI Button Project
Etc.
Generally these open-ended questions aren't received on StackOverflow. You should make an attempt to demonstrate what you've tried so far, and exactly where you're becoming confused. Otherwise people are doing your work for you, rather than trying to solve specific errors.
However, because you're new here, I've made an exception.
You can begin solving your issue by looping through the priority list and copy the values into the appropriate lists. For starters, I assumed that priority values begin at cell A2 and project names begin at cell B2 (the cells A1 and B1 would be the headers). I also assumed we're using a worksheet called Sheet1.
Now I need to know the length of the priority/project name list. I can determine this by using an integer called maxRows, calculated by Worksheets.Cells(1, 1).End(xlDown).Row. This gives the number of values in regular table (including the header, A1).
I continue by setting the columns for each priority list (high/low). In my example, I set these to columns 3 and 4. Then I clear these columns to remove any values that already existed there.
Then I create some tracking variables that will help me determine how many items I've already added to each list (highPriorityCount and lowPriorityCount).
Finally, I loop through the original list and check if the priority value is low (< 20) or high (the else condition). The project names are placed into the appropriate column, using the tracking variables I created above.
Note: Anywhere that uses a 2 as an offset is due to the fact that I am accounting for the header cells (row 1).
Option Explicit
Sub CreatePriorityTables()
With Worksheets("Sheet1")
' Determine the length of the main table
Dim maxRows As Integer
maxRows = .Cells(1, 1).End(xlDown).Row
' Set the location of the priority lists
Dim highPriorityColumn As Integer
Dim lowPriorityColumn As Integer
highPriorityColumn = 3
lowPriorityColumn = 4
' Empty the priority lists
.Columns(highPriorityColumn).Clear
.Columns(lowPriorityColumn).Clear
' Create headers for priority lists
.Cells(1, highPriorityColumn).Value = "Table 1 (High Priority)"
.Cells(1, lowPriorityColumn).Value = "Table 2 (Low Priority)"
' Create some useful counts to track
Dim highPriorityCount As Integer
Dim lowPriorityCount As Integer
highPriorityCount = 0
lowPriorityCount = 0
' Loop through all values and copy into priority lists
Dim currentColumn As Integer
Dim i As Integer
For i = 2 To maxRows
' Determine column by priority value
If (.Cells(i, 1) < 20) Then
.Cells(lowPriorityCount + 2, lowPriorityColumn).Value = .Cells(i, 2)
lowPriorityCount = lowPriorityCount + 1
Else
.Cells(highPriorityCount + 2, highPriorityColumn).Value = .Cells(i, 2)
highPriorityCount = highPriorityCount + 1
End If
Next i
End With
End Sub
This should produce the expected behavior.