Weka Text Mining Naive Bayes - weka

I have a question about Textmining in Weka. So I have 4 different categories. And I want the data to be classified into those categories. In addition I want the data to be predicted whether they are positive/ negative or neutral.
So here is my training data before using any filter:
#relation QueryResult
#attribute class {Qualität,Bord,Kite,Harness}
#attribute text {evo,foil,end,fin,edg}
#data
Qualität,evo
Bord,foil
Kite,end
Harness,fin
Qualität,edg
This is my java code:
Instances train = new Instances(loadInstancesForWeka("root","",sqlCommand));
train.setClassIndex(train.numAttributes() - 2);
NominalToString filter1 = new NominalToString();
filter1.setInputFormat(train);
train = Filter.useFilter(train, filter1);
//filter
StringToWordVector filter = new StringToWordVector();
filter.setInputFormat(train);
train = Filter.useFilter(train, filter);
// test2 are the testing instances
naive.buildClassifier(train);
for (int i = 0; i < test2.numInstances(); i++) {
double index = naive.classifyInstance(test2.instance(i));
}
So by now my data are classified into the four categories Qualität, Bord, Kite, Harness.
How can I now use naive bayes to classify them also into positive/ negative/ neutral?

Related

How could i use 3 random variables instead of one with shape=3 with pymc3

i am new to pymc3. I am trying to implement an unpooled model with 3 random variables (b0,b1,b2 ~ Normal) instead of one (b) with shape= 3. The code is from the book Bayesian Modeling and Computation in Python. I think it will help me implement a more complex multilevel model.
customers = sales_df.loc[:, "customers"].values
sales_observed = sales_df.loc[:, "sales"].values
food_category = pd.Categorical(sales_df["Food_Category"])
with pm.Model() as model_sales_unpooled:
s = pm.HalfNormal("s", 20, shape=3)
b = pm.Normal("b", mu=10, sigma=10, shape=3)
m = pm.Deterministic("m", b[food_category.codes] *customers)
sales = pm.Normal("sales", mu=m, sigma=s[food_category.codes],
observed=sales_observed)

How do I get ride of the gap in table footnote generated by kable in r markdown?

As you can see from the pic below. There is a gap between the note and the context.
The code I used is below. Any ideas to get rid of the gap?
kable(summarize(df.sum, group = "Experiment", test = T, digits = 1, show.NAs = F),
row.names = F, caption = 'Summary Statistics for Treated and Control Groups',
booktabs = T) %>% kable_styling(latex_options = c('striped', 'hold_position')) %>%
footnote(general = 'DM8OZ indicates the daily max 8-hour ozone concentration;
Daily_PM2.5 is the daily average of PM2.5; Tavg is the daily average temperature;
Prcp is the daily accumulated precipitation. The last column in the table represents the testing results of null
hypotheses that the treated and control groups are not statistically different. ',
footnote_as_chunk = T, threeparttable = T, fixed_small_size = T)

0 DF in regression in SAS enterprise guide

I created dummies in SAS (part of the codes below) and run regression (threw away M23). It was working fine. But then I tried to group them by age since we don't have enough members. I ran it the same way and threw away one age group (M20to24 since this group has the highest membership). Now some of my variables have 0 DF. Does anyone know what went wrong?
I got the message - Note: Model is not full rank. Least-squares solutions for the parameters are not unique. Some statistics will be misleading. A reported DF of 0 or B means that the estimate is biased. The following parameters have been set to 0, since the variables are a linear combination of other variables as shown.
data Table;
set Table;
M0=(AgeGender = '0M');
M1=(AgeGender = '1M');
M2=(AgeGender = '2M');
M3=(AgeGender = '3M');
M4=(AgeGender = '4M');
M5to9=(AgeGender = ' 5to9M');
M10to14=(AgeGender = '10to14M');
M15to19=(AgeGender = '15to19M');
M20to24=(AgeGender = '20to24M');
M25to29=(AgeGender = '25to29M');
M30to34=(AgeGender = '30to34M');
M35to39=(AgeGender = '35to39M');
M40to44=(AgeGender = '40to44M');
M45to49=(AgeGender = '45to49M');
M50to54=(AgeGender = '50to54M');
M55to59=(AgeGender = '55to59M');
M60to64=(AgeGender = '60to64M');
M65Plus=(AgeGender = '65+M');
F0=(AgeGender = '0F');
F1=(AgeGender = '1F');
F2=(AgeGender = '2F');
F3=(AgeGender = '3F');
F4=(AgeGender = '4F');
F5to9=(AgeGender = ' 5to9F');
F10to14=(AgeGender = '10to14F');
F15to19=(AgeGender = '15to19F');
F20to24=(AgeGender = '20to24F');
F25to29=(AgeGender = '25to29F');
F30to34=(AgeGender = '30to34F');
F35to39=(AgeGender = '35to39F');
F40to44=(AgeGender = '40to44F');
F45to49=(AgeGender = '45to49F');
F50to54=(AgeGender = '50to54F');
F55to59=(AgeGender = '55to59F');
F60to64=(AgeGender = '60to64F');
F65Plus=(AgeGender = '65+F');
Dep = (Relationship = 'Dep');
Mandatory = (Mand_Vo = 'Mandatory');
run;
ods output ParameterEstimates=Parameter_Estimates;
proc reg data= Table;
model logPMPM =
M0
M1
M2
M3
M4
M5to9
M10to14
M15to19
M25to29
M30to34
M35to39
M40to44
M45to49
M50to54
M55to59
M60to64
M65Plus
F0
F1
F2
F3
F4
F5to9
F10to14
F15to19
F20to24
F25to29
F30to34
F35to39
F40to44
F45to49
F50to54
F55to59
F60to64
F65Plus;
weight Membership;
run;
ods output close;
It doesn't look like you have overlaps or identical complimentary data variables but that's by definition. Your data is likely having that occur by chance, which is harder to find. You can likely find this by crossing variables that you suspect may be related or doing a pair wise scatter plot (PROC SGSCATTER) and seeing which two overlap almost identically.
You're correct, you wouldn't get this behaviour with continuous values because they're continuous and less likely to overlap exactly. In general, it's considered best practice to NOT categorize/bin variables when you can keep them continuous. The boundaries are artificial, does a 34 year old really differ from that 36 year old? What if all the people in that age group are 34 compared to the 36 in the 35 to 39 age group? You may not find a difference, but if your distribution was everyone at 39 vs everyone at 31 you may find more of a difference. Keeping the data continuous avoids these manufactured issues.

Weka classification and predicted class

I'm trying to classify an unlabelled string using Weka, I'm not an expert in data mining so i have been struggling with the different terms. What I'm doing is I am providing the training data and setting the unlabeled string after running the M5Rules classifier, I'm actually getting an output but i have no idea what it mean:
run:
{17 1,35 1,64 1,135 1,205 1,214 1,215 1,284 1,288 1,309 1,343 1,461 1,493 1,500 1,552 1,806 -0.038168} | -0.03816793850062397
-0.03816793850062397 ->
Results
======
Correlation coefficient 0
Mean absolute error 0
Root mean squared error 0
Relative absolute error 0 %
Root relative squared error 0 %
Total Number of Instances 1
BUILD SUCCESSFUL (total time: 1 second)
The source code is as follows:
public Categorizer(){
try{
//*** READ ARRF FILES *///////////////////////////////////////////////////////
//BufferedReader trainReader = new BufferedReader(new FileReader("c:/Users/Yehia A.Salam/Desktop/dd/training-data.arff"));//File with text examples
//BufferedReader classifyReader = new BufferedReader(new FileReader("c:/Users/Yehia A.Salam/Desktop/dd/test-data.arff"));//File with text to classify
// Create trainning data instance
TextDirectoryLoader loader = new TextDirectoryLoader();
loader.setDirectory(new File("c:/Users/Yehia A.Salam/Desktop/dd/training-data"));
Instances dataRaw = loader.getDataSet();
StringToWordVector filter = new StringToWordVector();
filter.setInputFormat(dataRaw);
Instances dataTraining = Filter.useFilter(dataRaw, filter);
dataTraining.setClassIndex(dataRaw.numAttributes() - 1);
// Create test data instances
loader.setDirectory(new File("c:/Users/Yehia A.Salam/Desktop/dd/test-data"));
dataRaw = loader.getDataSet();
Instances dataTest = Filter.useFilter(dataRaw, filter);
dataTest.setClassIndex(dataTest.numAttributes() - 1);
// Classify
FilteredClassifier model = new FilteredClassifier();
model.setFilter(new StringToWordVector());
model.setClassifier(new M5Rules());
model.buildClassifier(dataTraining);
for (int i = 0; i < dataTest.numInstances(); i++) {
dataTest.instance(i).setClassMissing();
double cls = model.classifyInstance(dataTest.instance(i));
dataTest.instance(i).setClassValue(cls);
System.out.println(dataTest.instance(i).toString() + " | " + cls);
System.out.println(cls + " -> " + dataTest.instance(i).classAttribute().value((int) cls));
// evaluate classifier and print some statistics
Evaluation eval = new Evaluation(dataTraining);
eval.evaluateModelOnce(cls, dataTest.instance(i));
System.out.println(eval.toSummaryString("\nResults\n======\n", false));
}
}
catch(FileNotFoundException e){
System.err.println(e.getMessage());
}
catch(IOException i){
System.err.println(i.getMessage());
}
catch(Exception o){
System.err.println(o.getMessage());
}
}
And finally a couple of screenshots in case i made anything wrong in the folder hierarchy:
tl;dr:
You set the class index to a random feature
You have to use a classifier, not a regression algorithm
The problem is how you initialize your data sets. Although weka usually puts the class in the last column, the TextDirectoryLoader doesn't. In fact, you don't need to set the class index manually, it is already set, so remove the lines
dataTraining.setClassIndex(dataRaw.numAttributes() - 1);
dataTest.setClassIndex(dataTest.numAttributes() - 1);
(The first line is wrong anyway, because you use the number of attributes from the raw data set, but choose the column of the already filtered data set.)
If you then run your code, you will get this:
weka.classifiers.functions.LinearRegression: Cannot handle binary class!
As I already guessed, M5Rules is not a classifier, but for regression. If you use a classifier like J48 or RandomForest, you will get a more sensible output. Just change the line
model.setClassifier(new M5Rules());
to
model.setClassifier(new RandomForest());
As for your output, here is what I make of it:
{17 1,35 1,64 1,135 1,205 1,214 1,215 1,284 1,288 1,309 1,343 1,461 1,493 1,500 1,552 1,806 -0.038168} | -0.03816793850062397
-0.03816793850062397 ->
is the result of the lines
System.out.println(dataTest.instance(i).toString() + " | " + cls);
System.out.println(cls + " -> " + dataTest.instance(i).classAttribute().value((int) cls));
So you see the features of your instance serialized as sparse ARFF followed by | and the class.
Usually, the class should be an integer, but from the documentation of M5Rules I get that it is a classifier for regression problems, so you won't get discrete classes, but continuous values, in your case -0.03816793850062397
Since you (incorrectly) set a numerical feature as class label, M5Rules didn't complain and gave you an output. If you use an actual classifier, you will get your labels "health" or "travel".
The rest are standard statistics about the classifiers performance, but they are pretty useless for only one classifier instance. It looks like the one sample was classified correctly, so all errors are zero.
Correlation coefficient 0
Mean absolute error 0
Root mean squared error 0
Relative absolute error 0 %
Root relative squared error 0 %
Total Number of Instances 1
Just in case someone else got the same error with M5P, try to see if the Arff is just a header or empty.
Otherwise try
model.buildClassifier(....)
instead of
model.setClassifier(....);
That solved it for me.

implementing Bags of Words object recognition using VLFEAT

I am trying to implement a BOW object recognition code in matlab. The process is slightly complicated and I've had a lot of trouble finding proper documentation on the procedure. So could someone double check if my plan below makes sense?
I'm using the VLSIFT library extensively here
Training:
1. Extract SIFT image descriptor with VLSIFT
2. Quantize the descriptors with k-means(vl_hikmeans)
3. Take quantized descriptors and create histogram(VL_HIKMEANSHIST)
4. Create SVM from histograms(VL_PEGASOS?)
I understand step 1-3, but I'm not quite sure if the function for SVM is correct.
VL_PEGASOS takes the following:
W = VL_PEGASOS(X, Y, LAMBDA)
How exactly do I use this function with the histogram that I create?
Finally during the recognition stage, how do I match the image with a class defined by the SVM?
Did you look at their Caltech 101 example code, that is full implementation of an BoW approach.
Here is the part where they classify with pegasos and evaluate the results:
% --------------------------------------------------------------------
% Train SVM
% --------------------------------------------------------------------
lambda = 1 / (conf.svm.C * length(selTrain)) ;
w = [] ;
for ci = 1:length(classes)
perm = randperm(length(selTrain)) ;
fprintf('Training model for class %s\n', classes{ci}) ;
y = 2 * (imageClass(selTrain) == ci) - 1 ;
data = vl_maketrainingset(psix(:,selTrain(perm)), int8(y(perm))) ;
[w(:,ci) b(ci)] = vl_svmpegasos(data, lambda, ...
'MaxIterations', 50/lambda, ...
'BiasMultiplier', conf.svm.biasMultiplier) ;
model.b = conf.svm.biasMultiplier * b ;
model.w = w ;
% --------------------------------------------------------------------
% Test SVM and evaluate
% --------------------------------------------------------------------
% Estimate the class of the test images
scores = model.w' * psix + model.b' * ones(1,size(psix,2)) ;
[drop, imageEstClass] = max(scores, [], 1) ;
% Compute the confusion matrix
idx = sub2ind([length(classes), length(classes)], ...
imageClass(selTest), imageEstClass(selTest)) ;
confus = zeros(length(classes)) ;
confus = vl_binsum(confus, ones(size(idx)), idx) ;