Model building and reading Sparse ARFF File in WEKA - weka

I have the following sparse ARFF File in Weka – I want to build a classifier from the given sparse ARFF file (training dataset) using a Weka Java API. The program is reading the file [not throwing any Exception] but not able to read the instances. When I print the number of instances – the program prints as 0. Thanks in advance for inputs.
#RELATION ample
#ATTRIBUTE T1 numeric
#ATTRIBUTE T2 numeric
#ATTRIBUTE T3 numeric
#ATTRIBUTE T4 numeric
#ATTRIBUTE T5 numeric
#ATTRIBUTE C1 {0, 1}
#DATA
{0 3, 1 2, 2 1, 6 1}
{3 3, 4 2, 6 0}
ArffLoader loader = new ArffLoader();
loader.setFile(new File("C:\\SAMPLE-01.arff"));
Instances data = loader.getStructure();
data.setClassIndex(data.numAttributes() - 1);
System.out.println("Number of Attributes : " + data.numAttributes());
System.out.println("Number of Instances : " + data.numInstances());

I believe that your sparse data is not properly formatted in your ARFF file. It should be something like this:
#RELATION ample
#ATTRIBUTE T1 numeric
#ATTRIBUTE T2 numeric
#ATTRIBUTE T3 numeric
#ATTRIBUTE T4 numeric
#ATTRIBUTE T5 numeric
#ATTRIBUTE C1 {0, 1}
#DATA
{0 3, 1 2, 2 1, 5 1}
{3 3, 4 2, 5 0}
And then you can use a class somewhat that looks like mine:
public class SO_Test {
DataSource source = null;
Instances data = null;
public void setDataset(String trainingFile){
try {
source = new DataSource(trainingFile);
} catch (Exception e) {
e.printStackTrace();
}
try {
data = source.getDataSet();
} catch (Exception e) {
e.printStackTrace();
}
if (data.classIndex() == -1)
data.setClassIndex(data.numAttributes() - 1);
}
public static void main(String[] args) throws Exception{
SO_Test s = new SO_Test();
s.setDataset("1.arff");
System.out.println("Number of Attributes : " + s.data.numAttributes());
System.out.println("Number of Instances : " + s.data.numInstances());
}
}

Related

Unable to understand lists in dart consider the example provided

I am developing an app in flutter. For which I am using lists of map but there something that I am unable to undertand. Consider the following cases:
SCENARIO 1
void main() {
List<Map<String,String>> _reminders = [];
Map<String , String> _tempMap = {};
for (int i = 0; i < 5; i++) {
_tempMap.clear();
_tempMap.putIfAbsent('M' , () => 'm ' + i.toString());
_tempMap.putIfAbsent('D' , () => 'd : ' + i.toString());
_reminders.add(_tempMap);
// or _reminders.insert(i, _tempMap);
}
print(_reminders.toString());
return;
}
to which the result is as follows
[{M: m 4, D: d : 4}, {M: m 4, D: d : 4}, {M: m 4, D: d : 4}, {M: m 4, D: d : 4}, {M: m 4, D: d : 4}]
SCENARIO 2
void main() {
List<Map<String,String>> _reminders = [];
for (int i = 0; i < 5; i++) {
Map<String , String> _tempMap = {};
_tempMap.putIfAbsent('M' , () => 'm ' + i.toString());
_tempMap.putIfAbsent('D' , () => 'd : ' + i.toString());
_reminders.add(_tempMap);;
}
print(_reminders.toString());
return;
}
to which the result is as follows
[{M: m 0, D: d : 0}, {M: m 1, D: d : 1}, {M: m 2, D: d : 2}, {M: m 3, D: d : 3}, {M: m 4, D: d : 4}]
As far as I understand, these scenarios should give similar results. Also in my use case scenario 2 is the correct way as it gives me the result that I want. Please note the above examples have been changed to similify the question. The usage in my original code is much more complex.
Dart, like many other programming languages including java, stores objects as reference, and not contiguous memory blocks. In the first case, in all the iterations of the loop, you have added the same Map using the _reminders.add(_tempMap). Your intuition that "Everytime I add the Map, a copy is created of the current state of Map and that copy is appended to the list" is incorrect.
From my understanding, both are different
The problem is with _tempMap.clear(); in the SCENARIO 1. You have used the global variable for map object and when you apply clear inside the for loop all the previously added entries will be cleared and map becomes empty.
when i = 0 => {} => clear() => all entries will be cleared => New item inserted.
when i = 1 => {"Item inserted in 0th iteration"} => clear() => all entries will be cleared => New item inserted.
So for every iteration map is cleared and holds only last iterated value. After for loop is completed it contains only the last iterated value(i=4) since we are clearing the global map variable every time when a new iteration starts.
EDIT :
You can print the map values inside the for loop and can check yourself.
for (int i = 0; i < 5; i++) {
print('\n $i => ${_tempMap} \n');

How do I group by into struct in BigQuery?

I have the following data:
player_id level talent_id
1 1 a
1 2 b
1 3 c
2 1 d
2 2 e
And want to group by player_id and have rows as structs, with level values as struct field names:
player_id data
1 {_1 = a, _2 = b, _3 = c}
2 {_1 = d, _2 = e, _3 = null}
the level column is always from the set of {1, 2, 3} but some levels might be missing (null)
What I've got so far is aggregation by player_id and with attached array of results:
talents as (
select
p.player_id,
array_agg(struct(p.level, p.talent_id)) as talents
from source.player_talent p
group by player_id
),
player_id data
1 [{1, a}, {2, b}, {3, c}]
2 {{1, d}, {2, e}]
now I need to map this array to a struct with fixed property names _1, _2, _3
This returns the expected results:
WITH Players AS (
SELECT 1 AS player_id, 1 AS level, 'a' AS talent_id UNION ALL
SELECT 1, 2, 'b' UNION ALL
SELECT 1, 3, 'c' UNION ALL
SELECT 2, 1, 'd' UNION ALL
SELECT 2, 2, 'e'
)
SELECT
player_id,
STRUCT(
MAX(IF(level = 1, talent_id, NULL)) AS _1,
MAX(IF(level = 2, talent_id, NULL)) AS _2,
MAX(IF(level = 3, talent_id, NULL)) AS _3) AS data
FROM Players
GROUP BY player_id
The technique here is known as pivoting (converting rows to columns).

Weka Text Mining Naive Bayes

I have a question about Textmining in Weka. So I have 4 different categories. And I want the data to be classified into those categories. In addition I want the data to be predicted whether they are positive/ negative or neutral.
So here is my training data before using any filter:
#relation QueryResult
#attribute class {Qualität,Bord,Kite,Harness}
#attribute text {evo,foil,end,fin,edg}
#data
Qualität,evo
Bord,foil
Kite,end
Harness,fin
Qualität,edg
This is my java code:
Instances train = new Instances(loadInstancesForWeka("root","",sqlCommand));
train.setClassIndex(train.numAttributes() - 2);
NominalToString filter1 = new NominalToString();
filter1.setInputFormat(train);
train = Filter.useFilter(train, filter1);
//filter
StringToWordVector filter = new StringToWordVector();
filter.setInputFormat(train);
train = Filter.useFilter(train, filter);
// test2 are the testing instances
naive.buildClassifier(train);
for (int i = 0; i < test2.numInstances(); i++) {
double index = naive.classifyInstance(test2.instance(i));
}
So by now my data are classified into the four categories Qualität, Bord, Kite, Harness.
How can I now use naive bayes to classify them also into positive/ negative/ neutral?

weka sparse arff file

I am making a sparse arff file but it will not load into Weka. I get the error that I have the wrong number of values in the #attribute class line, it expects 1 and rejects receiving 12. What am I doing wrong? My file looks like this:
%ARFF file for questions data
%
#relation brazilquestions
#attribute att0 numeric
#attribute att1 numeric
#attribute att2 numeric
#attribute att3 numeric
%there are 469 attributes which represent my bag of words
#attribute class {Odontologia_coletiva, Periodontia, Pediatria, Estomatologia,
Dentistica, Ortodontia, Endodontia, Cardiologia, Terapeutica,
Terapeutica_medicamentosa, Odontopediatria, Cirurgia}
#data
{126 1, 147 1, 199 1, 56 1, 367 1, 400 1 , Estomatologia}
{155 1, 76 1, 126 1, 78 1, 341 1, 148 1, Odontopediatria}
%and then 81 more instances of data
Any ideas about what is wrong with my syntax? I followed the example exactly from the book Data Mining by Witten/Frank/Hall. Thanks in advance!
the problem in data section .
you must put the index of the class attribute
for example :
{126 1, 147 1, 199 1, 56 1, 367 1, 400 1 , Estomatologia}
correct it like the following
{126 1, 147 1, 199 1, 56 1, 367 1, 400 1 ,470 Estomatologia}
In your document you declared 5 attributes but in #data you are adding 7 attributes, then you should to complete the rest of values in #data. You can see this in the manual
The attribute name for the instance class value needs to be listed, too. (See the Sparse ARFF file description.)
Your file:
#attribute myclass {Odontologia_coletiva, Periodontia, Pediatria, Estomatologia,
Dentistica, Ortodontia, Endodontia, Cardiologia, Terapeutica,
Terapeutica_medicamentosa, Odontopediatria, Cirurgia}
#data
{126 1, 147 1, 199 1, 56 1, 367 1, 400 1 , Estomatologia}
Should be:
#data
{126 1, 147 1, 199 1, 56 1, 367 1, 400 1 , myclass Estomatologia}
#ATTRIBUTE class string
Try using this instead of
#attribute class {Odontologia_coletiva, Periodontia, Pediatria, Estomatologia, Dentistica, Ortodontia, Endodontia, Cardiologia, Terapeutica, Terapeutica_medicamentosa, Odontopediatria, Cirurgia}

Sorting number of lists according to indexes and priority

I have a collection of lists with each containing around 6 to 7 values. Like,
list1 = 2,4,7,4,9,5
list2 = 4,3,7.3,9,8,1.2
list3 = 2,2.4,7,9,8,5
list4 = 9,1.6,4,3,4,1
list5 = 2,5,7,9,1,4
list6 = 6,8,7,2,1,5
list7 = 4,2,5,2,1,3
Now I want to sort these with index1 as primary and index3 as secondary and index2 as tertiary and so on. That is, the output should be like:
2,2.4,7,9,8,5
2,4,7,4,9,5
2,5,7,9,1,4
4,2,5,2,1,3
6,8,7,2,1,5
9,1.6,4,3,4,1
I want the list order to be sorted for index1 first and if the values are same for index1 than sorting is done on index3 and if further same than on index2. Here the number of lists are less which can increase to 20 and the indexes can grow up to 20 as well.
The algorithm I want to know is the same as that of iTunes song sorting, in which songs with the same album are sorted first and then by artist and then by rank and then by name. That's the album's if album names are the same then sorting is done on the artist if same, then by rank and so on. The code can be in C/C++/tcl/shell.
sort -n -t ',' -k 1 -k 3 -k 2
Feed the lists as individual lines into it.
To do this in Tcl, assuming there's not huge amounts of data (a few MB wouldn't be “huge”) the easiest way would be:
# Read the values in from stdin, break into lists of lists
foreach line [split [read stdin] "\n"] {
lappend records [split $line ","]
}
# Sort twice, first by secondary key then by primary (lsort is _stable_)
set records [lsort -index 1 -real $records]
set records [lsort -index 0 -real $records]
# Write the values back out to stdout
foreach record $records {
puts [join $record ","]
}
If you're using anything more complex than simple numbers, consider using the csv package in Tcllib for parsing and formatting, as it will deal with many syntactic issues that crop up in Real Data. If you're dealing with a lot of data (where “lot” depends on how much memory you deploy with) then consider using a more stream-oriented method for handling the data (and there are a few other optimizations in the memory handling) and you might also want to use the -command option to lsort to supply a custom comparator so you can sort only once; the performance hit of a custom comparator is quite high, alas, but for many records the reduced number of comparisons will win out. Or shove the data into a database like SQLite or Postgres.
You can use STL's sort, and then all you have to do is to write a comparison function that does what you want (the example in the link should be good enough).
Since you asked for a Tcl solution:
set lol {
{2 4 7 4 9 5}
{4 3 7.3 9 8 1.2}
{2 2.4 7 9 8 5}
{9 1.6 4 3 4 1}
{2 5 7 9 1 4}
{6 8 7 2 1 5}
{4 2 5 2 1 3}
}
set ::EPS 10e-6
proc compareLists {ixo e1 e2} {
foreach ix $ixo {
set d [expr {[lindex $e1 $ix] - [lindex $e2 $ix]}]
if {abs($d) > $::EPS} {
return [expr {($d>0)-($d<0)}]
}
}
return 0
}
foreach li [lsort -command [list compareLists {0 2 1}] $lol] {
puts $li
}
Hope that helps.
Here is a C++ solution:
#include <iostream>
#include <vector>
#include <algorithm>
template <typename Array, typename CompareOrderIndex>
struct arrayCompare
{
private:
size_t
size ;
CompareOrderIndex
index ;
public:
arrayCompare( CompareOrderIndex idx ) : size( idx.size() ), index(idx) { }
bool helper( const Array &a1, const Array &a2, unsigned int num ) const
{
if( a1[ index[size-num] ] > a2[ index[size-num] ] )
{
return false ;
}
if( !(a1[ index[size-num] ] < a2[ index[size-num] ]) )
{
if( 1 != num )
{
return helper( a1, a2, num-1 ) ;
}
}
return true ;
}
bool operator()( const Array &a1, const Array &a2 ) const
{
return helper( a1, a2, size ) ;
}
} ;
int main()
{
std::vector< std::vector<float> > lists = { { 2, 4, 7, 4, 9, 5},
{ 4, 3, 7.3, 9, 8, 1.2 },
{ 2, 2.4, 7, 9, 8, 5 },
{ 4, 2, 5, 2, 1, 3 },
{ 9, 1.6, 4, 3, 4, 1 },
{ 2, 5, 7, 9, 1, 4 },
{ 6, 8, 7, 2, 1, 5 },
{ 4, 2, 5, 2, 1, 1 },
};
//
// Specify the column indexes to compare and the order to compare.
// In this case it will first compare column 1 then 3 and finally 2.
//
//std::vector<int> indexOrder = { 0, 2, 1, 3, 4 ,5 } ;
std::vector<int> indexOrder = { 0, 2, 1 } ;
arrayCompare< std::vector<float>, std::vector<int>> compV( indexOrder ) ;
std::sort( lists.begin(), lists.end(), arrayCompare< std::vector<float>, std::vector<int>>( indexOrder ) ) ;
for(auto p: lists)
{
for( unsigned int i = 0; i < p.size(); ++i )
{
unsigned int idx = ( i > (indexOrder.size() -1) ? i : indexOrder[i] ) ;
std::cout << p[idx] << ", " ;
}
std::cout << std::endl ;
}
}