how to declare" class hierarchy atrribute " in weka - weka

I try to use Weka to create .arff file and run on CLUS.
But i have a problem with hierarchy atrribute.
#attribute 'class hierarchical' {Dummy,Top/Arts/Animation,Top/Arts}
I create .arff by this Code.
// 1. set up attributes
attributes = new FastVector();
// - numeric
int NumericAttSize=0;
for(String word : ListOfWord)
{
if(word.length()>1)
{
attributes.addElement(new Attribute(word));
NumericAttSize++;
}
}
// - nominal
attVals = new FastVector();
attVals.addElement("Dummy");
for (String branch : ListOfBranch)
{
attVals.addElement(branch);
}
attributes.addElement(new Attribute("class hierarchical", attVals));
// 2. create Instances object
dataSet = new Instances("training", attributes, 0);
// 3. fill with data
for(String DocID : indexTFIDF.keySet())
{
values = new double[dataSet.numAttributes()];
for(String word : ListOfWord)
{
int index = ListOfWord.indexOf(word);
if(indexTFIDF.get(DocID).containsKey(word))
values[index]=indexTFIDF.get(DocID).get(word);
}
String Branch = DocDetail.get(DocID).get("1");
values[NumericAttSize]= ListOfBranch.indexOf(Branch)+1;
dataSet.add(new Instance(1.0,values));
}
ArffSaver arffSaverInstance = new ArffSaver();
arffSaverInstance.setInstances(dataSet);
arffSaverInstance.setFile(new File("training.arff"));
arffSaverInstance.writeBatch();
then when I run "training.arff" in CLUS, I got this error message:
Error: Classes value not in tree hierarchy: Top/Arts/Animation (lookup: Animation, term: Top/Arts, subterms: Animation})
I think the problem is how i declare hierarchical attribute as a nominal attribute, but I have no other ideas how to declare this attribute.
Every suggestion would be helpful. Thanks in advance.

According to an example in the Clus manual (which is in this zip in /Clus/docs/clus-manual.pdf) a hierarchical attribute should be formatted as follows:
#ATTRIBUTE class hierarchical rec/sport/swim,rec/sport/run,rec/auto,alt/atheism
So in your case you should remove the quotes around 'class hierarchical' and remove the curly braces {} around your values resulting in:
#ATTRIBUTE class hierarchical Dummy,Top/Arts/Animation,Top/Arts
Also, if you have multi-label data (i.e., multiple labels per data sample), then you can separate multiple hierarchical values using #, as follows:
#DATA
1,...,1,rec/sport/run#rec/sport/swim

Related

Using split command to split a string

I am looking to split a string and put it into a set. The string to be split is an element of a tuple.
This string element in the tuple takes values such as(pitblockSet) :
{"P499,P376,P490,P366,P129,"}
{"P388,P491,P367,"}
{"P500,P377,P479,P355,"}
and so on. Each set refers to a Path Id(string name of the path)
The tuple was defined as :
tuple Path {
string id;
string source;
string dest;
{string} pitblockSet;
{string} roadPoints;
{string} dumpblockSet;
{string} others;
float dist;
};
And the above sets to be split refers to the element : {string} pitblockSet;
I now have to split the pitBlockSet. I am using the following :
{Path} Pbd={};
// Not putting the code to populate Pbd as it is irrelevant here
// there are several lines here for the purpose of creating set Pbd...
{string} splitPitBlocksPath[Pathid];
{string} splitDumpBlocksPath[Pathid];
execute {
for(var p in Pbd) {
var splitPitBlocksPath[p.id] = p.pitblockSet.split(",") ;
var splitDumpBlocksPath[p.id] = p.dumpblockSet.split(",") ;
}
}
The problem is when I execute it I get error in the above 2 lines appearing 4 times:
Scripting parser error: missing ';' or newline between statements.
I am not able to understand where I am going wrong
===============Added after Alex's Answer =============================
Thank you for the Answer again - It worked perfectly with some minor changes.
I might not have been able to explain the issue properly in the above hence adding the following. My actual code is much bigger and these are only an extract
Pbd for my case is a tuple of type {Path}. Path is described above. Pbd reads about 20,000 records from excel and each Pbd has tuple fields like id, source, dest, pitblockSet, dumpblockSet etc etc. These are all read from excel and populated into the tuple Pbd - this part is working fine. The 3 lines that I mentioned above were just for example of the Pbd.pitBlockSet for just 3 records out of the 20,000.
p.pitblockSet is a set but it contains only one string. The requirement is to break this string into a set. Like for example if p.pitblockSet has a value {"P499,P376,P490,P366,P129,"} for say p.id = "PT129" the expected result for this p.id is {"P499" "P376" "P490" "P366" "P129"}. Then say for example for p.id="PT1" the p.pitblockSet is {"P4,"} the expected result is a set with only one element like {"P4"}. As mentioned earlier there are several such records of p and the above two are just for example.
I have therefore modified the suggested code to some extent to fit into the problem. However I am still getting an issue with the split command.
{string} result[Pbd];
int MaxS=10;
execute {
for(var p in Pbd) {
var stringSet = Opl.item(p.pitblockSet,0);
var split= new Array(MaxS);
split=stringSet.split(",") ;
for(var i=0;i<=MaxS;i++) if ((split[i]!='null') && (split[i]!='')) result[p].add(split[i]);
writeln("result:", p.id, result[p]);
}
}
The Answers look like below :
result:PT1 {"P4"}
result:PT2 {"P5"}
result:PT3 {"P6"}
result:PT4 {"P7"}
result:PT5 {"P8"}
result:PT6 {"P8" "P330" "P455" "P341"}
result:PT7 {"P326"}
result:PT8 {"P327"}
result:PT9 {"P328"}
.
.
and so on
.
.
result:PT28097 {"P500" "P377" "P479" "P355"}
result:PT28098 {"P501" "P378" "P139"}
result:PT28099 {"P501" "P388" "P491" "P367"}
result:PT28100 {"P501" "P378" "P480"}
result:PT28101 {"P501" "P378" "P139"}
result:PT28102 {"P502"}
result:PT28103 {"P503"}
Unfortunately, I'm afraid you encounter a product limitation.
See: https://www.ibm.com/support/knowledgecenter/SSSA5P_12.10.0/ilog.odms.ide.help/refjsopl/html/intro.html?view=kc#1037020
Regards,
Chris.
int MaxS=10;
{string} splitDumpBlocksPath={"P499,P376,P490,P366,P129,"} union
{"P388,P491,P367,"} union
{"P500,P377,P479,P355,"};
range Pbd=0..card(splitDumpBlocksPath)-1;
{string} result[Pbd];
execute {
for(var p in Pbd) {
var stringSet=Opl.item(splitDumpBlocksPath,p);
writeln(stringSet);
var split= new Array(MaxS);
split=stringSet.split(",") ;
for(var i=0;i<=MaxS;i++) if ((split[i]!='null') && (split[i]!='')) result[p].add(split[i]);
}
writeln(result);
}
works fine and gives
P499,P376,P490,P366,P129,
P388,P491,P367,
P500,P377,P479,P355,
[{"P499" "P376" "P490" "P366" "P129"} {"P388" "P491" "P367"} {"P500" "P377"
"P479" "P355"}]

IcCube - Treemap Chart with duplicate names

In the Google Treemap Chart every node has to have a unique id, but two nodes can have the same name (https://groups.google.com/d/msg/google-visualization-api/UDLD-a-0PCM/IwVCGzsWOg8J).
I used the schema from the parent/child demo (http://www.iccube.com/support/documentation/user_guide/schemas_cubes/dim_parentchild.php)
Using the following MDX statement in the treemap works, as long as the names of the nodes are unique:
WITH
MEMBER [parent_name] as IIF( [dim (ALL)].[Hierarchy].currentmember
is [dim (ALL)].[Hierarchy].[ALL],'',
[dim (ALL)]. [Hierarchy].currentmember.parent.name )
SELECT
{[parent_name],[Measures].[value]} on 0,
non empty [dim (ALL)].[Hierarchy].members on 1
FROM
[Cube]
If I added the line to the In-memory table in icCube's schema :
7,4,Spain, 2, 32
but the name Spain is double when rendering the Treemap. To support names a child definition in the GVI table should be something like this:
{v:'uniqueID-Spain', f:'Spain'}
As a workaround you can use the following code that modifies GviTable processing for the google tree widget. Check the example here:
https://drive.google.com/file/d/0B3kSph_LgXizSVhvSm15Q1hIdW8/view?usp=sharing
Report JavaScript:
function consumeEvent( context, event ) {
if (event.name == 'ic3-report-init') {
if(!_.isFunction(ic3.originalProcessGviTable)) {
ic3.originalProcessGviTable = viz.charts.GenericGoogleWidget.prototype.processGviTable
}
viz.charts.GenericGoogleWidget.prototype.processGviTable = function(gviTable){
if(this.props.ic3chartType === "TreeMap") {
gviTable = gviTable || this.gviTable();
var underlying = _.cloneDeep(gviTable.getUnderlyingGviTable());
_.each(underlying.rows, function(row){
// Replace id with parent prefixed
if(_.isObject(row.c[0]) && !_.isString(row.c[0].f)) {
row.c[0].f = row.c[0].v;
if(_.isObject(row.c[0].p) && _.isString(row.c[0].p.mun)) {
row.c[0].v = row.c[0].p.mun;
}
}
});
gviTable = viz.GviTable.fromSnapshot(underlying);
this.startColumnSelection = gviTable.getNumberOfHeaderColumns() - 1;
return viz.charts.toGoogleDataTableOneRowHeader(gviTable);
} else {
return ic3.originalProcessGviTable.apply(this, gviTable);
}
}
}
}
For the query like:
WITH
MEMBER [parent_name] as
IIF( [dim (ALL)].[Hierarchy].currentmember.isAll(),
'',
([dim (ALL)].[Hierarchy].currentmember.parent.uniqueName)
)
SELECT
{[parent_name],[Measures].[value]} on 0,
non empty [dim (ALL)].[Hierarchy].members on 1
FROM
[Cube]
This is a limitation of Google Treemap chart that is using the same column for the id and the label. Besides changing the names to ensure they are unique (e.g. adding the parent) I don't see a workaround to this.
An option would be using another Treemap chart (e.g. one from D3) that has not this limitation.
--- icCube Schema ---
The schema is working (just use , instead of ; as separator )
--- icCube Reporting ---
The issue using Treemap is that you've two rows with the same id (Germany), fiddle
This fiddle is a running example of treemap

Univocity - parse each TSV file row to different Type of class object

I have a tsv file which has fixed rows but each row is mapped to different Java Class.
For example.
recordType recordValue1
recordType recordValue1 recordValue2
for First row I have follofing class:
public class FirstRow implements ItsvRecord {
#Parsed(index = 0)
private String recordType;
#Parsed(index = 1)
private String recordValue1;
public FirstRow() {
}
}
and for second row I have:
public class SecondRow implements ItsvRecord {
#Parsed(index = 0)
private String recordType;
#Parsed(index = 1)
private String recordValue1;
public SecondRow() {
}
}
I want to parse the TSV file directly to the respective objects but I am falling short of ideas.
Use an InputValueSwitch. This will match a value in a particular column of each row to determine what RowProcessor to use. Example:
Create two (or more) processors for each type of record you need to process:
final BeanListProcessor<FirstRow> firstProcessor = new BeanListProcessor<FirstRow>(FirstRow.class);
final BeanListProcessor<SecondRow> secondProcessor = new BeanListProcessor<SecondRow>(SecondRow.class);
Create an InputValueSwitch:
//0 means that the first column of each row has a value that
//identifies what is the type of record you are dealing with
InputValueSwitch valueSwitch = new InputValueSwitch(0);
//assigns the first processor to rows whose first column contain the 'firstRowType' value
valueSwitch.addSwitchForValue("firstRowType", firstProcessor);
//assigns the second processor to rows whose first column contain the 'secondRowType' value
valueSwitch.addSwitchForValue("secondRowType", secondProcessor);
Parse as usual:
TsvParserSettings settings = new TsvParserSettings(); //configure...
// your row processor is the switch
settings.setProcessor(valueSwitch);
TsvParser parser = new TsvParser(settings);
Reader input = new StringReader(""+
"firstRowType\trecordValue1\n" +
"secondRowType\trecordValue1\trecordValue2");
parser.parse(input);
Get the parsed objects from your processors:
List<FirstRow> firstTypeObjects = firstProcessor.getBeans();
List<SecondRow> secondTypeObjects = secondProcessor.getBeans();
The output will be*:
[FirstRow{recordType='firstRowType', recordValue1='recordValue1'}]
[SecondRow{recordType='secondRowType', recordValue1='recordValue1', recordValue2='recordValue2'}]
Assuming you have a sane toString() implemented in your classes
If you want to manage associations among the objects that are parsed:
If your FirstRow should contain the elements parsed for records of type SecondRow, simply override the rowProcessorSwitched method:
InputValueSwitch valueSwitch = new InputValueSwitch(0) {
#Override
public void rowProcessorSwitched(RowProcessor from, RowProcessor to) {
if (from == secondProcessor) {
List<FirstRow> firstRows = firstProcessor.getBeans();
FirstRow mostRecentRow = firstRows.get(firstRows.size() - 1);
mostRecentRow.addRowsOfOtherType(secondProcessor.getBeans());
secondProcessor.getBeans().clear();
}
}
};
The above assumes your FirstRow class has a addRowsOfOtherType method that takes a list of SecondRow as parameter.
And that's it!
You can even mix and match other types of RowProcessor. There's another example here that demonstrates this.
Hope this helps.

Interpretation of classification in Weka

I would like to use Weka to solve my classification problem.
I have a set of instances of my training data. Lets say that the data looks like:
#relation Relation1
#attribute att1 {val11, val12}
#attribute att2 {val21, val22}
#attribute class {class1, class2, class3}
#data
val11, val21, class1
val11, val22, class2
val12, val21, class3
In my code I read the training set from the file. I train the J48 tree and try to classify an instance. However, I have no idea how to interpret the results of the classification.
My code is following:
try {
DataSource source = new DataSource("trainingset.arff");
Instances data = source.getDataSet();
if (data.classIndex() == -1) {
data.setClassIndex(data.numAttributes() - 1);
}
Instance xyz = new Instance(data.numAttributes());
xyz.setDataset(data);
xyz.setValue(data.attribute(0), "val11");
xyz.setValue(data.attribute(1), "val21");
String[] options = new String[1];
options[0] = "-U"; // unpruned tree
J48 tree = new J48(); // new instance of tree
tree.setOptions(options); // set the options
tree.buildClassifier(data); // build classifier
double[] distributionForInstance = tree.distributionForInstance(xyz);
System.out.println(distributionForInstance[0]);
System.out.println(distributionForInstance[1]);
System.out.println(distributionForInstance[2]);
} catch (Exception e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
As an output I get:
0.3333333333333333
0.3333333333333333
0.3333333333333333
I also tried other way of classifying the instance:
double classifyInstance = tree.classifyInstance(xyz);
System.out.println(classifyInstance);
In this case the output is:
0.0
Could you explain how should I interpret the outputs from the distributionForInstance and classifyInstance methods?
My aim is to be able to create the classifier which would tell me to which class does the given instance belong.
Have a look at the javadoc. The distributionForInstance method returns an array with class membership probabilities (first element probability of instance being in first class etc) and classifyInstance returns the class (as an ID -- think index into array of class labels).
Use value method of Attribute to get class label:
double classifyInstance = tree.classifyInstance(xyz);
String classStr = data.classAttribute().value(classifyInstance);

Same Instances header ( arff ) for all my database queries

I am using InstanceQuery , SQL queries, to construct my Instances. But my query results does not come in the same order always as it is normal in SQL.
Beacuse of this Instances constucted from different SQL has different headers. A simple example can be seen below. I suspect my results changes because of this behavior.
Header 1
#attribute duration numeric
#attribute protocol_type {tcp,udp}
#attribute service {http,domain_u}
#attribute flag {SF}
Header 2
#attribute duration numeric
#attribute protocol_type {tcp}
#attribute service {pm_dump,pop_2,pop_3}
#attribute flag {SF,S0,SH}
My question is : How can I give correct header information to Instance construction.
Is something like below workflow is possible?
get pre-prepared header information from arff file or another place.
give instance construction this header information
call sql function and get Instances (header + data)
I am using following sql function to get instances from database.
public static Instances getInstanceDataFromDatabase(String pSql
,String pInstanceRelationName){
try {
DatabaseUtils utils = new DatabaseUtils();
InstanceQuery query = new InstanceQuery();
query.setUsername(username);
query.setPassword(password);
query.setQuery(pSql);
Instances data = query.retrieveInstances();
data.setRelationName(pInstanceRelationName);
if (data.classIndex() == -1)
{
data.setClassIndex(data.numAttributes() - 1);
}
return data;
} catch (Exception e) {
throw new RuntimeException(e);
}
}
I tried various approaches to my problem. But it seems that weka internal API does not allow solution to this problem right now. I modified weka.core.Instances append command line code for my purposes. This code is also given in this answer
According to this, here is my solution. I created a SampleWithKnownHeader.arff file , which contains correct header values. I read this file with following code.
public static Instances getSampleInstances() {
Instances data = null;
try {
BufferedReader reader = new BufferedReader(new FileReader(
"datas\\SampleWithKnownHeader.arff"));
data = new Instances(reader);
reader.close();
// setting class attribute
data.setClassIndex(data.numAttributes() - 1);
}
catch (Exception e) {
throw new RuntimeException(e);
}
return data;
}
After that , I use following code to create instances. I had to use StringBuilder and string values of instance, then I save corresponding string to file.
public static void main(String[] args) {
Instances SampleInstance = MyUtilsForWeka.getSampleInstances();
DataSource source1 = new DataSource(SampleInstance);
Instances data2 = InstancesFromDatabase
.getInstanceDataFromDatabase(DatabaseQueries.WEKALIST_QUESTION1);
MyUtilsForWeka.saveInstancesToFile(data2, "fromDatabase.arff");
DataSource source2 = new DataSource(data2);
Instances structure1;
Instances structure2;
StringBuilder sb = new StringBuilder();
try {
structure1 = source1.getStructure();
sb.append(structure1);
structure2 = source2.getStructure();
while (source2.hasMoreElements(structure2)) {
String elementAsString = source2.nextElement(structure2)
.toString();
sb.append(elementAsString);
sb.append("\n");
}
} catch (Exception ex) {
throw new RuntimeException(ex);
}
MyUtilsForWeka.saveInstancesToFile(sb.toString(), "combined.arff");
}
My save instances to file code is as below.
public static void saveInstancesToFile(String contents,String filename) {
FileWriter fstream;
try {
fstream = new FileWriter(filename);
BufferedWriter out = new BufferedWriter(fstream);
out.write(contents);
out.close();
} catch (Exception ex) {
throw new RuntimeException(ex);
}
This solves my problem but I wonder if more elegant solution exists.
I solved a similar problem with the Add filter that allows adding attributes to Instances. You need to add a correct Attibute with proper list of values to both datasets (in my case - to test dataset only):
Load train and test data:
/* "train" contains labels and data */
/* "test" contains data only */
CSVLoader csvLoader = new CSVLoader();
csvLoader.setFile(new File(trainFile));
Instances training = csvLoader.getDataSet();
csvLoader.reset();
csvLoader.setFile(new File(predictFile));
Instances test = csvLoader.getDataSet();
Set a new attribute with Add filter:
Add add = new Add();
/* the name of the attribute must be the same as in "train"*/
add.setAttributeName(training.attribute(0).name());
/* getValues returns a String with comma-separated values of the attribute */
add.setNominalLabels(getValues(training.attribute(0)));
/* put the new attribute to the 1st position, the same as in "train"*/
add.setAttributeIndex("1");
add.setInputFormat(test);
/* result - a compatible with "train" dataset */
test = Filter.useFilter(test, add);
As a result, the headers of both "train" and "test" are the same (compatible for Weka machine learning)