How to use Isolationforest in weka? - weka

I am trying to use isolationforest in weka ,but I cannot find a easy example which shows how to use it ,who can help me ?thanks in advance
import weka.classifiers.misc.IsolationForest;
public class Test2 {
public static void main(String[] args) {
IsolationForest isolationForest = new IsolationForest();
.....................................................
}
}

I strongly suggest you to study a little bit the implementation for IslationForest.
The following code work loading a CSV file with first column with Class (note: a single class value will produce only (1-anomaly score) if it's binary you will get the anomaly score too. Otherwise it just return an error). Note I skip the second column (that in my case is the uuid that is not needed for anomaly detection)
private static void findOutlier(File in, File out) throws Exception {
CSVLoader loader = new CSVLoader();
loader.setSource(new File(in.getAbsolutePath()));
Instances data = loader.getDataSet();
// setting class attribute if the data format does not provide this information
// For example, the XRFF format saves the class attribute information as well
if (data.classIndex() == -1)
data.setClassIndex(0);
String[] options = new String[2];
options[0] = "-R"; // "range"
options[1] = "2"; // first attribute
Remove remove = new Remove(); // new instance of filter
remove.setOptions(options); // set options
remove.setInputFormat(data); // inform filter about dataset **AFTER** setting options
Instances newData = Filter.useFilter(data, remove); // apply filter
IsolationForest randomForest = new IsolationForest();
randomForest.buildClassifier(newData);
// System.out.println(randomForest);
FileWriter fw = new FileWriter(out);
final Enumeration<Attribute> attributeEnumeration = data.enumerateAttributes();
for (Attribute e = attributeEnumeration.nextElement(); attributeEnumeration.hasMoreElements(); e = attributeEnumeration.nextElement()) {
fw.write(e.name());
fw.write(",");
}
fw.write("(1 - anomaly score),anomaly score\n");
for (int i = 0; i < data.size(); ++i) {
Instance inst = data.get(i);
final double[] distributionForInstance = randomForest.distributionForInstance(inst);
fw.write(inst + ", " + distributionForInstance[0] + "," + (1 - distributionForInstance[0]));
fw.write(",\n");
}
fw.flush();
}
The previous function will add at the CSV at last column the anomaly values. Please note I'm using a single class so for getting the corresponding anomaly I do 1 - distributionForInstance[0] otherwise you ca do simply distributionForInstance[1] .
A sample input.csv for getting (1-anomaly score):
Class,ignore, feature_0, feature_1, feature_2
A,1,21,31,31
A,2,41,61,81
A,3,61,37,34
A sample input.csv for getting (1-anomaly score) and anomaly score:
Class,ignore, feature_0, feature_1, feature_2
A,1,21,31,31
B,2,41,61,81
A,3,61,37,34

Related

Unit Test for Apex Trigger that Concatenates Fields

I am trying to write a test for a before trigger that takes fields from a custom object and concatenates them into a custom Key__c field.
The trigger works in the Sandbox and now I am trying to get it into production. However, whenever I try and do a System.assert/assertEquals after I create a purchase and perform DML, the value of Key__c always returns null. I am aware I can create a flow/process to do this, but I am trying to solve this with code for my own edification. How can I get the fields to concatenate and return properly in the test? (the commented out asserts are what I have tried so far, and have failed when run)
trigger Composite_Key on Purchases__c (before insert, before update) {
if(Trigger.isBefore)
{
for(Purchases__c purchase : trigger.new)
{
String eventName = String.isBlank(purchase.Event_name__c)?'':purchase.Event_name__c+'-';
String section = String.isBlank(purchase.section__c)?'':purchase.section__c+'-';
String row = String.isBlank(purchase.row__c)?'':purchase.row__c+'-';
String seat = String.isBlank(String.valueOf(purchase.seat__c))?'':String.valueOf(purchase.seat__c)+'-';
String numseats = String.isBlank(String.valueOf(purchase.number_of_seats__c))?'':String.valueOf(purchase.number_of_seats__c)+'-';
String adddatetime = String.isBlank(String.valueOf(purchase.add_datetime__c))?'':String.valueOf(purchase.add_datetime__c);
purchase.Key__c = eventName + section + row + seat + numseats + adddatetime;
}
}
}
#isTest
public class CompositeKeyTest {
public static testMethod void testPurchase() {
//create a purchase to fire the trigger
Purchases__c purchase = new Purchases__c(Event_name__c = 'test', section__c='test',row__c='test', seat__c=1.0,number_of_seats__c='test',add_datetime__c='test');
Insert purchase;
//System.assert(purchases__c.Key__c.getDescribe().getName() == 'testesttest1testtest');
//System.assertEquals('testtesttest1.0testtest',purchase.Key__c);
}
static testMethod void testbulkPurchase(){
List<Purchases__c> purchaseList = new List<Purchases__c>();
for(integer i=0 ; i < 10; i++)
{
Purchases__c purchaserec = new Purchases__c(Event_name__c = 'test', section__c='test',row__c='test', seat__c= i+1.0 ,number_of_seats__c='test',add_datetime__c='test');
purchaseList.add(purchaserec);
}
insert purchaseList;
//System.assertEquals('testtesttest5testtest',purchaseList[4].Key__c,'Key is not Valid');
}
}
You need to requery the records after inserting them to get the updated data from the triggers/database

Running BeamSql WithoutCoder or Making Coder Dynamic

I am reading data from file and converting it to BeamRecord But While i am Doing Query on that it Show Error-:
Exception in thread "main" java.lang.ClassCastException: org.apache.beam.sdk.coders.SerializableCoder cannot be cast to org.apache.beam.sdk.coders.BeamRecordCoder
at org.apache.beam.sdk.extensions.sql.BeamSql$QueryTransform.registerTables(BeamSql.java:173)
at org.apache.beam.sdk.extensions.sql.BeamSql$QueryTransform.expand(BeamSql.java:153)
at org.apache.beam.sdk.extensions.sql.BeamSql$QueryTransform.expand(BeamSql.java:116)
at org.apache.beam.sdk.Pipeline.applyInternal(Pipeline.java:533)
at org.apache.beam.sdk.Pipeline.applyTransform(Pipeline.java:465)
at org.apache.beam.sdk.values.PCollectionTuple.apply(PCollectionTuple.java:160)
at TestingClass.main(TestingClass.java:75)
But When I am Providing it a Coder Then It Runs Perfectly.
I am little confused that if I am reading data from a file the file data schema changes on every run because I am using templates so is there any way I can use Default Coder or Without Coder, i can Run the Query.
For Reference Code is Below Please Check.
PCollection<String> ReadFile1 = PBegin.in(p).apply(TextIO.read().from("gs://Bucket_Name/FileName.csv"));
PCollection<BeamRecord> File1_BeamRecord = ReadFile1.apply(new StringToBeamRecord()).setCoder(new Temp().test().getRecordCoder());
PCollection<String> ReadFile2= p.apply(TextIO.read().from("gs://Bucket_Name/FileName.csv"));
PCollection<BeamRecord> File2_Beam_Record = ReadFile2.apply(new StringToBeamRecord()).setCoder(new Temp().test1().getRecordCoder());
new Temp().test1().getRecordCoder() --> Returning HardCoded BeamRecordCoder Values Which I need to fetch at runtime
Conversion From PColletion<String> to PCollection<TableRow> is Below-:
Public class StringToBeamRecord extends PTransform<PCollection<String>,PCollection<BeamRecord>> {
private static final Logger LOG = LoggerFactory.getLogger(StringToBeamRecord.class);
#Override
public PCollection<BeamRecord> expand(PCollection<String> arg0) {
return arg0.apply("Conversion",ParDo.of(new ConversionOfData()));
}
static class ConversionOfData extends DoFn<String,BeamRecord> implements Serializable{
#ProcessElement
public void processElement(ProcessContext c){
String Data = c.element().replaceAll(",,",",blank,");
String[] array = Data.split(",");
List<String> fieldNames = new ArrayList<>();
List<Integer> fieldTypes = new ArrayList<>();
List<Object> Data_Conversion = new ArrayList<>();
int Count = 0;
for(int i = 0 ; i < array.length;i++){
fieldNames.add(new String("R"+Count).toString());
Count++;
fieldTypes.add(Types.VARCHAR); //Using Schema I can Set it
Data_Conversion.add(array[i].toString());
}
LOG.info("The Size is : "+Data_Conversion.size());
BeamRecordSqlType type = BeamRecordSqlType.create(fieldNames, fieldTypes);
c.output(new BeamRecord(type,Data_Conversion));
}
}
}
Query is -:
PCollectionTuple test = PCollectionTuple.of(
new TupleTag<BeamRecord>("File1_BeamRecord"),File1_BeamRecord)
.and(new TupleTag<BeamRecord>("File2_BeamRecord"), File2_BeamRecord);
PCollection<BeamRecord> output = test.apply(BeamSql.queryMulti(
"Select * From File1_BeamRecord JOIN File2_BeamRecord "));
Is thier anyway i can make Coder Dynamic or I can Run Query with Default Coder.

Univocity - parse each TSV file row to different Type of class object

I have a tsv file which has fixed rows but each row is mapped to different Java Class.
For example.
recordType recordValue1
recordType recordValue1 recordValue2
for First row I have follofing class:
public class FirstRow implements ItsvRecord {
#Parsed(index = 0)
private String recordType;
#Parsed(index = 1)
private String recordValue1;
public FirstRow() {
}
}
and for second row I have:
public class SecondRow implements ItsvRecord {
#Parsed(index = 0)
private String recordType;
#Parsed(index = 1)
private String recordValue1;
public SecondRow() {
}
}
I want to parse the TSV file directly to the respective objects but I am falling short of ideas.
Use an InputValueSwitch. This will match a value in a particular column of each row to determine what RowProcessor to use. Example:
Create two (or more) processors for each type of record you need to process:
final BeanListProcessor<FirstRow> firstProcessor = new BeanListProcessor<FirstRow>(FirstRow.class);
final BeanListProcessor<SecondRow> secondProcessor = new BeanListProcessor<SecondRow>(SecondRow.class);
Create an InputValueSwitch:
//0 means that the first column of each row has a value that
//identifies what is the type of record you are dealing with
InputValueSwitch valueSwitch = new InputValueSwitch(0);
//assigns the first processor to rows whose first column contain the 'firstRowType' value
valueSwitch.addSwitchForValue("firstRowType", firstProcessor);
//assigns the second processor to rows whose first column contain the 'secondRowType' value
valueSwitch.addSwitchForValue("secondRowType", secondProcessor);
Parse as usual:
TsvParserSettings settings = new TsvParserSettings(); //configure...
// your row processor is the switch
settings.setProcessor(valueSwitch);
TsvParser parser = new TsvParser(settings);
Reader input = new StringReader(""+
"firstRowType\trecordValue1\n" +
"secondRowType\trecordValue1\trecordValue2");
parser.parse(input);
Get the parsed objects from your processors:
List<FirstRow> firstTypeObjects = firstProcessor.getBeans();
List<SecondRow> secondTypeObjects = secondProcessor.getBeans();
The output will be*:
[FirstRow{recordType='firstRowType', recordValue1='recordValue1'}]
[SecondRow{recordType='secondRowType', recordValue1='recordValue1', recordValue2='recordValue2'}]
Assuming you have a sane toString() implemented in your classes
If you want to manage associations among the objects that are parsed:
If your FirstRow should contain the elements parsed for records of type SecondRow, simply override the rowProcessorSwitched method:
InputValueSwitch valueSwitch = new InputValueSwitch(0) {
#Override
public void rowProcessorSwitched(RowProcessor from, RowProcessor to) {
if (from == secondProcessor) {
List<FirstRow> firstRows = firstProcessor.getBeans();
FirstRow mostRecentRow = firstRows.get(firstRows.size() - 1);
mostRecentRow.addRowsOfOtherType(secondProcessor.getBeans());
secondProcessor.getBeans().clear();
}
}
};
The above assumes your FirstRow class has a addRowsOfOtherType method that takes a list of SecondRow as parameter.
And that's it!
You can even mix and match other types of RowProcessor. There's another example here that demonstrates this.
Hope this helps.

How to get all rows containing (or equaling) a particular ID from an HBase table?

I have a method which select the row whose rowkey contains the parameter passed into.
HTable table = new HTable(Bytes.toBytes(objectsTableName), connection);
public List<ObjectId> lookUp(String partialId) {
if (partialId.matches("[a-fA-F0-9]+")) {
// create a regular expression from partialId, which can
//match any rowkey that contains partialId as a substring,
//and then get all the row with the specified rowkey
} else {
throw new IllegalArgumentException(
"query must be done with hexadecimal values only");
}
}
I don't know how to finish code above.
I just know the following code can get the row with specified rowkey in Hbase.
String rowkey = "123";
Get get = new Get(Bytes.toBytes(rowkey));
Result result = table.get(get);
You can use RowFilter filter with RegexStringComparator to do that. Or, if it is just to fetch the rows which match a given substring you can use RowFilter with SubstringComparator. This is how you use HBase filters :
public static void main(String[] args) throws IOException {
Configuration conf = HBaseConfiguration.create();
HTable table = new HTable(conf, "demo");
Scan s = new Scan();
Filter f = new RowFilter(CompareOp.EQUAL, new SubstringComparator("abc"));
s.setFilter(f);
ResultScanner rs = table.getScanner(s);
for(Result r : rs){
System.out.println("RowKey : " + Bytes.toString(r.getRow()));
//rest of your logic
}
rs.close();
table.close();
}
The above piece of code will give you all the rows which contain abc as a part of their rowkeys.
HTH

Skip feature when classifying, but show feature in output

I've created a dataset which contains +/- 13000 rows with +/- 50 features. I know how to output every classification result: prediction and actual, but I would like to be able to output some sort of ID with those results. So i've added a ID column to my dataset but I don't know how disregard the ID when classifying while still being able to output the ID with every prediction result. I do know how to select features to output with every prediction.
Use FilteredClassifier. See this and this .
Let's say follwoing are the attributes in the bbcsport.arff that you want to remove and is in a file attributes.txt line by line..
serena
serve
service
sets
striking
tennis
tiebreak
tournaments
wimbledon
..
Here is how you may include or exclude the attributes by setting true or false. (mutually elusive) remove.setInvertSelection(false)
BufferedReader datafile = new BufferedReader(new FileReader("bbcsport.arff"));
BufferedReader attrfile = new BufferedReader(new FileReader("attributes.txt"));
Instances data = new Instances(datafile);
List<Integer> myList = new ArrayList<Integer>();
String line;
while ((line = attrfile.readLine()) != null) {
for (n = 0; n < data.numAttributes(); n++) {
if (data.attribute(n).name().equalsIgnoreCase(line)) {
if(!myList.contains(n))
myList.add(n);
}
}
}
int[] attrs = myList.stream().mapToInt(i -> i).toArray();
Remove remove = new Remove();
remove.setAttributeIndicesArray(attrs);
remove.setInvertSelection(false);
remove.setInputFormat(data); // init filter
Instances filtered = Filter.useFilter(data, remove);
'filtered' has the final attributes..
My blog .. http://ojaslabs.com/include-exclude-attributes-in-weka