Use pre-computed model for text classification in Weka - data-mining

I have a task of sentiment analysis. I have tweets (labelled as negative or positive) as training data. I created a model out of it using StringToWordVector and NaiveBayesMultinomial.
code:
try{
TextDirectoryLoader loader = new TextDirectoryLoader();
loader.setDirectory(new File("./train/"));
Instances dataRaw = loader.getDataSet();
System.out.println(loader.getStructure());
StringToWordVector filter = new StringToWordVector();
filter.setInputFormat(dataRaw);
Instances dataFiltered = Filter.useFilter(dataRaw, filter);
System.out.println("\n\nFiltered data:\n\n" + dataFiltered);
// train Multinomial NaiveBayes classifier and output model
NaiveBayesMultinomial classifier = new NaiveBayesMultinomial();
classifier.buildClassifier(dataFiltered);
//System.out.println("\n\nClassifier model:\n\n" + classifier);
//save the model
weka.core.SerializationHelper.write("./model/naviebayesmodel/", classifier);
}catch(Exception ex){
ex.printStackTrace();
}
Now I want to test this model on new tweets. I am unable to work out the testing part of the classifier. I tried the following code, but no instances are captured. How to use existing model to test new tweets?
Code:
try{
Classifier cls = (Classifier) weka.core.SerializationHelper.read("./model/naviebayesmodel");
//Instances ins = (Instances)weka.core.SerializationHelper.read("./model/naviebayesmodel");
//System.out.println(ins);
//i.s
TextDirectoryLoader loader = new TextDirectoryLoader();
loader.setDirectory(new File("./test/-1/"));
Instances dataRaw = loader.getDataSet();
//String data = "hello, I am your test case. This is a great clasifier :) !!";
StringToWordVector filter = new StringToWordVector();
filter.setInputFormat(dataRaw);
//Instances unlabeled = new Instances(new BufferedReader(new FileReader("./test/test.txt")));
Instances dataFiltered = Filter.useFilter(dataRaw, filter);
dataRaw.setClassIndex(dataRaw.numAttributes() - 1);
//Instances dataFiltered = Filter.useFilter(unlabeled, filter);
for (int i = 0; i < dataRaw.numInstances(); i++) {
double clsLabel = cls.classifyInstance(dataRaw.instance(i));
System.out.println(clsLabel);
}
//System.out.println(dataRaw.numInstances());
}catch(Exception ex){
ex.printStackTrace();
}

Related

Multilayer Perceptron while Data Prediction only one value is predicted

I am using Weka for machine learning.
I would like to predict different behaviors using mulitlayer perceptron.Then I make a Min Max normalization and change the order of the data (Randomize).
I did this for the whole data in weka (not programmed in java. the code given here is only an example how it would look like for the training data).
Then I split the Data: 60% training data, 20% cross valid data and 20% test data. After that I create the multilayer percetron model:
public static void main(String[] args) throws Exception {
String filepath = "...Training60%.arff";
FileReader trainreader = new FileReader(filepath);
Instances train = new Instances(trainreader);
train.setClassIndex(train.numAttributes() - 1);
/**
* Min-Max Normalisierung der Attribute in den Testdaten auf die Werte zwischen
* 0 and 1
*/
Normalize normalize = new Normalize();
normalize.setInputFormat(train);
Instances normalizedData = Filter.useFilter(train, normalize);
FileWriter fwriter1 = new FileWriter(
"...OutputJavaNormalize.arff");
fwriter1.write(normalizedData.toString());
fwriter1.close();
System.out.println("Fertig");
/**
* Mischt die Reihenfolge der übergebenen Instanzen (Normalisierte Daten) nach
* dem Zufallsprinzip.
*/
Randomize randomize = new Randomize();
randomize.setInputFormat(normalizedData);
Instances randomizedData = Filter.useFilter(normalizedData, randomize);
FileWriter fwriter2 = new FileWriter(
"...OutputJavaRandomize.arff");
System.out.println("Ende");
fwriter2.write(randomizedData.toString());
fwriter2.close();
Then I create the mulitlayer perceptron model and do the cross validation:
/**
* MultilayerPerceptron model
*/
MultilayerPerceptron mlp = new MultilayerPerceptron();
// Setting Parameters
mlp.setLearningRate(0.1);
mlp.setMomentum(0.2);
mlp.setTrainingTime(2000);
mlp.setSeed(1);
mlp.setValidationThreshold(20);
mlp.setHiddenLayers("9");
mlp.buildClassifier(randomizedData);
weka.core.SerializationHelper.write(".../MLPa753",mlp);
System.out.println("ModelErstellt");
Instances datapredict = new Instances(new BufferedReader(new FileReader(
"...CrossValid_20%.arff")));
datapredict.setClassIndex(datapredict.numAttributes() - 1);
Evaluation eval = new Evaluation(randomizedData);
eval.crossValidateModel(mlp, datapredict, 5, new Random(1));
After that I load the test data and predict the value and probability for it and save it.
// Auswertung/Vorhersage von nicht markierten Daten (20% von gesamten Daten)
Instances datapredict1 = new Instances(new BufferedReader(new FileReader(
"D:...TestSet_20%.arff")));
datapredict1.setClassIndex(datapredict1.numAttributes() - 1);
Instances predicteddata1 = new Instances(datapredict1);
FileWriter fwriter11 = new FileWriter(
".../output.arff");
for (int i1 = 0; i1 < datapredict1.numInstances(); i1++) {
double clsLabel1 = mlp.classifyInstance(datapredict1.instance(i1));
predicteddata1.instance(i1).setClassValue(clsLabel1);
String s = train.instance(i1) + "," + clsLabel1;
fwriter11.write(s.toString());
System.out.println(train.instance(i1) + "," + clsLabel1);
}
fwriter11.close();
System.out.println(eval.toClassDetailsString());
System.out.println(eval.toMatrixString());
System.out.println(eval.toSummaryString()); // Summary of Training
System.out.println(Arrays.toString(mlp.getOptions()));
}
}
When I look at the Confusions matrix
the model looks quite ok. The overview looks like this:
That looks ok too.
But in the output file where the predictions are stored, "Value1" is always predicted for all records. What is the reason for this? How can I change this?

How to use Isolationforest in weka?

I am trying to use isolationforest in weka ,but I cannot find a easy example which shows how to use it ,who can help me ?thanks in advance
import weka.classifiers.misc.IsolationForest;
public class Test2 {
public static void main(String[] args) {
IsolationForest isolationForest = new IsolationForest();
.....................................................
}
}
I strongly suggest you to study a little bit the implementation for IslationForest.
The following code work loading a CSV file with first column with Class (note: a single class value will produce only (1-anomaly score) if it's binary you will get the anomaly score too. Otherwise it just return an error). Note I skip the second column (that in my case is the uuid that is not needed for anomaly detection)
private static void findOutlier(File in, File out) throws Exception {
CSVLoader loader = new CSVLoader();
loader.setSource(new File(in.getAbsolutePath()));
Instances data = loader.getDataSet();
// setting class attribute if the data format does not provide this information
// For example, the XRFF format saves the class attribute information as well
if (data.classIndex() == -1)
data.setClassIndex(0);
String[] options = new String[2];
options[0] = "-R"; // "range"
options[1] = "2"; // first attribute
Remove remove = new Remove(); // new instance of filter
remove.setOptions(options); // set options
remove.setInputFormat(data); // inform filter about dataset **AFTER** setting options
Instances newData = Filter.useFilter(data, remove); // apply filter
IsolationForest randomForest = new IsolationForest();
randomForest.buildClassifier(newData);
// System.out.println(randomForest);
FileWriter fw = new FileWriter(out);
final Enumeration<Attribute> attributeEnumeration = data.enumerateAttributes();
for (Attribute e = attributeEnumeration.nextElement(); attributeEnumeration.hasMoreElements(); e = attributeEnumeration.nextElement()) {
fw.write(e.name());
fw.write(",");
}
fw.write("(1 - anomaly score),anomaly score\n");
for (int i = 0; i < data.size(); ++i) {
Instance inst = data.get(i);
final double[] distributionForInstance = randomForest.distributionForInstance(inst);
fw.write(inst + ", " + distributionForInstance[0] + "," + (1 - distributionForInstance[0]));
fw.write(",\n");
}
fw.flush();
}
The previous function will add at the CSV at last column the anomaly values. Please note I'm using a single class so for getting the corresponding anomaly I do 1 - distributionForInstance[0] otherwise you ca do simply distributionForInstance[1] .
A sample input.csv for getting (1-anomaly score):
Class,ignore, feature_0, feature_1, feature_2
A,1,21,31,31
A,2,41,61,81
A,3,61,37,34
A sample input.csv for getting (1-anomaly score) and anomaly score:
Class,ignore, feature_0, feature_1, feature_2
A,1,21,31,31
B,2,41,61,81
A,3,61,37,34

Google Custom Metrics tracking of latency data

I'm using external service, based on REST calls, and I want to track the time that took the service to respond to my requests. My code is written with C# (core v2.2)
I planning to count times for all the HTTP requests (with Stopwatch) and keep this information in a List<long>. Every 60 seconds I will write the tracked information from the list to Google Custom Metrics.
In the end, I expect to see the AVERAGE time of execution in a graph.
This is my code so far:
public class CustomMetricsWritter
{
public CustomMetricsWritter(string projectId)
{
this.Client = MetricServiceClient.Create();
this.ProjectId = projectId;
}
private MetricServiceClient Client { get; set; }
public string ProjectId { get; private set; }
public object CreateMetric(string metricType, string title, string description, string unit)
{
// Prepare custom metric descriptor.
MetricDescriptor metricDescriptor = new MetricDescriptor();
metricDescriptor.DisplayName = title;
metricDescriptor.Description = description;
metricDescriptor.MetricKind = MetricKind.Gauge;
metricDescriptor.ValueType = MetricDescriptor.Types.ValueType.Double;
metricDescriptor.Type = metricType;
metricDescriptor.Unit = unit;
CreateMetricDescriptorRequest request = new CreateMetricDescriptorRequest
{
ProjectName = new ProjectName(this.ProjectId),
};
request.MetricDescriptor = metricDescriptor;
// Make the request.
return Client.CreateMetricDescriptor(request);
}
public static readonly DateTime UnixEpoch = new DateTime(1970, 1, 1, 0, 0, 0, DateTimeKind.Utc);
public async Task WriteTimeSeriesDataAsync(string metricDescriptor, TypedValue[] points, string machineName)
{
// Initialize request argument(s).
ProjectName name = new ProjectName(this.ProjectId);
// Prepare a data point.
Timestamp timeStamp = new Timestamp();
timeStamp.Seconds = (long)(DateTime.UtcNow - UnixEpoch).TotalSeconds;
TimeInterval interval = new TimeInterval();
interval.EndTime = timeStamp;
// Prepare monitored resource.
MonitoredResource resource = new MonitoredResource();
resource.Type = "global";
resource.Labels.Add("project_id", this.ProjectId);
// Add newly created time series to list of time series to be written.
List<TimeSeries> timeSeries = new List<TimeSeries>(points.Length);
// Prepare custom metric.
Metric metric = new Metric();
metric.Type = metricDescriptor;
metric.Labels.Add("machine", machineName);
// Create a new time series using inputs.
TimeSeries timeSeriesData = new TimeSeries();
timeSeriesData.Metric = metric;
timeSeriesData.Resource = resource;
foreach (var point in points)
{
Point dataPoint = new Point();
dataPoint.Value = point;
dataPoint.Interval = interval;
timeSeriesData.Points.Add(dataPoint);
}
timeSeries.Add(timeSeriesData);
// Write time series data.
await this.Client.CreateTimeSeriesAsync(name, timeSeries).ConfigureAwait(false);
}
}
I running this class with this code (create the metric and then fill it with dummy values):
try
{
CustomMetricsWritter customMetricsWriter = new CustomMetricsWritter(Consts.GOOGLE_CLOUD_PROJECT);
string metric = "custom.googleapis.com/web/latency";
customMetricsWriter.CreateMetric(metric, "Execution Latency", "Calling REST service (MS).", "{INT64}");
// Exception thrown in the next line ----->
await customMetricsWriter.WriteTimeSeriesDataAsync(
metric,
new TypedValue[] {
new TypedValue(){ Int64Value = 150},
new TypedValue(){ Int64Value = 250},
new TypedValue(){ Int64Value = 350},
},
"my-machine-type");
}
catch (Exception ex)
{
Console.WriteLine(ex.ToString());
throw;
}
I getting back this thrown exception:
Grpc.Core.RpcException: Status(StatusCode=InvalidArgument, Detail="One or more TimeSeries could not be written: Field timeSeries[0].points had an invalid value: Only one point can be written per TimeSeries per request.: timeSeries[0]")
at Google.Api.Gax.Grpc.ApiCallRetryExtensions.<>c__DisplayClass0_0`2.<<WithRetry>b__0>d.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
at ***.CustomMetricsWritter.WriteTimeSeriesDataAsync(String metricDescriptor, TypedValue[] points, String machineName) in ***\GoogleCloud\CustomMetricsWritter.cs:line 131
at Test.Program.MainAsync() in ***\Test\Program.cs:line 156
What I'm doing wrong?
I'm not a C# expert, but there are examples here:
https://github.com/GoogleCloudPlatform/dotnet-docs-samples/blob/master/monitoring/api/QuickStart/QuickStart.cs

Writing Jena Models to Tar.Gz archives

I am working with RDF models at the moment. Therefore I query data from a database, generate models using Apache Jena and work with them. Although, I don't want to have to query the models every time I use them, so I thought about storing them locally. The models are quite big, so I'd like to compress them using Apache Commons Compress. This works so far (try-catch-blocks omitted):
public static void write(Map<String, Model> models, String file){
logger.info("Writing models to file " + file);
TarArchiveOutputStream tarOutput = null;;
TarArchiveEntry entry = null;
tarOutput = new TarArchiveOutputStream(new GzipCompressorOutputStream(new FileOutputStream(new File(file))));
for(Map.Entry<String, Model> e : models.entrySet()) {
logger.info("Packing model " + e.getKey());
// Convert Model
ByteArrayOutputStream baos = new ByteArrayOutputStream();
RDFDataMgr.write(baos,e.getValue(), RDFFormat.RDFXML_PRETTY);
// Prepare Entry
entry = new TarArchiveEntry(e.getKey());
entry.setSize(baos.size());
tarOutput.putArchiveEntry(entry);
// write into file and close
tarOutput.write(baos.toByteArray());
tarOutput.closeArchiveEntry();
}
tarOutput.close();
}
But as I try the other direction, I get weird NullPointerExceptions. Is this a bug in the GZip-Implementation or is my understanding of Streams wrong?
public static Map<String, Model> read(String file){
logger.info("Reading models from file " + file);
Map<String, Model> models = new HashMap<>();
TarArchiveInputStream tarInput = new TarArchiveInputStream(new GzipCompressorInputStream(new FileInputStream(file)));
for(TarArchiveEntry currentEntry = tarInput.getNextTarEntry();currentEntry != null; currentEntry= tarInput.getNextTarEntry()){
logger.info("Processing model " + currentEntry.getName());
// Read the current model
Model m = ModelFactory.createDefaultModel();
m.read(tarInput, null);
// And add it to the output
models.put(currentEntry.getName(),m);
tarInput.close();
}
return models;
}
This is the stack trace:
Exception in thread "main" java.lang.NullPointerException
at org.apache.commons.compress.compressors.gzip.GzipCompressorInputStream.read(GzipCompressorInputStream.java:271)
at java.io.InputStream.skip(InputStream.java:224)
at org.apache.commons.compress.utils.IOUtils.skip(IOUtils.java:106)
at org.apache.commons.compress.archivers.tar.TarArchiveInputStream.skipRecordPadding(TarArchiveInputStream.java:345)
at org.apache.commons.compress.archivers.tar.TarArchiveInputStream.getNextTarEntry(TarArchiveInputStream.java:272)
at de.mem89.masterthesis.rdfHydra.StorageHelper.read(StorageHelper.java:88)
at de.mem89.masterthesis.rdfHydra.StorageHelper.main(StorageHelper.java:124)

Skip feature when classifying, but show feature in output

I've created a dataset which contains +/- 13000 rows with +/- 50 features. I know how to output every classification result: prediction and actual, but I would like to be able to output some sort of ID with those results. So i've added a ID column to my dataset but I don't know how disregard the ID when classifying while still being able to output the ID with every prediction result. I do know how to select features to output with every prediction.
Use FilteredClassifier. See this and this .
Let's say follwoing are the attributes in the bbcsport.arff that you want to remove and is in a file attributes.txt line by line..
serena
serve
service
sets
striking
tennis
tiebreak
tournaments
wimbledon
..
Here is how you may include or exclude the attributes by setting true or false. (mutually elusive) remove.setInvertSelection(false)
BufferedReader datafile = new BufferedReader(new FileReader("bbcsport.arff"));
BufferedReader attrfile = new BufferedReader(new FileReader("attributes.txt"));
Instances data = new Instances(datafile);
List<Integer> myList = new ArrayList<Integer>();
String line;
while ((line = attrfile.readLine()) != null) {
for (n = 0; n < data.numAttributes(); n++) {
if (data.attribute(n).name().equalsIgnoreCase(line)) {
if(!myList.contains(n))
myList.add(n);
}
}
}
int[] attrs = myList.stream().mapToInt(i -> i).toArray();
Remove remove = new Remove();
remove.setAttributeIndicesArray(attrs);
remove.setInvertSelection(false);
remove.setInputFormat(data); // init filter
Instances filtered = Filter.useFilter(data, remove);
'filtered' has the final attributes..
My blog .. http://ojaslabs.com/include-exclude-attributes-in-weka