ARFF output in weka is different depending on if it incrementaly saved - weka

Below is a program that shows how strings are output incorrectly if the ARFF saver from weka is writing in incremental mode. The program below runs in incremental mode if a parameter is passed to the program and in batch mode if no parameter is passed.
Note that in batch mode, the ARFF file contains strings ... normal operation.
In incremental mode, the ARFF file contains integers in place of strings ... strange !
Any ideas on how to get the ARFF formater to output strings in incremental format?
import java.io.File;
import java.io.IOException;
import weka.core.Attribute;
import weka.core.FastVector;
import weka.core.Instance;
import weka.core.Instances;
import weka.core.converters.ArffSaver;
import weka.core.converters.Saver;
public class ArffTest {
static Instances instances;
static ArffSaver saver;
static boolean flag=false;
public static void addData(String ticker, double price) throws IOException{
int numAttr = instances.numAttributes(); // same for
double[] vals = new double[numAttr];
int i=0;
vals[i++] = instances.attribute(0).addStringValue(ticker);
vals[i++] = price;
Instance instance = new Instance(1.0, vals);
if (flag)
saver.writeIncremental(instance);
else
instances.add(instance);
}
public static void main(String[] args) {
if(args.length>0){
flag=true;
}
FastVector atts = new FastVector(); // attributes
atts.addElement(new Attribute("Ticker", (FastVector)null));// symbol
atts.addElement(new Attribute("Price")); // price that order exited at.
instances = new Instances("Samples", atts, 0); // create header
saver = new ArffSaver();
saver.setInstances(instances);
if(flag)
saver.setRetrieval(Saver.INCREMENTAL);
try{
saver.setFile(new File("test.arff"));
addData("YY", 23.0);
addData("XY", 24.0);
addData("XX", 29.0);
if(flag)
saver.writeIncremental(null);
else
saver.writeBatch();
}catch(Exception e){
System.out.println("Exception");
}
}
}

You forgot to add the newly created Instance to the dataset.
Instance instance = new DenseInstance(1.0, vals);
instance.setDataset(instances); //Add instance!
if (flag)
saver.writeIncremental(instance);
else
instances.add(instance);
The Instance must have access to the dataset to retrieve the String
attribute. If it doesn't it just writes out the index.
Besides that I recommend to use Weka 3.7.6. Instance is now an
interface with two implementations.
cheers,
Muki

Related

How to count the number of rows in the input file of the Google Dataflow file processing?

I am trying to count the number of rows in an input file and I am using Cloud dataflow Runner for creating the template. In the below code, I am reading the file from a GCS bucket, processing it and then storing the output in a Redis instance.
But I am unable to count the number of lines of the input file.
Main Class
public static void main(String[] args) {
/**
* Constructed StorageToRedisOptions object using the method PipelineOptionsFactory.fromArgs to read options from command-line
*/
StorageToRedisOptions options = PipelineOptionsFactory.fromArgs(args)
.withValidation()
.as(StorageToRedisOptions.class);
Pipeline p = Pipeline.create(options);
p.apply("Reading Lines...", TextIO.read().from(options.getInputFile()))
.apply("Transforming data...",
ParDo.of(new DoFn<String, String[]>() {
#ProcessElement
public void TransformData(#Element String line, OutputReceiver<String[]> out) {
String[] fields = line.split("\\|");
out.output(fields);
}
}))
.apply("Processing data...",
ParDo.of(new DoFn<String[], KV<String, String>>() {
#ProcessElement
public void ProcessData(#Element String[] fields, OutputReceiver<KV<String, String>> out) {
if (fields[RedisIndex.GUID.getValue()] != null) {
out.output(KV.of("firstname:"
.concat(fields[RedisIndex.FIRSTNAME.getValue()]), fields[RedisIndex.GUID.getValue()]));
out.output(KV.of("lastname:"
.concat(fields[RedisIndex.LASTNAME.getValue()]), fields[RedisIndex.GUID.getValue()]));
out.output(KV.of("dob:"
.concat(fields[RedisIndex.DOB.getValue()]), fields[RedisIndex.GUID.getValue()]));
out.output(KV.of("postalcode:"
.concat(fields[RedisIndex.POSTAL_CODE.getValue()]), fields[RedisIndex.GUID.getValue()]));
}
}
}))
.apply("Writing field indexes into redis",
RedisIO.write().withMethod(RedisIO.Write.Method.SADD)
.withEndpoint(options.getRedisHost(), options.getRedisPort()));
p.run();
}
Sample Input File
xxxxxxxxxxxxxxxx|bruce|wayne|31051989|444444444444
yyyyyyyyyyyyyyyy|selina|thomas|01051989|222222222222
aaaaaaaaaaaaaaaa|clark|kent|31051990|666666666666
Command to execute the pipeline
mvn compile exec:java \
-Dexec.mainClass=com.viveknaskar.DataFlowPipelineForMemStore \
-Dexec.args="--project=my-project-id \
--jobName=dataflow-job \
--inputFile=gs://my-input-bucket/*.txt \
--redisHost=127.0.0.1 \
--stagingLocation=gs://pipeline-bucket/stage/ \
--dataflowJobFile=gs://pipeline-bucket/templates/dataflow-template \
--runner=DataflowRunner"
I have tried to use the below code from the StackOverflow solution but it doesn't work me.
PipelineOptions options = ...;
DirectPipelineRunner runner = DirectPipelineRunner.fromOptions(options);
Pipeline p = Pipeline.create(options);
PCollection<Long> countPC =
p.apply(TextIO.Read.from("gs://..."))
.apply(Count.<String>globally());
DirectPipelineRunner.EvaluationResults results = runner.run(p);
long count = results.getPCollection(countPC).get(0);
I have gone through the Apache Beam documentation as well but didn't find anything helpful. Any help on this will be really appreciated.
I resolved this issue by adding the Count.globally() and applying to a PCollection<String> after the pipeline reads the file.
I have added the below code:
PCollection<String> lines = p.apply("Reading Lines...", TextIO.read().from(options.getInputFile()));
lines.apply(Count.globally()).apply("Count the total records", ParDo.of(new RecordCount()));
where I have created a new Class (RecordCount.java) which extends DoFn<Long, Void> which just logs the count.
RecordCount.java
import org.apache.beam.sdk.transforms.DoFn;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
public class RecordCount extends DoFn<Long, Void> {
private static final Logger LOGGER = LoggerFactory.getLogger(RecordCount.class);
#ProcessElement
public void processElement(#Element Long count) {
LOGGER.info("The total number of records in the input file is: ", count);
}
}
}
Proper way to do this is to write the count to a storage system using a Beam connector (or using a Beam ParDo). Pipeline result is not directly available to the main program since Beam runner could parallelize computation and execution may not happen in the same computer.
For example (pseudocode):
p.apply(TextIO.Read.from("gs://..."))
.apply(Count.<String>globally())
.apply(ParDo(MyLongToStringParDo()))
.apply(TextIO.Write.to("gs://..."));
If you need to handle output directly in the main program, you can read from GCS using a client library after Beam program ends (make sure to specify p.run().waitUntilFinish() in this case). Alternatively, you can move your computation (that needs the count) into a Beam PTransform and make that part of your pipeline.

how to pass dynamic parameters in google cloud dataflow pipeline

I have written code to inject CSV file from GCS to BigQuery with hardcoded ProjectID, Dataset, Table name, GCS Temp & Staging location.
I am looking code that should read
ProjectID
Dataset
Table name
GCS Temp & Staging location parameters
from BigQuery table(Dynamic parameters).
Code:-
public class DemoPipeline {
public static TableReference getGCDSTableReference() {
TableReference ref = new TableReference();
ref.setProjectId("myprojectbq");
ref.setDatasetId("DS_Emp");
ref.setTableId("emp");
return ref;
}
static class TransformToTable extends DoFn<String, TableRow> {
#ProcessElement
public void processElement(ProcessContext c) {
String input = c.element();
String[] s = input.split(",");
TableRow row = new TableRow();
row.set("id", s[0]);
row.set("name", s[1]);
c.output(row);
}
}
public interface MyOptions extends PipelineOptions {
/*
* Param
*
*/
}
public static void main(String[] args) {
MyOptions options = PipelineOptionsFactory.fromArgs(args).withValidation().as(MyOptions.class);
options.setTempLocation("gs://demo-xxxxxx/temp");
Pipeline p = Pipeline.create(options);
PCollection<String> lines = p.apply("Read From Storage", TextIO.read().from("gs://demo-xxxxxx/student.csv"));
PCollection<TableRow> rows = lines.apply("Transform To Table",ParDo.of(new TransformToTable()));
rows.apply("Write To Table",BigQueryIO.writeTableRows().to(getGCDSTableReference())
//.withSchema(BQTableSemantics.getGCDSTableSchema())
.withWriteDisposition(BigQueryIO.Write.WriteDisposition.WRITE_APPEND)
.withCreateDisposition(BigQueryIO.Write.CreateDisposition.CREATE_NEVER));
p.run();
}
}
Even to read from an initial table (Project ID / dataset / tables names) where other data is contained, you need to hardcode such information in somewhere. Properties files as Haris recommended is a good approach, look at the following suggestions:
Java Properties file. Used when parameters have to be changed or tuned. In general, changes that don't require new compilation. It's a file that has to live or attached to your java classes. Reading this file from GCS is feasible but a weird option.
Pipeline Execution Parameters. Custom parameters can be a workaround for your question, please check Creating Custom Options to understand how can be accomplished, here is a small example.

Creating Internal Accounts in SAS Metadata Server by programm on SAS Base

I'm trying to create Internal Accounts programmaticaly by using proc metadata.
The code section below creates person with External Login.
put"<Person Name=%str(%')&&PersonName&i.%str(%')>";
put"<Logins>";
put"<Login Name=%str(%')Login.&&PersonName&i.%str(%') Password=%str(%')&&word&i.%str(%')/>";
put"</Logins>";
put"</Person>";
To create ExternalLogin we can set attribute Password, and in SAS Metadata it will be encrypted automaticaly.
But to create InternalLogin type of object it is necessary to make the hash value of the password and the salt. I know that the standard sas002 encryption method, but in the case of using proc pwencode how to obtain the value of salt?
Is it possible create InternalLogin by using SAS Base?
Thanx.
So on. I found an article that can tell us how to create Stored Process for this problem. My answer is addition to the article.
The approach is base on execute java methods from sas programm.
1. Prerare setPasswd.java class
I've modified class from article. Separate code to connect to metadata server and create InternalLogin
import java.rmi.RemoteException;
import com.sas.metadata.remote.AssociationList;
import com.sas.metadata.remote.CMetadata;
import com.sas.metadata.remote.Person;
import com.sas.metadata.remote.MdException;
import com.sas.metadata.remote.MdFactory;
import com.sas.metadata.remote.MdFactoryImpl;
import com.sas.metadata.remote.MdOMIUtil;
import com.sas.metadata.remote.MdOMRConnection;
import com.sas.metadata.remote.MdObjectStore;
import com.sas.metadata.remote.MetadataObjects;
import com.sas.metadata.remote.PrimaryType;
import com.sas.metadata.remote.Tree;
import com.sas.meta.SASOMI.ISecurity_1_1;
import com.sas.iom.SASIOMDefs.VariableArray2dOfStringHolder;
public class setPasswd {
String serverName = null;
String serverPort = null;
String serverUser = null;
String serverPass = null;
MdOMRConnection connection = null;
MdFactoryImpl _factory = null;
ISecurity_1_1 iSecurity = null;
MdObjectStore objectStore = null;
Person person = null;
public int connectToMetadata(String name, String port, String user, String pass){
try {
serverName = name;
serverPort = port;
serverUser = user;
serverPass = pass;
_factory = new MdFactoryImpl(false);
connection = _factory.getConnection();
connection.makeOMRConnection(serverName, serverPort, serverUser, serverPass);
iSecurity = connection.MakeISecurityConnection();
return 0;
}catch(Exception e){
return 1;
}
}
public setPasswd(){};
public int changePasswd(String IdentityName, String IdentityPassword) {
try
{
//
// This block obtains the person metadata ID that is needed to change the password
//
// Defines the GetIdentityInfo 'ReturnUnrestrictedSource' option.
final String[][] options ={{"ReturnUnrestrictedSource",""}};
// Defines a stringholder for the info output parameter.
VariableArray2dOfStringHolder info = new VariableArray2dOfStringHolder();
// Issues the GetInfo method for the provided iSecurity connection user.
iSecurity.GetInfo("GetIdentityInfo","Person:"+IdentityName, options, info);
String[][] returnArray = info.value;
String personMetaID = new String();
for (int i=0; i< returnArray.length; i++ )
{
System.out.println(returnArray[i][0] + "=" + returnArray[i][1]);
if (returnArray[i][0].compareTo("IdentityObjectID") == 0) {
personMetaID = returnArray[i][1];
}
}
objectStore = _factory.createObjectStore();
person = (Person) _factory.createComplexMetadataObject(objectStore, IdentityName, MetadataObjects.PERSON, personMetaID);
iSecurity.SetInternalPassword(IdentityName, IdentityPassword);
person.updateMetadataAll();
System.out.println("Password has been changed.");
return 0; // success
}
catch (MdException e)
{
Throwable t = e.getCause();
if (t != null)
{
String ErrorType = e.getSASMessageSeverity();
String ErrorMsg = e.getSASMessage();
if (ErrorType == null)
{
// If there is no SAS server message, write a Java/CORBA message.
}
else
{
// If there is a message from the server:
System.out.println(ErrorType + ": " + ErrorMsg);
}
if (t instanceof org.omg.CORBA.COMM_FAILURE)
{
// If there is an invalid port number or host name:
System.out.println(e.getLocalizedMessage());
}
else if (t instanceof org.omg.CORBA.NO_PERMISSION)
{
// If there is an invalid user ID or password:
System.out.println(e.getLocalizedMessage());
}
}
else
{
// If we cannot find a nested exception, get message and print.
System.out.println(e.getLocalizedMessage());
}
// If there is an error, print the entire stack trace.
e.printStackTrace();
}
catch (RemoteException e)
{
// Unknown exception.
e.printStackTrace();
}
catch (Exception e)
{
// Unknown exception.
e.printStackTrace();
}
System.out.println("Failure: Password has NOT been changed.");
return 1; // failure
}
}
2. Resolve depends
Pay attention to imports in class. To enable execute the code below necessary set CLASSPATH enironment variable.
On linux you can add the next command in %SASConfig%/Lev1/level_env_usermods.sh:
export CLASSPATH=$CLASSPATH:%pathToJar%
On Windows you can add/change environment variable by Advanced system settings
So where should you search jar files? They are in folder:
%SASHome%/SASVersionedJarRepository/eclipse/plugins/
Which files i should include in path?
I've include all that used in OMI(Open Metadata Interface).Also I've added log4j.jar (not working without this jar. Your promts will be helpful):
sas.oma.joma.jar
sas.oma.joma.rmt.jar
sas.oma.omi.jar
sas.svc.connection.jar
sas.core.jar
sas.entities.jar
sas.security.sspi.jar
log4j.jar
setPasswd.jar (YOUR JAR FROM THE NEXT STEP!)
Choose files from nearest release. Example:
Here I'm set file from v940m3f (fix release).
Other ways is here.
3. Compile setPasswd.jar
I'm tried use internal javac.exe into SAS, but it's not worked properly. So ou need to download JDK to compile jars. I've create Bat-file:
"C:\Program Files\Java\jdk1.8.0_121\bin\javac.exe" -source 1.7 -target 1.7 setPasswd.java
"C:\Program Files\Java\jdk1.8.0_121\bin\jar" -cf setPasswd.jar setPasswd.class
Paramethers -source and -target will helpful if your version of JDK is upper, that usses in SAS. Version of "sas"-java you can see by:
PROC javainfo all;
run;
Search the next string in log:
java.vm.specification.version = 1.7
4. Finally. SAS Base call
Now we can call Java code by this method (All methods available here):
data test;
dcl javaobj j ("setPasswd");
j.callIntMethod("connectToMetadata", "%SERVER%", "%PORT%", "%ADMIN%", "%{SAS002}HASHPASSORPASS%", rc1);
j.callIntMethod("changePasswd", "testPassLogin", "pass1", rc2);
j.delete();
run;
In log:
UserClass=Normal
AuthenticatedUserid=Unknown
IdentityName=testPass
IdentityType=Person
IdentityObjectID=A56RQPC2.AP00000I
Password has been changed.
Now time to test. Create new user with no passwords.
Execute code:
data test;
dcl javaobj j ("setPasswd");
j.callIntMethod("connectToMetadata", "&server.", "&port.", "&adm", "&pass", rc1);
j.callIntMethod("changePasswd", "TestUserForStack", "Overflow", rc2);
j.delete();
run;
Now our user has InternalLogin object.
Thanx.

Best way to split log files

Need help and this seems like such a common task to do:
We have hourly huge logfiles containing many different events.
We have been using hive to split these events to different files, in a hard coded way:
from events
insert overwrite table specificevent1
where events.event_type='specificevent1'
insert overwrite table specificevent2
where events.event_type='specificevent2'
...;
This is problematic as the code must change for each new event that we add.
We try to use dynamic partitioning to do an automatic parsing but experiencing problems:
If my partition schema is /year/month/day/hour/event then we cannot recover partitions of more than a day as the number for monthly will be ~ (30 days)(24 hours)(100~ events)=~72k which is way too many to work with.
If my schema is event/year/month/day/hour then since the event is the dynamic part it forces the next partitions to be scripted as dynamic, and this causes the splitting to take more time as number of partitions grow.
Is there a better way to do this (Hive and non-Hive solutions)?
Hope this will help others...
I found that Hive is not the way to go if you want to split a logfile to many different files (file per event_type).
Dynamic partitions offered by Hive have too many limitations IMHO.
What I ended up doing is writing a custom map-reduce jar.
I also found the old Hadoop interface much more suitable as it offers the MultipleTextOutputFormat abstract class which lets you implement the generateFileNameForKeyValue(). (New hadoop offers a different multiple output file mechanism: MultipleOutputs which is great if you have predefined output locations, did not get how to have them on the fly from key-value)
example code:
\*
Run example:
hadoop jar DynamicSplit.jar DynamicEventSplit.DynamicEventSplitMultifileMapReduce /event/US/incoming/2013-01-01-01/ event US 2013-01-01-01 2 "[a-zA-Z0-9_ ]+" "/event/dynamicsplit1/" ","
*/
package DynamicEventSplit;
import java.io.*;
import java.util.*;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.conf.*;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapred.*;
import org.apache.hadoop.util.*;
import org.apache.hadoop.mapred.lib.*;
import java.io.IOException;
import java.util.Iterator;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.mapred.JobConf;
import org.apache.hadoop.mapred.FileOutputFormat;
import org.apache.hadoop.mapred.RecordWriter;
import org.apache.hadoop.mapred.Reporter;
import org.apache.hadoop.util.Progressable;
public class DynamicEventSplitMultifileMapReduce
{
static class Map extends MapReduceBase implements Mapper<LongWritable, Text, Text, Text>
{
private String event_name;
private String EventNameRegexp;
private int EventNameColumnNumber;
private String columndelimeter=",";
public void configure(JobConf job)
{
EventNameRegexp=job.get("EventNameRegexp");
EventNameColumnNumber=Integer.parseInt(job.get("EventNameColumnNumber"));
columndelimeter=job.get("columndelimeter");
}
public void map(LongWritable key, Text value,OutputCollector<Text, Text> output, Reporter reporter) throws IOException
{
//check that expected event_name field exists
String [] dall=value.toString().split(columndelimeter);
if (dall.length<EventNameColumnNumber)
{
return;
}
event_name=dall[EventNameColumnNumber-1];
//check that expected event_name is valid
if (!event_name.matches(EventNameRegexp))
{
return;
}
output.collect(new Text(dall[1]),value);
}
}
static class Reduce extends MapReduceBase implements Reducer<Text, Text, Text, Text>
{
public void reduce(Text key, Iterator<Text> values,OutputCollector<Text, Text> output, Reporter reporter) throws IOException
{
while (values.hasNext())
{
output.collect(key, values.next());
}
}
}
static class MultiFileOutput extends MultipleTextOutputFormat<Text, Text>
{
private String event_name;
private String site;
private String event_date;
private String year;
private String month;
private String day;
private String hour;
private String basepath;
public RecordWriter<Text,Text> getRecordWriter(FileSystem fs, JobConf job,String name, Progressable arg3) throws IOException
{
RecordWriter<Text,Text> rw=super.getRecordWriter(fs, job, name, arg3);
site=job.get("site");
event_date=job.get("date");
year=event_date.substring(0,4);
month=event_date.substring(5,7);
day=event_date.substring(8,10);
hour=event_date.substring(11,13);
basepath=job.get("basepath");
return rw;
}
protected String generateFileNameForKeyValue(Text key, Text value,String leaf)
{
event_name=key.toString();
return basepath+"event="+event_name+"/site="+site+"/year="+year+"/month="+month+"/day="+day+"/hour="+hour+"/"+leaf;
}
protected Text generateActualKey(Text key, Text value)
{
return null;
}
}
public static void main(String[] args) throws Exception
{
String InputFiles=args[0];
String OutputDir=args[1];
String SiteStr=args[2];
String DateStr=args[3];
String EventNameColumnNumber=args[4];
String EventNameRegexp=args[5];
String basepath=args[6];
String columndelimeter=args[7];
Configuration mycon=new Configuration();
JobConf conf = new JobConf(mycon,DynamicEventSplitMultifileMapReduce.class);
conf.set("site",SiteStr);
conf.set("date",DateStr);
conf.setOutputKeyClass(Text.class);
conf.setMapOutputKeyClass(Text.class);
conf.setOutputValueClass(Text.class);
conf.setMapperClass(Map.class);
conf.setReducerClass(Reduce.class);
conf.setInputFormat(TextInputFormat.class);
conf.setOutputFormat(MultiFileOutput.class);
conf.setMapSpeculativeExecution(false);
conf.setReduceSpeculativeExecution(false);
FileInputFormat.setInputPaths(conf,InputFiles);
FileOutputFormat.setOutputPath(conf,new Path("/"+OutputDir+SiteStr+DateStr+"/"));
conf.set("EventNameColumnNumber",EventNameColumnNumber);
conf.set("EventNameRegexp",EventNameRegexp);
conf.set("basepath",basepath);
conf.set("columndelimeter",columndelimeter);
JobClient.runJob(conf);
}
}

Same Instances header ( arff ) for all my database queries

I am using InstanceQuery , SQL queries, to construct my Instances. But my query results does not come in the same order always as it is normal in SQL.
Beacuse of this Instances constucted from different SQL has different headers. A simple example can be seen below. I suspect my results changes because of this behavior.
Header 1
#attribute duration numeric
#attribute protocol_type {tcp,udp}
#attribute service {http,domain_u}
#attribute flag {SF}
Header 2
#attribute duration numeric
#attribute protocol_type {tcp}
#attribute service {pm_dump,pop_2,pop_3}
#attribute flag {SF,S0,SH}
My question is : How can I give correct header information to Instance construction.
Is something like below workflow is possible?
get pre-prepared header information from arff file or another place.
give instance construction this header information
call sql function and get Instances (header + data)
I am using following sql function to get instances from database.
public static Instances getInstanceDataFromDatabase(String pSql
,String pInstanceRelationName){
try {
DatabaseUtils utils = new DatabaseUtils();
InstanceQuery query = new InstanceQuery();
query.setUsername(username);
query.setPassword(password);
query.setQuery(pSql);
Instances data = query.retrieveInstances();
data.setRelationName(pInstanceRelationName);
if (data.classIndex() == -1)
{
data.setClassIndex(data.numAttributes() - 1);
}
return data;
} catch (Exception e) {
throw new RuntimeException(e);
}
}
I tried various approaches to my problem. But it seems that weka internal API does not allow solution to this problem right now. I modified weka.core.Instances append command line code for my purposes. This code is also given in this answer
According to this, here is my solution. I created a SampleWithKnownHeader.arff file , which contains correct header values. I read this file with following code.
public static Instances getSampleInstances() {
Instances data = null;
try {
BufferedReader reader = new BufferedReader(new FileReader(
"datas\\SampleWithKnownHeader.arff"));
data = new Instances(reader);
reader.close();
// setting class attribute
data.setClassIndex(data.numAttributes() - 1);
}
catch (Exception e) {
throw new RuntimeException(e);
}
return data;
}
After that , I use following code to create instances. I had to use StringBuilder and string values of instance, then I save corresponding string to file.
public static void main(String[] args) {
Instances SampleInstance = MyUtilsForWeka.getSampleInstances();
DataSource source1 = new DataSource(SampleInstance);
Instances data2 = InstancesFromDatabase
.getInstanceDataFromDatabase(DatabaseQueries.WEKALIST_QUESTION1);
MyUtilsForWeka.saveInstancesToFile(data2, "fromDatabase.arff");
DataSource source2 = new DataSource(data2);
Instances structure1;
Instances structure2;
StringBuilder sb = new StringBuilder();
try {
structure1 = source1.getStructure();
sb.append(structure1);
structure2 = source2.getStructure();
while (source2.hasMoreElements(structure2)) {
String elementAsString = source2.nextElement(structure2)
.toString();
sb.append(elementAsString);
sb.append("\n");
}
} catch (Exception ex) {
throw new RuntimeException(ex);
}
MyUtilsForWeka.saveInstancesToFile(sb.toString(), "combined.arff");
}
My save instances to file code is as below.
public static void saveInstancesToFile(String contents,String filename) {
FileWriter fstream;
try {
fstream = new FileWriter(filename);
BufferedWriter out = new BufferedWriter(fstream);
out.write(contents);
out.close();
} catch (Exception ex) {
throw new RuntimeException(ex);
}
This solves my problem but I wonder if more elegant solution exists.
I solved a similar problem with the Add filter that allows adding attributes to Instances. You need to add a correct Attibute with proper list of values to both datasets (in my case - to test dataset only):
Load train and test data:
/* "train" contains labels and data */
/* "test" contains data only */
CSVLoader csvLoader = new CSVLoader();
csvLoader.setFile(new File(trainFile));
Instances training = csvLoader.getDataSet();
csvLoader.reset();
csvLoader.setFile(new File(predictFile));
Instances test = csvLoader.getDataSet();
Set a new attribute with Add filter:
Add add = new Add();
/* the name of the attribute must be the same as in "train"*/
add.setAttributeName(training.attribute(0).name());
/* getValues returns a String with comma-separated values of the attribute */
add.setNominalLabels(getValues(training.attribute(0)));
/* put the new attribute to the 1st position, the same as in "train"*/
add.setAttributeIndex("1");
add.setInputFormat(test);
/* result - a compatible with "train" dataset */
test = Filter.useFilter(test, add);
As a result, the headers of both "train" and "test" are the same (compatible for Weka machine learning)