Same Instances header ( arff ) for all my database queries - weka

I am using InstanceQuery , SQL queries, to construct my Instances. But my query results does not come in the same order always as it is normal in SQL.
Beacuse of this Instances constucted from different SQL has different headers. A simple example can be seen below. I suspect my results changes because of this behavior.
Header 1
#attribute duration numeric
#attribute protocol_type {tcp,udp}
#attribute service {http,domain_u}
#attribute flag {SF}
Header 2
#attribute duration numeric
#attribute protocol_type {tcp}
#attribute service {pm_dump,pop_2,pop_3}
#attribute flag {SF,S0,SH}
My question is : How can I give correct header information to Instance construction.
Is something like below workflow is possible?
get pre-prepared header information from arff file or another place.
give instance construction this header information
call sql function and get Instances (header + data)
I am using following sql function to get instances from database.
public static Instances getInstanceDataFromDatabase(String pSql
,String pInstanceRelationName){
try {
DatabaseUtils utils = new DatabaseUtils();
InstanceQuery query = new InstanceQuery();
query.setUsername(username);
query.setPassword(password);
query.setQuery(pSql);
Instances data = query.retrieveInstances();
data.setRelationName(pInstanceRelationName);
if (data.classIndex() == -1)
{
data.setClassIndex(data.numAttributes() - 1);
}
return data;
} catch (Exception e) {
throw new RuntimeException(e);
}
}

I tried various approaches to my problem. But it seems that weka internal API does not allow solution to this problem right now. I modified weka.core.Instances append command line code for my purposes. This code is also given in this answer
According to this, here is my solution. I created a SampleWithKnownHeader.arff file , which contains correct header values. I read this file with following code.
public static Instances getSampleInstances() {
Instances data = null;
try {
BufferedReader reader = new BufferedReader(new FileReader(
"datas\\SampleWithKnownHeader.arff"));
data = new Instances(reader);
reader.close();
// setting class attribute
data.setClassIndex(data.numAttributes() - 1);
}
catch (Exception e) {
throw new RuntimeException(e);
}
return data;
}
After that , I use following code to create instances. I had to use StringBuilder and string values of instance, then I save corresponding string to file.
public static void main(String[] args) {
Instances SampleInstance = MyUtilsForWeka.getSampleInstances();
DataSource source1 = new DataSource(SampleInstance);
Instances data2 = InstancesFromDatabase
.getInstanceDataFromDatabase(DatabaseQueries.WEKALIST_QUESTION1);
MyUtilsForWeka.saveInstancesToFile(data2, "fromDatabase.arff");
DataSource source2 = new DataSource(data2);
Instances structure1;
Instances structure2;
StringBuilder sb = new StringBuilder();
try {
structure1 = source1.getStructure();
sb.append(structure1);
structure2 = source2.getStructure();
while (source2.hasMoreElements(structure2)) {
String elementAsString = source2.nextElement(structure2)
.toString();
sb.append(elementAsString);
sb.append("\n");
}
} catch (Exception ex) {
throw new RuntimeException(ex);
}
MyUtilsForWeka.saveInstancesToFile(sb.toString(), "combined.arff");
}
My save instances to file code is as below.
public static void saveInstancesToFile(String contents,String filename) {
FileWriter fstream;
try {
fstream = new FileWriter(filename);
BufferedWriter out = new BufferedWriter(fstream);
out.write(contents);
out.close();
} catch (Exception ex) {
throw new RuntimeException(ex);
}
This solves my problem but I wonder if more elegant solution exists.

I solved a similar problem with the Add filter that allows adding attributes to Instances. You need to add a correct Attibute with proper list of values to both datasets (in my case - to test dataset only):
Load train and test data:
/* "train" contains labels and data */
/* "test" contains data only */
CSVLoader csvLoader = new CSVLoader();
csvLoader.setFile(new File(trainFile));
Instances training = csvLoader.getDataSet();
csvLoader.reset();
csvLoader.setFile(new File(predictFile));
Instances test = csvLoader.getDataSet();
Set a new attribute with Add filter:
Add add = new Add();
/* the name of the attribute must be the same as in "train"*/
add.setAttributeName(training.attribute(0).name());
/* getValues returns a String with comma-separated values of the attribute */
add.setNominalLabels(getValues(training.attribute(0)));
/* put the new attribute to the 1st position, the same as in "train"*/
add.setAttributeIndex("1");
add.setInputFormat(test);
/* result - a compatible with "train" dataset */
test = Filter.useFilter(test, add);
As a result, the headers of both "train" and "test" are the same (compatible for Weka machine learning)

Related

how to pass dynamic parameters in google cloud dataflow pipeline

I have written code to inject CSV file from GCS to BigQuery with hardcoded ProjectID, Dataset, Table name, GCS Temp & Staging location.
I am looking code that should read
ProjectID
Dataset
Table name
GCS Temp & Staging location parameters
from BigQuery table(Dynamic parameters).
Code:-
public class DemoPipeline {
public static TableReference getGCDSTableReference() {
TableReference ref = new TableReference();
ref.setProjectId("myprojectbq");
ref.setDatasetId("DS_Emp");
ref.setTableId("emp");
return ref;
}
static class TransformToTable extends DoFn<String, TableRow> {
#ProcessElement
public void processElement(ProcessContext c) {
String input = c.element();
String[] s = input.split(",");
TableRow row = new TableRow();
row.set("id", s[0]);
row.set("name", s[1]);
c.output(row);
}
}
public interface MyOptions extends PipelineOptions {
/*
* Param
*
*/
}
public static void main(String[] args) {
MyOptions options = PipelineOptionsFactory.fromArgs(args).withValidation().as(MyOptions.class);
options.setTempLocation("gs://demo-xxxxxx/temp");
Pipeline p = Pipeline.create(options);
PCollection<String> lines = p.apply("Read From Storage", TextIO.read().from("gs://demo-xxxxxx/student.csv"));
PCollection<TableRow> rows = lines.apply("Transform To Table",ParDo.of(new TransformToTable()));
rows.apply("Write To Table",BigQueryIO.writeTableRows().to(getGCDSTableReference())
//.withSchema(BQTableSemantics.getGCDSTableSchema())
.withWriteDisposition(BigQueryIO.Write.WriteDisposition.WRITE_APPEND)
.withCreateDisposition(BigQueryIO.Write.CreateDisposition.CREATE_NEVER));
p.run();
}
}
Even to read from an initial table (Project ID / dataset / tables names) where other data is contained, you need to hardcode such information in somewhere. Properties files as Haris recommended is a good approach, look at the following suggestions:
Java Properties file. Used when parameters have to be changed or tuned. In general, changes that don't require new compilation. It's a file that has to live or attached to your java classes. Reading this file from GCS is feasible but a weird option.
Pipeline Execution Parameters. Custom parameters can be a workaround for your question, please check Creating Custom Options to understand how can be accomplished, here is a small example.

Issue with Weka core DenseInstance

I'm building up an Instances object, adding Attributes, and then adding data in the form of Instance objects.
When I go to write it out, the toString() method is already throwing an OutOfBoundsException and unable to evaluate the data in the Instances. I receive the error when I try to print out the data and I can see the exception being thrown just in the Debugger as it shows it can't evaluate the toString() for the data object.
The only clue I have is that the error message seems to be using the first data element (StudentId) and using it as an index. I'm confused as to why.
The code:
// Set up the attributes for the Weka data model
ArrayList<Attribute> attributes = new ArrayList<>();
attributes.add(new Attribute("StudentIdentifier", true));
attributes.add(new Attribute("CourseGrade", true));
attributes.add(new Attribute("CourseIdentifier"));
attributes.add(new Attribute("Term", true));
attributes.add(new Attribute("YearCourseTaken", true));
// Create the data model object - I'm not happy that capacity is required and fixed? But that's another issue
Instances dataSet = new Instances("Records", attributes, 500);
// Set the attribute that will be used for prediction purposes - that will be CourseIdentifier
dataSet.setClassIndex(2);
// Pull back all the records in this term range, create Weka Instance objects for each and add to the data set
List<Record> records = recordsInTermRangeFindService.find(0, 10);
int count = 0;
for (Record r : records) {
Instance i = new DenseInstance(attributes.size());
i.setValue(attributes.get(0), r.studentIdentifier);
i.setValue(attributes.get(1), r.courseGrade);
i.setValue(attributes.get(2), r.courseIdentifier);
i.setValue(attributes.get(3), r.term);
i.setValue(attributes.get(4), r.yearCourseTaken);
dataSet.add(i);
}
System.out.println(dataSet.size());
BufferedWriter writer = null;
try {
writer = new BufferedWriter(new FileWriter("./test.arff"));
writer.write(dataSet.toString());
writer.flush();
writer.close();
} catch (IOException e) {
e.printStackTrace();
}
The error message:
Exception in thread "main" java.lang.IndexOutOfBoundsException: Index: 1010, Size: 0
I figured it out finally. I was setting the Attributes to strings with the 'true' second parameter in the Constructors, but they were integers coming out of the database table. I needed to change my lines to convert the integers to strings:
i.setValue(attributes.get(0), Integer.toString(r.studentIdentifier));
However, that created a different set of issues for me as things like the Apriori algorithm don't work on strings! I'm continuing to plug along learning Weka.

Adding stored procedures to In-Memory DB using SqLite

I am using In-Memory database (using ServiceStack.OrmLite.Sqlite.Windows) for unit testing in servicestack based web api. I want to test the service endpoints which depends on stored Procedures through In-Memory database for which i have gone through the link Servicestack Ormlite SqlServerProviderTests, the unit test class that i am using for the test is as follows,
using System;
using System.Collections.Generic;
using System.Data;
using System.Linq;
using NUnit.Framework;
using ServiceStack.Text;
using ServiceStack.Configuration;
using ServiceStack.Data;
namespace ServiceStack.OrmLite.Tests
{
public class DummyTable
{
public int Id { get; set; }
public string Name { get; set; }
}
[TestFixture]
public class SqlServerProviderTests
{
private IDbConnection db;
protected readonly ServiceStackHost appHost;
public SqlServerProviderTests()
{
appHost = TestHelper.SetUp(appHost).Init();
db = appHost.Container.Resolve<IDbConnectionFactory>().OpenDbConnection("inventoryDb");
if (bool.Parse(System.Configuration.ConfigurationManager.AppSettings["IsMock"]))
TestHelper.CreateInMemoryDB(appHost);
}
[TestFixtureTearDown]
public void TearDown()
{
db.Dispose();
}
[Test]
public void Can_SqlColumn_StoredProc_returning_Column()
{
var sql = #"CREATE PROCEDURE dbo.DummyColumn
#Times integer
AS
BEGIN
SET NOCOUNT ON;
CREATE TABLE #Temp
(
Id integer NOT NULL,
);
declare #i int
set #i=1
WHILE #i < #Times
BEGIN
INSERT INTO #Temp (Id) VALUES (#i)
SET #i = #i + 1
END
SELECT * FROM #Temp;
DROP TABLE #Temp;
END;";
db.ExecuteSql("IF OBJECT_ID('DummyColumn') IS NOT NULL DROP PROC DummyColumn");
db.ExecuteSql(sql);
var expected = 0;
10.Times(i => expected += i);
var results = db.SqlColumn<int>("EXEC DummyColumn #Times", new { Times = 10 });
results.PrintDump();
Assert.That(results.Sum(), Is.EqualTo(expected));
results = db.SqlColumn<int>("EXEC DummyColumn 10");
Assert.That(results.Sum(), Is.EqualTo(expected));
results = db.SqlColumn<int>("EXEC DummyColumn #Times", new Dictionary<string, object> { { "Times", 10 } });
Assert.That(results.Sum(), Is.EqualTo(expected));
}
}
}
when i tried to execute this through Live-DB, it was working fine. but when i tried for In-Memory DB was getting Exceptions as follows,
System.Data.SQLite.SQLiteException : SQL logic error or missing database near "IF": syntax error
near the code line,
db.ExecuteSql("IF OBJECT_ID('DummyColumn') IS NOT NULL DROP PROC DummyColumn");
i commented the above line and executed the test case but still i am getting exception as follows,
System.Data.SQLite.SQLiteException : SQL logic error or missing database near "IF": syntax error
for the code line,
db.ExecuteSql(sql);
the In-Memory DB Created is as follows, and its working fine for remaining cases.
public static void CreateInMemoryDB(ServiceStackHost appHost)
{
using (var db = appHost.Container.Resolve<IDbConnectionFactory>().OpenDbConnection("ConnectionString"))
{
db.DropAndCreateTable<DummyData>();
TestDataReader<TableList>("Reservation.json", "InMemoryInput").Reservation.ForEach(x => db.Insert(x));
db.DropAndCreateTable<DummyTable>();
}
}
why we are facing this exception is there any other way to add and run stored Procedure in In-Memory DB with Sqlite??
The error is because you're trying to run SQL Server-specific queries with TSQL against an in memory version of Sqlite - i.e. a completely different, embeddable database. As the name suggests SqlServerProviderTests only works on SQL Server, I'm confused why you would try to run this against Sqlite?
SQLite doesn't support Stored Procedures, TSQL, etc so trying to execute SQL Server TSQL statements will always result in an error. The only thing you can do is fake it with a custom Exec Filter, where you can catch the exception and return whatever custom result you like, e.g:
public class MockStoredProcExecFilter : OrmLiteExecFilter
{
public override T Exec<T>(IDbConnection dbConn, Func<IDbCommand, T> filter)
{
try
{
return base.Exec(dbConn, filter);
}
catch (Exception ex)
{
if (dbConn.GetLastSql() == "exec sp_name #firstName, #age")
return (T)(object)new Person { FirstName = "Mocked" };
throw;
}
}
}
OrmLiteConfig.ExecFilter = new MockStoredProcExecFilter();

org.apache.calcite.sql.validate.SqlValidatorException

I'm using Apache Calcite to parse a simple SQL statement and return its relational tree. I obtain a database schema using a JDBC connection to a simple SQLite database. The schema is then added using FrameworkConfig. The parser configuration is then modified to handle identifier quoting and case (not sensitive). However the SQL validator is unable to find the quoted table identifier in the SQL statement. Somehow the parser ignore the configuration settings and converts the table to UPPER CASE. A SqlValidatorException is raised, stating the the table name is not found. I suspect, the configuration is not being updated correctly? I have already validated that the table name is correctly included in the schema's meta-data.
public class ParseSQL {
public static void main(String[] args) {
try {
// register the JDBC driver
String sDriverName = "org.sqlite.JDBC";
Class.forName(sDriverName);
JsonObjectBuilder builder = Json.createObjectBuilder();
builder.add("jdbcDriver", "org.sqlite.JDBC")
.add("jdbcUrl",
"jdbc:sqlite://calcite/students.db")
.add("jdbcUser", "root")
.add("jdbcPassword", "root");
Map<String, JsonValue> JsonObject = builder.build();
//argument for JdbcSchema.Factory().create(....)
Map<String, Object> operand = new HashMap<String, Object>();
//explicitly extract JsonString(s) and load into operand map
for(String key : JsonObject.keySet()) {
JsonString value = (JsonString) JsonObject.get(key);
operand.put(key, value.getString());
}
final SchemaPlus rootSchema = Frameworks.createRootSchema(true);
Schema schema = new JdbcSchema.Factory().create(rootSchema, "students", operand);
rootSchema.add("students", schema);
//build a FrameworkConfig using defaults where values aren't required
Frameworks.ConfigBuilder configBuilder = Frameworks.newConfigBuilder();
//set defaultSchema
configBuilder.defaultSchema(rootSchema);
//build configuration
FrameworkConfig frameworkdConfig = configBuilder.build();
//use SQL parser config builder to ignore case of quoted identifier
SqlParser.configBuilder(frameworkdConfig.getParserConfig()).setQuotedCasing(Casing.UNCHANGED).build();
//use SQL parser config builder to set SQL case sensitive = false
SqlParser.configBuilder(frameworkdConfig.getParserConfig()).setCaseSensitive(false).build();
//get planner
Planner planner = Frameworks.getPlanner(frameworkdConfig);
//parse SQL statement
SqlNode sql_node = planner.parse("SELECT * FROM \"Students\" WHERE age > 15.0");
System.out.println("\n" + sql_node.toString());
//validate SQL
SqlNode sql_validated = planner.validate(sql_node);
//get associated relational expression
RelRoot relationalExpression = planner.rel(sql_validated);
relationalExpression.toString();
} catch (SqlParseException e) {
// TODO Auto-generated catch block
e.printStackTrace();
} catch (RelConversionException e) {
// TODO Auto-generated catch block
e.printStackTrace();
} catch (ValidationException e) {
// TODO Auto-generated catch block
e.printStackTrace();
} catch (ClassNotFoundException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
} // end main
} // end class
***** ERROR MESSAGE ******
Jan 20, 2016 8:54:51 PM org.apache.calcite.sql.validate.SqlValidatorException
SEVERE: org.apache.calcite.sql.validate.SqlValidatorException: Table 'Students' not found
This is a case-sensitivity issue, similar to table not found with apache calcite. Because you enclosed the table name in quotes in your SQL statement, the validator is looking for a table called "Students", and the error message attests to this. If your table is called "Students", I am surprised that Calcite can't find it.
There is a problem with how you are using the SqlParser.ConfigBuilder. When you call build(), you are not using the SqlParser.Config object that it creates. If you passed that object to Frameworks.ConfigBuilder.parserConfig, I think you would get the behavior you want.

Interpretation of classification in Weka

I would like to use Weka to solve my classification problem.
I have a set of instances of my training data. Lets say that the data looks like:
#relation Relation1
#attribute att1 {val11, val12}
#attribute att2 {val21, val22}
#attribute class {class1, class2, class3}
#data
val11, val21, class1
val11, val22, class2
val12, val21, class3
In my code I read the training set from the file. I train the J48 tree and try to classify an instance. However, I have no idea how to interpret the results of the classification.
My code is following:
try {
DataSource source = new DataSource("trainingset.arff");
Instances data = source.getDataSet();
if (data.classIndex() == -1) {
data.setClassIndex(data.numAttributes() - 1);
}
Instance xyz = new Instance(data.numAttributes());
xyz.setDataset(data);
xyz.setValue(data.attribute(0), "val11");
xyz.setValue(data.attribute(1), "val21");
String[] options = new String[1];
options[0] = "-U"; // unpruned tree
J48 tree = new J48(); // new instance of tree
tree.setOptions(options); // set the options
tree.buildClassifier(data); // build classifier
double[] distributionForInstance = tree.distributionForInstance(xyz);
System.out.println(distributionForInstance[0]);
System.out.println(distributionForInstance[1]);
System.out.println(distributionForInstance[2]);
} catch (Exception e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
As an output I get:
0.3333333333333333
0.3333333333333333
0.3333333333333333
I also tried other way of classifying the instance:
double classifyInstance = tree.classifyInstance(xyz);
System.out.println(classifyInstance);
In this case the output is:
0.0
Could you explain how should I interpret the outputs from the distributionForInstance and classifyInstance methods?
My aim is to be able to create the classifier which would tell me to which class does the given instance belong.
Have a look at the javadoc. The distributionForInstance method returns an array with class membership probabilities (first element probability of instance being in first class etc) and classifyInstance returns the class (as an ID -- think index into array of class labels).
Use value method of Attribute to get class label:
double classifyInstance = tree.classifyInstance(xyz);
String classStr = data.classAttribute().value(classifyInstance);