I'm building up an Instances object, adding Attributes, and then adding data in the form of Instance objects.
When I go to write it out, the toString() method is already throwing an OutOfBoundsException and unable to evaluate the data in the Instances. I receive the error when I try to print out the data and I can see the exception being thrown just in the Debugger as it shows it can't evaluate the toString() for the data object.
The only clue I have is that the error message seems to be using the first data element (StudentId) and using it as an index. I'm confused as to why.
The code:
// Set up the attributes for the Weka data model
ArrayList<Attribute> attributes = new ArrayList<>();
attributes.add(new Attribute("StudentIdentifier", true));
attributes.add(new Attribute("CourseGrade", true));
attributes.add(new Attribute("CourseIdentifier"));
attributes.add(new Attribute("Term", true));
attributes.add(new Attribute("YearCourseTaken", true));
// Create the data model object - I'm not happy that capacity is required and fixed? But that's another issue
Instances dataSet = new Instances("Records", attributes, 500);
// Set the attribute that will be used for prediction purposes - that will be CourseIdentifier
dataSet.setClassIndex(2);
// Pull back all the records in this term range, create Weka Instance objects for each and add to the data set
List<Record> records = recordsInTermRangeFindService.find(0, 10);
int count = 0;
for (Record r : records) {
Instance i = new DenseInstance(attributes.size());
i.setValue(attributes.get(0), r.studentIdentifier);
i.setValue(attributes.get(1), r.courseGrade);
i.setValue(attributes.get(2), r.courseIdentifier);
i.setValue(attributes.get(3), r.term);
i.setValue(attributes.get(4), r.yearCourseTaken);
dataSet.add(i);
}
System.out.println(dataSet.size());
BufferedWriter writer = null;
try {
writer = new BufferedWriter(new FileWriter("./test.arff"));
writer.write(dataSet.toString());
writer.flush();
writer.close();
} catch (IOException e) {
e.printStackTrace();
}
The error message:
Exception in thread "main" java.lang.IndexOutOfBoundsException: Index: 1010, Size: 0
I figured it out finally. I was setting the Attributes to strings with the 'true' second parameter in the Constructors, but they were integers coming out of the database table. I needed to change my lines to convert the integers to strings:
i.setValue(attributes.get(0), Integer.toString(r.studentIdentifier));
However, that created a different set of issues for me as things like the Apriori algorithm don't work on strings! I'm continuing to plug along learning Weka.
Related
I am using Univocity parser version 2.7.3. I have a CSV file that has 1 Million records and might grow in future. I am reading only a few specific columns from the file and below are my requirements:
DO NOT store the CSV contents into memory at any point
Ignore/skip bean creation if either of latitude or longitude columns
in CSV are null/blank
To meet these requirements, I tried implementing CsvRoutines so that the CSV data is not copied over to memory. I am using #Validate annotation on both "Latitude" and "Longitude" fields and have used error handler to not throw back any exception so that the record will be skipped on validation failure.
Sample CSV:
#version:1.0
#timestamp:2017-05-29T23:22:22.320Z
#brand:test report
network_name,location_name,location_category,location_address,location_zipcode,location_phone_number,location_latitude,location_longitude,location_city,location_state_name,location_state_abbreviation,location_country,location_country_code,pricing_type,wep_key
"1 Free WiFi","Test Restaurant","Cafe / Restaurant","Marktplatz 18","1233","+41 263 34 05","1212.15","7.51","Basel","test","BE","India","DE","premium",""
"2 Free WiFi","Test Restaurant","Cafe / Restaurant","Zufikerstrasse 1","1111","+41 631 60 00","11.354","8.12","Bremgarten","test","AG","China","CH","premium",""
"3 Free WiFi","Test Restaurant","Cafe / Restaurant","Chemin de la Fontaine 10","1260","+41 22 361 69","12.34","11.23","Nyon","Vaud","VD","Switzerland","CH","premium",""
"!.oist*~","HoistGroup Office","Office","Chemin de I Etang","CH-1211","","","","test","test","GE","Switzerland","CH","premium",""
"test","tess's Takashiro","Cafe / Restaurant","Test 1-10","870-01","097-55-1808","","","Oita","Oita","OITA","Japan","JP","premium","1234B"
TestDTO.java
#Data
#NoArgsConstructor
#AllArgsConstructor
#JsonIgnoreProperties(ignoreUnknown = true)
public class TestDTO implements Serializable {
#Parsed(field = "location_name")
private String name;
#Parsed(field = "location_address")
private String addressLine1;
#Parsed(field = "location_city")
private String city;
#Parsed(field = "location_state_abbreviation")
private String state;
#Parsed(field = "location_country_code")
private String country;
#Parsed(field = "location_zipcode")
private String postalCode;
#Parsed(field = "location_latitude")
#Validate
private Double latitude;
#Parsed(field = "location_longitude")
#Validate
private Double longitude;
#Parsed(field = "network_name")
private String ssid;
}
Main.java
CsvParserSettings parserSettings = new CsvParserSettings();
parserSettings.detectFormatAutomatically();
parserSettings.setLineSeparatorDetectionEnabled(true);
parserSettings.setHeaderExtractionEnabled(true);
parserSettings.setSkipEmptyLines(true);
parserSettings.selectFields("network_name", "location_name","location_address", "location_zipcode",
"location_latitude", "location_longitude", "location_city","location_state_abbreviation", "location_country_code");
parserSettings.setProcessorErrorHandler(new RowProcessorErrorHandler() {
#Override
public void handleError(DataProcessingException error, Object[] inputRow, ParsingContext context) {
//do nothing
}
});
CsvRoutines parser = new CsvRoutines(parserSettings);
ResultIterator<TestDTO, ParsingContext> iterator = parser.iterate(TestDTO.class, new FileReader("c:\\users\\...\\test.csv")).iterator();
int i=0;
while(iterator.hasNext()) {
TestDTO dto = iterator.next();
if(dto.getLongitude() == null || dto.getLatitude() == null)
i++;
}
System.out.println("count=="+i);
Problem:
I actually expected the count to be zero since I have added error handler and haven't thrown back the data validation exception but seems thats not the case. I thought #Validate will throw back an exception when it encounters a record with either Latitude or Longitude as null (both the columns may be null in same record as well) which is handled and ignored/skipped at error handler.
Basically I do not want UniVocity to create and map unnecessary DTO objects in heap (and lead to out of memory) since there are chances that the incoming CSV file might have more than 200 or 300k records with either Latitude/Longitude as null.
I even tried adding custom validator in #Validate as well but in vain.
Could someone please let me know what am I missing here?
Author of the library here. You are doing everything right. This is a bug and I just opened this issue here to be resolved today.
The bug appears when you select fields: the reordering of values makes the validation run against something else (in my test, it validated the city instead of latitude).
In your case, just add the following line of code and it will work fine:
parserSettings.setColumnReorderingEnabled(false);
This will make the rows be generated with nulls where fields were not selected, instead of removing the nulls and reordering the values in the parsed row. It will avoid the bug and also make your program run slightly faster.
You will also need to test for null in the iteration bit:
TestDTO dto = iterator.next();
if(dto != null) { // dto may come null here due to validation
if (dto.longitude == null || dto.latitude == null)
i++;
}
}
Hope this helps and thank you for using our parsers!
I'm using Univocity-Parser's bean iterator to read each line of file and get the bean. I have observed a weird behavior in the library when I'm attempting to read the same file mutiple times.
Code when passing the File object to CsvParser instance:
private static void testBeanIterator() throws Exception {
try {
File sampleFile = generateFile(0);
/*
System.out.println("Sample file content = " + FileUtils.readFileToString(sampleFile,
Charset.defaultCharset()));
*/
for (int i = 0; i < 1000; i++) {
BufferedReader reader =
new BufferedReader(new InputStreamReader(new FileInputStream(sampleFile),
StandardCharsets.UTF_8));
AtomicInteger atomicInteger = new AtomicInteger();
final BeanProcessor<CustomerSegmentMapping> rowProcessor =
new BeanProcessor<CustomerSegmentMapping>(CustomerSegmentMapping.class) {
#Override
public void beanProcessed(#Nonnull final CustomerSegmentMapping customerSegmentMapping,
#Nonnull final ParsingContext context) {
try {
System.out.println(OBJECT_MAPPER.writeValueAsString(customerSegmentMapping));
atomicInteger.getAndAdd(1);
} catch (Exception ex) {
throw new RuntimeException("error");
}
}
};
rowProcessor.setStrictHeaderValidationEnabled(true);
final CsvParserSettings parserSettings = new CsvParserSettings();
parserSettings.setRowProcessor(rowProcessor);
parserSettings.setHeaderExtractionEnabled(true);
final CsvParser parser = new CsvParser(parserSettings);
//parser.parse(reader);
parser.parse(sampleFile);
System.out.println("Finished parser");
if (atomicInteger.get() != 10) {
throw new Exception("mismatch");
}
reader.close();
}
} catch (Exception ex) {
throw new RuntimeException("exception = " + ex.getMessage(), ex);
} finally {
}
}
On executing the code, following is the console output:
{"customerId":"6bc12a7a-2c28-4aea-a7be-6be45e16ffb2","segmentId":"S1"}
{"customerId":"da736310-e508-47ff-92b8-59d490e37a72","segmentId":"S1"}
{"customerId":"9a5d4454-e6d4-49a5-bb04-8354154d0493","segmentId":"S1"}
{"customerId":"ec2ed5cc-cd18-443b-bd69-e56fc09ba0f5","segmentId":"S1"}
{"customerId":"94ea24b0-0c83-4039-a391-1d2439c88be8","segmentId":"S1"}
{"customerId":"2baef5f9-d8cd-451d-b579-a626cb58b284","segmentId":"S1"}
{"customerId":"022a184b-1b06-49aa-b1c4-b94a6f343b04","segmentId":"S1"}
{"customerId":"bcb3984c-0495-4da8-b146-9af3983cc158","segmentId":"S1"}
{"customerId":"feef62de-1aaf-43d4-a83b-afe053db97cf","segmentId":"S1"}
{"customerId":"5825c924-55d5-4fd6-8468-ca36d47a7cae","segmentId":"S1"}
Finished parser
{"customerId":"6bc12a7a-2c28-4aea-a7be-6be45e16ffb2","segmentId":"S1"}
{"customerId":"da736310-e508-47ff-92b8-59d490e37a72","segmentId":"S1"}
{"customerId":"9a5d4454-e6d4-49a5-bb04-8354154d0493","segmentId":"S1"}
{"customerId":"ec2ed5cc-cd18-443b-bd69-e56fc09ba0f5","segmentId":"S1"}
{"customerId":"94ea24b0-0c83-4039-a391-1d2439c88be8","segmentId":"S1"}
{"customerId":"2baef5f9-d8cd-451d-b579-a626cb58b284","segmentId":"S1"}
{"customerId":"022a184b-1b06-49aa-b1c4-b94a6f343b04","segmentId":"S1"}
{"customerId":"bcb3984c-0495-4da8-b146-9af3983cc158","segmentId":"S1"}
{"customerId":"feef62de-1aaf-43d4-a83b-afe053db97cf","segmentId":"S1"}
{"customerId":"5825c924-55d5-4fd6-8468-ca36d47a7cae","segmentId":"S1"}
Finished parser
{"customerId":"6bc12a7a-2c28-4aea-a7be-6be45e16ffb2","segmentId":"S1"}
{"customerId":"da736310-e508-47ff-92b8-59d490e37a72","segmentId":"S1"}
{"customerId":"9a5d4454-e6d4-49a5-bb04-8354154d0493","segmentId":"S1"}
{"customerId":"ec2ed5cc-cd18-443b-bd69-e56fc09ba0f5","segmentId":"S1"}
{"customerId":"94ea24b0-0c83-4039-a391-1d2439c88be8","segmentId":"S1"}
{"customerId":"2baef5f9-d8cd-451d-b579-a626cb58b284","segmentId":"S1"}
{"customerId":"022a184b-1b06-49aa-b1c4-b94a6f343b04","segmentId":"S1"}
{"customerId":"bcb3984c-0495-4da8-b146-9af3983cc158","segmentId":"S1"}
{"customerId":"feef62de-1aaf-43d4-a83b-afe053db97cf","segmentId":"S1"}
{"customerId":"5825c924-55d5-4fd6-8468-ca36d47a7cae","segmentId":"S1"}
Finished parser
Exception in thread "main" java.lang.RuntimeException: exception = Could not find fields [CustomerId]' in input. Names found: [ustomerId, SegmentId]
Internal state when error was thrown: line=2, column=0, record=1, charIndex=60, headers=[ustomerId, SegmentId]
at com.poppins.cube.common.UnivocityNahiHatanaHai.testBeanIterator(UnivocityNahiHatanaHai.java:95)
at com.poppins.cube.common.UnivocityNahiHatanaHai.main(UnivocityNahiHatanaHai.java:37)
Caused by: com.univocity.parsers.common.DataProcessingException: Could not find fields [CustomerId]' in input. Names found: [ustomerId, SegmentId]
Internal state when error was thrown: line=2, column=0, record=1, charIndex=60, headers=[ustomerId, SegmentId]
at com.univocity.parsers.common.processor.core.BeanConversionProcessor.mapFieldIndexes(BeanConversionProcessor.java:414)
at com.univocity.parsers.common.processor.core.BeanConversionProcessor.mapValuesToFields(BeanConversionProcessor.java:340)
at com.univocity.parsers.common.processor.core.BeanConversionProcessor.createBean(BeanConversionProcessor.java:508)
at com.univocity.parsers.common.processor.core.AbstractBeanProcessor.rowProcessed(AbstractBeanProcessor.java:54)
at com.univocity.parsers.common.Internal.process(Internal.java:21)
at com.univocity.parsers.common.AbstractParser.rowProcessed(AbstractParser.java:596)
at com.univocity.parsers.common.AbstractParser.parse(AbstractParser.java:133)
at com.univocity.parsers.common.AbstractParser.parse(AbstractParser.java:605)
at com.poppins.cube.common.UnivocityNahiHatanaHai.testBeanIterator(UnivocityNahiHatanaHai.java:83)
... 1 more
Process finished with exit code 1
Following is the content of the file:
CustomerId,SegmentId
6bc12a7a-2c28-4aea-a7be-6be45e16ffb2,S1
da736310-e508-47ff-92b8-59d490e37a72,S1
9a5d4454-e6d4-49a5-bb04-8354154d0493,S1
ec2ed5cc-cd18-443b-bd69-e56fc09ba0f5,S1
94ea24b0-0c83-4039-a391-1d2439c88be8,S1
2baef5f9-d8cd-451d-b579-a626cb58b284,S1
022a184b-1b06-49aa-b1c4-b94a6f343b04,S1
bcb3984c-0495-4da8-b146-9af3983cc158,S1
feef62de-1aaf-43d4-a83b-afe053db97cf,S1
5825c924-55d5-4fd6-8468-ca36d47a7cae,S1
From what I could understand, the issue is arising because I'm passing a File object to CsvParser. CsvParser internally creates an InputStream object which is not closed.
If I'm passing a Buffered reader object instead of File object, the issue is not arising.
I'm not able to understand whether this is a known issue with the Univocity-Parsers or is there anything I'm missing in understanding.
Author of the library here. I can see your exception showing it got header ustomerId instead of CustomerId.
This looks like a bug introduced in version 2.5.0 that was fixed in version 2.5.6 if I'm not mistaken. This plagued me for a while as it was an internal concurrency issue that was hard to track down. Basically when you pass a File without an explicit encoding it will try to find a UTF BOM marker in the input (effectively consuming the first character) to determine the encoding automatically. This happened only for InputStreams and Files.
Anyway, this has been fixed so simply updating to the latest version should get rid of the problem for you (please let me know if you are not using version 2.5.something)
If you want to remain with the current version you have there, the error will be gone if you call
parser.parse(sampleFile, Charset.defaultCharset());
This will prevent the parser from trying to discover whether there's a BOM marker in your file, therefore avoiding that pesky bug.
Hope this helps
I'm trying to create an initial DB state in DB Unit like this...
public function getDataSet() {
$primary = new \PHPUnit\DbUnit\DataSet\CompositeDataSet();
$fixturePaths = [
"test/Seeds/Upc/DB/UpcSelect.xml",
"test/Seeds/Generic/DB/ProductUpcSelect.xml"
];
foreach($fixturePaths as $fixturePath) {
$dataSet = $this->createXmlDataSet($fixturePath);
$primary->addDataSet($dataSet);
}
return $primary;
}
Then after my query I'm attempting to call this user-defined function...
protected function compareDatabase(String $seedPath, String $table) {
$expected = $this->createFlatXmlDataSet($seedPath)->getTable($table);
$result = $this->getConnection()->createQueryTable($table, "SELECT * FROM $table");
$this->assertTablesEqual($expected, $result);
}
The idea here is that I have an initial DB state, run my query, then compare the actual table state with the XML data set representing what I expect the table to look like. This process is described in PHPUnit's documentation for DBUnit but I keep having an exception thrown...
PHPUnit\DbUnit\InvalidArgumentException: There is already a table named upc with different table definition
Test example...
public function testDeleteByUpc() {
$mapper = new UpcMapper($this->getPdo());
$mapper->deleteByUpc("someUpcCode1");
$this->compareDatabase("test/Seeds/Upc/DB/UpcAfterDelete.xml", 'upc');
}
I seem to be following the docs...how is this supposed to be done?
This was actually unrelated to creating a second XML Dataset. This exception was thrown because the two fixtures I loaded in my getDataSet() method both had table definitions for upc.
I am looking at the documentation and the provided examples to find out how I can report invalid data while processing data with Google's dataflow service.
Pipeline p = Pipeline.create(options);
p.apply(TextIO.Read.named("ReadMyFile").from(options.getInput()))
.apply(new SomeTransformation())
.apply(TextIO.Write.named("WriteMyFile").to(options.getOutput()));
p.run();
In addition to the actual in-/output, I want to produce a 2nd output file that contains records that which are considered invalid (e.g. missing data, malformed data, values were too high). I want to troubleshoot those records and process them separately.
Input: gs://.../input.csv
Output: gs://.../output.csv
List of invalid records: gs://.../invalid.csv
How can I redirect those invalid records into a separate output?
You can use PCollectionTuples to return multiple PCollections from a single transform. For example,
TupleTag<String> mainOutput = new TupleTag<>("main");
TupleTag<String> missingData = new TupleTag<>("missing");
TupleTag<String> badValues = new TupleTag<>("bad");
Pipeline p = Pipeline.create(options);
PCollectionTuple all = p
.apply(TextIO.Read.named("ReadMyFile").from(options.getInput()))
.apply(new SomeTransformation());
all.get(mainOutput)
.apply(TextIO.Write.named("WriteMyFile").to(options.getOutput()));
all.get(missingData)
.apply(TextIO.Write.named("WriteMissingData").to(...));
...
PCollectionTuples can either be built up directly out of existing PCollections, or emitted from ParDo operations with side outputs, e.g.
PCollectionTuple partitioned = input.apply(ParDo
.of(new DoFn<String, String>() {
public void processElement(ProcessContext c) {
if (checkOK(c.element()) {
// Shows up in partitioned.get(mainOutput).
c.output(...);
} else if (hasMissingData(c.element())) {
// Shows up in partitioned.get(missingData).
c.sideOutput(missingData, c.element());
} else {
// Shows up in partitioned.get(badValues).
c.sideOutput(badValues, c.element());
}
}
})
.withOutputTags(mainOutput, TupleTagList.of(missingData).and(badValues)));
Note that in general the various side outputs need not have the same type, and data can be emitted any number of times to any number of side outputs (rather than the strict partitioning we have here).
Your SomeTransformation class could then look something like
class SomeTransformation extends PTransform<PCollection<String>,
PCollectionTuple> {
public PCollectionTuple apply(PCollection<String> input) {
// Filter into good and bad data.
PCollectionTuple partitioned = ...
// Process the good data.
PCollection<String> processed =
partitioned.get(mainOutput)
.apply(...)
.apply(...)
...;
// Repackage everything into a new output tuple.
return PCollectionTuple.of(mainOutput, processed)
.and(missingData, partitioned.get(missingData))
.and(badValues, partitioned.get(badValues));
}
}
Robert's suggestion of using sideOutputs is great, but note that this will only work if the bad data is identified by your ParDos. There currently isn't a way to identify bad records hit during initial decoding (where the error is hit in Coder.decode). We've got plans to work on that soon.
I am using InstanceQuery , SQL queries, to construct my Instances. But my query results does not come in the same order always as it is normal in SQL.
Beacuse of this Instances constucted from different SQL has different headers. A simple example can be seen below. I suspect my results changes because of this behavior.
Header 1
#attribute duration numeric
#attribute protocol_type {tcp,udp}
#attribute service {http,domain_u}
#attribute flag {SF}
Header 2
#attribute duration numeric
#attribute protocol_type {tcp}
#attribute service {pm_dump,pop_2,pop_3}
#attribute flag {SF,S0,SH}
My question is : How can I give correct header information to Instance construction.
Is something like below workflow is possible?
get pre-prepared header information from arff file or another place.
give instance construction this header information
call sql function and get Instances (header + data)
I am using following sql function to get instances from database.
public static Instances getInstanceDataFromDatabase(String pSql
,String pInstanceRelationName){
try {
DatabaseUtils utils = new DatabaseUtils();
InstanceQuery query = new InstanceQuery();
query.setUsername(username);
query.setPassword(password);
query.setQuery(pSql);
Instances data = query.retrieveInstances();
data.setRelationName(pInstanceRelationName);
if (data.classIndex() == -1)
{
data.setClassIndex(data.numAttributes() - 1);
}
return data;
} catch (Exception e) {
throw new RuntimeException(e);
}
}
I tried various approaches to my problem. But it seems that weka internal API does not allow solution to this problem right now. I modified weka.core.Instances append command line code for my purposes. This code is also given in this answer
According to this, here is my solution. I created a SampleWithKnownHeader.arff file , which contains correct header values. I read this file with following code.
public static Instances getSampleInstances() {
Instances data = null;
try {
BufferedReader reader = new BufferedReader(new FileReader(
"datas\\SampleWithKnownHeader.arff"));
data = new Instances(reader);
reader.close();
// setting class attribute
data.setClassIndex(data.numAttributes() - 1);
}
catch (Exception e) {
throw new RuntimeException(e);
}
return data;
}
After that , I use following code to create instances. I had to use StringBuilder and string values of instance, then I save corresponding string to file.
public static void main(String[] args) {
Instances SampleInstance = MyUtilsForWeka.getSampleInstances();
DataSource source1 = new DataSource(SampleInstance);
Instances data2 = InstancesFromDatabase
.getInstanceDataFromDatabase(DatabaseQueries.WEKALIST_QUESTION1);
MyUtilsForWeka.saveInstancesToFile(data2, "fromDatabase.arff");
DataSource source2 = new DataSource(data2);
Instances structure1;
Instances structure2;
StringBuilder sb = new StringBuilder();
try {
structure1 = source1.getStructure();
sb.append(structure1);
structure2 = source2.getStructure();
while (source2.hasMoreElements(structure2)) {
String elementAsString = source2.nextElement(structure2)
.toString();
sb.append(elementAsString);
sb.append("\n");
}
} catch (Exception ex) {
throw new RuntimeException(ex);
}
MyUtilsForWeka.saveInstancesToFile(sb.toString(), "combined.arff");
}
My save instances to file code is as below.
public static void saveInstancesToFile(String contents,String filename) {
FileWriter fstream;
try {
fstream = new FileWriter(filename);
BufferedWriter out = new BufferedWriter(fstream);
out.write(contents);
out.close();
} catch (Exception ex) {
throw new RuntimeException(ex);
}
This solves my problem but I wonder if more elegant solution exists.
I solved a similar problem with the Add filter that allows adding attributes to Instances. You need to add a correct Attibute with proper list of values to both datasets (in my case - to test dataset only):
Load train and test data:
/* "train" contains labels and data */
/* "test" contains data only */
CSVLoader csvLoader = new CSVLoader();
csvLoader.setFile(new File(trainFile));
Instances training = csvLoader.getDataSet();
csvLoader.reset();
csvLoader.setFile(new File(predictFile));
Instances test = csvLoader.getDataSet();
Set a new attribute with Add filter:
Add add = new Add();
/* the name of the attribute must be the same as in "train"*/
add.setAttributeName(training.attribute(0).name());
/* getValues returns a String with comma-separated values of the attribute */
add.setNominalLabels(getValues(training.attribute(0)));
/* put the new attribute to the 1st position, the same as in "train"*/
add.setAttributeIndex("1");
add.setInputFormat(test);
/* result - a compatible with "train" dataset */
test = Filter.useFilter(test, add);
As a result, the headers of both "train" and "test" are the same (compatible for Weka machine learning)