I am using Univocity parser version 2.7.3. I have a CSV file that has 1 Million records and might grow in future. I am reading only a few specific columns from the file and below are my requirements:
DO NOT store the CSV contents into memory at any point
Ignore/skip bean creation if either of latitude or longitude columns
in CSV are null/blank
To meet these requirements, I tried implementing CsvRoutines so that the CSV data is not copied over to memory. I am using #Validate annotation on both "Latitude" and "Longitude" fields and have used error handler to not throw back any exception so that the record will be skipped on validation failure.
Sample CSV:
#version:1.0
#timestamp:2017-05-29T23:22:22.320Z
#brand:test report
network_name,location_name,location_category,location_address,location_zipcode,location_phone_number,location_latitude,location_longitude,location_city,location_state_name,location_state_abbreviation,location_country,location_country_code,pricing_type,wep_key
"1 Free WiFi","Test Restaurant","Cafe / Restaurant","Marktplatz 18","1233","+41 263 34 05","1212.15","7.51","Basel","test","BE","India","DE","premium",""
"2 Free WiFi","Test Restaurant","Cafe / Restaurant","Zufikerstrasse 1","1111","+41 631 60 00","11.354","8.12","Bremgarten","test","AG","China","CH","premium",""
"3 Free WiFi","Test Restaurant","Cafe / Restaurant","Chemin de la Fontaine 10","1260","+41 22 361 69","12.34","11.23","Nyon","Vaud","VD","Switzerland","CH","premium",""
"!.oist*~","HoistGroup Office","Office","Chemin de I Etang","CH-1211","","","","test","test","GE","Switzerland","CH","premium",""
"test","tess's Takashiro","Cafe / Restaurant","Test 1-10","870-01","097-55-1808","","","Oita","Oita","OITA","Japan","JP","premium","1234B"
TestDTO.java
#Data
#NoArgsConstructor
#AllArgsConstructor
#JsonIgnoreProperties(ignoreUnknown = true)
public class TestDTO implements Serializable {
#Parsed(field = "location_name")
private String name;
#Parsed(field = "location_address")
private String addressLine1;
#Parsed(field = "location_city")
private String city;
#Parsed(field = "location_state_abbreviation")
private String state;
#Parsed(field = "location_country_code")
private String country;
#Parsed(field = "location_zipcode")
private String postalCode;
#Parsed(field = "location_latitude")
#Validate
private Double latitude;
#Parsed(field = "location_longitude")
#Validate
private Double longitude;
#Parsed(field = "network_name")
private String ssid;
}
Main.java
CsvParserSettings parserSettings = new CsvParserSettings();
parserSettings.detectFormatAutomatically();
parserSettings.setLineSeparatorDetectionEnabled(true);
parserSettings.setHeaderExtractionEnabled(true);
parserSettings.setSkipEmptyLines(true);
parserSettings.selectFields("network_name", "location_name","location_address", "location_zipcode",
"location_latitude", "location_longitude", "location_city","location_state_abbreviation", "location_country_code");
parserSettings.setProcessorErrorHandler(new RowProcessorErrorHandler() {
#Override
public void handleError(DataProcessingException error, Object[] inputRow, ParsingContext context) {
//do nothing
}
});
CsvRoutines parser = new CsvRoutines(parserSettings);
ResultIterator<TestDTO, ParsingContext> iterator = parser.iterate(TestDTO.class, new FileReader("c:\\users\\...\\test.csv")).iterator();
int i=0;
while(iterator.hasNext()) {
TestDTO dto = iterator.next();
if(dto.getLongitude() == null || dto.getLatitude() == null)
i++;
}
System.out.println("count=="+i);
Problem:
I actually expected the count to be zero since I have added error handler and haven't thrown back the data validation exception but seems thats not the case. I thought #Validate will throw back an exception when it encounters a record with either Latitude or Longitude as null (both the columns may be null in same record as well) which is handled and ignored/skipped at error handler.
Basically I do not want UniVocity to create and map unnecessary DTO objects in heap (and lead to out of memory) since there are chances that the incoming CSV file might have more than 200 or 300k records with either Latitude/Longitude as null.
I even tried adding custom validator in #Validate as well but in vain.
Could someone please let me know what am I missing here?
Author of the library here. You are doing everything right. This is a bug and I just opened this issue here to be resolved today.
The bug appears when you select fields: the reordering of values makes the validation run against something else (in my test, it validated the city instead of latitude).
In your case, just add the following line of code and it will work fine:
parserSettings.setColumnReorderingEnabled(false);
This will make the rows be generated with nulls where fields were not selected, instead of removing the nulls and reordering the values in the parsed row. It will avoid the bug and also make your program run slightly faster.
You will also need to test for null in the iteration bit:
TestDTO dto = iterator.next();
if(dto != null) { // dto may come null here due to validation
if (dto.longitude == null || dto.latitude == null)
i++;
}
}
Hope this helps and thank you for using our parsers!
Related
The following is my constructor for a Student object. I will be using a list of student. I need to store the list so even if the program is turned off, I can still access all the contents. The only way I could think of was to use reader/writer and a text file.
1) Is there a more efficient way to store this information?
2) If not, how can I use reader/writer to store each field?
public Student(String firstName, String lastName, String gender, String
state, String school, String lit, String wakeUp, String sleep, String
social,String contactInfo, String country, String major) {
this.firstName = firstName;
this.lastName = lastName;
this.gender = gender;
this.state = state;
this.school = school;
this.lit = lit;
this.wakeUp = wakeUp;
this.sleep = sleep;
this.social = social;
this.contactInfo = contactInfo;
this.country = country;
this.major = major;
}
The possibilities are really project specific and subjective.
Some possibilities include:
CSV file which makes it easy for exporting to other programs and parsing data
Online server which allows access from any computer that has the
program and an internet connection
Text file which works for local devices that won't require many
additions
It really just depends on how you want to implement it and what method suits your needs best.
To use reader/writer to store your fields, you could use the accessor methods of each variable to store them line by line in your text file. Below is some sample code to get you started on writing to the file:
PrintWriter outputStream = null;
try {
outputStream = new PrintWriter(new FileOutputStream(FILE_LOCATION));
}
catch (FileNotFoundException ex) {
JOptionPane optionPane = new JOptionPane("Unable to write to file\n " + FILE_LOCATION, JOptionPane.ERROR_MESSAGE);
JDialog dialog = optionPane.createDialog("Error!");
dialog.setAlwaysOnTop(true);
dialog.setVisible(true);
System.exit(0);
}
Iterator<YOUR_OBJECT> i = this.List.iterator();
YOUR_OBJECT temp = null;
while (i.hasNext()) {
temp = i.next();
if (temp instanceof YOUR_OBJECT) {
outputStream.println(temp.getAttribute());
}
}
outputStream.close();
I'm building up an Instances object, adding Attributes, and then adding data in the form of Instance objects.
When I go to write it out, the toString() method is already throwing an OutOfBoundsException and unable to evaluate the data in the Instances. I receive the error when I try to print out the data and I can see the exception being thrown just in the Debugger as it shows it can't evaluate the toString() for the data object.
The only clue I have is that the error message seems to be using the first data element (StudentId) and using it as an index. I'm confused as to why.
The code:
// Set up the attributes for the Weka data model
ArrayList<Attribute> attributes = new ArrayList<>();
attributes.add(new Attribute("StudentIdentifier", true));
attributes.add(new Attribute("CourseGrade", true));
attributes.add(new Attribute("CourseIdentifier"));
attributes.add(new Attribute("Term", true));
attributes.add(new Attribute("YearCourseTaken", true));
// Create the data model object - I'm not happy that capacity is required and fixed? But that's another issue
Instances dataSet = new Instances("Records", attributes, 500);
// Set the attribute that will be used for prediction purposes - that will be CourseIdentifier
dataSet.setClassIndex(2);
// Pull back all the records in this term range, create Weka Instance objects for each and add to the data set
List<Record> records = recordsInTermRangeFindService.find(0, 10);
int count = 0;
for (Record r : records) {
Instance i = new DenseInstance(attributes.size());
i.setValue(attributes.get(0), r.studentIdentifier);
i.setValue(attributes.get(1), r.courseGrade);
i.setValue(attributes.get(2), r.courseIdentifier);
i.setValue(attributes.get(3), r.term);
i.setValue(attributes.get(4), r.yearCourseTaken);
dataSet.add(i);
}
System.out.println(dataSet.size());
BufferedWriter writer = null;
try {
writer = new BufferedWriter(new FileWriter("./test.arff"));
writer.write(dataSet.toString());
writer.flush();
writer.close();
} catch (IOException e) {
e.printStackTrace();
}
The error message:
Exception in thread "main" java.lang.IndexOutOfBoundsException: Index: 1010, Size: 0
I figured it out finally. I was setting the Attributes to strings with the 'true' second parameter in the Constructors, but they were integers coming out of the database table. I needed to change my lines to convert the integers to strings:
i.setValue(attributes.get(0), Integer.toString(r.studentIdentifier));
However, that created a different set of issues for me as things like the Apriori algorithm don't work on strings! I'm continuing to plug along learning Weka.
I am using InstanceQuery , SQL queries, to construct my Instances. But my query results does not come in the same order always as it is normal in SQL.
Beacuse of this Instances constucted from different SQL has different headers. A simple example can be seen below. I suspect my results changes because of this behavior.
Header 1
#attribute duration numeric
#attribute protocol_type {tcp,udp}
#attribute service {http,domain_u}
#attribute flag {SF}
Header 2
#attribute duration numeric
#attribute protocol_type {tcp}
#attribute service {pm_dump,pop_2,pop_3}
#attribute flag {SF,S0,SH}
My question is : How can I give correct header information to Instance construction.
Is something like below workflow is possible?
get pre-prepared header information from arff file or another place.
give instance construction this header information
call sql function and get Instances (header + data)
I am using following sql function to get instances from database.
public static Instances getInstanceDataFromDatabase(String pSql
,String pInstanceRelationName){
try {
DatabaseUtils utils = new DatabaseUtils();
InstanceQuery query = new InstanceQuery();
query.setUsername(username);
query.setPassword(password);
query.setQuery(pSql);
Instances data = query.retrieveInstances();
data.setRelationName(pInstanceRelationName);
if (data.classIndex() == -1)
{
data.setClassIndex(data.numAttributes() - 1);
}
return data;
} catch (Exception e) {
throw new RuntimeException(e);
}
}
I tried various approaches to my problem. But it seems that weka internal API does not allow solution to this problem right now. I modified weka.core.Instances append command line code for my purposes. This code is also given in this answer
According to this, here is my solution. I created a SampleWithKnownHeader.arff file , which contains correct header values. I read this file with following code.
public static Instances getSampleInstances() {
Instances data = null;
try {
BufferedReader reader = new BufferedReader(new FileReader(
"datas\\SampleWithKnownHeader.arff"));
data = new Instances(reader);
reader.close();
// setting class attribute
data.setClassIndex(data.numAttributes() - 1);
}
catch (Exception e) {
throw new RuntimeException(e);
}
return data;
}
After that , I use following code to create instances. I had to use StringBuilder and string values of instance, then I save corresponding string to file.
public static void main(String[] args) {
Instances SampleInstance = MyUtilsForWeka.getSampleInstances();
DataSource source1 = new DataSource(SampleInstance);
Instances data2 = InstancesFromDatabase
.getInstanceDataFromDatabase(DatabaseQueries.WEKALIST_QUESTION1);
MyUtilsForWeka.saveInstancesToFile(data2, "fromDatabase.arff");
DataSource source2 = new DataSource(data2);
Instances structure1;
Instances structure2;
StringBuilder sb = new StringBuilder();
try {
structure1 = source1.getStructure();
sb.append(structure1);
structure2 = source2.getStructure();
while (source2.hasMoreElements(structure2)) {
String elementAsString = source2.nextElement(structure2)
.toString();
sb.append(elementAsString);
sb.append("\n");
}
} catch (Exception ex) {
throw new RuntimeException(ex);
}
MyUtilsForWeka.saveInstancesToFile(sb.toString(), "combined.arff");
}
My save instances to file code is as below.
public static void saveInstancesToFile(String contents,String filename) {
FileWriter fstream;
try {
fstream = new FileWriter(filename);
BufferedWriter out = new BufferedWriter(fstream);
out.write(contents);
out.close();
} catch (Exception ex) {
throw new RuntimeException(ex);
}
This solves my problem but I wonder if more elegant solution exists.
I solved a similar problem with the Add filter that allows adding attributes to Instances. You need to add a correct Attibute with proper list of values to both datasets (in my case - to test dataset only):
Load train and test data:
/* "train" contains labels and data */
/* "test" contains data only */
CSVLoader csvLoader = new CSVLoader();
csvLoader.setFile(new File(trainFile));
Instances training = csvLoader.getDataSet();
csvLoader.reset();
csvLoader.setFile(new File(predictFile));
Instances test = csvLoader.getDataSet();
Set a new attribute with Add filter:
Add add = new Add();
/* the name of the attribute must be the same as in "train"*/
add.setAttributeName(training.attribute(0).name());
/* getValues returns a String with comma-separated values of the attribute */
add.setNominalLabels(getValues(training.attribute(0)));
/* put the new attribute to the 1st position, the same as in "train"*/
add.setAttributeIndex("1");
add.setInputFormat(test);
/* result - a compatible with "train" dataset */
test = Filter.useFilter(test, add);
As a result, the headers of both "train" and "test" are the same (compatible for Weka machine learning)
I am trying to write an update statement for inserting data from asp.net gridview to sql server 2005 database.but it is showing me an error, Please tell me how to solve.
cmdUpdate.CommandText = String.Format("Update Products SET ProductName=
{0},UnitsInStock={1},UnitsOnOrder={2},ReorderLevel={3} WHERE ProductID={4} AND
SupplierID={5}", "productname.Text, unitsinstock.Text, unitsonorder.Text,
recorderlevel.Text, employeeid.Text, supplierid.Text");
Error is-
Index (zero based) must be greater than or equal to zero and less than the size of the argument list.
Your syntax for string.Format is incorrect - each parameter after the string template should be on their own, without the double quotes surrounding them all...
This will work (notice I've removed the double quotes from just before 'productname.Text' and after 'supplierid.Text'):
String.Format("Update Products SET ProductName={0}, UnitsInStock={1}, UnitsOnOrder={2}, ReorderLevel={3} WHERE ProductID={4} AND SupplierID={5}",
productname.Text, unitsinstock.Text, unitsonorder.Text,
recorderlevel.Text, employeeid.Text, supplierid.Text);
You missed the arguments,
For instance,
str=String.Format("{0} {1}",arg1,arg2);
Do not use hard-coded sql strings. Try to learn/use parameterized queries.
EDIT:
string ConnectionString = "put_connection_string";
using (SqlConnection con = new SqlConnection(ConnectionString))
{
using (SqlCommand cmd = new SqlCommand())
{
string sql = "Update Products SET
ProductName=#ProductName,
UnitsInStock=#UnitsInStock,
UnitsOnOrder=#UnitsOnOrder,
ReorderLevel=ReorderLevel
WHERE ProductID=ProductID AND SupplierID=#SupplierID";
cmd.CommandText = sql;
cmd.Connection = con;
cmd.Parameters.Add("#ProductName", System.Data.SqlDbType.VarChar, 50).Value =productname.Text;
cmd.Parameters.Add("#UnitsInStock", System.Data.SqlDbType.Int).Value =unitsinstock.Text;
cmd.Parameters.Add("#UnitsOnOrder", System.Data.SqlDbType.Int).Value =unitsonorder.Text;
cmd.Parameters.Add("#ReorderLevel ", System.Data.SqlDbType.Int).Value =recorderlevel.Text;
cmd.Parameters.Add("#ProductID", System.Data.SqlDbType.Int).Value =producteid.Text;
cmd.Parameters.Add("#SupplierID", System.Data.SqlDbType.Int).Value =supplierid.Text;
con.Open();
cmd.ExecuteNonQuery();
con.Close();
}
}
EDIT: What is C# using block?
If the type implements IDisposable, it automatically disposes it
Provides a convenient syntax that ensures the correct use of IDisposable objects.
Avoiding Problems with the Using Statement
Currently, I'm using Conversion Studio to bring in a CSV file and store the contents in an AX table. This part is working. I have a block defined and the fields are correctly mapped.
The CSV file contains several comments columns, such as Comments-1, Comments-2, etc. There are a fixed number of these. The public comments are labeled as Comments-1...5, and the private comments are labeled as Private-Comment-1...5.
The desired result would be to bring the data into the AX table (as is currently working) and either concatenate the comment fields or store them as separate comments into the DocuRef table as internal or external notes.
Would it not require just setting up a new block in the Conversion Studio project that I already have setup? Can you point me to a resource that maybe shows a similar procedure or how to do this?
Thanks in advance!
After chasing the rabbit down the deepest of rabbit holes, I discovered that the easiest way to do this is like so:
Override the onEntityCommit method of your Document Handler (that extends AppDataDocumentHandler), like so:
AppEntityAction onEntityCommit(AppDocumentBlock documentBlock, AppBlock fromBlock, AppEntity toEntity)
{
AppEntityAction ret;
int64 recId; // Should point to the record currently being imported into CMCTRS
;
ret = super(documentBlock, fromBlock, toEntity);
recId = toEntity.getRecord().recId;
// Do whatever you need to do with the recId now
return ret;
}
Here is my method to insert the notes, in case you need that too:
private static boolean insertNote(RefTableId _tableId, int64 _docuRefId, str _note, str _name, boolean _isPublic)
{
DocuRef docuRef;
boolean insertResult = false;
;
if (_docuRefId)
{
try
{
docuRef.clear();
ttsbegin;
docuRef.RefCompanyId = curext();
docuRef.RefTableId = _tableId;
docuRef.RefRecId = _docuRefId;
docuRef.TypeId = 'Note';
docuRef.Name = _name;
docuRef.Notes = _note;
docuRef.Restriction = (_isPublic) ? DocuRestriction::External : DocuRestriction::Internal;
docuRef.insert();
ttscommit;
insertResult = true;
}
catch
{
ttsabort;
error("Could not insert " + ((_isPublic) ? "public" : "private") + " comment:\n\n\t\"" + _note + "\"");
}
}
return insertResult;
}