How to read structured data from file in C++? - c++

I have a data file where I need to read a datum from each line and store it. And then depending on the value of one of those datums store that data in an array so that I can then calculate the median value of all of these data.
The line of data is demographic information and depending on the geographic location, address of a person. I need to capture their age and then find the median of the people that live on a particular street for example.
So the data set is 150,000 records and each record has 26 fields, a lot of those fields are segments of an address and then the other fields are just numbers, age, street number and that sort of thing.
So what I need to do is read through the line and then if a particular field in the record meets a certain condition I need to capture a field from the record and store it in an array so that I can calculate the median of people that live on "Oak Street" for example.
I have the conditional logic and can work the sort out but I'm uncomfortable with the iostream objects in C++, like instantiating an ifstream object and then reading from the file itself.
Oh I forgot that the data was a comma separated value file.

For comma-delimited input:
using namespace std;
ifstream file;
string line;
while(getline(file, line)) {
istringstream stream(line);
string data[3];
for(int ii = 0; ii < sizeof data / sizeof data[0]; ++ii)
if(!getline(stream, data[ii], ','))
throw std::runtime_error("invalid data");
// process data here
}
For whitespace-delimited input (original answer):
using namespace std;
ifstream file;
string line;
while(getline(file, line)) {
int datum1;
string datum2;
double datum3;
istringstream stream(line);
if(!(line >> datum1 >> datum2 >> datum3))
throw std::runtime_error("invalid data");
// process data here
}
These methods won't win any prizes for performance, but hopefully they're fairly reliable and easy to understand.

This sounds like a perfect problem for an SQL light style embedded data base. Then you could have any number of standard SQL features without having to rewrite the wheel.

Related

parse huge csv file with C++

I order to simulate my network I am using a trace file (csv file) with a size between 5 to 30 GB.
The csv file is a row based, where each row contains multiple fields delimited by a space and forming teh information to form a network packet:
3 53 4 12 1 1 2 6
Since the file's size could reach several GBs (millions of lines), is it better to divided it into small chunks myfile00.csv, myfile01.csv..., or I can process the entire file on the hard drive without being loaded into the memory?
I want to read the file line by line at a specific time, which is the clock cycle of the simulation, and get all information in the line to create an omnet++ message.
packet MyTrace::getpacket() {
int id; // first field
int cycle; // second field
int source; // third field
int destination; // fourth field
int numberofDep; // fifth field
std::list<int> listofDep; // remaining fields
if (!traceFile.is_open()) {
// get id
// get cycle
// ....
}
Any suggestion would be helpful.
EDIT:
string line;
ifstream myfile ("BlackSmall.csv");
int currentline=0 ;
if (myfile.is_open())
{
while (getline(myfile, line)) {
istringstream ss(line);
string request;
int id, cycle, source , dest, srcType, destType, packetSize, dependency;
int listdep;
std::list<int> dep;
ss >> id;
ss>> cycle;
ss>> source;
ss>> dest;
ss>>request;
ss>> srcType;
ss>> destType;
ss>> packetSize;
ss>> dependency;
while (ss >> listdep) dep.push_back(listdep);
// Create my packet
}
myfile.close();
}
else cout << "Unable to open file";
With the above code, I can get all information that I need from a line.
The problem is that I need to use this code inside a class, which when I call it returns just one line's information. Is there a way how to point to a specific line when I call this class?
It seems like your application seems to require a single sequential pass through the input, so processing a file that is 1GB or 100GB is perhaps just a matter of patience and perhaps parallelism.
The approach should be to translate records line-by-line. You should avoid strategies that attempt to read the entire file into memory. The STL offers the easy-to-use std::ifstream class with a built-in getline method, which returns a std::string containing the line to be converted.
If you are feeling more ambitious and want to control the amount of data read or buffered more carefully then you would not be the first developer to roll-your-own code to implement a buffered reader. This is a fairly empowering exercise and will help you think through some corner cases with reading partial lines and such. But in the end, it probably will not give you a significant boost toward your goal. I suspect the ifstream approach will get you up and running without the hassle and will not ultimately be the bottleneck in processing these files.
If you were really concerned about optimizing execution time then having multiple files might help you launch parallel processing tasks.
// define a class to hold your custom record
class Record {
};
// create a parser function to convert a line of text into the record
bool parse(std::string const &line, Record &record) {
}
// create a translator method to convert a record into the desired output
bool write(Record const &record, std::ofstream &os) {
}
// actually open input stream for the input file
std::ifstream is;
std::ofstream os;
std::string line;
while (std::getline(is,line)) {
Record record;
if (!parse(line,record)) break;
if (!write(record,os)) break;
}
You can re-use the Record instance by moving it outside the while loop so long as you are careful to reset the variable so that information from preceding records does not taint the current record. You can also dive head first into the C++ ecosystem by producing stream input and output operator ("<<",">>") but I personally find this approach to be more confusion than it is worth.
Perhaps best approach for you would be to import your CSV file into SQLite database.
Once you import it and add some indexes, you can easily and very efficiently query necessary rows from that database. SQLite has lots of ready-to-use C/C++ client libraries available, you can start with default one at https://www.sqlite.org/cintro.html.

How does one correctly store data into an array struct with stringstream? [duplicate]

This question already has answers here:
Why does reading a record struct fields from std::istream fail, and how can I fix it?
(9 answers)
Closed 6 years ago.
I was wondering how to store data from a CSV file into a structured array. I realize I need to use getline and such and so far I have come up with this code:
This is my struct:
struct csvData //creating a structure
{
string username; //creating a vector of strings called username
float gpa; //creating a vector of floats called gpa
int age; //creating a vector of ints called age
};
This is my data reader and the part that stores the data:
csvData arrayData[10];
string data;
ifstream infile; //creating object with ifstream
infile.open("datafile.csv"); //opening file
if (infile.is_open()) //error check
int i=0;
while(getline(infile, data));
{
stringstream ss(data);
ss >> arrayData[i].username;
ss >> arrayData[i].gpa;
ss >> arrayData[i].age;
i++;
}
Further, this is how I was attempting to print out the information:
for (int z = 0; z<10; z++)
{
cout<<arrayData[z].username<<arrayData[z].gpa<<arrayData[z].age<<endl;
}
However, when running this command, I get a cout of what seem to be random numbers:
1.83751e-0383 03 4.2039e-0453 1.8368e-0383 07011688
I assume this has to be the array running not storing the variables correctly and thus I am reading out random memory slots, however, I am unsure.
Lastly, here is the CSV file I am attempting to read.
username,gpa,age
Steven,3.2,20
Will,3.4,19
Ryan,3.6,19
Tom,3,19
There's nothing in your parsing code that actually attempts to parse the single line into the individual fields:
while(getline(infile, data));
{
This correctly reads a single line from the input file into the data string.
stringstream ss(data);
ss >> arrayData[i].username;
ss >> arrayData[i].gpa;
ss >> arrayData[i].age;
You need to try to explain to your rubber duck how this is supposed to take a single line of comma-separated values, like the one you showed in your question:
Steven,3.2,20
and separate that string into the individual values, by commas. There's nothing about the >> operator that will do this. operator>> separates input using whitespaces, not commas. Your suspicions were correct, you were not parsing the input correctly.
This is a task that you have to do yourself. I am presuming that you would like, as a learning experience, or as a homework assignment, to do this yourself, manually. Well, then, do it yourself. You have the a single line in data. Use any number of tools that C++ gives you: the std::string's find() method, or std::find() from <algorithm>, to find each comma in the data string, then extract each individual portion of the string that's between each comma. Then, you still need to convert the two numeric fields into the appropriate datatypes. And that's when you put each one of them into a std::istringstream, and use operator>> to convert them to numeric types.
But, having said all that, there's an alternative dirty trick, to solve this problem quickly. Recall that the original line in data contains
Steven,3.2,20
All you have to do is replace the commas with spaces, turning it into:
Steven 3.2 20
Replacing commas with spaces is trivial with std::replace(), or with a small loop. Then, you can stuff the result into a std::istringstream, and use operator>> to extract the individual whitespace-delimited values into the discrete variables, using the code that you've already written.
Just a small word of warning: if this was indeed your homework assignment, to write code to manually parse and extract comma-delimited values, it's not guaranteed that your instructor will give you the full grade for taking the dirty-trick approach...
UNDER CONSTRUCTION
Ton, nice try and nice complete question. Here is the answer:
1) You have a semicolon after the loop:
while(getline(infile, data));
delete it.
How did I figure that out easily? I compiled with all the warnings enabled, like this:
C02QT2UBFVH6-lm:~ gsamaras$ g++ -Wall main.cpp
main.cpp:24:33: warning: while loop has empty body [-Wempty-body]
while(getline(infile, data));
^
main.cpp:24:33: note: put the semicolon on a separate line to silence this warning
1 warning generated.
In fact, you should get that warning without -Wall as well, but get into using it, it will also make good to you! :)
2) Then, you read some elements, but not 10, so why do you print 10? Print as many as the ones you actually read, i.e. i.
When you try to print all 10 elements of your array, you print elements that are not initialized, since you didn't initialize your array of structs.
Moreover, the number of lines in datafile.csv was less than 10. So you started populating your array, but you stopped, when the file didn't have more lines. As a result, some of the elements of your array (the last 6 elements) remained uninitialized.
Printing uninitialized data, causes Undefined Behavior, that's why you see garbage values.
3) Also this:
if (infile.is_open()) //error check
could be written like this:
if (!infile.is_open())
cerr << "Error Message by Mr. Tom\n";
Putting them all together:
WILL STILL NOT WORK, BECAUSE ss >> arrayData[i].username; eats the entire input line and the next two extractions fail, as Pete Becker said, but I leave it here, so that others won't make the same attempt!!!!!!!
#include <iostream>
#include <fstream>
#include <string>
#include <sstream>
using namespace std;
struct csvData //creating a structure
{
string username; //creating a vector of strings called username
float gpa; //creating a vector of floats called gpa
int age; //creating a vector of ints called age
};
int main()
{
csvData arrayData[10];
string data;
ifstream infile; //creating object with ifstream
infile.open("datafile.csv"); //opening file
if (!infile.is_open()) { cerr << "File is not opened..\n"; }
int i=0;
while(getline(infile, data))
{
stringstream ss(data);
ss >> arrayData[i].username;
ss >> arrayData[i].gpa;
ss >> arrayData[i].age;
i++;
}
for (int z = 0; z< i; z++)
{
cout<<arrayData[z].username<<arrayData[z].gpa<<arrayData[z].age<<endl;
}
return 0;
}
Output:
C02QT2UBFVH6-lm:~ gsamaras$ g++ -Wall main.cpp
C02QT2UBFVH6-lm:~ gsamaras$ ./a.out
username,gpa,age00
Steven,3.2,2000
Will,3.4,1900
Ryan,3.6,1900
Tom,3,1900
But wait a minute, so now it works, but why this:
while(getline(infile, data));
{
...
}
didn't?
Because, putting a semicolon after a loop is equivalent to this:
while()
{
;
}
because as you probably already know loops with only one line as a body do not require curly brackets.
And what happened to what I thought it was the body of the loop (i.e. the part were you use std::stringstream)?
It got executed! But only once!.
You see, a pair of curly brackets alone means something, it's an anonymous scope/block.
So this:
{
stringstream ss(data);
ss >> arrayData[i].username;
ss >> arrayData[i].gpa;
ss >> arrayData[i].age;
i++;
}
functioned on its one, without being part of the while loop, as you intended too!
Any why did it work?! Because you had declared i before the loop! ;)

Store Input from Text File into Arrays or Variables?

My Problem:
I am very new to new to programming and trying to write a program in C++, I have a text file. *In the text file is stored is a Students Name, Grade, and Grade Letter. I want them stored as different types in arrays *
I want to store them individually in an array....
ie text file would look like this:
Jill Hamming A 96
Steven Jenning A 94
Tim Sutton B 89
Dillon Crass C 76
Sammy Salsa D 54
Karen Poulk D 49
I would like to store all the First names in one array, last names in another etc. So on and so on.
These arrays will later be assigned to object for the student. There maybe up to 500 students.
So the question:
How to store input from a text document into an array instead of using 500 variables.
ie. Here is my attempt.
int main()
{
/// the input is all different types, strings, ints and chars
string my_First_Name[500], my_Last_Name[500];
int my_grade[500];
char my_letter[500];
ifstream myfile("input.txt");
if (myfile.is_open()){
for (int i = 0; i < 500; i++) {
myfile >> my_First_Name[i] >> my_Last_Name[i] >> my_grade[i] >> my_letter[i];
}
// much later and irrelevent part but just showing because this is what I want to do. where Student Class exsists somewhere else. yet to be programmed.
Student myStudent [500];
myStudent[i].Grades = my_grade[i];
myStudent[i].LetterGrade = my_letter[i];
}
myfile.close();
//Exit
system("pause");
return 0;
}
When I went to print out what I had. I had all negative and weird numbers which means it was not initialized. Where did I go wrong?
When you have multiple arrays of the same size, it is usually as sign of poor design.
The rule of thumb is to have a record (or line of data) represented by a class or structure:
struct Record
{
string first_name;
string last_name;
int grade_value; // Can this go negative?
string grade_text;
};
If you know you are going to have 500, you can create an array for the data:
#define ARRAY_CAPACITY (500)
Record grades[ARRAY_CAPACITY];
This not a great solution, because you waste space if there are less than 500 and perform buffer overrun if you read more than 500.
Thus a better solution is to use std::vector, which allows you to append records as you read them. The std::vector will expand as necessary to contain the records. With an array, you would have to allocate a new array, copy old records to new array, then delete the old array.
Also, a good solution will include methods, inside Record, to read the data members from an input stream. Research "overloading stream extraction operator".
You input loop should look something like:
Record r;
while (input_file >> r)
{
student_grades.push_back(r);
}
You can find this information by searching stackoverflow. A good beginning search is "stackoverflow c++ read file space separated" or "comma separated" or "stackoverflow c++ read file structure".
Reading a record
You could expand the loop to something like this:
std::vector<Record> student_grades;
Record r;
while (input >> r.first_name >> r.last_name >> r.grade_value >> r.grade_text)
{
student_grades.push_back(r);
}
If you are allergic or limited to arrays, you will need to use a counter with the array.
#define MAXIMUM_RECORDS (500)
Record r;
Record grade_book[MAXIMUM_RECORDS];
unsigned int record_count = 0U;
while (input >> r.first_name >> r.last_name >> r.grade_value >> r.grade_text)
{
grade_book[record_count] = r;
++record_count;
if (record_count >= MAXIMUM_RECORDS)
{
break;
}
}
A for loop is not used because we don't know how many records are in the file, only a maximum that the program will read. If the file has 600 records, only 500 will be read.

Efficiently read CSV file with optional columns

I'm trying to write a program that reads in a CSV file (no need to worry about escaping anything, it's strictly formatted with no quotes) but any numeric item with a value of 0 is instead just left blank. So a normal line would look like:
12,string1,string2,3,,,string3,4.5
instead of
12,string1,string2,3,0,0,string3,4.5
I have some working code using vectors but it's way too slow.
int main(int argc, char** argv)
{
string filename("path\\to\\file.csv");
string outname("path\\to\\outfile.csv");
ifstream infile(filename.c_str());
if(!infile)
{
cerr << "Couldn't open file " << filename.c_str();
return 1;
}
vector<vector<string>> records;
string line;
while( getline(infile, line) )
{
vector<string> row;
string item;
istringstream ss(line);
while(getline(ss, item, ','))
{
row.push_back(item);
}
records.push_back(row);
}
return 0;
}
Is it possible to overload operator<< of ostream similar to How to use C++ to read in a .csv file and output in another form? when fields can be blank?
Would that improve the performance?
Or is there anything else I can do to get this to run faster?
Thanks
The time spent reading the string data from the file is greater than the time spent parsing it. You won't make significant time savings in the parsing of the string.
To make your program run faster, read bigger "chunks" into memory; get more data per read. Research on memory mapped files.
One alternative way to handle this to get better performance is to read the whole file into a buffer. Then go through the buffer and set pointers to where the values start, if you find a , or end of line put in a \0.
e.g. https://code.google.com/p/csv-routine/

How to read in a data file of unknown dimensions in C/C++

I have a data file which contains data in row/colum form. I would like a way to read this data in to a 2D array in C or C++ (whichever is easier) but I don't know how many rows or columns the file might have before I start reading it in.
At the top of the file is a commented line giving a series of numbers relating to what each column holds. Each row is holding the data for each number at a point in time, so an example data file (a small one - the ones i'm using are much bigger!) could be like:
# 1 4 6 28
21.2 492.1 58201.5 586.2
182.4 1284.2 12059. 28195.2
.....
I am currently using Python to read in the data using numpy.loadtxt which conveniently splits the data in row/column form whatever the data array size, but this is getting quite slow. I want to be able to do this reliably in C or C++.
I can see some options:
Add a header tag with the dimensions from my extraction program
# 1 4 6 28
# xdim, ydim
21.2 492.1 58201.5 586.2
182.4 1284.2 12059. 28195.2
.....
but this requires rewriting my extraction programs and programs which use the extracted data, which is quite intensive.
Store the data in a database file eg. MySQL, SQLite etc. Then the data could be extracted on demand. This might be a requirement further along in the development process so it might be good to look into anyway.
Use Python to read in the data and wrap C code for the analysis. This might be easiest in the short run.
Use wc on linux to find the number of lines and number of words in the header to find the dimensions.
echo $((`cat FILE | wc -l` - 1)) # get number of rows (-1 for header line)
echo $((`cat FILE | head -n 1 | wc -w` - 1)) # get number of columns (-1 for '#' character)
Use C/C++ code
This question is mostly related to point 5 - if there is an easy and reliable way to do this in C/C++. Otherwise any other suggestions would be welcome
Thanks
Create table as vector of vectors:
std::vector<std::vector<double> > table;
Inside infinite (while(true)) loop:
Read line:
std::string line;
std::getline(ifs, line);
If something went wrong (probably EOF), exit the loop:
if(!ifs)
break;
Skip that line if it's a comment:
if(line[0] == '#')
continue;
Read row contents into vector:
std::vector<double> row;
std::copy(std::istream_iterator<double>(ifs),
std::istream_iterator<double>(),
std::back_inserter(row));
Add row to table;
table.push_back(row);
At the time you're out of the loop, "table" contains the data:
table.size() is the number of rows
table[i] is row i
table[i].size() is the number of cols. in row i
table[i][j] is the element at the j-th col. of row i
How about:
Load the file.
Count the number of rows and columns.
Close the file.
Allocate the memory needed.
Load the file again.
Fill the array with data.
Every .obj (3D model file) loader I've seen uses this method. :)
Figured out a way to do this. Thanks go mostly to Manuel as it was the most informative answer.
std::vector< std::vector<double> > readIn2dData(const char* filename)
{
/* Function takes a char* filename argument and returns a
* 2d dynamic array containing the data
*/
std::vector< std::vector<double> > table;
std::fstream ifs;
/* open file */
ifs.open(filename);
while (true)
{
std::string line;
double buf;
getline(ifs, line);
std::stringstream ss(line, std::ios_base::out|std::ios_base::in|std::ios_base::binary);
if (!ifs)
// mainly catch EOF
break;
if (line[0] == '#' || line.empty())
// catch empty lines or comment lines
continue;
std::vector<double> row;
while (ss >> buf)
row.push_back(buf);
table.push_back(row);
}
ifs.close();
return table;
}
Basically create a vector of vectors. The only difficulty was splitting by whitespace which is taken care of with the stringstream object. This may not be the most effective way of doing it but it certainly works in the short term!
Also I'm looking for a replacement for the deprecated atof function, but nevermind. Just needs some memory leak checking (it shouldn't have any since most of the objects are std objects) and I'm done.
Thanks for all your help
Do you need a square or a ragged matrix? If the latter, create a structure like this:
std:vector < std::vector <double> > data;
Now read each line at a time into a:
vector <double> d;
and add the vector to the ragged matrix:
data.push_back( d );
All data structures involved are dynamic, and will grow as required.
I've seen your answer, and while it's not bad, I don't think it's ideal either. At least as I understand your original question, the first comment basically specifies how many columns you'll have in each of the remaining rows. e.g. the one you've given ("1 4 6 28") contains four numbers, which can be interpreted as saying each succeeding line will contain 4 numbers.
Assuming that's correct, I'd use that data to optimize reading the data. In particular, after that, (again, as I understand it) the file just contains row after row of numbers. That being the case, I'd put all the numbers together into a single vector, and use the number of columns from the header to index into the rest:
class matrix {
std::vector<double> data;
int columns;
public:
// a matrix is 2D, with fixed number of columns, and arbitrary number of rows.
matrix(int cols) : columns(cols) {}
// just read raw data from stream into vector:
std::istream &read(std::istream &stream) {
std::copy(std::istream_iterator<double>(stream),
std::istream_iterator<double>(),
std::back_inserter(data));
return stream;
}
// Do 2D addressing by converting rows/columns to a linear address
// If you want to check subscripts, use vector.at(x) instead of vector[x].
double operator()(size_t row, size_t col) {
return data[row*columns+col];
}
};
This is all pretty straightfoward -- the matrix knows how many columns it has, so you can do x,y indexing into the matrix, even though it stores all its data in a single vector. Reading the data from the stream just means copying that data from the stream into the vector. To deal with the header, and simplify creating a matrix from the data in a stream, we can use a simple function like this:
matrix read_data(std::string name) {
// read one line from the stream.
std::ifstream in(name.c_str());
std::string line;
std::getline(in, line);
// break that up into space-separated groups:
std::istringstream temp(line);
std::vector<std::string> counter;
std::copy(std::istream_iterator<std::string>(temp),
std::istream_iterator<std::string>(),
std::back_inserter(counter));
// the number of columns is the number of groups, -1 for the leading '#'.
matrix m(counter.size()-1);
// Read the remaining data into the matrix.
m.read(in);
return m;
}
As it's written right now, this depends on your compiler implementing the "Named Return Value Optimization" (NRVO). Without that, the compiler will copy the entire matrix (probably a couple of times) when it's returned from the function. With the optimization, the compiler pre-allocates space for a matrix, and has read_data() generate the matrix in place.