Reading a CSV to vectors in objects

Reading a CSV to vectors in objects - c++

I'm trying to write code that will, on a line-by-line basis, pass numerical data from a CSV to an object's vector. The object's structure is as follows: the object itself (let's call it CS) is an enclosed space, within which resides a vector of objects (called Points) which each have a vector of objects (Features) with 3 variables. The first two variables in these Features are descriptors of the feature and the third is the actual value taken by a specific Point[i].Feature[j]. Each point has the same set of Features, and aside from third value being different, the descriptors are likewise identical. (edit: Sadly I can't change this structure as it's part of a larger framework which is out of my hands)
Currently, my CSV has one column per feature, the first two rows being the descriptors which apply for all points and the rest of the rows being each individual point's third feature value. It's been a while since my introductory C++ course and I'm finding it hard to think of a fast way to implement this, as my CSVs could become fairly large (my current upper limit is 50000 points having 2000 features, this will probably grow) and I wouldn't want to do something silly like rereading the first two lines every time for each point. I've looked around and most CSV solutions involve string CSVs, which I don't have to deal with, and simpler array objects in which the CSV is stored. The problem for me is simply going up a level each time I reach the end of the line and restarting the procedure for the next point, and I can't think of anything. Any tips?

You could just create a temporary array of Descriptor objects which holds the two descriptors for each column and then read in your first row and create your Point objects from that. Afterwards you can just copy the descriptors from the Point a row above, e.g. Point[i-csvWidth], and deallocate the Descriptor array.

I guess I was nearly there, just used the wrong kind of variable to read in.
fstream myFile;
myFile.open(filePath.c_str());
if(!myFile){
cout << "File \"" << filePath << "\" doesn't exist, exiting program." << endl;
exit(EXIT_FAILURE);
}
string line,line2,line3;
Points.clear();
//gets the range row
getline(myFile,line);
istringstream lineStream(line);
//gets the nomin row
getline(myFile,line2);
istringstream lineStream2(line2);
//gets the first person's traits
getline(myFile,line3);
istringstream lineStream3(line3);
CultVec originalCultVec = CultVec(RNG);
int val,val2,val3,val4;
while (lineStream >> val && lineStream2 >> val2 && lineStream3 >> val3) {
Feature feature;
feature.Range = (char)val;
feature.Nomin = (bool)val2;
feature.Trait = (char)val3;
originalCultVec.addFeature(feature);
} // while
Points.push_back(originalCultVec);
while (getline(myFile,line)) {
int i = 0;
CultVec newVec = CultVec(RNG);
istringstream lineStream4(line);
while ( lineStream4 >> val4 ) {
Feature newFeat = originalCultVec.getFeature(i);
newFeat.Trait = (char)val4;
newVec.addFeature(newFeat);
i++;
}
Points.push_back(newVec);
}

Related

Reading key-value pairs as fast as possible in C++ from file

I have a file with roughly 2 million lines like this:
2s,3s,4s,5s,6s 100000
2s,3s,4s,5s,8s 101
2s,3s,4s,5s,9s 102
The first comma separated part indicates a poker result in Omaha, while the latter score is an example "value" of the cards. It is very important for me to read this file as fast as possible in C++, but I cannot seem to get it to be faster than a simple approach in Python (4.5 seconds) using the base library.
Using the Qt framework (QHash and QString), I was able to read the file in 2.5 seconds in release mode. However, I do not want to have the Qt dependency. The goal is to allow quick simulations using those 2 million lines, i.e. some_container["2s,3s,4s,5s,6s"] to yield 100 (though if applying a translation function or any non-readable format will allow for faster reading that's okay as well).
My current implementation is extremely slow (8 seconds!):
std::map<std::string, int> get_file_contents(const char *filename)
{
std::map<std::string, int> outcomes;
std::ifstream infile(filename);
std::string c;
int d;
while (infile.good())
{
infile >> c;
infile >> d;
//std::cout << c << d << std::endl;
outcomes[c] = d;
}
return outcomes;
}
What can I do to read this data into some kind of a key/value hash as fast as possible?
Note: The first 16 characters are always going to be there (the cards), while the score can go up to around 1 million.
Some further informations gathered from various comments:
sample file: http://pastebin.com/rB1hFViM
ram restrictions: 750MB
initialization time restriction: 5s
computation time per hand restriction: 0.5s

As I see it, there are two bottlenecks on your code.
1 Bottleneck
I believe that the file reading is the biggest problem there. Having a binary file is the fastest option. Not only you can read it directly in an array with a raw istream::read in a single operation (which is very fast), but you can even map the file in memory if your OS supports it. Here is a link that's very informative on how to use memory mapped files.
2 Bottleneck
The std::map is usually implemented with a self-balancing BST that will store all the data in order. This makes the insertion to be an O(logn) operation. You can change it to std::unordered_map, wich uses a hash table instead. A hash table have a constant time insertion if the number of colisions are low. As the ammount of elements that you need to read is known, you can reserve a suitable ammount of chuncks before inserting the elements. Keep in mind that you need more chuncks than the number of elements that will be inserted in the hash to avoid the maximum ammount of colisions.

Ian Medeiros already mentioned the two major botlenecks.
a few thoughts about data structures:
the amount of different cards is known: 4 colors of each 13 cards -> 52 cards.
so a card requires less than 6 bits to store. your current file format currently uses 24 bit (includig the comma).
so by simply enumerating the cards and omitting the comma you can save ~2/3 of file size and allows you to determine a card with reading only one character per card.
if you want to keep the file text based you may use a-m, n-z, A-M and N-Z for the four colors.
another thing that bugs me is the string based map. string operations are innefficient.
One hand contains 5 cards.
that means 52^5 posiibilities if we keep it simple and do not consider the already drawn cards.
--> 52^5 = 380.204.032 < 2^32
that means we can enumuerate every possible hand with a uint32 number. by defining a special sorting scheme of the cards (since order is irrelevant), we can assign a number to the hand and use this number as key in our map that is a lot faster than using strings.
if we have enough memory (1.5 GB) we do not even need a map but we can simply use an array.
of course the most cells are unused but access may be very fast. we even can ommit the ordering of the cards since the cells are present independet if we fill them or not. So we can use them. but in this case you should not forget to fill all possible permutations of the hand read from the file.
with this scheme we also (may be) can further optimize our file reading speed. if we only store the hands number and the rating so that only 2 values need to be parsed.
infact we can optimize the required storage space by using a more complex adressing scheme for the different hands, since in reality there are only 52*51*50*49*48 = 311.875.200 possible hands.additional to that the ordering is irrelevant as mentioned but i think that this saving is not worth the increased complexity of the encoding of the hands.

A simple idea might be to use the C API, which is considerably simpler:
#include <cstdio>
int n;
char s[128];
while (std::fscanf(stdin, "%127s %d", s, &n) == 2)
{
outcomes[s] = n;
}
A rough test showed a considerable speedup for me compared to the iostreams library.
Further speedups may be achieved by storing the data in a contiguous array, e.g. a vector of std::pair<std::string, int>; it depends on whether your data is already sorted and how you need to access it later.
For a serious solution, though, you should probably step back further and think of a better way to represent your data. For example, a fixed-width, binary encoding would be much more space-efficient and faster to parse, since you won't need to look ahead for line endings or parse strings.
Update: From some quick experimentation I've found it fairly fast to first read the entire file into memory and then perform alternating strtok calls with either " " or "\n" as the delimiter; whenever a pair of calls succeed, apply strtol on the second pointer to parse the integer. Here's a skeleton:
#include <cerrno>
#include <cstdio>
#include <cstdlib>
#include <cstring>
#include <vector>
int main()
{
std::vector<char> data;
// Read entire file to memory
{
data.reserve(100000000);
char buf[4096];
for (std::size_t n; (n = std::fread(buf, 1, sizeof buf, stdin)) > 0; )
{
data.insert(data.end(), buf, buf + n);
}
data.push_back('\0');
}
// Tokenize the in-memory data
char * p = &data.front();
for (char * q = std::strtok(p, " "); q; q = std::strtok(nullptr, " "))
{
if (char * r = std::strtok(nullptr, "\n"))
{
char * e;
errno = 0;
int const n = std::strtol(r, &e, 10);
if (*e != '\0' || errno != 0) { continue; }
// At this point we have data:
// * the string is "q"
// * the integer is "n"
}
}
}

Clarification required regarding Arrays, Vectors and Maps in usage of a C++ Application

I want to know the right algorithm and a container class for my application. I am trying to build one Client-Server communication system where the Server contains group of files (.txt). The file structure (prototype) is like:
A|B|C|D....|Z$(some integer value)#(some integer value). Again the contents of A to Z are a1_a2_a3_a4......aN|b1_b2_b3_b4......bN|......|z1_z2_z3_z4.....zN. So what I wanted to do is when Server application has started, it has to load these files one-by-one and save the contents of each file in a Container class and again the contents of the file into particular variables based on the delimiters i.e.
for (int i=0; i< (Number of files); i++)
{
1) Load the file[0] in Container class[0];
2) Read the Container class[0] search for occurences of delimiters "_" and "|"
3) Till next "|" occurs, save the value occurred at "_" to an array or variable (save it in a buffer)
4) Do this till the file length completes or reaches EOF
5) Next read the second file, save it in Container class[1] and follow the steps as in 2),3) and 4)
}
I want to know if Vector or Map suits my requirement? As I need to search for occurrences of delimiters and push_back them and access while necessity comes.
Can I read whole single file as block and manipulate with the buffer or while file read only using seekg I can push the values to stack? One which will be better and easier to implement? What are the possibilities of using regex?

According to the format of input, and its size, I'd suggest doing something along these lines for reading and parsing the input:
void ParseOneFile (std::istream & inp)
{
std::vector<std::vector<std::string>> data;
int some_int_1 = 0, some_int_2 = 0;
std::string temp;
data.push_back ({});
while (0 == 0)
{
int c = inp.get();
if ('$' == c)
{
data.back().emplace_back (std::move(temp));
break;
}
else if ('|' == c)
{
data.back().emplace_back (std::move(temp));
data.push_back ({});
}
else if ('_' == c)
data.back().emplace_back (std::move(temp));
else
temp += char(c);
}
char sharp;
inp >> some_int_1 >> sharp >> some_int_2;
assert ('#' == sharp);
// Here, you have your data and your two integers...
}
The above function does not return the information it extracts, so you will want to change that. But it does read one of your files into a vector of vector of strings called data and two integers (some_int_1 and some_int_2.) It uses C++11 and does this reading and parsing quite efficiently, both in terms of processing and memory.
And, the above code does not check for any errors and inconsistent formatting in the input file.
Now, for your data structure problem. Since I have no idea about the nature of your data, I can't say for sure. All I can say is that a two-dimensional array and two integers on the side feels like a natural fit for this data. Since you have several files, you can store them all in another dimension of vector (or perhaps in a map, mapping a file name to a data structure like the following:
struct OneFile
{
vector<vector<string>> data;
int i1, i2;
};
vector<OneFile> all_files;
// or...
// map<string, OneFile> all_files;
The above function would fill one instance of the OneFile struct above.
As an example, all_files[0].data[0][0] will be a string referring to data item A0 in the first file, and all_files[7].data[25][3] will be another string referring to data item Z3 in the 8th file.

search for specific row c++ tab delmited

AccountNumber Type Amount
15 checking 52.42
23 savings 51.51
11 checking 12.21
is my tab delmited file
i would like to be able to search for rows by the account number. say if i put in 23, i want to get that specific row. how would id do that?
also more advance, if i wanted to change a specific value, say amount 51.51 in account 23. how do i fetch that value and replace it with a new value?
so far im just reading in row by row
string line;
ifstream is("account.txt");
if (is.is_open())
{
while (std::getline(is, line)) // read one line at a time
{
string value;
string parseline;
std::istringstream iss(line);
getline(line, parseline);
cout << parseline << endl; // do something with the value
while (iss >> value) // read one value at at time from the line
{
//cout << line << " "; // do something with the value
}
}
is.close();
}
else
cout << "File cant be opened" << endl;
return 0;

Given that each line is of variable length there is no way to index to particular row without first parsing the entire file.
But I suspect your program will want to manipulate random rows and columns. So I'd start by parsing out the entire file. Put each row into its own data structure in an array, then index that row in the array.
You can use "strtok" to split the input up into rows, and then strtok again to split each row into fields.

If I were to do this, I would first write a few functions that parse the entire file and store the data in an appropriate data structure (such as an array or std::map). Then I would use the data structure for the required operations (such as searching or editing). Finally, I would write the data structure back to a file if there are any modifications.

Initializing a Vector of Objects from a .txt file

#include<iostream>
#include<vector>
#include<fstream>
#include "stock.h"
int main(){
double balance =0, tempPrice=0;
string tempStr;
vector < Stock > portfolio;
typedef vector<Stock>::iterator StockIt;
ifstream fileIn( "Results.txt" );
for(StockIt i = portfolio.begin(); i != portfolio.end(); i++)
{
while ( !fileIn.eof( ))
{
getline(fileIn,tempStr);
i->setSymbol(tempStr);
fileIn >> tempPrice;
i->setPrice(tempPrice);
getline(fileIn,tempStr);
i->setDate(tempStr);
}
fileIn.close();
}
for(StockIt i = portfolio.begin(); i != portfolio.end(); i++){
cout<<i->getSymbol() <<endl;
cout<<i->getPrice() <<endl;
cout<<i->getDate() <<endl;
}
return 0;
}
Sample text file, Results.txt:
GOOG 569.964 11/17/2010
MSFT 29.62 11/17/2010
YHOO 15.38 11/17/2010
AAPL 199.92 11/17/2010
Now obviously, I want this program to create a vector of Stock Objects which has the appropriate set/get functionality for object: Stock(string, double, string).
Once that is done, I want to print out each individual member of each Object in the vector.
One thing that boggles my mind about fstream, is how can it decipher spaces and end of lines, and intelligently read strings/ints/doubles and place them into the appropriate data type? Maybe it can't...and I have to add an entirely new functionality?
now it would seem that I'm not actually creating a new object for each iteration of the loop? I think would need to do something along the lines of:
portfolio.push_back(new Stock(string, double, string));? I'm just not entirely sure how to get to that point.
Also, this code should be interchangeable with std::list as well as std::vector. This program compiles without error, however, there is zero output.

First of all, iterating over the vector only makes sense when it isn't empty. So remove the line:
for(StockIt i = portfolio.begin(); i != portfolio.end(); i++)
because otherwise the contents of this loop will never be executed.
Second, you have problems with your input reading: you use getline for the first field, which would read the values of all 3 fields on the line into the tempStr variable.
Third, you shouldn't use while(!fileIn.eof()) - the eof function only returns true after you tried to read past the end of the file. Instead, use:
while (fileIn >> symbol >> price >> date) {
//here you should create a Stock object and call push_back on the vector.
}
This will read the three fields, which are separated by spaces.

Few issues in your code.
the first for loop runs on an empty portfolio vector, as the vector is not initialized (no objects are being pushed to it) the begin() and end() are the same.
you should read line by line from the fstream until EOF, then push objects to the vector.
each line you read, you should split (tokenize) into the 3 parts, and create a new Stock object to be pushed to the vector.
Another side feedback, whenever using an stl iterator, use ++itr on for loops, it will run much more fast

How to read in a data file of unknown dimensions in C/C++

I have a data file which contains data in row/colum form. I would like a way to read this data in to a 2D array in C or C++ (whichever is easier) but I don't know how many rows or columns the file might have before I start reading it in.
At the top of the file is a commented line giving a series of numbers relating to what each column holds. Each row is holding the data for each number at a point in time, so an example data file (a small one - the ones i'm using are much bigger!) could be like:
# 1 4 6 28
21.2 492.1 58201.5 586.2
182.4 1284.2 12059. 28195.2
.....
I am currently using Python to read in the data using numpy.loadtxt which conveniently splits the data in row/column form whatever the data array size, but this is getting quite slow. I want to be able to do this reliably in C or C++.
I can see some options:
Add a header tag with the dimensions from my extraction program
# 1 4 6 28
# xdim, ydim
21.2 492.1 58201.5 586.2
182.4 1284.2 12059. 28195.2
.....
but this requires rewriting my extraction programs and programs which use the extracted data, which is quite intensive.
Store the data in a database file eg. MySQL, SQLite etc. Then the data could be extracted on demand. This might be a requirement further along in the development process so it might be good to look into anyway.
Use Python to read in the data and wrap C code for the analysis. This might be easiest in the short run.
Use wc on linux to find the number of lines and number of words in the header to find the dimensions.
echo $((`cat FILE | wc -l` - 1)) # get number of rows (-1 for header line)
echo $((`cat FILE | head -n 1 | wc -w` - 1)) # get number of columns (-1 for '#' character)
Use C/C++ code
This question is mostly related to point 5 - if there is an easy and reliable way to do this in C/C++. Otherwise any other suggestions would be welcome
Thanks

Create table as vector of vectors:
std::vector<std::vector<double> > table;
Inside infinite (while(true)) loop:
Read line:
std::string line;
std::getline(ifs, line);
If something went wrong (probably EOF), exit the loop:
if(!ifs)
break;
Skip that line if it's a comment:
if(line[0] == '#')
continue;
Read row contents into vector:
std::vector<double> row;
std::copy(std::istream_iterator<double>(ifs),
std::istream_iterator<double>(),
std::back_inserter(row));
Add row to table;
table.push_back(row);
At the time you're out of the loop, "table" contains the data:
table.size() is the number of rows
table[i] is row i
table[i].size() is the number of cols. in row i
table[i][j] is the element at the j-th col. of row i

How about:
Load the file.
Count the number of rows and columns.
Close the file.
Allocate the memory needed.
Load the file again.
Fill the array with data.
Every .obj (3D model file) loader I've seen uses this method. :)

Figured out a way to do this. Thanks go mostly to Manuel as it was the most informative answer.
std::vector< std::vector<double> > readIn2dData(const char* filename)
{
/* Function takes a char* filename argument and returns a
* 2d dynamic array containing the data
*/
std::vector< std::vector<double> > table;
std::fstream ifs;
/* open file */
ifs.open(filename);
while (true)
{
std::string line;
double buf;
getline(ifs, line);
std::stringstream ss(line, std::ios_base::out|std::ios_base::in|std::ios_base::binary);
if (!ifs)
// mainly catch EOF
break;
if (line[0] == '#' || line.empty())
// catch empty lines or comment lines
continue;
std::vector<double> row;
while (ss >> buf)
row.push_back(buf);
table.push_back(row);
}
ifs.close();
return table;
}
Basically create a vector of vectors. The only difficulty was splitting by whitespace which is taken care of with the stringstream object. This may not be the most effective way of doing it but it certainly works in the short term!
Also I'm looking for a replacement for the deprecated atof function, but nevermind. Just needs some memory leak checking (it shouldn't have any since most of the objects are std objects) and I'm done.
Thanks for all your help

Do you need a square or a ragged matrix? If the latter, create a structure like this:
std:vector < std::vector <double> > data;
Now read each line at a time into a:
vector <double> d;
and add the vector to the ragged matrix:
data.push_back( d );
All data structures involved are dynamic, and will grow as required.

I've seen your answer, and while it's not bad, I don't think it's ideal either. At least as I understand your original question, the first comment basically specifies how many columns you'll have in each of the remaining rows. e.g. the one you've given ("1 4 6 28") contains four numbers, which can be interpreted as saying each succeeding line will contain 4 numbers.
Assuming that's correct, I'd use that data to optimize reading the data. In particular, after that, (again, as I understand it) the file just contains row after row of numbers. That being the case, I'd put all the numbers together into a single vector, and use the number of columns from the header to index into the rest:
class matrix {
std::vector<double> data;
int columns;
public:
// a matrix is 2D, with fixed number of columns, and arbitrary number of rows.
matrix(int cols) : columns(cols) {}
// just read raw data from stream into vector:
std::istream &read(std::istream &stream) {
std::copy(std::istream_iterator<double>(stream),
std::istream_iterator<double>(),
std::back_inserter(data));
return stream;
}
// Do 2D addressing by converting rows/columns to a linear address
// If you want to check subscripts, use vector.at(x) instead of vector[x].
double operator()(size_t row, size_t col) {
return data[row*columns+col];
}
};
This is all pretty straightfoward -- the matrix knows how many columns it has, so you can do x,y indexing into the matrix, even though it stores all its data in a single vector. Reading the data from the stream just means copying that data from the stream into the vector. To deal with the header, and simplify creating a matrix from the data in a stream, we can use a simple function like this:
matrix read_data(std::string name) {
// read one line from the stream.
std::ifstream in(name.c_str());
std::string line;
std::getline(in, line);
// break that up into space-separated groups:
std::istringstream temp(line);
std::vector<std::string> counter;
std::copy(std::istream_iterator<std::string>(temp),
std::istream_iterator<std::string>(),
std::back_inserter(counter));
// the number of columns is the number of groups, -1 for the leading '#'.
matrix m(counter.size()-1);
// Read the remaining data into the matrix.
m.read(in);
return m;
}
As it's written right now, this depends on your compiler implementing the "Named Return Value Optimization" (NRVO). Without that, the compiler will copy the entire matrix (probably a couple of times) when it's returned from the function. With the optimization, the compiler pre-allocates space for a matrix, and has read_data() generate the matrix in place.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js