Reading key-value pairs as fast as possible in C++ from file

Reading key-value pairs as fast as possible in C++ from file - c++

I have a file with roughly 2 million lines like this:
2s,3s,4s,5s,6s 100000
2s,3s,4s,5s,8s 101
2s,3s,4s,5s,9s 102
The first comma separated part indicates a poker result in Omaha, while the latter score is an example "value" of the cards. It is very important for me to read this file as fast as possible in C++, but I cannot seem to get it to be faster than a simple approach in Python (4.5 seconds) using the base library.
Using the Qt framework (QHash and QString), I was able to read the file in 2.5 seconds in release mode. However, I do not want to have the Qt dependency. The goal is to allow quick simulations using those 2 million lines, i.e. some_container["2s,3s,4s,5s,6s"] to yield 100 (though if applying a translation function or any non-readable format will allow for faster reading that's okay as well).
My current implementation is extremely slow (8 seconds!):
std::map<std::string, int> get_file_contents(const char *filename)
{
std::map<std::string, int> outcomes;
std::ifstream infile(filename);
std::string c;
int d;
while (infile.good())
{
infile >> c;
infile >> d;
//std::cout << c << d << std::endl;
outcomes[c] = d;
}
return outcomes;
}
What can I do to read this data into some kind of a key/value hash as fast as possible?
Note: The first 16 characters are always going to be there (the cards), while the score can go up to around 1 million.
Some further informations gathered from various comments:
sample file: http://pastebin.com/rB1hFViM
ram restrictions: 750MB
initialization time restriction: 5s
computation time per hand restriction: 0.5s

As I see it, there are two bottlenecks on your code.
1 Bottleneck
I believe that the file reading is the biggest problem there. Having a binary file is the fastest option. Not only you can read it directly in an array with a raw istream::read in a single operation (which is very fast), but you can even map the file in memory if your OS supports it. Here is a link that's very informative on how to use memory mapped files.
2 Bottleneck
The std::map is usually implemented with a self-balancing BST that will store all the data in order. This makes the insertion to be an O(logn) operation. You can change it to std::unordered_map, wich uses a hash table instead. A hash table have a constant time insertion if the number of colisions are low. As the ammount of elements that you need to read is known, you can reserve a suitable ammount of chuncks before inserting the elements. Keep in mind that you need more chuncks than the number of elements that will be inserted in the hash to avoid the maximum ammount of colisions.

Ian Medeiros already mentioned the two major botlenecks.
a few thoughts about data structures:
the amount of different cards is known: 4 colors of each 13 cards -> 52 cards.
so a card requires less than 6 bits to store. your current file format currently uses 24 bit (includig the comma).
so by simply enumerating the cards and omitting the comma you can save ~2/3 of file size and allows you to determine a card with reading only one character per card.
if you want to keep the file text based you may use a-m, n-z, A-M and N-Z for the four colors.
another thing that bugs me is the string based map. string operations are innefficient.
One hand contains 5 cards.
that means 52^5 posiibilities if we keep it simple and do not consider the already drawn cards.
--> 52^5 = 380.204.032 < 2^32
that means we can enumuerate every possible hand with a uint32 number. by defining a special sorting scheme of the cards (since order is irrelevant), we can assign a number to the hand and use this number as key in our map that is a lot faster than using strings.
if we have enough memory (1.5 GB) we do not even need a map but we can simply use an array.
of course the most cells are unused but access may be very fast. we even can ommit the ordering of the cards since the cells are present independet if we fill them or not. So we can use them. but in this case you should not forget to fill all possible permutations of the hand read from the file.
with this scheme we also (may be) can further optimize our file reading speed. if we only store the hands number and the rating so that only 2 values need to be parsed.
infact we can optimize the required storage space by using a more complex adressing scheme for the different hands, since in reality there are only 52*51*50*49*48 = 311.875.200 possible hands.additional to that the ordering is irrelevant as mentioned but i think that this saving is not worth the increased complexity of the encoding of the hands.

A simple idea might be to use the C API, which is considerably simpler:
#include <cstdio>
int n;
char s[128];
while (std::fscanf(stdin, "%127s %d", s, &n) == 2)
{
outcomes[s] = n;
}
A rough test showed a considerable speedup for me compared to the iostreams library.
Further speedups may be achieved by storing the data in a contiguous array, e.g. a vector of std::pair<std::string, int>; it depends on whether your data is already sorted and how you need to access it later.
For a serious solution, though, you should probably step back further and think of a better way to represent your data. For example, a fixed-width, binary encoding would be much more space-efficient and faster to parse, since you won't need to look ahead for line endings or parse strings.
Update: From some quick experimentation I've found it fairly fast to first read the entire file into memory and then perform alternating strtok calls with either " " or "\n" as the delimiter; whenever a pair of calls succeed, apply strtol on the second pointer to parse the integer. Here's a skeleton:
#include <cerrno>
#include <cstdio>
#include <cstdlib>
#include <cstring>
#include <vector>
int main()
{
std::vector<char> data;
// Read entire file to memory
{
data.reserve(100000000);
char buf[4096];
for (std::size_t n; (n = std::fread(buf, 1, sizeof buf, stdin)) > 0; )
{
data.insert(data.end(), buf, buf + n);
}
data.push_back('\0');
}
// Tokenize the in-memory data
char * p = &data.front();
for (char * q = std::strtok(p, " "); q; q = std::strtok(nullptr, " "))
{
if (char * r = std::strtok(nullptr, "\n"))
{
char * e;
errno = 0;
int const n = std::strtol(r, &e, 10);
if (*e != '\0' || errno != 0) { continue; }
// At this point we have data:
// * the string is "q"
// * the integer is "n"
}
}
}

Related

C++: How to read a lot of data from formatted text files into program?

I'm writing a CFD solver for specific fluid problems. So far the mesh is generated every time running the simulation, and when changing geometry and fluid properties,the program needs to be recompiled.
For small-sized problem with low number of cells, it works just fine. But for cases with over 1 million cells, and fluid properties needs to be changed very often, It is quite inefficient.
Obviously, we need to store simulation setup data in a config file, and geometry information in a formatted mesh file.
Simulation.config file
% Dimension: 2D or 3D
N_Dimension= 2
% Number of fluid phases
N_Phases= 1
% Fluid density (kg/m3)
Density_Phase1= 1000.0
Density_Phase2= 1.0
% Kinematic viscosity (m^2/s)
Viscosity_Phase1= 1e-6
Viscosity_Phase2= 1.48e-05
...
Geometry.mesh file
% Dimension: 2D or 3D
N_Dimension= 2
% Points (index: x, y, z)
N_Points= 100
x0 y0
x1 y1
...
x99 y99
% Faces (Lines in 2D: P1->p2)
N_Faces= 55
0 2
3 4
...
% Cells (polygons in 2D: Cell-Type and Points clock-wise). 6: triangle; 9: quad
N_Cells= 20
9 0 1 6 20
9 1 3 4 7
...
% Boundary Faces (index)
Left_Faces= 4
0
1
2
3
Bottom_Faces= 6
7
8
9
10
11
12
...
It's easy to write config and mesh information to formatted text files. The problem is, how do we read these data efficiently into program? I wonder if there is any easy-to-use c++ library to do this job.

Well, well
You can implement your own API based on a finite elements collection, a dictionary, some Regex and, after all, apply bet practice according to some international standard.
Or you can take a look on that:
GMSH_IO
OpenMesh:
I just used OpenMesh in my last implementation for C++ OpenGL project.

As a first-iteration solution to just get something tolerable - take #JosmarBarbosa's suggestion and use an established format for your kind of data - which also probably has free, open-source libraries for you to use. One example is OpenMesh developed at RWTH Aachen. It supports:
Representation of arbitrary polygonal (the general case) and pure triangle meshes (providing more efficient, specialized algorithms)
Explicit representation of vertices, halfedges, edges and faces.
Fast neighborhood access, especially the one-ring neighborhood (see below).
[Customization]
But if you really need to speed up your mesh data reading, consider doing the following:
Separate the limited-size meta-data from the larger, unlimited-size mesh data;
Place the limited-size meta-data in a separate file and read it whichever way you like, it doesn't matter.
Arrange the mesh data as several arrays of fixed-size elements or fixed-size structures (e.g. cells, faces, points, etc.).
Store each of the fixed-width arrays of mesh data in its own file - without using streaming individual values anywhere: Just read or write the array as-is, directly. Here's an example of how a read would look. Youll know the appropriate size of the read either by looking at the file size or the metadata.
Finally, you could avoid explicitly-reading altogether and use memory-mapping for each of the data files. See
fastest technique to read a file into memory?
Notes/caveats:
If you write and read binary data on systems with different memory layout of certain values (e.g. little-endian vs big-endian) - you'll need to shuffle the bytes around in memory. See also this SO question about endianness.
It might not be worth it to optimize the reading speed as much as possible. You should consider Amdahl's law, and only optimize it to a point where it's no longer a significant fraction of your overall execution time. It's better to lose a few percentage points of execution time, but get human-readable data files which can be used with other tools supporting an established format.

In the following answear I asume:
That if the first character of a line is % then it shall be ignored as a comment.
Any other line is structured exactly as follows: identifier= value.
The code I present will parse a config file following the mentioned assumptions correctly. This is the code (I hope that all needed explanation is in comments):
#include <fstream> //required for file IO
#include <iostream> //required for console IO
#include <unordered_map> //required for creating a hashtable to store the identifiers
int main()
{
std::unordered_map<std::string, double> identifiers;
std::string configPath;
std::cout << "Enter config path: ";
std::cin >> configPath;
std::ifstream config(configPath); //open the specified file
if (!config.is_open()) //error if failed to open file
{
std::cerr << "Cannot open config file!";
return -1;
}
std::string line;
while (std::getline(config, line)) //read each line of the file
{
if (line[0] == '%') //line is a comment
continue;
std::size_t identifierLenght = 0;
while (line[identifierLenght] != '=')
++identifierLenght;
identifiers.emplace(
line.substr(0, identifierLenght),
std::stod(line.substr(identifierLenght + 2))
); //add entry to identifiers
}
for (const auto& entry : identifiers)
std::cout << entry.first << " = " << entry.second << '\n';
}
After reading the identifiers you can, of course, do whatever you need to do with them. I just print them as an example to show how to fetch them. For more information about std::unordered_map look here. For a lot of very good information about making parsers have a look here instead.
If you want to make your program process input faster insert the following line at the beginning of main: std::ios_base::sync_with_stdio(false). This will desynchronize C++ IO with C IO and, in result, make it faster.

Assuming:
you don't want to use an existing format for meshes
you don't want to use a generic text format (json, yml, ...)
you don't want a binary format (even though you want something efficient)
In a nutshell, you really need your own text format.
You can use any parser generator to get started. While you could probably parse your config file as it is using only regexps, they can be really limited on the long run. So I'll suggest a context-free grammar parser, generated with Boost spirit::x3.
AST
The Abstract Syntax Tree will hold the final result of the parser.
#include <string>
#include <utility>
#include <vector>
#include <variant>
namespace AST {
using Identifier = std::string; // Variable name.
using Value = std::variant<int,double>; // Variable value.
using Assignment = std::pair<Identifier,Value>; // Identifier = Value.
using Root = std::vector<Assignment>; // Whole file: all assignments.
}
Parser
Grammar description:
#include <boost/fusion/adapted/std_pair.hpp>
#include <boost/spirit/home/x3.hpp>
namespace Parser {
using namespace x3;
// Line: Identifier = value
const x3::rule<class assignment, AST::Assignment> assignment = "assignment";
// Line: comment
const x3::rule<class comment> comment = "comment";
// Variable name
const x3::rule<class identifier, AST::Identifier> identifier = "identifier";
// File
const x3::rule<class root, AST::Root> root = "root";
// Any valid value in the config file
const x3::rule<class value, AST::Value> value = "value";
// Semantic action
auto emplace_back = [](const auto& ctx) {
x3::_val(ctx).emplace_back(x3::_attr(ctx));
};
// Grammar
const auto assignment_def = skip(blank)[identifier >> '=' >> value];
const auto comment_def = '%' >> omit[*(char_ - eol)];
const auto identifier_def = lexeme[alpha >> +(alnum | char_('_'))];
const auto root_def = *((comment | assignment[emplace_back]) >> eol) >> omit[*blank];
const auto value_def = double_ | int_;
BOOST_SPIRIT_DEFINE(root, assignment, comment, identifier, value);
}
Usage
// Takes iterators on string/stream...
// Returns the AST of the input.
template<typename IteratorType>
AST::Root parse(IteratorType& begin, const IteratorType& end) {
AST::Root result;
bool parsed = x3::parse(begin, end, Parser::root, result);
if (!parsed || begin != end) {
throw std::domain_error("Parser received an invalid input.");
}
return result;
}
Live demo
Evolutions
To change where blank spaces are allowed, add/move x3::skip(blank) in the xxxx_def expressions.
Currently the file must end with a newline. Rewriting the root_def expression can fix that.
You'll certainly want to know why the parsing failed on invalid inputs. See the error handling tutorial for that.
You're just a few rules away from parsing more complicated things:
// 100 X_n Y_n
const auto point_def = lit("N_Points") >> ':' >> int_ >> eol >> *(double_ >> double_ >> eol)

If you don't need specific text file format, but have a lot of data and do care about performance, I recommend using some existing data serialization frameworks instead.
E.g. Google protocol buffers allow efficient serialization and deserialization with very little code. The file is binary, so typically much smaller than text file, and binary serialization is much faster than parsing text. It also supports structured data (arrays, nested structs), data versioning, and other goodies.
https://developers.google.com/protocol-buffers/

Index large txt file

I have a large file (500 million records).
The file is two columns(tab delimited) as follows:
1 4590
3 1390
4 4590
5 4285
7 8902
8 9000
...
All values in first column are ordered numerically (but with gaps e.g: 1 and then 3 and than 4...).
I would like to index that file to be able to access the value on column2 based on value from column 1 (that i will call key)
For example if i submit 8 it should return 9000.
I have started by creating an index as follows:
// Record each entry into a structure
struct Record{
int gi; //first column
int taxa; //second column
};
Record buffer;
ofstream BinaryFile("large_file_indexed.bin", ios::binary);
ifstream inputFile("infile.dat");
//Write to binary file
while( inputFile.good() ){
inputFile >> buffer.gi >> buffer.taxa;
BinaryFile.write( (char *) &buffer, sizeof(Record) );
}
BinaryFile.close();
Ok, what i´m doing above is just creating an binary index file for entries and save it to a binary file. This is working as expected.
The problem comes now, and since i´m not an expert i would appreciate your advice.
The idea is to read the binary file and get a specific record
//Read binary file
ifstream ReadBinary("large_file_indexed.bin, ios::binary );
int idx = 8 ; // Which key do we search for?
while(!ReadBinary.eof())
{
ReadBinary.read( (char *) &buffer, sizeof(Record));
if(idx == buffer.gi) // If we find key return corresponding value
{
cout << "Found key " << buffer.gi << " Taxa:" << buffer.taxa << endl;
break;
}
}
This returns the expected value. Since we are asking for value corresponding to key 8 it returns 9000.
The thing is that it still too long to get the value and i was wondering how can i be faster. If i use seekg and can get a specific index but i don´t know which index (position) corresponds to the key we want. So in other words can i directly jump to the position where the key is and get the corrsponding value. I´m confused on how to get the position for a particular key and jump to the corresponding position in the binary file. Maybe i should index my input file differently or i´m missing something ?
Thanks for your comments.

If you can't use a database or a b-tree library, and don't want to invest in developing yet another b-tree library, you could consider one of the two following approaches.
Both assume that the binary index file is sorted, and take advantage of the fixed size record.
1.Simple heuristic approach
If there would be no gap, to find the n-th record (numbering starting at one) you would do:
if (ReadBinary.seekg(sizeof(Record)*(n-1))
&& ReadBinary.read( (char*)&buffer, sizeof(Record))) {
// process record
}
else {
// record not found (certainly beyond eof)
}
But you can have gaps. This means, if there's no duplicate, the element n would be at this position or before. So just read and rewind as long as necessary:
if (! ReadBinary.seekg(sizeof(Record)*(n-1))) { // try to position
ReadBinary.clear(); // if couldn't position
ReadBinary.seekg(-sizeof(Record), ios_base::end); // go to last record
}
while (ReadBinary.read( (char*)&buffer, sizeof(Record)) && buffer.gi>n ) {
ReadBinary.seekg (-2*sizeof(Record), ios_base::cur);
}
if (ReadBinary && buffer.gi==n) {
// record found
}
else {
// record not found
}
2.Dichotomic approach
Of course, if you have many gaps this heuristic approach will quickly become too slow, as the number searched for increase.
You could therefore opt for a dichotomic search (aka binary search): with seekg() go to the end of the file and use tellg() to know the size of the file, that you could translate into number of records.
Cut the number into two, position on the record in the middle, read it, look if the searched number would be smaller or bigger than the number read, and restart with the new bounds of the search until you find the right position. The same principle you would use to search in an array.
This is very efficient, as you need only at most log(n)/log(2) reads to find any number. So for any of the 500 000 000 numbers, you'd need at most 29 reads !
3.Conclusions
Of course there are other feasible approaches as well. But in the end, this is already pretty good even if it would be outperformed by any database or a good crafted b-tree library, because b-trees reduce disk head movement by an astute regrouping of nodes into blocks that are optimized to be read at once with a minimal disk overhead. This reduces the number of disk access to log(n)/log(b) where b is the number of nodes in a block. For example if b=10, searching the 500 000 000 elements would require at most 9 reads from disk.

How to get more performance when reading file

My program download files from site (via curl per 30 min). (it is possible that size of these files can reach 150 mb)
So i thought that getting data from these files can be inefficient. (search a line per 5 seconds)
These files can have ~10.000 lines
To parse this file (values are seperate by ",") i use regex :
regex wzorzec("(.*),(.*),(.*),(.*),(.*),(.*),(.*),(.*)");
There are 8 values.
Now i have to push it to vector:
allys.push_back({ std::stoi(std::string(wynik[1])), nick, tag, stoi(string(wynik[4])), stoi(string(wynik[5])), stoi(string(wynik[6])), stoi(string(wynik[7])), stoi(string(wynik[8])) });
I use std::async to do that, but for 3 files (~7 mb) procesor jumps to 80% and operation take about 10 secs. I read from SSD so this is not slowly IO fault.
I'm reading data line per line by fstream
How to boost this operation?
Maybe i have to parse this values, and push it to SQL ?
Best Regards

You can probably get some performance boost by avoiding regex, and use something along the lines of std::strtok, or else just hard-code a search for commas in your data. Regex has more power than you need just to look for commas. Next, if you use vector::reserve before you begin a sequence of push_back for any given vector, you will save a lot of time in both reallocation and moving memory around. If you are expecting a large vector, reserve room for it up front.
This may not cover all available performance ideas, but I'd bet you will see an improvement.

Your problem here is most likely additional overhead introduced by the regular expression, since you're using many variable length and greedy matches (the regex engine will try different alignments for the matches to find the largest matching result).
Instead, you might want to try to manually parse the lines. There are many different ways to achieve this. Here's one quick and dirty example (it's not flexible and has quite some duplicate code in there, but there's lots of room for optimization). It should explain the basic idea though:
#include <iostream>
#include <sstream>
#include <cstdlib>
const char *input = "1,Mario,Stuff,4,5,6,7,8";
struct data {
int id;
std::string nick;
std::string tag;
} myData;
int main(int argc, char **argv){
char buffer[256];
std::istringstream in(input);
// Read an entry and convert/store it:
in.get(buffer, 256, ','); // read
myData.id = atoi(buffer); // convert and store
// Skip the comma
in.seekg(1, std::ios::cur);
// Read the next entry and convert/store it:
in.get(buffer, 256, ','); // read
myData.nick = buffer; // store
// Skip the comma
in.seekg(1, std::ios::cur);
// Read the next entry and convert/store it:
in.get(buffer, 256, ','); // read
myData.tag = buffer; // store
// Skip the comma
in.seekg(1, std::ios::cur);
// Some test output
std::cout << "id: " << myData.id << "\nnick: " << myData.nick << "\ntag: " << myData.tag << std::endl;
return 0;
}
Note that there isn't any error handling in case entries are too long or too short (or broken in some other way).
Console output:
id: 1
nick: Mario
tag: Stuff

Reading a CSV to vectors in objects

I'm trying to write code that will, on a line-by-line basis, pass numerical data from a CSV to an object's vector. The object's structure is as follows: the object itself (let's call it CS) is an enclosed space, within which resides a vector of objects (called Points) which each have a vector of objects (Features) with 3 variables. The first two variables in these Features are descriptors of the feature and the third is the actual value taken by a specific Point[i].Feature[j]. Each point has the same set of Features, and aside from third value being different, the descriptors are likewise identical. (edit: Sadly I can't change this structure as it's part of a larger framework which is out of my hands)
Currently, my CSV has one column per feature, the first two rows being the descriptors which apply for all points and the rest of the rows being each individual point's third feature value. It's been a while since my introductory C++ course and I'm finding it hard to think of a fast way to implement this, as my CSVs could become fairly large (my current upper limit is 50000 points having 2000 features, this will probably grow) and I wouldn't want to do something silly like rereading the first two lines every time for each point. I've looked around and most CSV solutions involve string CSVs, which I don't have to deal with, and simpler array objects in which the CSV is stored. The problem for me is simply going up a level each time I reach the end of the line and restarting the procedure for the next point, and I can't think of anything. Any tips?

You could just create a temporary array of Descriptor objects which holds the two descriptors for each column and then read in your first row and create your Point objects from that. Afterwards you can just copy the descriptors from the Point a row above, e.g. Point[i-csvWidth], and deallocate the Descriptor array.

I guess I was nearly there, just used the wrong kind of variable to read in.
fstream myFile;
myFile.open(filePath.c_str());
if(!myFile){
cout << "File \"" << filePath << "\" doesn't exist, exiting program." << endl;
exit(EXIT_FAILURE);
}
string line,line2,line3;
Points.clear();
//gets the range row
getline(myFile,line);
istringstream lineStream(line);
//gets the nomin row
getline(myFile,line2);
istringstream lineStream2(line2);
//gets the first person's traits
getline(myFile,line3);
istringstream lineStream3(line3);
CultVec originalCultVec = CultVec(RNG);
int val,val2,val3,val4;
while (lineStream >> val && lineStream2 >> val2 && lineStream3 >> val3) {
Feature feature;
feature.Range = (char)val;
feature.Nomin = (bool)val2;
feature.Trait = (char)val3;
originalCultVec.addFeature(feature);
} // while
Points.push_back(originalCultVec);
while (getline(myFile,line)) {
int i = 0;
CultVec newVec = CultVec(RNG);
istringstream lineStream4(line);
while ( lineStream4 >> val4 ) {
Feature newFeat = originalCultVec.getFeature(i);
newFeat.Trait = (char)val4;
newVec.addFeature(newFeat);
i++;
}
Points.push_back(newVec);
}

How to read in a data file of unknown dimensions in C/C++

I have a data file which contains data in row/colum form. I would like a way to read this data in to a 2D array in C or C++ (whichever is easier) but I don't know how many rows or columns the file might have before I start reading it in.
At the top of the file is a commented line giving a series of numbers relating to what each column holds. Each row is holding the data for each number at a point in time, so an example data file (a small one - the ones i'm using are much bigger!) could be like:
# 1 4 6 28
21.2 492.1 58201.5 586.2
182.4 1284.2 12059. 28195.2
.....
I am currently using Python to read in the data using numpy.loadtxt which conveniently splits the data in row/column form whatever the data array size, but this is getting quite slow. I want to be able to do this reliably in C or C++.
I can see some options:
Add a header tag with the dimensions from my extraction program
# 1 4 6 28
# xdim, ydim
21.2 492.1 58201.5 586.2
182.4 1284.2 12059. 28195.2
.....
but this requires rewriting my extraction programs and programs which use the extracted data, which is quite intensive.
Store the data in a database file eg. MySQL, SQLite etc. Then the data could be extracted on demand. This might be a requirement further along in the development process so it might be good to look into anyway.
Use Python to read in the data and wrap C code for the analysis. This might be easiest in the short run.
Use wc on linux to find the number of lines and number of words in the header to find the dimensions.
echo $((`cat FILE | wc -l` - 1)) # get number of rows (-1 for header line)
echo $((`cat FILE | head -n 1 | wc -w` - 1)) # get number of columns (-1 for '#' character)
Use C/C++ code
This question is mostly related to point 5 - if there is an easy and reliable way to do this in C/C++. Otherwise any other suggestions would be welcome
Thanks

Create table as vector of vectors:
std::vector<std::vector<double> > table;
Inside infinite (while(true)) loop:
Read line:
std::string line;
std::getline(ifs, line);
If something went wrong (probably EOF), exit the loop:
if(!ifs)
break;
Skip that line if it's a comment:
if(line[0] == '#')
continue;
Read row contents into vector:
std::vector<double> row;
std::copy(std::istream_iterator<double>(ifs),
std::istream_iterator<double>(),
std::back_inserter(row));
Add row to table;
table.push_back(row);
At the time you're out of the loop, "table" contains the data:
table.size() is the number of rows
table[i] is row i
table[i].size() is the number of cols. in row i
table[i][j] is the element at the j-th col. of row i

How about:
Load the file.
Count the number of rows and columns.
Close the file.
Allocate the memory needed.
Load the file again.
Fill the array with data.
Every .obj (3D model file) loader I've seen uses this method. :)

Figured out a way to do this. Thanks go mostly to Manuel as it was the most informative answer.
std::vector< std::vector<double> > readIn2dData(const char* filename)
{
/* Function takes a char* filename argument and returns a
* 2d dynamic array containing the data
*/
std::vector< std::vector<double> > table;
std::fstream ifs;
/* open file */
ifs.open(filename);
while (true)
{
std::string line;
double buf;
getline(ifs, line);
std::stringstream ss(line, std::ios_base::out|std::ios_base::in|std::ios_base::binary);
if (!ifs)
// mainly catch EOF
break;
if (line[0] == '#' || line.empty())
// catch empty lines or comment lines
continue;
std::vector<double> row;
while (ss >> buf)
row.push_back(buf);
table.push_back(row);
}
ifs.close();
return table;
}
Basically create a vector of vectors. The only difficulty was splitting by whitespace which is taken care of with the stringstream object. This may not be the most effective way of doing it but it certainly works in the short term!
Also I'm looking for a replacement for the deprecated atof function, but nevermind. Just needs some memory leak checking (it shouldn't have any since most of the objects are std objects) and I'm done.
Thanks for all your help

Do you need a square or a ragged matrix? If the latter, create a structure like this:
std:vector < std::vector <double> > data;
Now read each line at a time into a:
vector <double> d;
and add the vector to the ragged matrix:
data.push_back( d );
All data structures involved are dynamic, and will grow as required.

I've seen your answer, and while it's not bad, I don't think it's ideal either. At least as I understand your original question, the first comment basically specifies how many columns you'll have in each of the remaining rows. e.g. the one you've given ("1 4 6 28") contains four numbers, which can be interpreted as saying each succeeding line will contain 4 numbers.
Assuming that's correct, I'd use that data to optimize reading the data. In particular, after that, (again, as I understand it) the file just contains row after row of numbers. That being the case, I'd put all the numbers together into a single vector, and use the number of columns from the header to index into the rest:
class matrix {
std::vector<double> data;
int columns;
public:
// a matrix is 2D, with fixed number of columns, and arbitrary number of rows.
matrix(int cols) : columns(cols) {}
// just read raw data from stream into vector:
std::istream &read(std::istream &stream) {
std::copy(std::istream_iterator<double>(stream),
std::istream_iterator<double>(),
std::back_inserter(data));
return stream;
}
// Do 2D addressing by converting rows/columns to a linear address
// If you want to check subscripts, use vector.at(x) instead of vector[x].
double operator()(size_t row, size_t col) {
return data[row*columns+col];
}
};
This is all pretty straightfoward -- the matrix knows how many columns it has, so you can do x,y indexing into the matrix, even though it stores all its data in a single vector. Reading the data from the stream just means copying that data from the stream into the vector. To deal with the header, and simplify creating a matrix from the data in a stream, we can use a simple function like this:
matrix read_data(std::string name) {
// read one line from the stream.
std::ifstream in(name.c_str());
std::string line;
std::getline(in, line);
// break that up into space-separated groups:
std::istringstream temp(line);
std::vector<std::string> counter;
std::copy(std::istream_iterator<std::string>(temp),
std::istream_iterator<std::string>(),
std::back_inserter(counter));
// the number of columns is the number of groups, -1 for the leading '#'.
matrix m(counter.size()-1);
// Read the remaining data into the matrix.
m.read(in);
return m;
}
As it's written right now, this depends on your compiler implementing the "Named Return Value Optimization" (NRVO). Without that, the compiler will copy the entire matrix (probably a couple of times) when it's returned from the function. With the optimization, the compiler pre-allocates space for a matrix, and has read_data() generate the matrix in place.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js