C++: How to read a lot of data from formatted text files into program?

C++: How to read a lot of data from formatted text files into program? - c++

I'm writing a CFD solver for specific fluid problems. So far the mesh is generated every time running the simulation, and when changing geometry and fluid properties,the program needs to be recompiled.
For small-sized problem with low number of cells, it works just fine. But for cases with over 1 million cells, and fluid properties needs to be changed very often, It is quite inefficient.
Obviously, we need to store simulation setup data in a config file, and geometry information in a formatted mesh file.
Simulation.config file
% Dimension: 2D or 3D
N_Dimension= 2
% Number of fluid phases
N_Phases= 1
% Fluid density (kg/m3)
Density_Phase1= 1000.0
Density_Phase2= 1.0
% Kinematic viscosity (m^2/s)
Viscosity_Phase1= 1e-6
Viscosity_Phase2= 1.48e-05
...
Geometry.mesh file
% Dimension: 2D or 3D
N_Dimension= 2
% Points (index: x, y, z)
N_Points= 100
x0 y0
x1 y1
...
x99 y99
% Faces (Lines in 2D: P1->p2)
N_Faces= 55
0 2
3 4
...
% Cells (polygons in 2D: Cell-Type and Points clock-wise). 6: triangle; 9: quad
N_Cells= 20
9 0 1 6 20
9 1 3 4 7
...
% Boundary Faces (index)
Left_Faces= 4
0
1
2
3
Bottom_Faces= 6
7
8
9
10
11
12
...
It's easy to write config and mesh information to formatted text files. The problem is, how do we read these data efficiently into program? I wonder if there is any easy-to-use c++ library to do this job.

Well, well
You can implement your own API based on a finite elements collection, a dictionary, some Regex and, after all, apply bet practice according to some international standard.
Or you can take a look on that:
GMSH_IO
OpenMesh:
I just used OpenMesh in my last implementation for C++ OpenGL project.

As a first-iteration solution to just get something tolerable - take #JosmarBarbosa's suggestion and use an established format for your kind of data - which also probably has free, open-source libraries for you to use. One example is OpenMesh developed at RWTH Aachen. It supports:
Representation of arbitrary polygonal (the general case) and pure triangle meshes (providing more efficient, specialized algorithms)
Explicit representation of vertices, halfedges, edges and faces.
Fast neighborhood access, especially the one-ring neighborhood (see below).
[Customization]
But if you really need to speed up your mesh data reading, consider doing the following:
Separate the limited-size meta-data from the larger, unlimited-size mesh data;
Place the limited-size meta-data in a separate file and read it whichever way you like, it doesn't matter.
Arrange the mesh data as several arrays of fixed-size elements or fixed-size structures (e.g. cells, faces, points, etc.).
Store each of the fixed-width arrays of mesh data in its own file - without using streaming individual values anywhere: Just read or write the array as-is, directly. Here's an example of how a read would look. Youll know the appropriate size of the read either by looking at the file size or the metadata.
Finally, you could avoid explicitly-reading altogether and use memory-mapping for each of the data files. See
fastest technique to read a file into memory?
Notes/caveats:
If you write and read binary data on systems with different memory layout of certain values (e.g. little-endian vs big-endian) - you'll need to shuffle the bytes around in memory. See also this SO question about endianness.
It might not be worth it to optimize the reading speed as much as possible. You should consider Amdahl's law, and only optimize it to a point where it's no longer a significant fraction of your overall execution time. It's better to lose a few percentage points of execution time, but get human-readable data files which can be used with other tools supporting an established format.

In the following answear I asume:
That if the first character of a line is % then it shall be ignored as a comment.
Any other line is structured exactly as follows: identifier= value.
The code I present will parse a config file following the mentioned assumptions correctly. This is the code (I hope that all needed explanation is in comments):
#include <fstream> //required for file IO
#include <iostream> //required for console IO
#include <unordered_map> //required for creating a hashtable to store the identifiers
int main()
{
std::unordered_map<std::string, double> identifiers;
std::string configPath;
std::cout << "Enter config path: ";
std::cin >> configPath;
std::ifstream config(configPath); //open the specified file
if (!config.is_open()) //error if failed to open file
{
std::cerr << "Cannot open config file!";
return -1;
}
std::string line;
while (std::getline(config, line)) //read each line of the file
{
if (line[0] == '%') //line is a comment
continue;
std::size_t identifierLenght = 0;
while (line[identifierLenght] != '=')
++identifierLenght;
identifiers.emplace(
line.substr(0, identifierLenght),
std::stod(line.substr(identifierLenght + 2))
); //add entry to identifiers
}
for (const auto& entry : identifiers)
std::cout << entry.first << " = " << entry.second << '\n';
}
After reading the identifiers you can, of course, do whatever you need to do with them. I just print them as an example to show how to fetch them. For more information about std::unordered_map look here. For a lot of very good information about making parsers have a look here instead.
If you want to make your program process input faster insert the following line at the beginning of main: std::ios_base::sync_with_stdio(false). This will desynchronize C++ IO with C IO and, in result, make it faster.

Assuming:
you don't want to use an existing format for meshes
you don't want to use a generic text format (json, yml, ...)
you don't want a binary format (even though you want something efficient)
In a nutshell, you really need your own text format.
You can use any parser generator to get started. While you could probably parse your config file as it is using only regexps, they can be really limited on the long run. So I'll suggest a context-free grammar parser, generated with Boost spirit::x3.
AST
The Abstract Syntax Tree will hold the final result of the parser.
#include <string>
#include <utility>
#include <vector>
#include <variant>
namespace AST {
using Identifier = std::string; // Variable name.
using Value = std::variant<int,double>; // Variable value.
using Assignment = std::pair<Identifier,Value>; // Identifier = Value.
using Root = std::vector<Assignment>; // Whole file: all assignments.
}
Parser
Grammar description:
#include <boost/fusion/adapted/std_pair.hpp>
#include <boost/spirit/home/x3.hpp>
namespace Parser {
using namespace x3;
// Line: Identifier = value
const x3::rule<class assignment, AST::Assignment> assignment = "assignment";
// Line: comment
const x3::rule<class comment> comment = "comment";
// Variable name
const x3::rule<class identifier, AST::Identifier> identifier = "identifier";
// File
const x3::rule<class root, AST::Root> root = "root";
// Any valid value in the config file
const x3::rule<class value, AST::Value> value = "value";
// Semantic action
auto emplace_back = [](const auto& ctx) {
x3::_val(ctx).emplace_back(x3::_attr(ctx));
};
// Grammar
const auto assignment_def = skip(blank)[identifier >> '=' >> value];
const auto comment_def = '%' >> omit[*(char_ - eol)];
const auto identifier_def = lexeme[alpha >> +(alnum | char_('_'))];
const auto root_def = *((comment | assignment[emplace_back]) >> eol) >> omit[*blank];
const auto value_def = double_ | int_;
BOOST_SPIRIT_DEFINE(root, assignment, comment, identifier, value);
}
Usage
// Takes iterators on string/stream...
// Returns the AST of the input.
template<typename IteratorType>
AST::Root parse(IteratorType& begin, const IteratorType& end) {
AST::Root result;
bool parsed = x3::parse(begin, end, Parser::root, result);
if (!parsed || begin != end) {
throw std::domain_error("Parser received an invalid input.");
}
return result;
}
Live demo
Evolutions
To change where blank spaces are allowed, add/move x3::skip(blank) in the xxxx_def expressions.
Currently the file must end with a newline. Rewriting the root_def expression can fix that.
You'll certainly want to know why the parsing failed on invalid inputs. See the error handling tutorial for that.
You're just a few rules away from parsing more complicated things:
// 100 X_n Y_n
const auto point_def = lit("N_Points") >> ':' >> int_ >> eol >> *(double_ >> double_ >> eol)

If you don't need specific text file format, but have a lot of data and do care about performance, I recommend using some existing data serialization frameworks instead.
E.g. Google protocol buffers allow efficient serialization and deserialization with very little code. The file is binary, so typically much smaller than text file, and binary serialization is much faster than parsing text. It also supports structured data (arrays, nested structs), data versioning, and other goodies.
https://developers.google.com/protocol-buffers/

Related

Insert into array specific strings from text file

ArticlesDataset.txt file contains all the metadata information of documents. unigramCount contains all unique words and their number of occurrences for each document. There are 1500 publications recorded in the txt file. Here is an example entry for a document:
{"creator":["Romain Allais","Julie Gobert"],
"datePublished":"2018-05-30",
"docType":"article",
"doi":"10.1051\/mattech\/2018010",
"id":"ark:\/\/27927\/phz10hn2bh3",
"isPartOf":"Mat\u00e9riaux & Techniques",
"issueNumber":"5-6",
"language":["eng"],
"outputFormat":["unigram","bigram","trigram"],
"pageCount":7,
"pagination":"pp. null-null",
"provider":"portico",
"publicationYear":2018,
"publisher":"EDP Sciences",
"sequence":3.0,
"tdmCategory":["Applied sciences -Engineering"],
"title":"Environmental assessment of PSS",
"url":"http:\/\/doi.org\/10.1051\/mattech\/2018010",
"volumeNumber":"105",
"wordCount":4446,
"unigramCount":{"others":1,"air":1,"networks,":1,"conventional":1,"IEEE":1}}
My purpose is to pull out the unigram counts for each document and store them in a suitable array. How can I do it by using fstream library?
How can i improve below code to reach my goal.
std::string dummy;
std::ifstream data("PublicationsDataSet.txt");
while (data.good())
{
getline(data, dummy, ',');
}

your question delves in two different topics, one is parsing the data and the other into storing it in memory.
To the first point the answer is, you'll need a parser, you either write one which will involve a syntax parser to convert each "key words" into tokens, for then an interpreter to compile them into a data object based on the token parameter the data is preceded or succeeded eg:
'[' = start an array, every values after this are part of the array
']' = end of the an array, return to previous parsing state
':' = separate key and values, left hand side is key, right hand side is value
...
this is a fine exercise to sharpen one's skills but way too arduous and with potential never-ending-bug-fixing road, as recommended also by other comments finding an already made library is probably the easier road on a time pinch or on a project time crunching scenario.
Another thing to point out, plain arrays in c++ are size fixed, so mostly likely since you are parsing the values you'll probably use std::vectors, which allow insertion, and once you are done processing the file and really intend to send the data back as an array you can do that directly from the object
std::vector<YourObjectType> parsedObject;
char* arr = new char[parsedObject.size()];
std::copy(v.begin(), v.end(), arr);
this is a psudo code, lots of things will depend on the implementation, but it gives the idea.
A starting point to write a parse is this article goes in great details on how it works and it's components, mind you every parser implements it's own language (yes just like c++ and other languages, are all parsed) so you'll need to expand on the concept with your commands
expression parser

Here's a simplified solution of what you could do using std::regex:
Read the lines of a stream (std::cin in this case) one by one.
Check if the line contains a unigramCount element.
If that's the case, walk the different entries within the unigramCount element.
About the regular expressions used:
"unigramCount":{}, allowing:
zero or more whitespaces basically everywhere, and
zero or more characters within the braces.
"<key>":<value>, where:
<key> is one or more characters other than a double quote,
<value> is one or more digits, and
you could have whitespaces at both sides of the :.
A good data structure for storing your unigramCount entries could be a std::map.
[Demo]
#include <iostream> // cout
#include <map>
#include <regex> // regex_match, regex_search, sregex_iterator
#include <string> // stoi
int main()
{
std::string line{};
std::map<std::string, int> unigram_counts{};
while (std::getline(std::cin, line))
{
const std::regex unigram_count_pattern{R"(^\s*\"unigramCount\"\s*:\s*\{.*\}\s*$)"};
if (std::regex_match(line, unigram_count_pattern))
{
const std::regex entry_pattern{R"(\"([^\"]+)\"\s*:\s*([0-9]+))"};
for (auto entry_it{std::sregex_iterator(line.cbegin(), line.cend(), entry_pattern)};
entry_it != std::sregex_iterator{};
++entry_it)
{
auto matches{*entry_it};
auto& key{matches[1]};
auto& value{matches[2]};
unigram_counts[key] = std::stoi(value);
}
}
}
for (auto& [key, value] : unigram_counts)
{
std::cout << "'" << key << "' : " << value << "\n";
}
}
// Outputs:
//
// 'IEEE' : 1
// 'air' : 1
// 'conventional' : 1
// 'networks,' : 1
// 'others' : 1

Sort .csv in multidimensional arrays

I'm trying to read specific values (i.e. values#coordinate XY) from a .csv file and struggle with a proper way to define multidimensional arrays within that .csv.
Here's an example of the form from my .csv file
NaN,NaN,1.23,2.34,9.99
1.23,NaN,2.34,3.45,NaN
NaN,NaN,1.23,2.34,9.99
1.23,NaN,2.34,3.45,NaN
1.23,NaN,2.34,3.45,NaN
NaN,NaN,1.23,2.34,9.99
1.23,NaN,2.34,3.45,NaN
NaN,NaN,1.23,2.34,9.99
1.23,NaN,2.34,3.45,NaN
1.23,NaN,2.34,3.45,NaN
NaN,NaN,1.23,2.34,9.99
1.23,NaN,2.34,3.45,NaN
NaN,NaN,1.23,2.34,9.99
1.23,NaN,2.34,3.45,NaN
1.23,NaN,2.34,3.45,NaN
...
Ok, in reality, this file becomes very large. You can interpret rows=latitudes and columns=longitudes and thus each block is an hourly measured coordinate map. The blocks usually have the size of row[361] column[720] and time periods can range up to 20 years (=24*365*20 blocks), just to give you an idea of the data size.
To structure this, I thought of scanning through the .csv and define each block as a vector t, which I can access by choosing the desired timestep t=0,1,2,3...
Then, within this block I would like to go to a specific line (i.e. latitude) and define it as a vector longitudeArray.
The outcome shall be a specified value from coordinate XY at time Z.
As you might guess, my coding experience is rather limited and this is why my actual question might be very simple: How can I arrange my vectors in order to be able to call any random value?
This is my code so far (sadly it is not much, cause I don't know how to continue...)
#include <fstream>
#include <iostream>
#include <iomanip>
#include <sstream>
#include <string>
#include <vector>
#include <algorithm>
using namespace std;
int main()
{
int longitude, latitude; //Coordinates used to specify desired value
int t; //Each array is associated to a specific time t=0,1,2,3... (corresponds to hourly measured data)
string value;
vector<string> t; //Vector of each block
vector<string> longitudeArray; //Line of array, i.e. latitude
ifstream file("swh.csv"); //Open file
if (!file.is_open()) //Check if file is opened, if not
print "File could..."
{
cout << "File could not open..." << endl;
return 1;
}
while (getline(file, latitude, latitude.empty())) //Scan .csv (vertically) and delimit every time a white line occurs
{
longitudeArray.clear();
stringstream ss(latitude);
while(getline(ss,value,',') //Breaks line into comma delimited fields //Specify line number (i.e. int latitude) here??
{
latitudeArray.push_back(value); //Adds each field to the 1D array //Horizontal vector, i.e. latitude
}
t.push_back(/*BLOCK*/) //Adds each block to a distinct vector t
}
cout << t(longitudeArray[5])[6] << endl; //Output: 5th element of longitudeArray in my 6th block
return 0;
}
If you have any hint, especially if there is a better way handling large .csv files, I'd be very grateful.
Ps: C++ is inevitable for this project...
Tüdelüü,
jtotheakob

As usual you should first think in terms of data and data usage. Here you have floating point values (that can be NaN) that should be accessible as a 3D thing along latitude, longitude and time.
If you can accept simple (integer) indexes, the standard ways in C++ would be raw arrays, std::array and std::vector. The rule of thumb then says: if the sizes are known at compile time arrays (or std::array if you want operation on global arrays) are fine, else go with vectors. And if unsure std:vector is your workhorse.
So you will probably end with a std::vector<std::vector<std::vector<double>>> data, that you would use as data[timeindex][latindex][longindex]. If everything is static you could use a double data[NTIMES][NLATS][NLONGS] that you would access more or less the same way. Beware if the array is large, most compilers will choke if you declare it inside a function (including main), but it could be a global inside one compilation unit (C-ish but still valid in C++).
So read the file line by line, feeding values in your container. If you use statically defined arrays just assign each new value in its position, if you use vectors, you can dynamically add new elements with push_back.
This is too far from your current code for me to show you more than trivial code.
The static (C-ish) version could contain:
#define NTIMES 24*365*20
#define NLATS 361
#define NLONGS 720
double data[NTIMES][NLATS][NLONGS];
...
int time, lat, long;
for(time=0; time<NTIMES; time++) {
for (lat=0; lat<NLATS; lat++) {
for (long=0; long<NLONGS; long++) {
std::cin >> data[time][lat][long];
for (;;) {
if (! std::cin) break;
char c = std::cin.peek();
if (std::isspace(c) || (c == ',')) std::cin.get();
else break;
}
if (! std::cin) break;
}
if (! std::cin) break;
}
if (! std::cin) break;
}
if (time != NTIMES) {
//Not enough values or read error
...
}
A more dynamic version using vectors could be:
int ntimes = 0;
const int nlats=361; // may be a non compile time values
const int nlongs=720; // dito
vector<vector<vector<double>>> data;
int lat, long;
for(;;) {
data.push_back(vector<vector<double>>);
for(lat=0; lat<nlats; lat++) {
data[ntimes].push_back(vector<double>(nlongs));
for(long=0; long<nlongs; long++) {
std::cin >> data[time][lat][long];
for (;;) {
if (! std::cin) break;
char c = std::cin.peek();
if (std::isspace(c) || (c == ',')) std::cin.get();
else break;
}
if (! std::cin) break;
}
if (! std::cin) break;
}
if (! std::cin) break;
if (lat!=nlats || long!=nlongs) {
//Not enough values or read error
...
}
ntimes += 1;
}
This code will successfully process NaN converting it the special not a number value, but it does not check the number of fields per line. To do that, read a line with std::getline and use a strstream to parse it.

Thanks, I tried to transfer both versions to my code, but I couldn't make it run.
Guess my poor coding skills aren't able to see what's obvious to everyone else. Can you name the additional libs I might require?
For std::isspace I do need #include <cctype>, anything else missing which is not mentioned in my code from above?
Can you also explain how if (std::isspace(c) || (c == ',')) std::cin.get(); works? From what I understand, it will check whether c (which is the input field?) is a whitespace, and if so, the right term becomes automatically "true" because of ||? What consequence results from that?
At last, if (! std::cin) break is used to stop the loop after we reached the specified array[time][lat][long]?
Anyhow, thanks for your response. I really appreciate it and I have now an idea how to define my loops.

Again thank you all for your ideas.
Unfortunately, I was not able to run the script... but my task changed slightly, thus the need to read very large arrays is not required anymore.
However, I've got an idea of how to structure such operations and most probably will transfer it to my new task.
You may close this topic now ;)
Cheers
jtothekaob

What is the most efficiency way to import an .STL file in c++?

What is the most efficient strategy for parsing a .STL file?
A critical part of my code is importing a .STL file, (a common CAD file format) and this is limiting performance overall.
The .STL file format is summarized here- https://en.wikipedia.org/wiki/STL_(file_format)
Using ASCII format is required for this application.
The generic format is:
solid name
facet normal ni nj nk
outer loop
vertex v1x v1y v1z
vertex v2x v2y v2z
vertex v3x v3y v3z
endloop
endfacet
endsolid
However, I've noticed that there are no strict formatting requirements. And, the import function must do a minimal amount of error checking. I've done some performance measuring (using chrono) which for a 43,000 line file gives:
stl_import() - 1.177568 s
parsing loop - 3.894250 s
Parsing loop:
cout << "Importing " << stl_path << "... ";
auto file_vec = import_stl(stl_path);
for (auto& l : file_vec) {
trim(l);
if (solid_state) {
if (facet_state) {
if (starts_with(l, "vertex")) {
//---------ADD FACE----------//
l.erase(0, 6);
trim(l);
vector<string> strs;
split(strs, l, is_any_of(" "));
point p = { stod(strs[0]), stod(strs[1]), stod(strs[2]) };
facet_points.push_back(p);
//---------------------------//
}
else {
if (starts_with(l, "endfacet")) {
facet_state = false;
}
}
}
else {
if (starts_with(l, "facet")) {
facet_state = true;
//assert(facet_points.size() == 0);
//---------------------------//
// Normals can be ignored //
//---------------------------//
}
if (starts_with(l, "endsolid")) {
solid_state = false;
}
}
}
else {
if (starts_with(l, "solid")) {
solid_state = true;
}
}
if (facet_points.size() == 3) {
triangle facet(facet_points[0], facet_points[1], facet_points[2]);
stl_solid.add_facet(facet);
facet_points.clear();
//check normal
facet.normal();
}
}
The stl_import function is:
std::vector<std::string> import_stl(const std::string& file_path)
{
std::ifstream infile(file_path);
SkipBOM(infile);
std::vector<std::string> file_vec;
std::string line;
while (std::getline(infile, line))
{
file_vec.push_back(line);
}
return file_vec;
}
I have searched for ways to optimize file reading, etc. And, I see that using mmap may improve file read speed.
Fast textfile reading in c++
This question is an inquiry as to what the best parsing strategy for a .STL file is?

Without data which can be used for measuring where the time is spent it hard to determine what actually improves the performance. A decent library already doing the job may be the easiest approach. However, the current code uses a few approaches which may be easy wins to improve performance. There are things I spotted:
The streams library is quite good at skipping leading whitespace. Instead of first reading spaces followed by trimming them off, you may want to use std::getline(infile >> std::ws, line): the std::ws manipulator skips leading whitespaces.
Instead of using starts_with() with string literals, I'd rather read each line into a "command" and the tail of the line and compare the commands against std::string const objects: instead of a character comparison it may be sufficient to compare the size.
Instead of split()ing a std::string into a std::vector<std::string> on whitespace I'd rather reset a suitable stream (probably an std::istringstream but to prevent copying possibly a custom memory stream) and read directly from that:
std::istringstream in; // declared outside the reading loop
// ...
point p;
in.clear(); // get rid of potentially existing errors
in.str(line);
if (in >> p.x >> p.y >> p.z) {
facet_points.push_back(p);
}
This approach has the added advantage of allowing format checking: I always distrust any input received, even when it is from a trusted source.
If you insist in using adjusting the character sequence and/or splitting it into subsequences, I'd strongly recommend using std::string_view (or, in case this C++17 class isn't available a similar class) to avoid moving characters around.
Assuming the file is of a significant size, I'd recommend against reading the file into a std::vector<std::string> and then parsing it. Instead, I'd parse the file on the fly: this way the hot memory is immediately reused instead of moving it out of cache for later post-processing. This way dealing with an auxiliary stream (see point 3 above) can be avoided. To prevent an overly complex reading loop I'd split nested sections into appropriate functions, returning from them on closing tags. In addition I'd define input functions for structures like point to simply read them off the stream.
Depending on the system you are working on, you may want to call std::ios_base::sync_with_stdio(false) before reading the file: there used to be at least one often used implementation of streams which would benefit from this call.

How to get more performance when reading file

My program download files from site (via curl per 30 min). (it is possible that size of these files can reach 150 mb)
So i thought that getting data from these files can be inefficient. (search a line per 5 seconds)
These files can have ~10.000 lines
To parse this file (values are seperate by ",") i use regex :
regex wzorzec("(.*),(.*),(.*),(.*),(.*),(.*),(.*),(.*)");
There are 8 values.
Now i have to push it to vector:
allys.push_back({ std::stoi(std::string(wynik[1])), nick, tag, stoi(string(wynik[4])), stoi(string(wynik[5])), stoi(string(wynik[6])), stoi(string(wynik[7])), stoi(string(wynik[8])) });
I use std::async to do that, but for 3 files (~7 mb) procesor jumps to 80% and operation take about 10 secs. I read from SSD so this is not slowly IO fault.
I'm reading data line per line by fstream
How to boost this operation?
Maybe i have to parse this values, and push it to SQL ?
Best Regards

You can probably get some performance boost by avoiding regex, and use something along the lines of std::strtok, or else just hard-code a search for commas in your data. Regex has more power than you need just to look for commas. Next, if you use vector::reserve before you begin a sequence of push_back for any given vector, you will save a lot of time in both reallocation and moving memory around. If you are expecting a large vector, reserve room for it up front.
This may not cover all available performance ideas, but I'd bet you will see an improvement.

Your problem here is most likely additional overhead introduced by the regular expression, since you're using many variable length and greedy matches (the regex engine will try different alignments for the matches to find the largest matching result).
Instead, you might want to try to manually parse the lines. There are many different ways to achieve this. Here's one quick and dirty example (it's not flexible and has quite some duplicate code in there, but there's lots of room for optimization). It should explain the basic idea though:
#include <iostream>
#include <sstream>
#include <cstdlib>
const char *input = "1,Mario,Stuff,4,5,6,7,8";
struct data {
int id;
std::string nick;
std::string tag;
} myData;
int main(int argc, char **argv){
char buffer[256];
std::istringstream in(input);
// Read an entry and convert/store it:
in.get(buffer, 256, ','); // read
myData.id = atoi(buffer); // convert and store
// Skip the comma
in.seekg(1, std::ios::cur);
// Read the next entry and convert/store it:
in.get(buffer, 256, ','); // read
myData.nick = buffer; // store
// Skip the comma
in.seekg(1, std::ios::cur);
// Read the next entry and convert/store it:
in.get(buffer, 256, ','); // read
myData.tag = buffer; // store
// Skip the comma
in.seekg(1, std::ios::cur);
// Some test output
std::cout << "id: " << myData.id << "\nnick: " << myData.nick << "\ntag: " << myData.tag << std::endl;
return 0;
}
Note that there isn't any error handling in case entries are too long or too short (or broken in some other way).
Console output:
id: 1
nick: Mario
tag: Stuff

Reading key-value pairs as fast as possible in C++ from file

I have a file with roughly 2 million lines like this:
2s,3s,4s,5s,6s 100000
2s,3s,4s,5s,8s 101
2s,3s,4s,5s,9s 102
The first comma separated part indicates a poker result in Omaha, while the latter score is an example "value" of the cards. It is very important for me to read this file as fast as possible in C++, but I cannot seem to get it to be faster than a simple approach in Python (4.5 seconds) using the base library.
Using the Qt framework (QHash and QString), I was able to read the file in 2.5 seconds in release mode. However, I do not want to have the Qt dependency. The goal is to allow quick simulations using those 2 million lines, i.e. some_container["2s,3s,4s,5s,6s"] to yield 100 (though if applying a translation function or any non-readable format will allow for faster reading that's okay as well).
My current implementation is extremely slow (8 seconds!):
std::map<std::string, int> get_file_contents(const char *filename)
{
std::map<std::string, int> outcomes;
std::ifstream infile(filename);
std::string c;
int d;
while (infile.good())
{
infile >> c;
infile >> d;
//std::cout << c << d << std::endl;
outcomes[c] = d;
}
return outcomes;
}
What can I do to read this data into some kind of a key/value hash as fast as possible?
Note: The first 16 characters are always going to be there (the cards), while the score can go up to around 1 million.
Some further informations gathered from various comments:
sample file: http://pastebin.com/rB1hFViM
ram restrictions: 750MB
initialization time restriction: 5s
computation time per hand restriction: 0.5s

As I see it, there are two bottlenecks on your code.
1 Bottleneck
I believe that the file reading is the biggest problem there. Having a binary file is the fastest option. Not only you can read it directly in an array with a raw istream::read in a single operation (which is very fast), but you can even map the file in memory if your OS supports it. Here is a link that's very informative on how to use memory mapped files.
2 Bottleneck
The std::map is usually implemented with a self-balancing BST that will store all the data in order. This makes the insertion to be an O(logn) operation. You can change it to std::unordered_map, wich uses a hash table instead. A hash table have a constant time insertion if the number of colisions are low. As the ammount of elements that you need to read is known, you can reserve a suitable ammount of chuncks before inserting the elements. Keep in mind that you need more chuncks than the number of elements that will be inserted in the hash to avoid the maximum ammount of colisions.

Ian Medeiros already mentioned the two major botlenecks.
a few thoughts about data structures:
the amount of different cards is known: 4 colors of each 13 cards -> 52 cards.
so a card requires less than 6 bits to store. your current file format currently uses 24 bit (includig the comma).
so by simply enumerating the cards and omitting the comma you can save ~2/3 of file size and allows you to determine a card with reading only one character per card.
if you want to keep the file text based you may use a-m, n-z, A-M and N-Z for the four colors.
another thing that bugs me is the string based map. string operations are innefficient.
One hand contains 5 cards.
that means 52^5 posiibilities if we keep it simple and do not consider the already drawn cards.
--> 52^5 = 380.204.032 < 2^32
that means we can enumuerate every possible hand with a uint32 number. by defining a special sorting scheme of the cards (since order is irrelevant), we can assign a number to the hand and use this number as key in our map that is a lot faster than using strings.
if we have enough memory (1.5 GB) we do not even need a map but we can simply use an array.
of course the most cells are unused but access may be very fast. we even can ommit the ordering of the cards since the cells are present independet if we fill them or not. So we can use them. but in this case you should not forget to fill all possible permutations of the hand read from the file.
with this scheme we also (may be) can further optimize our file reading speed. if we only store the hands number and the rating so that only 2 values need to be parsed.
infact we can optimize the required storage space by using a more complex adressing scheme for the different hands, since in reality there are only 52*51*50*49*48 = 311.875.200 possible hands.additional to that the ordering is irrelevant as mentioned but i think that this saving is not worth the increased complexity of the encoding of the hands.

A simple idea might be to use the C API, which is considerably simpler:
#include <cstdio>
int n;
char s[128];
while (std::fscanf(stdin, "%127s %d", s, &n) == 2)
{
outcomes[s] = n;
}
A rough test showed a considerable speedup for me compared to the iostreams library.
Further speedups may be achieved by storing the data in a contiguous array, e.g. a vector of std::pair<std::string, int>; it depends on whether your data is already sorted and how you need to access it later.
For a serious solution, though, you should probably step back further and think of a better way to represent your data. For example, a fixed-width, binary encoding would be much more space-efficient and faster to parse, since you won't need to look ahead for line endings or parse strings.
Update: From some quick experimentation I've found it fairly fast to first read the entire file into memory and then perform alternating strtok calls with either " " or "\n" as the delimiter; whenever a pair of calls succeed, apply strtol on the second pointer to parse the integer. Here's a skeleton:
#include <cerrno>
#include <cstdio>
#include <cstdlib>
#include <cstring>
#include <vector>
int main()
{
std::vector<char> data;
// Read entire file to memory
{
data.reserve(100000000);
char buf[4096];
for (std::size_t n; (n = std::fread(buf, 1, sizeof buf, stdin)) > 0; )
{
data.insert(data.end(), buf, buf + n);
}
data.push_back('\0');
}
// Tokenize the in-memory data
char * p = &data.front();
for (char * q = std::strtok(p, " "); q; q = std::strtok(nullptr, " "))
{
if (char * r = std::strtok(nullptr, "\n"))
{
char * e;
errno = 0;
int const n = std::strtol(r, &e, 10);
if (*e != '\0' || errno != 0) { continue; }
// At this point we have data:
// * the string is "q"
// * the integer is "n"
}
}
}

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js