Why does locking slow down this sequential file parser?

Why does locking slow down this sequential file parser? - c++

I wrote a simple reader and parser for a graph file format. The problem is that it is incredibly slow. Here are the relevant methods:
Graph METISGraphReader::read(std::string path) {
METISParser parser(path);
std::pair<int64_t, int64_t> header = parser.getHeader();
int64_t n = header.first;
int64_t m = header.second;
Graph G(n);
node u = 0;
while (parser.hasNext()) {
u += 1;
std::vector<node> adjacencies = parser.getNext();
for (node v : adjacencies) {
if (! G.hasEdge(u, v)) {
G.insertEdge(u, v);
}
}
}
return G;
}
std::vector<node> METISParser::getNext() {
std::string line;
bool comment = false;
do {
comment = false;
std::getline(this->graphFile, line);
// check for comment line starting with '%'
if (line[0] == '%') {
comment = true;
TRACE("comment line found");
} else {
return parseLine(line);
}
} while (comment);
}
static std::vector<node> parseLine(std::string line) {
std::stringstream stream(line);
std::string token;
char delim = ' ';
std::vector<node> adjacencies;
// split string and push adjacent nodes
while (std::getline(stream, token, delim)) {
node v = atoi(token.c_str());
adjacencies.push_back(v);
}
return adjacencies;
}
To diagnose why it is so slow, I ran it in a profiler (Apple Instruments). The results were surprising: It's slow because of locking overhead. The program spends over 90% of its time in pthread_mutex_lock and _pthread_cond_wait.
I have no idea where the locking overhead comes from, but I need to get rid of it. Can you suggest next steps?
EDIT: See the call stack expanded for _pthread_con_wait. I cannot figure out the source of the locking overhead by looking at this:

Expand the call stack on the _pthread_cond_wait and pthread_mutex_lock calls to find out where the locking calls are invoked from.
As a guess I'm going to say it's in all the unnecessary heap allocations you're doing. The heap is a thread safe resource and on this platform the thread safety could be provided via mutexes.

All functions that read data from an istream will lock a mutex, read data from a streambuf and unlock the mutex. To eliminate that overhead, read the file directly from the streambuf instead of the istream and don't use stringstream to parse the data.
Here is a version of getline that uses streambuf instead of istream
bool fastGetline(streambuf* sb, std::string& t)
{
t.clear();
for(;;) {
int c = sb->sbumpc();
switch (c) {
case '\n':
return true;
case '\r':
if(sb->sgetc() == '\n')
sb->sbumpc();
return true;
case EOF:
return !t.empty();
default:
t += (char)c;
}
}

Related

Getline takes the whole text instead of one line

I have a problem, when I try getline, instead of scanning a line program chooses to take the whole text. I tried adding delim also I have include string for this, but it still doesn't seem to work. And the vector is just one big string.
void convert_wishlist(ifstream& file, vector<string>& wishlist, int& budget){
assert(file.is_open());
string g;
file>>budget;
while (!file.fail()) {
getline(file, g);
wishlist.push_back(g);
}
}

A while back, I had to write code ready to deal deal with text files coming from either Windows or Linux so I wrote my own version of getline that was ready for either. This has worked for me
template<class CT, class TT, class AT>
inline
std::basic_istream<CT, TT>& getline(
std::basic_istream<CT, TT>& instr,
std::basic_string<CT, TT, AT>& str)
{
using is_type = std::basic_istream<CT, TT>;
using sentry_type = typename is_type::sentry;
std::ios_base::iostate state = std::ios_base::goodbit;
auto changed = false;
const sentry_type sentry(instr, true);
if (sentry)
{
// State okay, extract characters
try
{
auto sb = instr.rdbuf();
str.erase();
const auto delimNL = TT::to_int_type('\n');
const auto delimCR = TT::to_int_type('\r');
auto ch = sb->sgetc();
for (; ; ch = sb->snextc())
{
if (TT::eq_int_type(TT::eof(), ch))
{
// End of file, quit
state |= std::ios::eofbit;
break;
}
else if (TT::eq_int_type(ch, delimNL))
{
// Got a newline, discard it and quit.
changed = true; // We did read something successfully
sb->sbumpc(); // Discard this character
break; // Quit
}
else if (TT::eq_int_type(ch, delimCR))
{
// Got a carriage return.
changed = true; // We did read something successfully
sb->sbumpc(); // Discard this character
// Next character might be a newline. If so, do the
// same for that
if (TT::eq_int_type(sb->sgetc(), delimNL))
sb->sbumpc();
break; // We are done
}
else if(str.max_size() <= str.size())
{
// String too large, quit
state |= std::ios_base::failbit;
break;
}
else
{
// Got a character, add it to string
str += TT::to_char_type(ch);
changed = true;
}
}
}
catch(...)
{
// Some exception. Set badbit and rethrow
instr.setstate(std::ios_base::badbit);
throw;
}
}
}

"Cleaning up" nested if statements

In a console program I am creating, I have a bit of code that parses through a file. After parsing each line, it is checked for syntax errors. If there is a syntax error, the program then stops reading the file and goes to the next part of the program. The problem is, it is very messy as my only solution to it so far is a series of nested if statements or a line of if statements. The problem with nested ifs is it gets very messy very fast, and a series of if statements has the program testing for several things that don't need to be tested. Heres some sudo code of my problem (note I am NOT using a return statement)
Pseudo code shown instead of real code, as it is very large
Nested if:
open file;
read line;
//Each if is testing something different
//Every error is different
if (line is valid)
{
read line;
if (line is valid)
{
read line;
if (line is valid)
{
do stuff;
}
else
error;
}
else
error;
}
else
error;
code that must be reached, even if there was an error;
Non-nested ifs:
bool fail = false;
open file;
read line;
//Each if is testing something different
//Every error is different
if (line is valid)
read line;
else
{
error;
fail = true;
}
if (!error && line is valid)
read line;
else
{
error;
fail = true;
}
if (!error && line is valid)
do stuff;
else
error;
//Note how error is constantly evaluated, even if it has already found to be false
code that must be reached, even if there was an error;
I have looked at many different sites, but their verdicts differed from my problem. This code does work at runtime, but as you can see it is not very elegant. Is there anyone who has a more readable/efficient approach on my problem? Any help is appreciated :)

Two options come to mind:
Option 1: chain reads and validations
This is similar to how std::istream extraction operators work. You could do something like this:
void your_function() {
std::ifstream file("some_file");
std::string line1, line2, line3;
if (std::getline(file, line1) &&
std::getline(file, line2) &&
std::getline(file, line3)) {
// do stuff
} else {
// error
}
// code that must be reached, even if there was an error;
}
Option 2: split into different functions
This can get a little long, but if you split things out right (and give everything a sane name), it can actually be very readable and debuggable.
bool step3(const std::string& line1,
const std::string& line2,
const std::string& line3) {
// do stuff
return true;
}
bool step2(std::ifstream& file,
const std::string& line1,
const std::string& line2) {
std::string line3;
return std::getline(file, line3) && step3(line1, line2, line3);
}
bool step1(std::ifstream& file,
const std::string& line1) {
std::string line2;
return std::getline(file, line2) && step2(file, line1, line2);
}
bool step0(std::ifstream& file) {
std::string line1;
return std::getline(file, line1) && step1(file, line1);
}
void your_function() {
std::ifstream file("some_file");
if (!step0(file)) {
// error
}
// code that must be reached, even if there was an error;
}
This example code is a little too trivial. If the line validation that occurs in each step is more complicated than std::getline's return value (which is often the case when doing real input validation), then this approach has the benefit of making that more readable. But if the input validation is as simple as checking std::getline, then the first option should be preferred.

Is there [...] a more readable/efficient approach on my problem
Step 1. Look around for a classical example of text parser
Answer: a compiler, which parses text files and produces different kind of results.
Step 2. Read some theory how does compilers work
There are lots of approaches and techniques. Books, online and open source examples. Simple and complicated.
Sure, you might just skip this step if you are not that interested.
Step 3. Apply theory on you problem
Looking through the theory, you will no miss such therms as "state machine", "automates" etc. Here is a brief explanation on Wikipedia:
https://en.wikipedia.org/wiki/Automata-based_programming
There is basically a ready to use example on the Wiki page:
#include <stdio.h>
enum states { before, inside, after };
void step(enum states *state, int c)
{
if(c == '\n') {
putchar('\n');
*state = before;
} else
switch(*state) {
case before:
if(c != ' ') {
putchar(c);
*state = inside;
}
break;
case inside:
if(c == ' ') {
*state = after;
} else {
putchar(c);
}
break;
case after:
break;
}
}
int main(void)
{
int c;
enum states state = before;
while((c = getchar()) != EOF) {
step(&state, c);
}
if(state != before)
putchar('\n');
return 0;
}
Or a C++ example with state machine:
#include <stdio.h>
class StateMachine {
enum states { before = 0, inside = 1, after = 2 } state;
struct branch {
unsigned char new_state:2;
unsigned char should_putchar:1;
};
static struct branch the_table[3][3];
public:
StateMachine() : state(before) {}
void FeedChar(int c) {
int idx2 = (c == ' ') ? 0 : (c == '\n') ? 1 : 2;
struct branch *b = & the_table[state][idx2];
state = (enum states)(b->new_state);
if(b->should_putchar) putchar(c);
}
};
struct StateMachine::branch StateMachine::the_table[3][3] = {
/* ' ' '\n' others */
/* before */ { {before,0}, {before,1}, {inside,1} },
/* inside */ { {after, 0}, {before,1}, {inside,1} },
/* after */ { {after, 0}, {before,1}, {after, 0} }
};
int main(void)
{
int c;
StateMachine machine;
while((c = getchar()) != EOF)
machine.FeedChar(c);
return 0;
}
Sure, instead of chars you should feed lines.
This technique scales up to a complicated compilers, proven with tons of implementations. So if you are looking for a "right" approach, here it is.

A common modern practice is an early return with RAII. Basically it means that the code that must happen should be in a destructor of a class, and your function will have a local object of that class. Now when you have error you exit early from the function (either with Exception or just plain return) and the destructor of that local object will handle the code that must happen.
The code will look something like this:
class Guard
{
...
Guard()
~Guard() { /*code that must happen */}
...
}
void someFunction()
{
Gaurd localGuard;
...
open file;
read line;
//Each if is testing something different
//Every error is different
if (!line)
{
return;
}
read line;
if (!line)
{
return;
}
...
}

How to read only some previously know lines using ifstream (C++)

By preprocessing on the file i found some line for further processing, know i want to read that lines. is there any faster solution than reading lines one by one using ifstream::getline(...) ?
For example i know that i want only lines of product 4 (0-4-8-12-16-...) or special line numbers stored in a vector...
Now I'm doing this :
string line;
int counter = 0;
while( getline(ifstr,line) ){
if(counter%4 =0){
// some code working with line
}
}
but i want something like this (if faster)
while(getline(ifstr,line)){
// some code working with line
while(++counter%4 !=0){ // or checking on index vector
skipline(ifstr)
}
}
Let me mention again that i have some line index (sorted but not this regular) but i use this example of product4 for simplicity.
Edit: and i want to jump to line at the begining, for example i know that i need to read from line number 2000, how to skip 1999 lines quickly ?
Thanks all

Because #caps said this left him with the feeling there's nothing in the standard library to help with this kind of task, I felt compelled to demonstrate otherwise :)
Live On Coliru
template <typename It, typename Out, typename Filter = std::vector<int> >
Out retrieve_lines(It begin, It const end, Filter lines, Out out, char const* delim = "\\n") {
if (lines.empty())
return out;
// make sure input is orderly
assert(std::is_sorted(lines.begin(), lines.end()));
assert(lines.front() >= 0);
std::regex re(delim);
std::regex_token_iterator<It> line(begin, end, re, -1), eof;
// make lines into incremental offsets
std::adjacent_difference(lines.begin(), lines.end(), lines.begin());
// iterate advancing by each offset requested
auto advanced = [&line, eof](size_t n) { while (line!=eof && n--) ++line; return line; };
for (auto offset = lines.begin(); offset != lines.end() && advanced(*offset) != eof; ++offset) {
*out++ = *line;
}
return out;
}
This is noticably more generic. The trade off (for now) is that the tokenizing iterator requires a random access iterator. I find this a good trade-off because "random access" on files really asks for memory mapped files anyways
Live Demo 1: from string to vector<string>
Live On Coliru
int main() {
std::vector<std::string> output_lines;
std::string is(" a b c d e\nf g hijklmnop\nqrstuvw\nxyz");
retrieve_lines(is.begin(), is.end(), {0,3,999}, back_inserter(output_lines));
// for debug purposes
for (auto& line : output_lines)
std::cout << line << "\n";
}
Prints
a b c d e
xyz
Live Demo 2: From file to cout
Live On Coliru
#include <boost/iostreams/device/mapped_file.hpp>
int main() {
boost::iostreams::mapped_file_source is("/etc/dictionaries-common/words");
retrieve_lines(is.begin(), is.end(), {13,784, 9996}, std::ostream_iterator<std::string>(std::cout, "\n"));
}
Prints e.g.
ASL's
Apennines
Mercer's
The use of boost::iostreams::mapped_file_source can easily be replaced with straight up ::mmap but I found it uglier in the presentation sample.

Store std::fstream::streampos instances corresponding to line beginnings of your file into a std::vector and then you can access a specific line using the index of this vector. A possible implementation follows,
class file_reader {
public:
// load file streampos offsets during construction
explicit file_reader(const std::string& filename)
: fs(filename) { cache_file_streampos(); }
std::size_t lines() const noexcept { return line_streampos_vec.size(); }
// get a std::string representation of specific line in file
std::string read_line(std::size_t n) {
if (n >= line_streampos_vec.size() - 1)
throw std::out_of_range("out of bounds");
navigate_to_line(n);
std::string rtn_str;
std::getline(fs, rtn_str);
return rtn_str;
}
private:
std::fstream fs;
std::vector<std::fstream::streampos> line_streampos_vec;
const std::size_t max_line_length = // some sensible value
// store file streampos instances in vector
void cache_file_streampos() {
std::string s;
s.reserve(max_line_length);
while (std::getline(fs, s))
line_streampos_vec.push_back(fs.tellg());
}
// go to specific line in file stream
void navigate_to_line(std::size_t n) {
fs.clear();
fs.seekg(line_streampos_vec[n]);
}
};
Then you can read a specific line of your file via,
file_reader fr("filename.ext");
for (int i = 0; i < fr.lines(); ++i) {
if (!(i % 4))
std::string line_contents = fr.read_line(i); // then do something with the string
}

ArchbishopOfBanterbury's answer is nice, and I would agree with him that you will get cleaner code and better efficiency by just storing the character positions of the beginning of each line when you do your preprocessing.
But, supposing that is not possible (perhaps the preprocessing is handled by some other API, or is from user input), there is a solution that should do the minimal amount of work necessary to read in only the specified lines.
The fundamental problem is that, given a file with variable line lengths, you cannot know where each line begins and ends, since a line is defined as a sequence of characters that end in '\n'. So, you must parse every character to check and see if it is '\n' or not, and if so, advance your line counter and read in the line if the line counter matches one of your desired inputs.
auto retrieve_lines(std::ifstream& file_to_read, std::vector<int> line_numbers_to_read) -> std::vector<std::string>
{
auto begin = std::istreambuf_iterator<char>(file_to_read);
auto end = std::istreambuf_iterator<char>();
auto current_line = 0;
auto next_line_num = std::begin(line_numbers_to_read);
auto output_lines = std::vector<std::string>();
output_lines.reserve(line_numbers_to_read.size()); //this may be a silly "optimization," since all the strings are still separate unreserved buffers
//we can bail if we've reached the end of the lines we want to read, even if there are lines remaining in the stream
//we *must* bail if we've reached the end of the stream, even if there are supposedly lines left to read; that input must have been incorrect
while(begin != end && next_line_num != std::end(line_numbers_to_read))
{
if(current_line == *next_line_num)
{
auto matching_line = std::string();
if(*begin != '\n')
{
//potential optimization: reserve matching_line to something that you expect will fit most/all of your input lines
while(begin != end && *begin != '\n')
{
matching_line.push_back(*begin++);
}
}
output_lines.emplace_back(matching_line);
++next_line_num;
}
else
{
//skip this "line" by finding the next '\n'
while(begin != end && *begin != '\n')
{
++begin;
}
}
//either code path in the previous if/else leaves us staring at the '\n' at the end of a line,
//which is not the right state for the next iteration of the loop.
//So skip this '\n' to get to the beginning of the next line
if (begin != end && *begin == '\n')
{
++begin;
}
++current_line;
}
return output_lines;
}
Here it is live on Coliru, along with the input I tested it with. As you can see, it correctly handles empty lines as well as correctly handling being told to grab more lines than are in the file.

Performance bottleneck with CSV parser

My current parser is given below - Reading in ~10MB CSV to an STL vector takes ~30secs, which is too slow for my liking given I've got over 100MB which needs to be read in every time the program is run. Can anyone give some advice on how to improve performance? Indeed, would it be faster in plain C?
int main() {
std::vector<double> data;
std::ifstream infile( "data.csv" );
infile >> data;
std::cin.get();
return 0;
}
std::istream& operator >> (std::istream& ins, std::vector<double>& data)
{
data.clear();
// Reserve data vector
std::string line, field;
std::getline(ins, line);
std::stringstream ssl(line), ssf;
std::size_t rows = 1, cols = 0;
while (std::getline(ssl, field, ',')) cols++;
while (std::getline(ins, line)) rows++;
std::cout << rows << " x " << cols << "\n";
ins.clear(); // clear bad state after eof
ins.seekg(0);
data.reserve(rows*cols);
// Populate data
double f = 0.0;
while (std::getline(ins, line)) {
ssl.str(line);
ssl.clear();
while (std::getline(ssl, field, ',')) {
ssf.str(field);
ssf.clear();
ssf >> f;
data.push_back(f);
}
}
return ins;
}
NB: I have also have openMP at my disposal, and the contents will eventually be used for GPGPU computation with CUDA.

You could half the time by reading the file once and not twice.
While presizing the vector is beneficial, it will never dominate runtime, because I/O will always be slower by some magnitude.
Another possible optimization could be reading without a string stream. Something like (untested)
int c = 0;
while (ins >> f) {
data.push_back(f);
if (++c < cols) {
char comma;
ins >> comma; // skip comma
} else {
c = 0; // end of line, start next line
}
}
If you can omit the , and separate the values by white space only, it could be even
while (ins >> f)
data.push_back(f);
or
std::copy(std::istream_iterator<double>(ins), std::istream_iterator<double>(),
std::back_inserter(data));

On my machine, your reserve code takes about 1.1 seconds and your populate code takes 8.5 seconds.
Adding std::ios::sync_with_stdio(false); made no difference to my compiler.
The below C code takes 2.3 seconds.
int i = 0;
int j = 0;
while( true ) {
float x;
j = fscanf( file, "%f", & x );
if( j == EOF ) break;
data[i++] = x;
// skip ',' or '\n'
int ch = getc(file);
}

Try calling
std::ios::sync_with_stdio(false);
at the start of your program. This disables the (allegedly quite slow) synchronization between cin/cout and scanf/printf (I have never tried this myself, but have often seen the recommendation, such as here). Note that if you do this, you cannot mix C++-style and C-style IO in your program.
(In addition, Olaf Dietsche is completely right about only reading the file once.)

apparently, file io is a bad idea, just map the whole file into memory, access the
csv file as a continous vm block, this incur only a few syscall

Speed up integer reading from file in C++

I'm reading a file, line by line, and extracting integers from it. Some noteworthy points:
the input file is not in binary;
I cannot load up the whole file in memory;
file format (only integers, separated by some delimiter):
x1 x2 x3 x4 ...
y1 y2 y3 ...
z1 z2 z3 z4 z5 ...
...
Just to add context, I'm reading the integers, and counting them, using an std::unordered_map<unsigned int, unsinged int>.
Simply looping through lines, and allocating useless stringstreams, like this:
std::fstream infile(<inpath>, std::ios::in);
while (std::getline(infile, line)) {
std::stringstream ss(line);
}
gives me ~2.7s for a 700MB file.
Parsing each line:
unsigned int item;
std::fstream infile(<inpath>, std::ios::in);
while (std::getline(infile, line)) {
std::stringstream ss(line);
while (ss >> item);
}
Gives me ~17.8s for the same file.
If I change the operator to a std::getline + atoi:
unsigned int item;
std::fstream infile(<inpath>, std::ios::in);
while (std::getline(infile, line)) {
std::stringstream ss(line);
while (std::getline(ss, token, ' ')) item = atoi(token.c_str());
}
It gives ~14.6s.
Is there anything faster than these approaches? I don't think it's necessary to speed up the file reading, just the parsing itself -- both wouldn't make no harm, though (:

This program
#include <iostream>
int main ()
{
int num;
while (std::cin >> num) ;
}
needs about 17 seconds to read a file. This code
#include <iostream>
int main()
{
int lc = 0;
int item = 0;
char buf[2048];
do
{
std::cin.read(buf, sizeof(buf));
int k = std::cin.gcount();
for (int i = 0; i < k; ++i)
{
switch (buf[i])
{
case '\r':
break;
case '\n':
item = 0; lc++;
break;
case ' ':
item = 0;
break;
case '0': case '1': case '2': case '3':
case '4': case '5': case '6': case '7':
case '8': case '9':
item = 10*item + buf[i] - '0';
break;
default:
std::cerr << "Bad format\n";
}
}
} while (std::cin);
}
needs 1.25 seconds for the same file. Make what you want of it...

Streams are slow. If you really want to do stuff fast load the entire file into memory, and parse it in memory. If you really can't load it all into memory, load it in chunks, making those chunks as large as possible, and parse the chunks in memory.
When parsing in memory, replace the spaces and line endings with nulls so you can use atoi to convert to integer as you go.
Oh, and you'll get problems with the end of chunks because you don't know whether the chunk end cuts off a number or not. To solve this easily stop a small distance (16 byte should do) before the chunk end and copy this tail to the start before loading the next chunk after it.

Have you tried input iterators?
It skips the creation of the strings:
std::istream_iterator<int> begin(infile);
std::istream_iterator<int> end;
int item = 0;
while(begin != end)
item = *begin++;

Why don't you skip the stream and the line buffers and read from the file stream directly?
template<class T, class CharT, class CharTraits>
std::vector<T> read(std::basic_istream<CharT, CharTraits> &in) {
std::vector<T> ret;
while(in.good()) {
T x;
in >> x;
if(in.good()) ret.push_back(x);
}
return ret;
}
http://ideone.com/FNJKFa

Following up Jack Aidley's answer (can't put code in the comments), here's some pseudo-code:
vector<char> buff( chunk_size );
roffset = 0;
char* chunk = &buff[0];
while( not done with file )
{
fread( chunk + roffset, ... ); // Read a sizable chunk into memory, filling in after roffset
roffset = find_last_eol(chunk); // find where the last full line ends
parse_in_mem( chunk, chunk_size - roffset ); // process up to the last full line
move_unprocessed_to_front( chunk, roffset ); // don't re-read what's already in mem
}

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Why does locking slow down this sequential file parser? - c++

Related

Getline takes the whole text instead of one line

"Cleaning up" nested if statements

How to read only some previously know lines using ifstream (C++)

Performance bottleneck with CSV parser

Speed up integer reading from file in C++

Categories

Resources