Performance bottleneck with CSV parser

Performance bottleneck with CSV parser - c++

My current parser is given below - Reading in ~10MB CSV to an STL vector takes ~30secs, which is too slow for my liking given I've got over 100MB which needs to be read in every time the program is run. Can anyone give some advice on how to improve performance? Indeed, would it be faster in plain C?
int main() {
std::vector<double> data;
std::ifstream infile( "data.csv" );
infile >> data;
std::cin.get();
return 0;
}
std::istream& operator >> (std::istream& ins, std::vector<double>& data)
{
data.clear();
// Reserve data vector
std::string line, field;
std::getline(ins, line);
std::stringstream ssl(line), ssf;
std::size_t rows = 1, cols = 0;
while (std::getline(ssl, field, ',')) cols++;
while (std::getline(ins, line)) rows++;
std::cout << rows << " x " << cols << "\n";
ins.clear(); // clear bad state after eof
ins.seekg(0);
data.reserve(rows*cols);
// Populate data
double f = 0.0;
while (std::getline(ins, line)) {
ssl.str(line);
ssl.clear();
while (std::getline(ssl, field, ',')) {
ssf.str(field);
ssf.clear();
ssf >> f;
data.push_back(f);
}
}
return ins;
}
NB: I have also have openMP at my disposal, and the contents will eventually be used for GPGPU computation with CUDA.

You could half the time by reading the file once and not twice.
While presizing the vector is beneficial, it will never dominate runtime, because I/O will always be slower by some magnitude.
Another possible optimization could be reading without a string stream. Something like (untested)
int c = 0;
while (ins >> f) {
data.push_back(f);
if (++c < cols) {
char comma;
ins >> comma; // skip comma
} else {
c = 0; // end of line, start next line
}
}
If you can omit the , and separate the values by white space only, it could be even
while (ins >> f)
data.push_back(f);
or
std::copy(std::istream_iterator<double>(ins), std::istream_iterator<double>(),
std::back_inserter(data));

On my machine, your reserve code takes about 1.1 seconds and your populate code takes 8.5 seconds.
Adding std::ios::sync_with_stdio(false); made no difference to my compiler.
The below C code takes 2.3 seconds.
int i = 0;
int j = 0;
while( true ) {
float x;
j = fscanf( file, "%f", & x );
if( j == EOF ) break;
data[i++] = x;
// skip ',' or '\n'
int ch = getc(file);
}

Try calling
std::ios::sync_with_stdio(false);
at the start of your program. This disables the (allegedly quite slow) synchronization between cin/cout and scanf/printf (I have never tried this myself, but have often seen the recommendation, such as here). Note that if you do this, you cannot mix C++-style and C-style IO in your program.
(In addition, Olaf Dietsche is completely right about only reading the file once.)

apparently, file io is a bad idea, just map the whole file into memory, access the
csv file as a continous vm block, this incur only a few syscall

Related

Reading lines from input

I'm looking to read from std::in with a syntax as below (it is always int, int, int, char[]/str). What would be the fastest way to parse the data into an int array[3] and either a string or char array.
#NumberOfLines(i.e.10000000)
1,2,2,'abc'
2,2,2,'abcd'
1,2,3,'ab'
...1M+ to 10M+ more lines, always in the form of (int,int,int,str)
At the moment, I'm doing something along the lines of.
//unsync stdio
std::ios_base::sync_with_stdio (false);
std::cin.tie(NULL);
//read from cin
for(i in amount of lines in stdin){
getline(cin,str);
if(i<3){
int commaindex = str.find(',');
string substring = str.substr(0,commaindex);
array[i]=atoi(substring.c_str());
str.erase(0,commaindex+1)
}else{
label = str;
}
//assign array and label to other stuff and do other stuff, repeat
}
I'm quite new to C++ and recently learned profiling with Visual Studio however not the best at interpreting it. IO takes up 68.2% and kernel takes 15.8% of CPU usage. getline() covers 35.66% of the elapsed inclusive time.
Is there any way I can do something similar to reading large chunks at once to avoid calling getline() as much? I've been told fgets() is much faster, however, I'm unsure of how to use it when I cannot predict the number of characters to specify.
I've attempted to use scanf as follows, however it was slower than getline method. Also have used `stringstreams, but that was incredibly slow.
scanf("%i,%i,%i,%s",&array[0],&array[1],&array[2],str);
Also if it matters, it is run on a server with low memory available. I think reading the entire input to buffer would not be viable?
Thanks!
Update: Using #ted-lyngmo approach, gathered the results below.
time wc datafile
real 4m53.506s
user 4m14.219s
sys 0m36.781s
time ./a.out < datafile
real 2m50.657s
user 1m55.469s
sys 0m54.422s
time ./a.out datafile
real 2m40.367s
user 1m53.523s
sys 0m53.234s

You could use std::from_chars (and reserve() the approximate amount of lines you have in the file, if you store the values in a vector for example). I also suggest adding support for reading directly from the file. Reading from a file opened by the program is (at least for me) faster than reading from std::cin (even with sync_with_stdio(false)).
Example:
#include <algorithm> // std::for_each
#include <cctype> // std::isspace
#include <charconv> // std::from_chars
#include <cstdio> // std::perror
#include <fstream>
#include <iostream>
#include <iterator> // std::istream_iterator
#include <limits> // std::numeric_limits
struct foo {
int a[3];
std::string s;
};
std::istream& operator>>(std::istream& is, foo& f) {
if(std::getline(is, f.s)) {
std::from_chars_result fcr{f.s.data(), {}};
const char* end = f.s.data() + f.s.size();
// extract the numbers
for(unsigned i = 0; i < 3 && fcr.ptr < end; ++i) {
fcr = std::from_chars(fcr.ptr, end, f.a[i]);
if(fcr.ec != std::errc{}) {
is.setstate(std::ios::failbit);
return is;
}
// find next non-whitespace
do ++fcr.ptr;
while(fcr.ptr < end &&
std::isspace(static_cast<unsigned char>(*fcr.ptr)));
}
// extract the string
if(++fcr.ptr < end)
f.s = std::string(fcr.ptr, end - 1);
else
is.setstate(std::ios::failbit);
}
return is;
}
std::ostream& operator<<(std::ostream& os, const foo& f) {
for(int i = 0; i < 3; ++i) {
os << f.a[i] << ',';
}
return os << '\'' << f.s << "'\n";
}
int main(int argc, char* argv[]) {
std::ifstream ifs;
if(argc >= 2) {
ifs.open(argv[1]); // if a filename is given as argument
if(!ifs) {
std::perror(argv[1]);
return 1;
}
} else {
std::ios_base::sync_with_stdio(false);
std::cin.tie(nullptr);
}
std::istream& is = argc >= 2 ? ifs : std::cin;
// ignore the first line - it's of no use in this demo
is.ignore(std::numeric_limits<std::streamsize>::max(), '\n');
// read all `foo`s from the stream
std::uintmax_t co = 0;
std::for_each(std::istream_iterator<foo>(is), std::istream_iterator<foo>(),
[&co](const foo& f) {
// Process each foo here
// Just counting them for demo purposes:
++co;
});
std::cout << co << '\n';
}
My test runs on a file with 1'000'000'000 lines with content looking like below:
2,2,2,'abcd'
2, 2,2,'abcd'
2, 2, 2,'abcd'
2, 2, 2, 'abcd'
Unix time wc datafile
1000000000 2500000000 14500000000 datafile
real 1m53.440s
user 1m48.001s
sys 0m3.215s
time ./my_from_chars_prog datafile
1000000000
real 1m43.471s
user 1m28.247s
sys 0m5.622s
From this comparison I think one can see that my_from_chars_prog is able to successfully parse all entries pretty fast. It was consistently faster at doing so than wc - a standard unix tool whos only purpose is to count lines, words and characters.

Which is faster and more efficient, processing char by char as char or as stream?

If I want to process a text file char by char before using it. What method is most efficient?
I can do this:
ifstream ifs("the_file.txt", ios_base::in);
char c;
while (ifs >> noskipws >> c) {
// process c ...
}
ifs.close();
and this:
ifstream ifs("the_file.txt", ios_base::in);
stringstream sstr;
sstr << ifs.rdbuf();
string txt = sstr.str();
for (string::iterator iter = txt.begin(); iter != txt.end(); ++iter) {
// process *iter ...
}
The final output will be splitted string based on char found while iterating.
Which is faster? Or maybe there's another more efficient way? Do I need to flush the stringstream for every character (I read somewhere that flush is affecting performance)?

a) Measure (I'd guess that first one should be faster as it avoids extra allocation, but it is just a guess)
b) While it can indeed be a Really Bad case of premature optimization, if you really need the very best performance, try something along the lines of:
int f = open(...);
//error handling here
char buf[256];
while(1) {
int rd = read(f,buf,256);
if( rd == 0 ) break;
for(const char*p=buf;p<buf+rd;++p) {
//process *p; note that this loop can be entered more than once
}
}
close(f);
I'm pretty sure that it will be very difficult to beat this code performance-wise (unless going into very low-level non-standard IO); however, it might easily happen that ifstream will produce comparable results. Or it might not.
NB: for C++ the difference provided by this technique (read fixed-size buffer, then scan buffer) is small and usually negligible, but for other languages it might easily provide up to 2x difference (has been observed on Java).

Based on a crude test for a 20 mega byte file, this method loads the file in to one string in 0.1 second, versus 0.5 second for rdbuf method you had earlier. So basically there is no difference unless you are accessing lots and lots of files.
ifstream ifs(filename, ios::binary);
string txt;
unsigned int cursor = 0;
const unsigned int readsize = 4096;
while (ifs.good())
{
txt.resize(cursor + readsize);
ifs.read(&txt[cursor], readsize);
cursor += (unsigned int)ifs.gcount();
}
txt.resize(cursor);

Reading spaces from file

I have the following code which reads text from a file and stores the characters in a vector. However this code is not reading spaces and pushing them in the vector. I tried to use myRf>>noskipws but its not working.
int a;
int b;
int outp;
if (myRF.is_open())
{
while (!myRF.eof())
{
myRF >> a;
myRF >> b;
// myRf>>noskipws
for (int i=0; i<a; i++)
{
vector <char> col;
for (int j=0; j<b; j++)
{
myRF>>outp;
col.push_back(outp);
}
grid.push_back(col);
}
}
}
myRF.close();

When you enable std::noskipws leading whitespace isn't skipped. However, you try to read an int which can't start with a space! You should read a variable of type char to read, well, chars. That should just work.
Note that it is much faster to read chars using std::istreambuf_iterator<char>:
std::istream::kerberos(myRF);
if (kerberos) {
std::istreambuf_iterator<char> it(myRF, true), end;
while (it = end /* && other condition */) {
char c = *it;
++it;
// do other stuff
}
}
BTW, do not myRF.eof() to control the loop! That doesn't work because the stream cannot predict what you will try to read! The eof() member is only useful to determine why a read failed and distinguish between legit reason (have reached te end of the file) and broken input. Instead, read and check the result, e.g.
while (myRF >> a >> b) {
// ...
}

The problem is that the >> operator uses white space to determine when to end the stream extraction. If you want to grab every character from a file and store the separately then you would use something like this:
std::vector<char> letters;
std::ifstream fin ("someFile.txt");
char ch;
while (fin.get(ch))
letters.push_back(ch);

Finding end of file while reading from it

void graph::fillTable()
{
ifstream fin;
char X;
int slot=0;
fin.open("data.txt");
while(fin.good()){
fin>>Gtable[slot].Name;
fin>>Gtable[slot].Out;
cout<<Gtable[slot].Name<<endl;
for(int i=0; i<=Gtable[slot].Out-1;i++)
{
**//cant get here**
fin>>X;
cout<<X<<endl;
Gtable[slot].AdjacentOnes.addFront(X);
}
slot++;
}
fin.close();
}
That's my code, basically it does exactly what I want it to but it keeps reading when the file is not good anymore. It'll input and output all the things I'm looking for, and then when the file is at an end, fin.good() apparently isn't returning false. Here is the text file.
A 2 B F
B 2 C G
C 1 H
H 2 G I
I 3 A G E
F 2 I E
and here is the output
A
B
F
B
C
G
C
H
H
G
I
I
A
G
E
F
I
E
Segmentation fault
-
Here's is Gtable's type.
struct Gvertex:public slist
{
char Name;
int VisitNum;
int Out;
slist AdjacentOnes;
//linked list from slist
};
I'm expecting it to stop after outputting 'E' which is the last char in the file. The program never gets into the for loop again after reading the last char. I can't figure out why the while isn't breaking.

Your condition in the while loop is wrong. ios::eof() isn't
predictive; it will only be set once the stream has attempted
(internally) to read beyond end of file. You have to check after each
input.
The classical way of handling your case would be to define a >>
function for GTable, along the lines of:
std::istream&
operator>>( std::istream& source, GTable& dest )
{
std::string line;
while ( std::getline( source, line ) && line.empty() ) {
}
if ( source ) {
std::istringstream tmp( line );
std::string name;
int count;
if ( !(tmp >> name >> count) ) {
source.setstate( std::ios::failbit );
} else {
std::vector< char > adjactentOnes;
char ch;
while ( tmp >> ch ) {
adjactentOnes.push_back( ch );
}
if ( !tmp.eof() || adjactentOnes.size() != count ) {
source.setstate( std::ios::failbit );
} else {
dest.Name = name;
dest.Out = count;
for ( int i = 0; i < count; ++ i ) {
dest.AdjacentOnes.addFront( adjactentOnes[ i ] );
}
}
}
}
return source;
}
(This was written rather hastily. In real code, I'd almost certainly
factor the inner loop out into a separate function.)
Note that:
We read line by line, in order to verify the format (and to allow
resynchronization in case of error).
We set failbit in the source stream in case of an input error.
We skip empty lines (since your input apparently contains them).
We do not modify the target element until we are sure that the input
is correct.
One we have this, it is easy to loop over all of the elements:
int slot = 0;
while ( slot < GTable.size() && fin >> GTable[ slot ] ) {
++ slot;
}
if ( slot != GTable.size )
// ... error ...
EDIT:
I'll point this out explicitly, because the other people responding seem
to have missed it: it is absolutely imperative to ensure that you have
the place to read into before attempting the read.
EDIT 2:
Given the number of wrong answers this question is receiving, I would
like to stress:
Any use of fin.eof() before the input is known to fail is wrong.
Any use of fin.good(), period, is wrong.
Any use of one of the values read before having tested that the input
has succeeded is wrong. (This doesn't prevent things like fin >> a >>
b, as long as neither a or b are used before the success is
tested.)
Any attempt to read into Gtable[slot] without ensuring that slot
is in bounds is wrong.
With regards to eof() and good():
The base class of istream and ostream defines three
“error” bits: failbit, badbit and eofbit. It's
important to understand when these are set: badbit is set in case of a
non-recoverable hardward error (practically never, in fact, since most
implementations can't or don't detect such errors); and failbit is set in
any other case the input fails—either no data available (end of
file), or a format error ("abc" when inputting an int, etc.).
eofbit is set anytime the streambuf returns EOF, whether this
causes the input to fail or not! Thus, if you read an int, and the
stream contains "123", without trailing white space or newline,
eofbit will be set (since the stream must read ahead to know where the
int ends); if the stream contains "123\n", eofbit will not be set.
In both cases, however, the input succeeds, and failbit will not be
set.
To read these bits, there are the following functions (as code, since I
don't know how to get a table otherwise):
eof(): returns eofbit
bad(): returns badbit
fail(): returns failbit || badbit
good(): returns !failbit && !badbit && !eofbit
operator!(): returns fail()
operator void*(): returns fail() ? NULL : this
(typically---all that's guaranteed is that !fail() returns non-null.)
Given this: the first check must always be fail() or one of the
operator (which are based on fail). Once fail() returns true, we
can use the other functions to determine why:
if ( fin.bad() ) {
// Serious problem, disk read error or such.
} else if ( fin.eof() ) {
// End of file: there was no data there to read.
} else {
// Formatting error: something like "abc" for an int
}
Practically speaking, any other use is an error (and any use of good()
is an error—don't ask me why the function is there).

Slightly slower but cleaner approach:
void graph::fillTable()
{
ifstream fin("data.txt");
char X;
int slot=0;
std::string line;
while(std::getline(fin, line))
{
if (line.empty()) // skip empty lines
continue;
std::istringstream sin(line);
if (sin >> Gtable[slot].Name >> Gtable[slot].Out && Gtable[slot].Out > 0)
{
std::cout << Gtable[slot].Name << std::endl;
for(int i = 0; i < Gtable[slot].Out; ++i)
{
if (sin >> X)
{
std::cout << X << std::endl;
Gtable[slot].AdjacentOnes.addFront(X);
}
}
slot++;
}
}
}
If you still have issues, it's not with file reading...

The file won't fail until you actually read from past the end of file. This won't occur until the fin>>Gtable[slot].Name; line. Since your check is before this, good can still return true.
One solution would be to add additional checks for failure and break out of the loop if so.
fin>>Gtable[slot].Name;
fin>>Gtable[slot].Out;
if(!fin) break;
This still does not handle formatting errors in the input file very nicely; for that you should be reading line by line as mentioned in some of the other answers.

Try moving first two reads in the while condition:
// assuming Gtable has at least size of 1
while( fin>>Gtable[slot].Name && fin>>Gtable[slot].Out ) {
cout<<Gtable[slot].Name<<endl;
for(int i=0; i<=Gtable[slot].Out-1;i++) {
fin>>X;
cout<<X<<endl;
Gtable[slot].AdjacentOnes.addFront(X);
}
slot++;
//EDIT:
if (slot == table_size) break;
}
Edit: As per James Kanze's comment, you're taking an adress past the end of Gtable array, which is what causes segfault. You could pass the size of Gtable as argument to your fillTable() function (f.ex. void fillTable(int table_size)) and check slot is in bounds before each read.

*Edited in response to James' comment - the code now uses a good() check instead of a
!eof() check, which will allow it to catch most errors. I also threw in an is_open()
check to ensure the stream is associated with the file.*
Generally, you should try to structure your file reading in a loop as follows:
ifstream fin("file.txt");
char a = '\0';
int b = 0;
char c = '\0';
if (!fin.is_open())
return 1; // Failed to open file.
// Do an initial read. You have to attempt at least one read before you can
// reliably check for EOF.
fin >> a;
// Read until EOF
while (fin.good())
{
// Read the integer
fin >> b;
// Read the remaining characters (I'm just storing them in c in this example)
for (int i = 0; i < b; i++)
fin >> c;
// Begin to read the next line. Note that this will be the point at which
// fin will reach EOF. Since it is the last statement in the loop, the
// file stream check is done straight after and the loop is exited.
// Also note that if the file is empty, the loop will never be entered.
fin >> a;
}
fin.close();
This solution is desirable (in my opinion) because it does not rely on adding random
breaks inside the loop, and the loop condition is a simple good() check. This makes the
code easier to understand.

Getting the nth line of a text file in C++

I need to read the nth line of a text file (e.g. textfile.findline(0) would find the first line of the text file loaded with ifstream textfile). Is this possible?
I don't need to put the contents of the file in an array/vector, I need to just assign a specific line of the text file to a varible (specifically a int).
P.S. I am looking for the simplest solution that would not require me to use any big external library (e.g. Boost)
Thanks in advance.

How about this?
std::string ReadNthLine(const std::string& filename, int N)
{
std::ifstream in(filename.c_str());
std::string s;
//for performance
s.reserve(some_reasonable_max_line_length);
//skip N lines
for(int i = 0; i < N; ++i)
std::getline(in, s);
std::getline(in,s);
return s;
}

If you want to read the start of the nth line, you can use stdin::ignore to skip over the first n-1 lines, then read from the next line to assign to the variable.
template<typename T>
void readNthLine(istream& in, int n, T& value) {
for (int i = 0; i < n-1; ++i) {
in.ignore(numeric_limits<streamsize>::max(), '\n');
}
in >> value;
}

Armen's solution is the correct answer, but I thought I'd throw out an alternative, based on jweyrich's caching idea. For better or for worse, this reads in the entire file at construction, but only saves the newline positions (doesn't store the entire file, so it plays nice with massive files.) Then you can simply call ReadNthLine, and it will immediately jump to that line, and read in the one line you want. On the other hand, this is only optimal if you want to get only a fraction of the lines at a time, and the line numbers are not known at compile time.
class TextFile {
std::ifstream file_stream;
std::vector<std::ifstream::streampos> linebegins;
TextFile& operator=(TextFile& b) = delete;
public;
TextFile(std::string filename)
:file_stream(filename)
{
//this chunk stolen from Armen's,
std::string s;
//for performance
s.reserve(some_reasonable_max_line_length);
while(file_stream) {
linebegins.push_back(file_stream.tellg());
std::getline(file_stream, s);
}
}
TextFile(TextFile&& b)
:file_stream(std::move(b.file_stream)),
:linebegins(std::move(b.linebegins))
{}
TextFile& operator=(TextFile&& b)
{
file_stream = std::move(b.file_stream);
linebegins = std::move(b.linebegins);
}
std::string ReadNthLine(int N) {
if (N >= linebegins.size()-1)
throw std::runtime_error("File doesn't have that many lines!");
std::string s;
// clear EOF and error flags
file_stream.clear();
file_stream.seekg(linebegins[N]);
std::getline(file_stream, s);
return s;
}
};

It's certainly possible. There are (n-1) '\n' characters preceding the nth line. Read lines until you reach the one you're looking for. You can do this on the fly without storing anything except the current line being considered.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Performance bottleneck with CSV parser - c++

apparently, file io is a bad idea, just map the whole file into memory, access the csv file as a continous vm block, this incur only a few syscall

Related

Reading lines from input

Which is faster and more efficient, processing char by char as char or as stream?

Reading spaces from file

Finding end of file while reading from it

Getting the nth line of a text file in C++

Categories

Resources