Big csv file c++ parsing performance - c++

I have a big csv file (25 mb) that represents a symmetric graph (about 18kX18k). While parsing it into an array of vectors, i have analyzed the code (with VS2012 ANALYZER) and it shows that the problem with the parsing efficiency (about 19 seconds total) occurs while reading each character (getline::basic_string::operator+=) as shown in the picture below:
This leaves me frustrated, as with Java simple buffered line file reading and tokenizer i achieve it with less than half a second.
My code uses only STL library:
int allColumns = initFirstRow(file,secondRow);
// secondRow has initialized with one value
int column = 1; // dont forget, first column is 0
VertexSet* rows = new VertexSet[allColumns];
rows[1] = secondRow;
string vertexString;
long double vertexDouble;
for (int row = 1; row < allColumns; row ++){
// dont do the last row
for (; column < allColumns; column++){
//dont do the last column
getline(file,vertexString,',');
vertexDouble = stold(vertexString);
if (vertexDouble > _TH){
rows[row].add(column);
}
}
// do the last in the column
getline(file,vertexString);
vertexDouble = stold(vertexString);
if (vertexDouble > _TH){
rows[row].add(++column);
}
column = 0;
}
initLastRow(file,rows[allColumns-1],allColumns);
init first and last row basically does the same thing as the loop above, but initFirstRow also counts the number of columns.
VertexSet is basically a vector of indexes (int). Each vertex read (separated by ',') goes no more than 7 characters length long (values are between -1 and 1).

At 25 megabytes, I'm going to guess that your file is machine generated. As such, you (probably) don't need to worry about things like verifying the format (e.g., that every comma is in place).
Given the shape of the file (i.e., each line is quite long) you probably won't impose a lot of overhead by putting each line into a stringstream to parse out the numbers.
Based on those two facts, I'd at least consider writing a ctype facet that treats commas as whitespace, then imbuing the stringstream with a locale using that facet to make it easy to parse out the numbers. Overall code length would be a little greater, but each part of the code would end up pretty simple:
#include <iostream>
#include <fstream>
#include <vector>
#include <string>
#include <time.h>
#include <stdlib.h>
#include <locale>
#include <sstream>
#include <algorithm>
#include <iterator>
class my_ctype : public std::ctype<char> {
std::vector<mask> my_table;
public:
my_ctype(size_t refs=0):
my_table(table_size),
std::ctype<char>(my_table.data(), false, refs)
{
std::copy_n(classic_table(), table_size, my_table.data());
my_table[',']=(mask)space;
}
};
template <class T>
class converter {
std::stringstream buffer;
my_ctype *m;
std::locale l;
public:
converter() : m(new my_ctype), l(std::locale::classic(), m) { buffer.imbue(l); }
std::vector<T> operator()(std::string const &in) {
buffer.clear();
buffer<<in;
return std::vector<T> {std::istream_iterator<T>(buffer),
std::istream_iterator<T>()};
}
};
int main() {
std::ifstream in("somefile.csv");
std::vector<std::vector<double>> numbers;
std::string line;
converter<double> cvt;
clock_t start=clock();
while (std::getline(in, line))
numbers.push_back(cvt(line));
clock_t stop=clock();
std::cout<<double(stop-start)/CLOCKS_PER_SEC << " seconds\n";
}
To test this, I generated an 1.8K x 1.8K CSV file of pseudo-random doubles like this:
#include <iostream>
#include <stdlib.h>
int main() {
for (int i=0; i<1800; i++) {
for (int j=0; j<1800; j++)
std::cout<<rand()/double(RAND_MAX)<<",";
std::cout << "\n";
}
}
This produced a file around 27 megabytes. After compiling the reading/parsing code with gcc (g++ -O2 trash9.cpp), a quick test on my laptop showed it running in about 0.18 to 0.19 seconds. It never seems to use (even close to) all of one CPU core, indicating that it's I/O bound, so on a desktop/server machine (with a faster hard drive) I'd expect it to run faster still.

The inefficiency here is in Microsoft's implementation of std::getline, which is being used in two places in the code. The key problems with it are:
It reads from the stream one character at a time
It appends to the string one character at a time
The profile in the original post shows that the second of these problems is the biggest issue in this case.
I wrote more about the inefficiency of std::getline here.
GNU's implementation of std::getline, i.e. the version in libstdc++, is much better.
Sadly, if you want your program to be fast and you build it with Visual C++ you'll have to use lower level functions than std::getline.

The debug Runtime Library in VS is very slow because it does a lot of debug checks (for out of bound accesses and things like that) and calls lots of very small functions that are not inlined when you compile in Debug.
Running your program in release should remove all these overheads.
My bet on the next bottleneck is string allocation.

I would try read bigger chunks of memory at once and then parse it all.
Like.. read full line. and then parse this line using pointers and specialized functions.

Hmm good answer here. Took me a while but I had the same problem. After this fix my write and process time went from 38 sec to 6 sec.
Here's what I did.
First get data using boost mmap. Then you can use boost thread to make processing faster on the const char* that boost mmap returns. Something like this: (the multithreading is different depending on your implementation so I excluded that part)
#include <boost/iostreams/device/mapped_file.hpp>
#include <boost/thread/thread.hpp>
#include <boost/lockfree/queue.hpp>
foo(string path)
{
boost::iostreams::mapped_file mmap(path,boost::iostreams::mapped_file::readonly);
auto chars = mmap.const_data(); // set data to char array
auto eofile = chars + mmap.size(); // used to detect end of file
string next = ""; // used to read in chars
vector<double> data; // store the data
for (; chars && chars != eofile; chars++) {
if (chars[0] == ',' || chars[0] == '\n') { // end of value
data.push_back(atof(next.c_str())); // add value
next = ""; // clear
}
else
next += chars[0]; // add to read string
}
}

Related

Read a line of a file c++

I'm just trying to use the fstream library and I wanna read a given row.
I thought this, but I don't know if is the most efficient way.
#include <iostream>
#include <fstream>
using namespace std;
int main(){
int x;
fstream input2;
string line;
int countLine = 0;
input2.open("theinput.txt");
if(input2.is_open());
while(getline(input2,line)){
countLine++;
if (countLine==1){ //1 is the lane I want to read.
cout<<line<<endl;
}
}
}
}
Is there another way?
This does not appear to be the most efficient code, no.
In particular, you're currently reading the entire input file even though you only care about one line of the file. Unfortunately, doing a good job of skipping a line is somewhat difficult. Quite a few people recommend using code like:
your_stream.ignore(std::numeric_limits<std::streamsize>::max(), '\n');
...for this job. This can work, but has a couple of shortcomings. First and foremost, if you try to use it on a non-text file (especially one that doesn't contain new-lines) it can waste inordinate amounts of time reading an entire huge file, long after you've read enough that you would normally realize that there must be a problem. For example, if you're reading a "line", that's a pretty good indication that you're expecting a text file, and you can pretty easily set a much lower limit on how long that first line could reasonably be, such as (say) a megabyte, and usually quite a lot less than that.
You also usually want to detect whether it stopped reading because it reached that maximum, or because it got to the end of the line. Skipping a line "succeeded" only if a new-line was encountered before reaching the specified maximum. To do that, you can use gcount() to compare against the maximum you specified. If you stopped reading because you reached the specified maximum, you typically want to stop processing that file (and log the error, print out an error message, etc.)
With that in mind, we might write code like this:
bool skip_line(std::istream &in) {
size_t max = 0xfffff;
in.ignore(max, '\n');
return in.gcount() < max;
}
Depending on the situation, you might prefer to pass the maximum line size as a parameter (probably with a default) instead:
bool skip_line(std::istream &in, size_t max = 0xfffff) {
// skip definition of `max`, remainder identical
With this, you can skip up to a megabyte by default, but if you want to specify a different maximum, you can do so quite easily.
Either way, with that defined, the remainder becomes fairly trivial, something like this:
int main(){
std::ifstream in("theinput.txt");
if (!skip_line(in)) {
std::cerr << "Error reading file\n";
return EXIT_FAILURE;
}
// copy the second line:
std::string line;
if (std::getline(in, line))
std::cout << line;
}
Of course, if you want to skip more than one line, you can do that pretty easily as well by putting the call to skip_line in a loop--but note that you still usually want to test the result from it, and break the loop (and log the error) if it fails. You don't usually want something like:
for (int i=0; i<lines_to_skip; i++)
skip_line(in);
With this, you'd lose one of the basic benefits of assuring that your input really is what you expected, and you're not producing garbage.
I think you can condense your code to this. if (input) is sufficient to check for failure.
#include <iostream>
#include <fstream>
#include <limits>
int main()
{
std::ifstream input("file.txt");
int row = 5;
int count = 0;
if (input)
{
while (count++ < row) input.ignore(std::numeric_limits<std::streamsize>::max(), '\n');
std::string line;
std::getline(input, line);
std::cout << line;
}
}

How to get more performance when reading file

My program download files from site (via curl per 30 min). (it is possible that size of these files can reach 150 mb)
So i thought that getting data from these files can be inefficient. (search a line per 5 seconds)
These files can have ~10.000 lines
To parse this file (values are seperate by ",") i use regex :
regex wzorzec("(.*),(.*),(.*),(.*),(.*),(.*),(.*),(.*)");
There are 8 values.
Now i have to push it to vector:
allys.push_back({ std::stoi(std::string(wynik[1])), nick, tag, stoi(string(wynik[4])), stoi(string(wynik[5])), stoi(string(wynik[6])), stoi(string(wynik[7])), stoi(string(wynik[8])) });
I use std::async to do that, but for 3 files (~7 mb) procesor jumps to 80% and operation take about 10 secs. I read from SSD so this is not slowly IO fault.
I'm reading data line per line by fstream
How to boost this operation?
Maybe i have to parse this values, and push it to SQL ?
Best Regards
You can probably get some performance boost by avoiding regex, and use something along the lines of std::strtok, or else just hard-code a search for commas in your data. Regex has more power than you need just to look for commas. Next, if you use vector::reserve before you begin a sequence of push_back for any given vector, you will save a lot of time in both reallocation and moving memory around. If you are expecting a large vector, reserve room for it up front.
This may not cover all available performance ideas, but I'd bet you will see an improvement.
Your problem here is most likely additional overhead introduced by the regular expression, since you're using many variable length and greedy matches (the regex engine will try different alignments for the matches to find the largest matching result).
Instead, you might want to try to manually parse the lines. There are many different ways to achieve this. Here's one quick and dirty example (it's not flexible and has quite some duplicate code in there, but there's lots of room for optimization). It should explain the basic idea though:
#include <iostream>
#include <sstream>
#include <cstdlib>
const char *input = "1,Mario,Stuff,4,5,6,7,8";
struct data {
int id;
std::string nick;
std::string tag;
} myData;
int main(int argc, char **argv){
char buffer[256];
std::istringstream in(input);
// Read an entry and convert/store it:
in.get(buffer, 256, ','); // read
myData.id = atoi(buffer); // convert and store
// Skip the comma
in.seekg(1, std::ios::cur);
// Read the next entry and convert/store it:
in.get(buffer, 256, ','); // read
myData.nick = buffer; // store
// Skip the comma
in.seekg(1, std::ios::cur);
// Read the next entry and convert/store it:
in.get(buffer, 256, ','); // read
myData.tag = buffer; // store
// Skip the comma
in.seekg(1, std::ios::cur);
// Some test output
std::cout << "id: " << myData.id << "\nnick: " << myData.nick << "\ntag: " << myData.tag << std::endl;
return 0;
}
Note that there isn't any error handling in case entries are too long or too short (or broken in some other way).
Console output:
id: 1
nick: Mario
tag: Stuff

C++ Read file into Array / List / Vector

I am currently working on a small program to join two text files (similar to a database join). One file might look like:
269ED3
86356D
818858
5C8ABB
531810
38066C
7485C5
948FD4
The second one is similar:
hsdf87347
7485C5
rhdff
23487
948FD4
Both files have over 1.000.000 lines and are not limited to a specific number of characters. What I would like to do is find all matching lines in both files.
I have tried a few things, Arrays, Vectors, Lists - but I am currently struggling with deciding what the best (fastest and memory easy) way.
My code currently looks like:
#include iostream>
#include fstream>
#include string>
#include ctime>
#include list>
#include algorithm>
#include iterator>
using namespace std;
int main()
{
string line;
clock_t startTime = clock();
list data;
//read first file
ifstream myfile ("test.txt");
if (myfile.is_open())
{
for(line; getline(myfile, line);/**/){
data.push_back(line);
}
myfile.close();
}
list data2;
//read second file
ifstream myfile2 ("test2.txt");
if (myfile2.is_open())
{
for(line; getline(myfile2, line);/**/){
data2.push_back(line);
}
myfile2.close();
}
else cout data2[k], k++
//if data[j] > a;
return 0;
}
My thinking is: With a vector, random access on elements is very difficult and jumping to the next element is not optimal (not in the code, but I hope you get the point). It also takes a long time to read the file into a vector by using push_back and adding the lines one by one. With arrays the random access is easier, but reading >1.000.000 records into an array will be very memory intense and takes a long time as well. Lists can read the files faster, random access is expensive again.
Eventually I will not only look for exact matches, but also for the first 4 characters of each line.
Can you please help me deciding, what the most efficient way is? I have tried arrays, vectors and lists, but am not satisfied with the speed so far. Is there any other way to find matches, that I have not considered? I am very happy to change the code completely, looking forward to any suggestion!
Thanks a lot!
EDIT: The output should list the matching values / lines. In this example the output is supposed to look like:
7485C5
948FD4
Reading a 2 millions lines won't be too much slow, what might be slowing down is your comparison logic :
Use : std::intersection
data1.sort(data1.begin(), data1.end()); // N1log(N1)
data2.sort(data2.begin(), data2.end()); // N2log(N2)
std::vector<int> v; //Gives the matching elements
std::set_intersection(data1.begin(), data1.end(),
data2.begin(), data2.end(),
std::back_inserter(v));
// Does 2(N1+N2-1) comparisons (worst case)
You can also try using std::set and insert lines into it from both files, the resultant set will have only unique elements.
If the values for this are unique in the first file, this becomes trivial when exploiting the O(nlogn) characteristics of a set. The following stores all lines in the first file passed as a command-line argument to a set, then performs a O(logn) search for each line in the second file.
EDIT: Added 4-char-only preamble searching. To do this, the set contains only the first four chars of each line, and the search from the second looks for only the first four chars of each search-line. The second-file line is printed in its entirety if there is a match. Printing the first file full-line in entirety would be a bit more challenging.
#include <iostream>
#include <fstream>
#include <string>
#include <set>
int main(int argc, char *argv[])
{
if (argc < 3)
return EXIT_FAILURE;
// load set with first file
std::ifstream inf(argv[1]);
std::set<std::string> lines;
std::string line;
for (unsigned int i=1; std::getline(inf,line); ++i)
lines.insert(line.substr(0,4));
// load second file, identifying all entries.
std::ifstream inf2(argv[2]);
while (std::getline(inf2, line))
{
if (lines.find(line.substr(0,4)) != lines.end())
std::cout << line << std::endl;
}
return 0;
}
One solution is to read the entire file at once.
Use istream::seekg and istream::tellg to figure the size of the two files. Allocate a character array large enough to store them both. Read both files into the array, at appropriate location, using istream::read.
Here is an example of the above functions.

Trouble getting string to print random line from text file

I picked up this bit of code a while back as a way to select a random line from a text file and output the result. Unfortunately, it only seems to output the first letter of the line that it selects and I can't figure out why its doing so or how to fix it. Any help would be appreciated.
#include "stdafx.h"
#include <stdio.h>
#include <iostream>
#include <fstream>
#include <string>
#include <time.h>
using namespace std;
#define MAX_STRING_SIZE 1000
string firstName()
{
string firstName;
char str[MAX_STRING_SIZE], pick[MAX_STRING_SIZE];
FILE *fp;
int readCount = 0;
fp = fopen("firstnames.txt", "r");
if (fp)
{
if (fgets(pick, MAX_STRING_SIZE, fp) != NULL)
{
readCount = 1;
while (fgets (str, MAX_STRING_SIZE, fp) != NULL)
{
if ((rand() % ++readCount) == 0)
{
strcpy(pick, str);
}
}
}
}
fclose(fp);
firstName = *pick;
return firstName;
}
int main()
{
srand(time(NULL));
int n = 1;
while (n < 10)
{
string fn = firstName();
cout << fn << endl;
++n;
}
system("pause");
}
firstName = *pick;
I am guessing this is the problem.
pick here is essentially a pointer to the first element of the array, char*, so of course *pick is of type char.. or the first character of the array.
Another way to see it is that *pick == *(pick +0) == pick[0]
There are several ways to fix it. Simplest is to just do the below.
return pick;
The constructor will automatically make the conversion for you.
Since you didn't specify the format of your file, I'll cover both cases: fixed record length and variable record length; assuming each text line is a record.
Reading Random Names, Fixed Length Records
This one is straight forward.
Determine the index (random) of the record you want.
Calculate the file position = record length * index.
Set file to the position.
Read text from file, using std::getline.
Reading Random Names, Variable Length Records
This assumes that the length of the text lines vary. Since they vary, you can't use math to determine the file position.
To randomly pick a line from a file you will either have to put each line into a container, or put the file offset of the beginning of the line into a container.
After you have your container establish, determine the random name number and use that as an index into the container. If you stored the file offsets, position the file to the offset and read the line. Otherwise, pull the text from the container.
Which container should be used? It depends. Storing the text is faster but takes up memory (you are essentially storing the file into memory). Storing the file positions takes up less room but you will end up reading each line twice (once to find the position, second to fetch the data).
Augmentations to these algorithms is to memory-map the file, which is an exercise for the reader.
Edit 1: Example
include <iostream>
#include <fstream>
#include <vector>
#include <string>
using std::string;
using std::vector;
using std::fstream;
// Create a container for the file positions.
std::vector< std::streampos > file_positions;
// Create a container for the text lines
std::vector< std::string > text_lines;
// Load both containers.
// The number of lines is the size of either vector.
void
Load_Containers(std::ifstream& inp)
{
std::string text_line;
std::streampos file_pos;
file_pos = inp.tellg();
while (!std::getline(inp, text_line)
{
file_positions.push_back(file_pos);
file_pos = inp.tellg();
text_lines.push_back(text_line);
}
}

How to enhance the speed of my C++ program in reading delimited text files?

I show you C# and C++ code that execute the same job: to read the same text file delimited by “|” and save with “#” delimited text.
When I execute C++ program, the time elapsed is 169 seconds.
UPDATE 1: Thanks to Seth (compilation with: cl /EHsc /Ox /Ob2 /Oi) and GWW for changing the positions of string s outside the loops, the elapsed time was reduced to 53 seconds. I updated the code also.
UPDATE 2: Do you have any other suggestion to enhace the C++ code?
When I execute the C# program, the elapsed time is 34 seconds!
The question is, how can I enhance the speed of C++ comparing with the C# one?
C++ Program:
int main ()
{
Timer t;
cout << t.ShowStart() << endl;
ifstream input("in.txt");
ofstream output("out.txt", ios::out);
char const row_delim = '\n';
char const field_delim = '|';
string s1, s2;
while (input)
{
if (!getline( input, s1, row_delim ))
break;
istringstream iss(s1);
while (iss)
{
if (!getline(iss, s2, field_delim ))
break;
output << s2 << "#";
}
output << "\n";
}
t.Stop();
cout << t.ShowEnd() << endl;
cout << "Executed in: " << t.ElapsedSeconds() << " seconds." << endl;
return 0;
}
C# program:
static void Main(string[] args)
{
long i;
Stopwatch sw = new Stopwatch();
Console.WriteLine(DateTime.Now);
sw.Start();
StreamReader sr = new StreamReader("in.txt", Encoding.Default);
StreamWriter wr = new StreamWriter("out.txt", false, Encoding.Default);
object[] cols = new object[0]; // allocates more elements automatically when filling
string line;
while (!string.Equals(line = sr.ReadLine(), null)) // Fastest way
{
cols = line.Split('|'); // Faster than using a List<>
foreach (object col in cols)
wr.Write(col + "#");
wr.WriteLine();
}
sw.Stop();
Console.WriteLine("Conteo tomó {0} secs", sw.Elapsed);
Console.WriteLine(DateTime.Now);
}
UPDATE 3:
Well, I must say I am very happy for the help received and because the answer to my question has been satisfied.
I changed the text of the question a little to be more specific, and I tested the solutions that kindly raised Molbdlino and Bo Persson.
Keeping Seth indications for the compile command (i.e. cl /EHsc /Ox /Ob2 /Oi pgm.cpp):
Bo Persson's solution took 18 seconds on average to complete the execution, really a good one taking in account that the code is near to what I like).
Molbdlino solution took 6 seconds on average, really amazing!! (thanks to Constantine also).
Never too late to learn, and I learned valuable things with my question.
My best regards.
As Constantine suggests, read large chunks at a time using read.
I cut the time from ~25s to ~3s on a 129M file with 5M "entries" (26 bytes each) in 100,000 lines.
#include <iostream>
#include <fstream>
#include <sstream>
#include <algorithm>
using namespace std;
int main ()
{
ifstream input("in.txt");
ofstream output("out.txt", ios::out);
const size_t size = 512 * 1024;
char buffer[size];
while (input) {
input.read(buffer, size);
size_t readBytes = input.gcount();
replace(buffer, buffer+readBytes, '|', '#');
output.write(buffer, readBytes);
}
input.close();
output.close();
return 0;
}
How about this for the central loop
while (getline( input, s1, row_delim ))
{
for (string::iterator c = s1.begin(); c != s1.end(); ++c)
if (*c == field_delim)
*c = '#';
output << s1 << '\n';
}
It seems to me that Your slow part is within getline. I don't have precise documentation which would support my idea, but it's how it feels for me. You should try using read instead. Because getline has the delimiter, so it need to check every symbol whether it has found the delimiter symbol, so that looks like multiple in operations, so Your program accesses a symbol in a file, then write it to the memory of your program, in other words, the time consumed on disk head movement. But if You use read function, You will copy the block of symbols and then work with them within program's memory, that may reduce time consuming.
PS again, I don't have documentation about getline and how it works, but I'm sure about read, hope it is helpful.
If you know the max line length you can your stdio+fgets and null terminated strings, it will rock.
For c# if it will fit in memory (probably not if it takes 34 sec) I'd be curious to see how IO.File.WriteAllText("out.txt",IO.File.ReadAllText("in.txt").Replace("|","#")); performs!
I'd be really surprised if this beat #molbdnilo's version, but it's probably the second fastest, and (I would posit) the simplest and cleanest:
#include <fstream>
#include <string>
#include <sstream>
#include <algorithm>
int main() {
std::ifstream in("in.txt");
std::ostringstream buffer;
buffer << in.rdbuf();
std::string s(buffer.str());
std::replace(s.begin(), s.end(), '|', '#');
std::ofstream out("out.txt");
out << s;
return 0;
}
Based on past experience with this method, I'd expect it to be no worse than half the speed of what #molbdnilo posted -- which should still be around triple the speed of your C# version, and over ten times as fast as your original version in C++. [Edit: I just wrote a file generator, and on a file a little over 100 megabytes, it's even closer than I expected -- I'm getting 4.4 seconds, versus 3.5 for #molbdnilo's code.] The combination of reasonable speed with really short, simple code is often quite a decent trade-off. Of course, that's all predicated on your having enough physical RAM to hold the entire file content in memory, but that's generally a fairly safe assumption these days.