effeciently read large spreadsheet file in C++ - c++

I normally use the method described in csv parser to read spreadsheet files. However, when reading a 64MB file which has around 40 columns and 250K rows of data, it takes about 4 minutes. In the original method, a CSVRow class is used to read the file row by row, and a private vector is used to store all the data in a row.
Several things to note:
I did reserve enough capacity of the vector but not much helpful.
I also need to create instances of some class when reading each line, but even when the code just read in the data without creating any instances, it takes long time.
The file is tab-delimited instead of comma-delimited, but I don't think it matters.
Since some columns in that file are not useful data, I changed the method to have a private string member to store all the data and then find the position of the (n-1)th and the nth delimiter to get the useful data (of course there are many useful columns). By doing so, I avoid some push_back operations, and cut the time to a little more than 2 minutes. However, that still seems too long to me.
Here are my questions:
Is there a way to read such a
spreadsheet file more efficiently?
Shall I read the file by buffer
instead of line by line? If so, how
to read by buffer and use the csvrow
class?
I haven't tried boost tokenizer, is
that more efficient?
Thank you for your help!

It looks like your being bottle-necked by IO. Instead of reading the file line by line, read it in blocks of maybe 8 MB. Parse the block read for records and determine if end of the block is a partial record. If it is, copy the portion of the last record from the block and prepend it to the next block. Repeat until the file is all read. This way, for a 64 MB file you're only making 8 IO requests. You can experiment with block size to determine what gives the best performance vs memory usage.

If reading the whole data into memory acceptable (and apparently it is), then I'd do this:
Read the whole file into a std::vector
Populate a vector > which contains the start positions of all newline characters and cells the data. These positions denote the start/end of each cell
Some code sketch to demonstrate the idea:
vector<vector<vector<char>::size_Type> > rows;
for ( vector<char>::size_type i = 0; i < data.size(); ++i ) {
vector<vector<char>::size_type> currentRow;
currentRow.push_back( i );
while ( data[i] != '\n' ) {
if ( data[i] == ',' ) { // XXX consider comma at end of line
currentRow.push_back( i );
}
}
rows.push_back( currentRow );
}
// XXX consider files which don't end in a newline
Thus, you know the positions of all newlines and all commas, and you have the complete CSV date available as one contiguous memory block. So you can easily extract a cell text like this:
// XXX error checking omitted for simplicity
string getCellText( int row, int col )
{
// XXX Needs handling for last cell of a line
const vector<char>::size_type start = rows[row][col];
const vector<char>::size_type end = rows[row][col + 1];
return string(data[start], data[end]);
}

This article should be helpful.
In short:
1. Either use memory mapped files OR read file in 4kbyte blocks to access the data. Memory-mapped files will be faster.
2. Try to avoid using push_back, std::string operations (like +) and similar routines from stl within parsing loop. They are nice, but they ALL use dynamically allocated memory, and dynamic memory allocation is slow. Anything that is being frequently dynamically allocated, will make your program slower. Try to preallocate all buffers before parsing. Counting all tokens in order to preallocate memory for them shouldn't be difficult.
3. Use profiler to identify what causes the slowdown.
4. You may want to try to avoid using iostream's << and >> operators, and parse file yourself.
In general, efficient C/C++ parser implementation should be able to parse 20 megabytes big text file within 3 seconds.

Related

How to read from a file multiple integer arrays c++

I have a file with multiple arrays of variable length arrays like this:
15
1 5 2 7
8 4 9
53 21 60 4 342 4321
...
Let's say the first number(15) gives the number of arrays,so that it would be easier to understand(everything is random though).
How can I read from the file all the numbers in c++ and put them into a variable,let's say x[100][100], so when I code x[1][1] + x[2][2] it will give me 14 (5 + 9).I thought about reading until the end of the line,but I don't know how to keep track of columns
If you have e.g. int x[100][100] and only need a few of those elements you have quite a large mount of wasted memory that won't be used (but still exists and must be initialized).
The solution to that is pointers and dynamic allocation. Allocating the correct amount of "sub-arrays" is easy once you have read the first line. The problem comes with how to handle the sub-arrays since they all seem to be of variable number of elements. You can allocate a fixed amount of elements for each sub-array, and hope you will not need more (which brings back the issue of wasted memory that needs to be initialized). Some of the problems can be mitigated if you do two passes over the input: One to get the maximum number of elements in any line, and the second to actually read the data.
A second option is to read and dynamically allocate just enough elements for each line. This requires you to parse the input so you know when the line ends, and also to use reallocation as you add new numbers. You also need to keep track of the number of elements for each sub-array so you don't risk go out of bounds.
To keep track of the number of elements for each line you can either use a second array with the count. Or you can use an array of structures instead, where each structure contains the number of elements and the sub-array for each line.
A better solution (now that I noticed this was a C++-tagged question and not C) you should use std::vector. Or rather a vector of vectors (of int).
When you have read the first line and parsed its number, you know how many sub-vectors you need and can preallocate them.
Then it's just a matter of reading the rest of the data, which is very easy in C++ with std::getline and std::istringstream and std::istream_iterator.
Perhaps something like this:
std::string line;
// Get the first line, the amount of extra lines to read
std::getline(input_file, line);
// Create the vector (of vectors)
std::vector<std::vector<int>> data;
size_t number_of_sub_vectors = std::stoi(line);
// Preallocate memory for the sub-vectors
data.reserve(number_of_sub_vectors);
// Now read the data for each line
for (size_t i = 0; i < number_of_sub_vectors; ++i)
{
// Get the data for the current line
std::getline(input_file, line);
// And put into an input string stream for parsing
std::istringstream iss(line);
// Create the sub-vector in-place, and populate it with the data from the file
data.emplace_back(std::istream_iterator<int>(iss),
std::istream_iterator<int>());
}
Of course the above example doesn't have any kind of error handling, which is really needed.

C++ slows over time reading 70,000 files

I have a program which needs to analyze 100,000 files spread over multiple filesystems.
After processing around 3000 files it starts to slow down. I ran it through gprof, but since the slow down doesn't kick in until 30-60 seconds into the analysis, I don't think it tells me much.
How would I track down the cause? top doesn't show high CPU and the process memory does not increase over time, so I/O?
At top level, we have:
scanner.init(); // build a std::vector<std::string> of pathnames.
scanner.scan(); // analyze those files
Now, init() completes in 1 second. It populates the vector with 70,000 actual filenames and 30,000 symbolic links.
scan() traverses the entries in the vector, looks at the file names, reads the contents (say 1KB of text), and builds a "segment list" [1]
I've read conflicting views on the evils of using std::strings, especially passing them as arguments. All the functions pass &references for both std::strings, structures, etc.
But it does use a lot of string processing to parse filenames, extract substrings and search for substrings. (and if they were evil, the program should be always slow, not just slow down after a while.
Could that be a reason for slowing down over time?
The algorithm is very straightforward and doesn't have any new / delete operators...
Abbreviated, scan():
while (tsFile != mFileMap.end())
{
curFileInfo.filePath = tsFile->second;
mpUtils->parseDateTimeString(tsFile->first, curFileInfo.start);
// Ignore files too small
size_t fs = mpFileActions->fileSize(curFileInfo.filePath);
mDvStorInfo.tsSizeBytes += fs;
if (fileNum++ % 200 == 0)
{
usleep(LONGNAPUSEC); // long nap to give others a turn
}
// collect file information
curFileInfo.locked = isLocked(curFileInfo.filePath);
curFileInfo.sizeBytes = mpFileActions->fileSize(curFileInfo.filePath);
getTsRateAndPktSize(curFileInfo.filePath, curFileInfo.rateBps, curFileInfo.pktSize);
getServiceIdList(curFileInfo.filePath, curFileInfo.svcIdList);
std::string fileBasePath;
fileBasePath = mpUtils->strReplace(".ts", "", curFileInfo.filePath.c_str());
fileBasePath = mpUtils->strReplace(".lockts", "", fileBasePath.c_str()); // chained replace
// Extract the last part of the filename, ie. /mnt/das.b/20160327.104200.to.20160327.104400
getFileEndTimeAndDuration(fileBasePath, curFileInfo);
// Update machine info for both actual ts duration and span including gaps
mDvStorInfo.tsDurationSec += curFileInfo.durSec;
if (!firstTime)
{
// beef is here.
if (hasGap(curFileInfo, prevFileInfo) ||
lockChanged(curFileInfo, prevFileInfo) ||
svcIdListChanged(curFileInfo, prevFileInfo) ||
lastTsFile(tsFile))
{
// This current file differs from those before it so
// close off previous segment and push to list
curSegInfo.prevFileStart = curFileInfo.start;
mSegmentList.push_back(curSegInfo);
prevFileInfo = curFileInfo; // do this before resetting everything!
// initialize the new segment
resetSegmentInfo(curSegInfo);
copyValues(curSegInfo, curFileInfo);
resetFileInfo(curFileInfo);
}
else
{
// still running. Update current segment info
curSegInfo.durSec += curFileInfo.durSec;
curSegInfo.sizeBytes += curFileInfo.sizeBytes;
curSegInfo.end = curFileInfo.end;
curSegInfo.prevFileStart = prevFileInfo.start;
prevFileInfo = curFileInfo;
}
}
else // first time
{
firstTime = false;
prevFileInfo = curFileInfo;
copyValues(curSegInfo, curFileInfo);
resetFileInfo(curFileInfo);
}
++tsFile;
}
where:
curFileInfo/prevFileInfo is a plain struct. The other functions do string processing, returning a &reference to std::strings
fileSize is calculated by calling stat()
getServiceIdList opens the file with fopen, reads each line and closes the file.
UPDATE
Removing the push_back to the container did not change the performance at all. However, rewriting to use C functions (eg. strstr(), strcpy() etc) now shows constant performance.
Culprit was the std::strings – despite passing as &refs, I guess too many construct/destroy/copy.
[1] the file names are named by YYYYMMDD.HHMMSS date/time, eg 20160612.093200. The purpose of the program is to look for time gaps within the names of the 70,000 files and build a list of contiguous time segments.
This could be a heap fragmentation issue. Over time, the heap can turn into Swiss cheese making it much harder for the memory manager to allocate blocks, and potentially forcing swap even if there is free RAM because there aren't any large-enough contiguous free blocks. Here's an MSDN article about heap fragmentation.
You mentioned using std::vector which guarantees contiguous memory and therefore can be a major culprit in heap fragmentation, as it must free and reallocate each time the collection grows beyond a boundary. If you don't require the contiguous guarantee, you might try a different container.
the file names are named by YYYYMMDD.HHMMSS date/time, eg 20160612.093200. The purpose of the program is to look for time gaps within the names of the 70,000 files and build a list of contiguous time segments
Comparing strings is slow; O(N). Comparing integers is fast; O(1). Rather than storing the filenames as strings, consider storing them as integers (or pairs of integers).
And I strongly suggest that you use hash maps, if possible. See std::unordered_set and std::unordered_map. These will greatly cut down on the number of comparisons.
Removing the push_back to the container did not change the performance at all. However, rewriting to use C functions (eg. strstr(), strcpy() etc) now shows constant performance.
std::set<char*> is sorting pointer addresses, not the strings that they contain.
And don't forget to std::move your strings to cut down on allocations.

Iterating through an array or getting characters from an open file - are there any advantages of one over the other?

I'm just wondering if say...
ifstream fin(xxx)
then
char c;
fin.get(c)
is any better than putting the entire text into an array and iterating through the array instead of getting characters from the loaded file.
I guess there's the extra step to put the input file into an array.
If the file is 237 GB, then iterating over it is more feasible than copying it to a memory array.
If you iterate, you still want to do the actual disk I/O in page sized chunks (not go to the device for every byte). But streams usually provide that kind of buffering.
So what you want is a mix of both.

Dynamic allocation of file data in C++

To be frank, I have an assignment that says, quite vaguely,
"If the file exists, the one-argument constructor allocates memory for the number of records contained in the file and copies them into memory."
Now, in considering this instruction, it would seem I am to allocate the dynamic memory /before/ copy the data over, and this seems in principle, impossible.
To dynamically allocate memory, to my knowledge, you require runtime definition of the size of the block to be reserved.
Given that the file size, or number of 'entries' is unknown, how can one possibly allocate that much memory? Does not the notion defeat the very purpose of dynamic allocation?
Solution wise, it would seem the only option is to parse the entire file, determining the size, allocate the proper amount of memory afterward, and then read through the file again, copying the data into the allocated memory.
Given that this must be a common operation in any program that reads file data, I wonder: What is the proper, or most efficient way of loading a file into RAM?
The notion of reading once to determine the size, and then again to copy seems very inefficient. I assume there is a way to jump to the end of the file to determine max length, which would make the process faster. Or perhaps using a static buffer and loading that in blocks to RAM?
Is it possible to read all of the data, and then move it into dynamic memory using the move operator? Or perhaps more efficient to use a linked list of some kind?
The most efficient method is to have the operating system map the file to memory. Search your OS API for "mmap" or "memory mapping".
Another approach is to seek to the end of the file and get the position (tellg()). This is the size of the file. Allocate an array in dynamic memory or create a std::vector reserving at least this amount of space.
Some Operating Systems have API you can call to get the size of a file (without having to seek to the end). You could use this method, then dynamically allocate the memory or use std::vector<char>.
You will need to come up with a plan if the file doesn't fit into memory.
If you need to read the entire file into memory, you could use istream::read using the file length.
It all depends on file format. One way to store records is to first write how many records are stored in file. If you have two phone numbers your file might look like this:
2
Jon
555-123
Mary
555-456
In this case the solution is straightforward:
// ...
is >> count;
record_type *record = new record_type[count];
for ( int i = 0; i < count; ++i )
is >> record[i].name >> record[i].number; // stream checks omitted
// ...
If the file does not store the number of records (I wouldn't do this), you will have to count them first, and then use the above solution:
// ...
int count = 0;
std::string dummy;
while ( is >> dummy >> dummy )
++count;
is.clear();
is.seekg( 0 );
// ...
A second solution for the second case, would be to write a dynamic container (I assume you are not allowed to use standard containers) and push the records as you read them:
// ...
list_type list;
record_type r;
while ( is >> r.name >> r.number )
list.push_back( r );
// ...
The solutions are ordered by complexity. I did not compile the examples above.

How to delete parts from a binary file in C++

I would like to delete parts from a binary file, using C++. The binary file is about about 5-10 MB.
What I would like to do:
Search for a ANSI string "something"
Once I found this string, I would like to delete the following n bytes, for example the following 1 MB of data. I would like to delete those character, not to fill them with NULL, thus make the file smaller.
I would like to save the modified file into a new binary file, what is the same as the original file, except for the missing n bytes what I have deleted.
Can you give me some advice / best practices how to do this the most efficiently? Should I load the file into memory first?
How can I search efficiently for an ANSI string? I mean possibly I have to skip a few megabytes of data before I find that string. >> I have been told I should ask it in an other question, so its here:
How to look for an ANSI string in a binary file?
How can I delete n bytes and write it out to a new file efficiently?
OK, I don't need it to be super efficient, the file will not be bigger than 10 MB and its OK if it runs for a few seconds.
There are a number of fast string search routines that perform much better than testing each and every character. For example, when trying to find "something", only every 9th character needs to be tested.
Here's an example I wrote for an earlier question: code review: finding </body> tag reverse search on a non-null terminated char str
For a 5-10MB file I would have a look at writev() if your system supports it. Read the entire file into memory since it is small enough. Scan for the bytes you want to drop. Pass writev() the list of iovecs (which will just be pointers into your read buffer and lenghts) and then you can rewrite the entire modified contents in a single system call.
First, if I understand your meaning in your "How can I search efficiently" subsection, you cannot just skip a few megabytes of data in the search if the target string might be in those first few megabytes.
As for loading the file into memory, if you do that, don't forget to make sure you have enough space in memory for the entire file. You will be frustrated if you go to use your utility and find that the 2GB file you want to use it on can't fit in the 1.5GB of memory you have left.
I am going to assume you will load into memory or memory map it for the following.
You did specifically say this was a binary file, so this means that you cannot use the normal C++ string searching/matching, as the null characters in the file's data will confuse it (end it prematurely without a match). You might instead be able to use memchr to find the first occurrence of the first byte in your target, and memcmp to compare the next few bytes with the bytes in the target; keep using memchr/memcmp pairs to scan through the entire thing until found. This is not the most efficient way, as there are better pattern-matching algorithms, but this is a sort of efficient way, I suppose.
To "delete" n bytes you have to actually move the data after those n bytes, copying the entire thing up to the new location.
If you actually copy the data from disk to memory, then it'd be faster to manipulate it there and write to the new file. Otherwise, once you find the spot on the disk you want to start deleting from, you can open a new file for writing, read in X bytes from the first file, where X is the file pointer position into the first file, and write them right into the second file, then seek into the first file to X+n and do the same from there to file1's eof, appending that to what you've already put into file2.