HUGE .cpp file better than reading from text file? - c++

Is it a legitimate optimisation to simply create a really HUGE source file which initialises a vector with hundreds of thousands of values manually? rather than parsing a text file with the same values into a vector?
Sorry that could probably be worded better. The function that parses the text file in is very slow due to C++'s stream reading being very slow (takes about 6 minutes opposed to about 6 seconds in the C# version.
Would making a massive array initialisation file be a legitimate solution? It doesn't seem elegant, but if it's faster then I suppose it's better?
this is the file reading code:
//parses the text path vector into the engine
void Level::PopulatePathVectors(string pathTable)
{
// Read the file line by line.
ifstream myFile(pathTable);
for (unsigned int i = 0; i < nodes.size(); i++)
{
pathLookupVectors.push_back(vector<vector<int>>());
for (unsigned int j = 0; j < nodes.size(); j++)
{
string line;
if (getline(myFile, line)) //enter if a line is read successfully
{
stringstream ss(line);
istream_iterator<int> begin(ss), end;
pathLookupVectors[i].push_back(vector<int>(begin, end));
}
}
}
myFile.close();
}
sample line from the text file (in which there are about half a million lines of similar format but varying length.
0 5 3 12 65 87 n

First, make sure you're compiling with the highest optimization level available, then please add the following lines marked below, then test again. I doubt this will fix the problem, but it may help. Hard to say until I see the results.
//parses the text path vector into the engine
void Level::PopulatePathVectors(string pathTable)
{
// Read the file line by line.
ifstream myFile(pathTable);
pathLookupVectors.reserve(nodes.size()); // HERE
for (unsigned int i = 0; i < nodes.size(); i++)
{
pathLookupVectors.push_back(vector<vector<int> >(nodes.size()));
pathLookupVectors[i].reserve(nodes.size()); // HERE
for (unsigned int j = 0; j < nodes.size(); j++)
{
string line;
if (getline(myFile, line)) //enter if a line is read successfully
{
stringstream ss(line);
istream_iterator<int> begin(ss), end;
pathLookupVectors[i].push_back(vector<int>(begin, end));
}
}
}
myFile.close();
}

6 minutes vs 6 seconds!! must be something wrong with your C++ code. Optimize it using good old methods before you revert to such an extreme "optimization" mentioned in your post.
Also know that reading from file would allow you to change the vector contents without changing the source code. If you do it the way you mention it, you'll have to re-code, compile n link all over again.

Depending if the data changes. If the data can/needs to be changed (after compiletime) than the only option is to load it from textfile. If not, well I don't see any harm to compile it.

I was able to get the following result with Boost.Spirit 2.5:
$ time ./test input
real 0m6.759s
user 0m6.670s
sys 0m0.090s
'input' is a file containing 500,000 lines containing 10 random integers between 0 and 65535 each.
Here's the code:
#include <vector>
#include <boost/spirit/include/qi.hpp>
#include <boost/spirit/include/classic_file_iterator.hpp>
using namespace std;
namespace spirit = boost::spirit;
namespace qi = boost::spirit::qi;
namespace ascii = boost::spirit::ascii;
typedef vector<int> ragged_matrix_row_type;
typedef vector<ragged_matrix_row_type> ragged_matrix_type;
template <class Iterator>
struct ragged_matrix_grammar : qi::grammar<Iterator, ragged_matrix_type()> {
ragged_matrix_grammar() : ragged_matrix_grammar::base_type(ragged_matrix_) {
ragged_matrix_ %= ragged_matrix_row_ % qi::eol;
ragged_matrix_row_ %= qi::int_ % ascii::space;
}
qi::rule<Iterator, ragged_matrix_type()> ragged_matrix_;
qi::rule<Iterator, ragged_matrix_row_type()> ragged_matrix_row_;
};
int main(int argc, char** argv){
typedef spirit::classic::file_iterator<> ragged_matrix_file_iterator;
ragged_matrix_type result;
ragged_matrix_grammar<ragged_matrix_file_iterator> my_grammar;
ragged_matrix_file_iterator input_it(argv[1]);
qi::parse(input_it, input_it.make_end(), my_grammar, result);
return 0;
}
At this point, result contains the ragged matrix, which can be confirmed by printing its contents. In my case the 'ragged matrix' isn't so ragged-it's a 500000 x 10 rectangle-but it won't matter because I'm pretty sure the grammar is correct. I got even better results when I read the entire file into memory before parsing (~4 sec), but the code for that is longer and it's generally undesirable to copy large files into memory in their entirety.
Note: my test machine has an SSD, so I don't know if you'll get the same numbers I did (unless your test machine has an SSD as well).
HTH!

I wouldn't consider compiling static data into your application to be bad practice. If there is little conceivable need to change your data without a recompilation, parsing the file at compile time not only improves runtime performance (since your data have been pre-parsed by the compiler and are in a usable format at runtime), but also reduces risks (like the data file not being found at runtime or any other parse errors).
Make sure that users won't have need to change the data (or have the means to recompile the program), document your motivation and you should be absolutely fine.
That said, you could make the iostream version a lot faster if necessary.

using a huge array in a C++ file is a totally allowed option, depending on the case.
You must consider if the data will change and how often.
If you put it in a C++ file, that means that you will have to recompile your program each time the data change (and distribute it to your customers each time !) So that wouldn't be a good solution if you have to distribute the program to other people.
Now if a compilation is allowed for every data change, then you can have the best of two worlds : just use a small script (for example in python or perl) which will take your .txt and generate a C++ file, so the file parsing will only have to be done one time for each data change. You can even integrate this step in your build process with automatic dependency management.
Good luck !

Don't use the std input stream, it's extremely slow.
There are better alternatives.
Since people decided to downvote my answer because they are too lazy to use google, here:
http://accu.org/index.php/journals/1539

Related

Dynamically Allocating Array With Datafile

On a C++ project, I have been trying to use an array to store data from a textfile that I would later use. I have been having problems initializing the array without a size. Here is a basic sample of what I have been doing:
#include <iostream>
#include <string>
#include <fstream>
using namespace std;
int main()
{
int i = 0;
ifstream usern;
string data;
string otherdata;
string *users = nullptr;
usern.open("users.txt");
while(usern >> otherdata)
i++;
users = new (nothrow) string[i];
for (int n = 0; usern >> data; n++)
{
users[n] = data;
}
usern.close();
return 0;
}
This is a pretty rough example that I threw together. Basically I try to read the items from a text file called users.txt and store them in an array. I used pointers in the example that I included (which probably wasn't the best idea considering I don't know too much about poniters). When I run this program, regardless of the data in the file, I do not get any result when I try to test the values by including cout << *(users + 1). It just leaves a blank line in the window. I am guessing my error is in my use of pointers or in how I am assigning values in the pointers themselves. I was wondering if anybody could point me in the right direction on how to get the correct values into an array. Thanks!
Try reopening usern after
while(usern >> otherdata)
i++;
perhaps, try putting in
usern.close();
ifstream usern2;
usern2.open("users.txt");
right after that.
There may be other issues, but this seems like the most likely one to me. Let me know if you find success with this. To me it appears like usern is already reaching eof, and then you try to read from it a second time.
One thing that helps me a lot in finding such issues is to just put a cout << "looping"; or something inside the for loop so you know that you're at least getting in that for loop.
You can also do the same thing with usern.seekg(0, ios::beg);
What I think has happened in your code is that you have moved the pointer in the file that shows where the file is being read from. This happened when you iterated the number of strings to be read in using the code below.
while(usern >> otherdata)
i++;
This however brought the file pointer to the end of the file this means that in order to read the file you need to move the file pointer to the beginning of the file before you re-read it into your array of strings that you allocated of size i. This can be acheived by adding usern.seekg(0, ios::beg); after your while loop, as shown below. (For a good tutorial on file pointers see here.)
while(usern >> otherdata)
i++;
// Returns file pointer to beginning of file.
usern.seekg(0, ios::beg);
// The rest of your code.
Warning: I am unsure about how safe dynamically allocating STL containers are, I have previously run into issues with code similar to yours and would recommend staying away from this in functional code.

Dumping a File into a String Array

I am using Visual C++ with an MFC program, using Visual Studio 2008, and I will be creating or appending to an XML file.
If the file doesn't exist, it will be created, and there is no worries, but it's when the file already exists and I have to append to it that there seems to be an issue.
What I was instructed, and found through some research, was to read the file into a string, back up a bit, and write to the end of the string. My idea for that was to read the file into an array of strings.
bool WriteXMLHeader(string header, ofstream xmlFile)
{
int fileSize = 1;
while(!xmlFile.eof())
{
fileSize++;
}
string entireFile[fileSize];
for(int i = 0; i < fileSize; i++)
{
xmlFile >> entireFile[i];
}
//Processing code to add more to the end
//Save the File
return true;
}
However, this causes an error where entireFile is of unknown size, and there are constants errors popping up.
I am not allowed to use any third party software (already looked into TinyXML and RapidXML).
What would be a better way to append to the end of an XML file above an unknown amount of closing tags?
Edit: My boss keeps talking about sending in a path to a node, and writing after the last instance of the node. He wants this capable of processing xml files with a million indents if needed. (Impossible for one man to accomplish?)
std::vector < std::string>
"I mentioned that, and my boss said no and to focus on strings"
Well this is what a most preferred, easiest, least error prone solution.
Keeping aside you xml parsing, (if any), coming to the question/confusion, whatever it is.
Consider following:
#include <vector>
#include <string>
//...
std::vector < std::string > entireFile;
while ( std::getline(xmlFile, line) )
{
entireFile.push_back( line ) ;
}
xmlFile.close( );
// entireFile now contains all lines from xml file.
// To iterate its just like simple array
for( std::size_t i = 0; i < entireFile.size( ); ++i )
{
// entireFile[i]
}
Note: with <algorithm> and <iterators> you can achieve this in still fewer lines of code.
Suggested Reading: Why is iostream::eof inside a loop condition considered wrong?
If your boss says no, ask him why with courtesy, why ?
There can't be any valid reason unless you're tired with specific environment/platform with limited capabilities.

Keep a text file from wiping in a function but keep ability to write to it? C++

I have a function that swaps two chars, in a file, at a time, which works, however if i try to use the function more than once the previous swap i made will be wiped from the text file and the original text in now back in, therefore the second change will seem as my first. how can i resolve this?
void swapping_letters()
{
ifstream inFile("decrypted.txt");
ofstream outFile("swap.txt");
char a;
char b;
vector<char> fileChars;
if (inFile.is_open())
{
cout<<"What is the letter you want to replace?"<<endl;
cin>>a;
cout<<"What is the letter you want to replace it with?"<<endl;
cin>>b;
while (inFile.good())
{
char c;
inFile.get(c);
fileChars.push_back(c);
}
replace(fileChars.begin(),fileChars.end(),a,b);
}
else
{
cout<<"Please run the decrypt."<<endl;
}
for(int i = 0; i < fileChars.size(); i++)
{
outFile<<fileChars[i];
}
}
What you probably want to do is to parameterize your function :
void swapping_letters(string inFileName, string outFileName)
{
ifstream inFile(inFileName);
ofstream outFile(outFileName);
...
Because you don't have parameters, calling it twice is equivalent to:
swapping_letters("decrypted.txt", "swap.txt");
swapping_letters("decrypted.txt", "swap.txt");
But "decrypted.txt" wasn't modified after the first call, because you don't change the input file. So if you wanted to use the output of the first operation as the input to the second you'd have to write:
swapping_letters("decrypted.txt", "intermediate.txt");
swapping_letters("intermediate.txt", "swap.txt");
There are other ways of approaching this problem. By reading the file one character at a time, you are making quite a number of function calls...a million-byte file will involve 1 million calls to get() and 1 million calls to push_back(). Most of the time the internal buffering means this won't be too slow, but there are better ways:
Read whole ASCII file into C++ std::string
Note that if this is the actual problem you're solving, you don't actually need to read the whole file into memory. You can read the file in blocks (or character-by-character as you are doing) and do your output without holding the entire file.
An advanced idea that you may be interested in at some point are memory-mapped files. This lets you treat a disk file like it's a big array and easily modify it in memory...while letting the operating system worry about details of how much of the file to page in or page out at a time. They're a good fit for some problems, and there's a C++ platform-independent API for memory mapped files in the boost library:
http://en.wikipedia.org/wiki/Memory-mapped_file

std::ifstream buffer caching

In my application I'm trying to merge sorted files (keeping them sorted of course), so I have to iterate through each element in both files to write the minimal to the third one. This works pretty much slow on big files, as far as I don't see any other choice (the iteration has to be done) I'm trying to optimize file loading. I can use some amount of RAM, which I can use for buffering. I mean instead of reading 4 bytes from both files every time I can read once something like 100Mb and work with that buffer after that, until there will be no element in buffer, then I'll refill the buffer again. But I guess ifstream is already doing that, will it give me more performance and is there any reason? If fstream does, maybe I can change size of that buffer?
added
My current code looks like that (pseudocode)
// this is done in loop
int i1 = input1.read_integer();
int i2 = input2.read_integer();
if (!input1.eof() && !input2.eof())
{
if (i1 < i2)
{
output.write(i1);
input2.seek_back(sizeof(int));
} else
input1.seek_back(sizeof(int));
output.write(i2);
}
} else {
if (input1.eof())
output.write(i2);
else if (input2.eof())
output.write(i1);
}
What I don't like here is
seek_back - I have to seek back to previous position as there is no way to peek 4 bytes
too much reading from file
if one of the streams is in EOF it still continues to check that stream instead of putting contents of another stream directly to output, but this is not a big issue, because chunk sizes are almost always equal.
Can you suggest improvement for that?
Thanks.
Without getting into the discussion on stream buffers, you can get rid of the seek_back and generally make the code much simpler by doing:
using namespace std;
merge(istream_iterator<int>(file1), istream_iterator<int>(),
istream_iterator<int>(file2), istream_iterator<int>(),
ostream_iterator<int>(cout));
Edit:
Added binary capability
#include <algorithm>
#include <iterator>
#include <fstream>
#include <iostream>
struct BinInt
{
int value;
operator int() const { return value; }
friend std::istream& operator>>(std::istream& stream, BinInt& data)
{
return stream.read(reinterpret_cast<char*>(&data.value),sizeof(int));
}
};
int main()
{
std::ifstream file1("f1.txt");
std::ifstream file2("f2.txt");
std::merge(std::istream_iterator<BinInt>(file1), std::istream_iterator<BinInt>(),
std::istream_iterator<BinInt>(file2), std::istream_iterator<BinInt>(),
std::ostream_iterator<int>(std::cout));
}
In decreasing order of performance (best first):
memory-mapped I/O
OS-specific ReadFile or read calls.
fread into a large buffer
ifstream.read into a large buffer
ifstream and extractors
A program like this should be I/O bound, meaning it should be spending at least 80% of it's time waiting for completion of reading or writing a buffer, and if the buffers are reasonably big, it should be keeping the disk heads busy. That's what you want.
Don't assume it is I/O bound, without proof. A way to prove it is by taking several stackshots. If it is, most of the samples will show the program waiting for I/O completion.
It is possible that it is not I/O bound, meaning you may find other things going on in some of the samples that you never expected. If so, then you know what to fix to speed it up. I have seen some code like this spending much more time than necessary in the merge loop, testing for end-of-file, getting data to compare, etc. for example.
You can just use the read function of an ifstream to read large blocks.
http://www.cplusplus.com/reference/iostream/istream/read/
The second parameter is the number of bytes. You should make this a multiple of 4 in your case - maybe 4096? :)
Simply read a chunk at a time and work on it.
As martin-york said, this may not have any beneficial effect on your performance, but try it and find out.
I think it is very likely that you can improve performance by reading big chunks.
Try opening the file with ios::binary as an argument, then use istream::read to read the data.
If you need maximum performance, I would actually suggest skipping iostreams altogether, and using cstdio instead. But I guess this is not what you want.
Unless there is something very special about your data it is unlikely that you will improve on the buffering that is built into the std::fstream object.
The std::fstream objects are designed to be very effecient for general purpose file access. It does not sound like you are doing anything special by accessing the data 4 bytes at a time. You can always profile your code to see where the actual time is spent in your code.
Maybe if you share the code with ous we could spot some major inefficiencies.
Edit:
I don't like your algorithm. Seeking back and forward may be hard on the stream especially of the number lies over a buffer boundary. I would only read one number each time through the loop.
Try this:
Note: This is not optimal (and it assumes stream input of numbers (while yours looks binary)) But I am sure you can use it as a starting point.
#include <fstream>
#include <iostream>
// Return the current val (that was the smaller value)
// and replace it with the next value in the stream.
int getNext(int& val, std::istream& str)
{
int result = val;
str >> val;
return result;
}
int main()
{
std::ifstream f1("f1.txt");
std::ifstream f2("f2.txt");
std::ofstream re("result");
int v1;
int v2;
f1 >> v1;
f2 >> v2;
// While there are values in both stream
// Output one value and replace it using getNext()
while(f1 && f2)
{
re << (v1 < v2)? getNext(v1, f1) : getNext(v2, f2);
}
// At this point one (or both) stream(s) is(are) empty.
// So dump the other stream.
for(;f1;f1 >> v1)
{
// Note if the stream is at the end it will
// never enter the loop
re << v1;
}
for(;f2;f2 >> v2)
{
re << v2;
}
}

How to speed-up loading of 15M integers from file stream?

I have an array of precomputed integers, it's fixed size of 15M values. I need to load these values at the program start. Currently it takes up to 2 mins to load, file size is ~130MB. Is it any way to speed-up loading. I'm free to change save process as well.
std::array<int, 15000000> keys;
std::string config = "config.dat";
// how array is saved
std::ofstream out(config.c_str());
std::copy(keys.cbegin(), keys.cend(),
std::ostream_iterator<int>(out, "\n"));
// load of array
std::ifstream in(config.c_str());
std::copy(std::istream_iterator<int>(in),
std::istream_iterator<int>(), keys.begin());
in_ranks.close();
Thanks in advance.
SOLVED. Used the approach proposed in accepted answer. Now it takes just a blink.
Thanks all for your insights.
You have two issues regarding the speed of your write and read operations.
First, std::copy cannot do a block copy optimization when writing to an output_iterator because it doesn't have direct access to underlying target.
Second, you're writing the integers out as ascii and not binary, so for each iteration of your write output_iterator is creating an ascii representation of your int and on read it has to parse the text back into integers. I believe this is the brunt of your performance issue.
The raw storage of your array (assuming a 4 byte int) should only be 60MB, but since each character of an integer in ascii is 1 byte any ints with more than 4 characters are going to be larger than the binary storage, hence your 130MB file.
There is not an easy way to solve your speed problem portably (so that the file can be read on different endian or int sized machines) or when using std::copy. The easiest way is to just dump the whole of the array to disk and then read it all back using fstream.write and read, just remember that it's not strictly portable.
To write:
std::fstream out(config.c_str(), ios::out | ios::binary);
out.write( keys.data(), keys.size() * sizeof(int) );
And to read:
std::fstream in(config.c_str(), ios::in | ios::binary);
in.read( keys.data(), keys.size() * sizeof(int) );
----Update----
If you are really concerned about portability you could easily use a portable format (like your initial ascii version) in your distribution artifacts then when the program is first run it could convert that portable format to a locally optimized version for use during subsequent executions.
Something like this perhaps:
std::array<int, 15000000> keys;
// data.txt are the ascii values and data.bin is the binary version
if(!file_exists("data.bin")) {
std::ifstream in("data.txt");
std::copy(std::istream_iterator<int>(in),
std::istream_iterator<int>(), keys.begin());
in.close();
std::fstream out("data.bin", ios::out | ios::binary);
out.write( keys.data(), keys.size() * sizeof(int) );
} else {
std::fstream in("data.bin", ios::in | ios::binary);
in.read( keys.data(), keys.size() * sizeof(int) );
}
If you have an install process this preprocessing could also be done at that time...
Attention. Reality check ahead:
Reading integers from a large text file is an IO bound operation unless you're doing something completely wrong (like using C++ streams for this). Loading 15M integers from a text file takes less than 2 seconds on an AMD64#3GHZ when the file is already buffered (and only a bit long if had to be fetched from a sufficiently fast disk). Here's a quick & dirty routine to prove my point (that's why I do not check for all possible errors in the format of the integers, nor close my files at the end, because I exit() anyway).
$ wc nums.txt
15000000 15000000 156979060 nums.txt
$ head -n 5 nums.txt
730547560
-226810937
607950954
640895092
884005970
$ g++ -O2 read.cc
$ time ./a.out <nums.txt
=>1752547657
real 0m1.781s
user 0m1.651s
sys 0m0.114s
$ cat read.cc
#include <stdio.h>
#include <stdlib.h>
#include <ctype.h>
#include <vector>
int main()
{
char c;
int num=0;
int pos=1;
int line=1;
std::vector<int> res;
while(c=getchar(),c!=EOF)
{
if (c>='0' && c<='9')
num=num*10+c-'0';
else if (c=='-')
pos=0;
else if (c=='\n')
{
res.push_back(pos?num:-num);
num=0;
pos=1;
line++;
}
else
{
printf("I've got a problem with this file at line %d\n",line);
exit(1);
}
}
// make sure the optimizer does not throw vector away, also a check.
unsigned sum=0;
for (int i=0;i<res.size();i++)
{
sum=sum+(unsigned)res[i];
}
printf("=>%d\n",sum);
}
UPDATE: and here's my result when read the text file (not binary) using mmap:
$ g++ -O2 mread.cc
$ time ./a.out nums.txt
=>1752547657
real 0m0.559s
user 0m0.478s
sys 0m0.081s
code's on pastebin:
http://pastebin.com/NgqFa11k
What do I suggest
1-2 seconds is a realistic lower bound for a typical desktop machine for load this data. 2 minutes sounds more like a 60 Mhz micro controller reading from a cheap SD card. So either you have an undetected/unmentioned hardware condition or your implementation of C++ stream is somehow broken or unusable. I suggest to establish a lower bound for this task on your your machine by running my sample code.
if the integers are saved in binary format and you're not concerned with Endian problems, try reading the entire file into memory at once (fread) and cast the pointer to int *
You could precompile the array into a .o file, which wouldn't need to be recompiled unless the data changes.
thedata.hpp:
static const int NUM_ENTRIES = 5;
extern int thedata[NUM_ENTRIES];
thedata.cpp:
#include "thedata.hpp"
int thedata[NUM_ENTRIES] = {
10
,200
,3000
,40000
,500000
};
To compile this:
# make thedata.o
Then your main application would look something like:
#include "thedata.hpp"
using namespace std;
int main() {
for (int i=0; i<NUM_ENTRIES; i++) {
cout << thedata[i] << endl;
}
}
Assuming the data doesn't change often, and that you can process the data to create thedata.cpp, then this is effectively instant loadtime. I don't know if the compiler would choke on such a large literal array though!
Save the file in a binary format.
Write the file by taking a pointer to the start of your int array and convert it to a char pointer. Then write the 15000000*sizeof(int) chars to the file.
And when you read the file, do the same in reverse: read the file as a sequence of chars, take a pointer to the beginning of the sequence, and convert it to an int*.
of course, this assumes that endianness isn't an issue.
For actually reading and writing the file, memory mapping is probably the most sensible approach.
If the numbers never change, preprocess the file into a C++ source and compile it into the application.
If the number can change and thus you have to keep them in separate file that you have to load on startup then avoid doing that number by number using C++ IO streams. C++ IO streams are nice abstraction but there is too much of it for such simple task as loading a bunch of number fast. In my experience, huge part of the run time is spent in parsing the numbers and another in accessing the file char by char.
(Assuming your file is more than single long line.) Read the file line by line using std::getline(), parse numbers out of each line using not streams but std::strtol(). This avoids huge part of the overhead. You can get more speed out of the streams by crafting your own variant of std::getline(), such that reads the input ahead (using istream::read()); standard std::getline() also reads input char by char.
Use a buffer of 1000 (or even 15M, you can modify this size as you please) integers, not integer after integer. Not using a buffer is clearly the problem in my opinion.
If the data in the file is binary and you don't have to worry about endianess, and you're on a system that supports it, use the mmap system call. See this article on IBM's website:
High-performance network programming, Part 2: Speed up processing at both the client and server
Also see this SO post:
When should I use mmap for file access?