How to efficiently write a vector of structs to file? - c++

I have code that is writing a vector of size greater than 10million to a text file. I used clock() to time the writefile function and its the slowest part of my program. Is there a better way to write to file than my below method?
void writefile(vector<fields>& fieldsvec, ofstream& sigfile, ofstream& noisefile)
/* Writes clean and noise data to respective files
*
* fieldsvec: vector of clean data
* noisevec: vector of noise data
* sigfile: file to store clean data
* noisefile: file to store noise data
*/
{
for(unsigned int i=0; i<fieldsvec.size(); i++)
{
if(fieldsvec[i].nflag==false)
{
sigfile << fieldsvec[i].timestamp << ";" << fieldsvec[i].price << ";" << fieldsvec[i].units;
sigfile << endl;
}
else
{
noisefile << fieldsvec[i].timestamp << ";" << fieldsvec[i].price << ";" << fieldsvec[i].units;
noisefile << endl;
}
}
}
where my struct is:
struct fields
// Stores a parsed line of a file
{
public:
string timestamp;
float price;
float units;
bool nflag; //flag if noise (TRUE=NOISE)
};

I suggest getting rid of the endl. This effectively flushes the buffer every time and thus greatly increases the number of syscalls.
Writing '\n' instead of endl should be a very good improvement.
And by the way, the code can be simplified:
ofstream& files[2] = { sigfile, noisefile };
for(unsigned int i=0; i<fieldsvec.size(); i++)
files[fieldsvec[i].nflag] << fieldsvec[i].timestamp << ';' << fieldsvec[i].price << ";\n";

You could write your file in binary format instead of text format to increase the writing speed, as suggested in the first answer of this SO question:
file.open(filename.c_str(), ios_base::binary);
...
// The following writes a vector into a file in binary format
vector<double> v;
const char* pointer = reinterpret_cast<const char*>(&v[0]);
size_t bytes = v.size() * sizeof(v[0]);
file.write(pointer, bytes);
From the same link, the OP reported:
replacing std::endl with \n increased his code speed by 1%
concatenating all the content to be written in a stream and writing everything in the file at the end increased the code speed by 7%
the change of text format to binary format increased his code speed by 90%.

A significant speed-killer is that you are converting your numbers to text.
As for the raw file output, the buffering on an ofstream is supposed to be pretty efficient by default.
You should pass your array as a const reference. That might not be a big deal, but it does allow certain compiler optimizations.
If you think the stream is messing things up because of repeated writes, you could try creating a string with sprintf of snprintf and write it once. Only do this if your timestamp is a known size. Of course, that would make extra copying because the string must be then put in the output buffer. Experiment.
Otherwise, it's going to start getting dirty. When you need to tweak out the performance of files, you need to start tailoring the buffers to your application. That tends to get down to using no buffering or cache, sector-aligning your own buffer, and writing large chunks.

Related

C++ takes a lot of time when reading Blocks (4kb) out of a file. [std::ios::binary]

Im currently on a project where I try to store files into one (binary) file.
The files are stored block by block and a block looks like this:
struct Block_FS {
uint32_t ID;
char nutzdaten[_blockSize]; // blocksize is 4096, Nutzdaten means rawdata
};
So there is no issue with writing it into the file. That goes very cleanly and fast.
But when I try to read the original file out of the (binary) file it takes a long time (ca. 3-4 or more seconds for 10 blocks).
My readBlock function looks like this:
Block_FS getBlock(long blockID, std::fstream & iso, long blockPosition, Superblock_FS superblock) {
iso.seekg(blockPosition);
for (long i=0; i<superblock.blocksCount; i++) { //blockscount is just all blocks that are currently stored
Block_FS block;
std::string line;
std::getline(iso, line); //read the id
lock.ID = stoi(line);
iso.read(&block.nutzdaten[0], superblock.blockSize); //read the raw data 4kb
getline(iso, line); //skipping. maybe I should use null terminating to avoid this getline()...?
if (block.ID == blockID) { //if the current block has the right id im returning the whole block.
return block;
}
}
std::cerr << "Unexpected behavior: Block -> " << blockID << " not found." << std::endl;
Block_FS block;
block.ID = -1;
return block;
}
I don't think that there's much more you need to see from my code.
Somewhere in this function must be something that takes a lot of time. Blocks are always stored on the disk and not in the cache. Only the one I'm currently writing or reading is in the cache.
Currently I'm kind of iterating through the file until I find the right id. I thought maybe that causes the time issue. But when I tried the other way by saving the position of the block with tellg() and jump right to the point with seekg() it still took the same amount of time to read blocks. Do you see anything I don't see?

Reading and writing binary files using structures

I am attempting read from a binary file and dump the information into a structure. Before I read from it I write into the file from a vector of structures. Unfortunately I am not able to get the new structure to receive the information from the file.
I have tried switching between vectors and individual structures. Also tried messing with the file pointer, moving it back and forth and also leaving it as is to see if that has been the problem. Using vectors because it is supposed to take unlimited values. Also allows me to test what the output should look like when I look up a specific structure in the file.
struct Department{
string departmentName;
string departmentHead;
int departmentID;
double departmentSalary;
};
int main()
{
//...
vector<Employee> emp;
vector<Department> dept;
vector<int> empID;
vector<int> deptID;
if(response==1){
addDepartment(dept, deptID);
fstream output_file("departments.dat", ios::in|ios::out|ios::binary);
output_file.write(reinterpret_cast<char *>(&dept[counter-1]), sizeof(dept[counter-1]));
output_file.close();
}
else if(response==2){
addEmployee(emp, dept, empID);
}
else if(response==3){
Department master;
int size=dept.size();
int index;
cout << "Which record to EDIT:\n";
cout << "Please choose one of the following... 1"<< " to " << size << " : ";
cin >> index;
fstream input_file("departments.dat", ios::in|ios::out|ios::binary);
input_file.seekg((index-1) * sizeof(master), ios::beg);
input_file.read(reinterpret_cast<char *>(&master), sizeof(master));
input_file.close();
cout<< "\n" << master.departmentName;
}
else if(response==4){
}
//...
Files are streams of bytes. If you want to write something to a file and read it back reliably, you need to define the contents of the file at the byte level. Have a look at the specifications for some binary file formats (such a GIF) to see what such a specification looks like. Then write code to convert to and from your class instance and a chunk of bytes.
Otherwise, it will be hit or miss and, way too often, miss. Punch "serialization C++" into your favorite search engine for lots of ideas on how to do this.
Your code can't possibly work for an obvious reason. A string can contain a million bytes of data. But you're only writing sizeof(string) bytes to your file. So you're not writing anything that a reader can make sense out of.
Say sizeof(string) is 32 on your platform but the departmentHead is more than 32 bytes. How could the file's contents possibly be right? This code makes no attempt to serialize the data into a stream of bytes suitable for writing to a file which is ... a stream of bytes.

getline while reading a file vs reading whole file and then splitting based on newline character

I want to process each line of a file on a hard-disk now. Is it better to load a file as a whole and then split on basis of newline character (using boost), or is it better to use getline()? My question is does getline() reads single line when called (resulting in multiple hard disk access) or reads whole file and gives line by line?
getline will call read() as a system call somewhere deep in the gutst of the C library. Exactly how many times it is called, and how it is called depends on the C library design. But most likely there is no distinct difference in reading a line at a time vs. the whole file, becuse the OS at the bottom layer will read (at least) one disk-block at a time, and most likely at least a "page" (4KB), if not more.
Further, unles you do nearly nothing with your string after you have read it (e.g you are writing something like "grep", so mostly just reading the to find a string), it is unlikely that the overhead of reading a line at a time is a large part of the time you spend.
But the "load the whole file in one go" has several, distinct, problems:
You don't start processing until you have read the whole file.
You need enough memory to read the entire file into memory - what if the file is a few hundred GB in size? Does your program fail then?
Don't try to optimise something unless you have used profiling to prove that it's part of why your code is running slow. You are just causing more problems for yourself.
Edit: So, I wrote a program to measure this, since I think it's quite interesting.
And the results are definitely interesting - to make the comparison fair, I created three large files of 1297984192 bytes each (by copying all source files in a directory with about a dozen different source files, then copying this file several times over to "multiply" it, until it took over 1.5 seconds to run the test, which is how long I think you need to run things to make sure the timing isn't too susceptible to random "network packet came in" or some other outside influences taking time out of the process).
I also decided to measure the system and user-time by the process.
$ ./bigfile
Lines=24812608
Wallclock time for mmap is 1.98 (user:1.83 system: 0.14)
Lines=24812608
Wallclock time for getline is 2.07 (user:1.68 system: 0.389)
Lines=24812608
Wallclock time for readwhole is 2.52 (user:1.79 system: 0.723)
$ ./bigfile
Lines=24812608
Wallclock time for mmap is 1.96 (user:1.83 system: 0.12)
Lines=24812608
Wallclock time for getline is 2.07 (user:1.67 system: 0.392)
Lines=24812608
Wallclock time for readwhole is 2.48 (user:1.76 system: 0.707)
Here's the three different functions to read the file (there's some code to measure time and stuff too, of course, but for reducing the size of this post, I choose to not post all of that - and I played around with ordering to see if that made any difference, so results above are not in the same order as the functions here)
void func_readwhole(const char *name)
{
string fullname = string("bigfile_") + name;
ifstream f(fullname.c_str());
if (!f)
{
cerr << "could not open file for " << fullname << endl;
exit(1);
}
f.seekg(0, ios::end);
streampos size = f.tellg();
f.seekg(0, ios::beg);
char* buffer = new char[size];
f.read(buffer, size);
if (f.gcount() != size)
{
cerr << "Read failed ...\n";
exit(1);
}
stringstream ss;
ss.rdbuf()->pubsetbuf(buffer, size);
int lines = 0;
string str;
while(getline(ss, str))
{
lines++;
}
f.close();
cout << "Lines=" << lines << endl;
delete [] buffer;
}
void func_getline(const char *name)
{
string fullname = string("bigfile_") + name;
ifstream f(fullname.c_str());
if (!f)
{
cerr << "could not open file for " << fullname << endl;
exit(1);
}
string str;
int lines = 0;
while(getline(f, str))
{
lines++;
}
cout << "Lines=" << lines << endl;
f.close();
}
void func_mmap(const char *name)
{
char *buffer;
string fullname = string("bigfile_") + name;
int f = open(fullname.c_str(), O_RDONLY);
off_t size = lseek(f, 0, SEEK_END);
lseek(f, 0, SEEK_SET);
buffer = (char *)mmap(NULL, size, PROT_READ, MAP_PRIVATE, f, 0);
stringstream ss;
ss.rdbuf()->pubsetbuf(buffer, size);
int lines = 0;
string str;
while(getline(ss, str))
{
lines++;
}
munmap(buffer, size);
cout << "Lines=" << lines << endl;
}
The OS will read a whole block of data (depending on how the disk is formatted, typically 4-8k at a time) and do some of the buffering for you. Let the OS take care of it for you, and read the data in the way that makes sense for your program.
The fstreams are buffered reasonably. The underlying acesses to the harddisk by the OS are buffered reasonably. The hard disk itself has a reasonable buffer. You most surely will not trigger more hard disk accesses if you read the file line by line. Or character by character, for that matter.
So there is no reason to load the whole file into a big buffer and work on that buffer, because it already is in a buffer. And there often is no reason to buffer one line at a time, either. Why allocate memory to buffer something in a string that is already buffered in the ifstream? If you can, work on the stream directly, don't bother tossing everything around twice or more from one buffer to the next. Unless it supports readability and/or your profiler told you that disc access is slowing your program down significantly.
I believe the C++ idiom would be to read the file line-by-line, and create a line-based container as you read the file. Most likely the iostreams (getline) will be buffered enough that you won't notice a significant difference.
However for very large files you may get better performance by reading larger chunks of the file (not the whole file at once) and splitting internall as newlines are found.
If you want to know specifically which method is faster and by how much, you'll have to profile your code.
Its better to fetch the all data if it can be accommodated in memory because whenever you request the I/O your programmme looses the processing and put in a wait Q.
However if the file size is big then it's better to read as much data at a time which is required in processing. Because bigger read operation will take much time to complete then the small ones. cpu process switching time is much smaller then this entire file read time.
If it's a small file on disk, it's probably more efficient to read the entire file and parse it line by line vs. reading one line at a time--that would take lot's of disk access.

How do I go back to end of the first line in csv file?

I am using ofstream to write a csv file.
Currently, I am writing it left to right using "<<" operator, which is easy.
For example,
Shape,Area,Min,Max
Square,10,2,11
Rectangle,20,3,12
I want to change so that it looks like
Shape,Square,Rectangle
Area,10,20
Min,2,3
Max,11,12
I know I can use "<<" operator and just write it that way, but I am using some loops and it's not possible to use "<<" operator write it like that.
So I am looking for a way to write in the order, for example
Shape,
Area,
Min,
Max,
Then becomes
Shape,Square
Area,10
Min,2
Max,1
So It's basically going from top to bottom rather than left to right.
How do I use ofstream to code this? I am guessing I have to use seekp, but I'm not sure how.
Thank you very much.
You can't insert other than at the end of an ostream without
overwriting already written data. For something like what
you're trying to do, you probably have to collect each row in
separate string (perhaps using ostringstream to write it), then
output the rows. Something like:
std::ostringstream label;
label << "Shape";
std::ostringstream area;
area << "Area";
std::ostringstream min;
min << "Min";
std::ostringstream max;
max << "Max";
for (std::vector<Shape>::const_iterator> it = shapes.begin();
it != shapese.end();
++ it)
{
label << ',' << it->TypeName();
area << ',' << it->Area();
min << ',' << it->min();
max << ',' << it->max();
}
dest << label.str() << '\n';
dest << area.str() << '\n';
dest << min.str() << '\n';
dest << max.str() << '\n';
You can use the old FILE* API, seeking as needed. IOStreams also allow you to seekp and seekg. But manipulating files like this will be difficult. If you write out, say, 100 bytes, seekp to the beginning and start writing more data, you're going to overwrite what's already there (it doesn't automatically get shifted around for you). You're likely going to need to read in the file's contents, manipulate them in memory, and write them to disk in one shot.
Eventhough it is inefficient, it could be done by writing fixed size lines (40 characters?) with extra spaces, so you can go to a line and (fixed) position by seeking line*40+position (or look for the comma) and overwrite the spaces.
Now that you have this knowledge, go for the approach as mentioned by Martin

mixing cout and printf for faster output

After performing some tests I noticed that printf is much faster than cout. I know that it's implementation dependent, but on my Linux box printf is 8x faster. So my idea is to mix the two printing methods: I want to use cout for simple prints, and I plan to use printf for producing huge outputs (typically in a loop). I think it's safe to do as long as I don't forget to flush before switching to the other method:
cout << "Hello" << endl;
cout.flush();
for (int i=0; i<1000000; ++i) {
printf("World!\n");
}
fflush(stdout);
cout << "last line" << endl;
cout << flush;
Is it OK like that?
Update: Thanks for all the precious feedbacks. Summary of the answers: if you want to avoid tricky solutions, simply stick with cout but don't use endl since it flushes the buffer implicitly (slowing the process down). Use "\n" instead. It can be interesting if you produce large outputs.
The direct answer is that yes, that's okay.
A lot of people have thrown around various ideas of how to improve speed, but there seems to be quite a bit of disagreement over which is most effective. I decided to write a quick test program to get at least some idea of which techniques did what.
#include <iostream>
#include <string>
#include <sstream>
#include <time.h>
#include <iomanip>
#include <algorithm>
#include <iterator>
#include <stdio.h>
char fmt[] = "%s\n";
static const int count = 3000000;
static char const *const string = "This is a string.";
static std::string s = std::string(string) + "\n";
void show_time(void (*f)(), char const *caption) {
clock_t start = clock();
f();
clock_t ticks = clock()-start;
std::cerr << std::setw(30) << caption
<< ": "
<< (double)ticks/CLOCKS_PER_SEC << "\n";
}
void use_printf() {
for (int i=0; i<count; i++)
printf(fmt, string);
}
void use_puts() {
for (int i=0; i<count; i++)
puts(string);
}
void use_cout() {
for (int i=0; i<count; i++)
std::cout << string << "\n";
}
void use_cout_unsync() {
std::cout.sync_with_stdio(false);
for (int i=0; i<count; i++)
std::cout << string << "\n";
std::cout.sync_with_stdio(true);
}
void use_stringstream() {
std::stringstream temp;
for (int i=0; i<count; i++)
temp << string << "\n";
std::cout << temp.str();
}
void use_endl() {
for (int i=0; i<count; i++)
std::cout << string << std::endl;
}
void use_fill_n() {
std::fill_n(std::ostream_iterator<char const *>(std::cout, "\n"), count, string);
}
void use_write() {
for (int i = 0; i < count; i++)
std::cout.write(s.data(), s.size());
}
int main() {
show_time(use_printf, "Time using printf");
show_time(use_puts, "Time using puts");
show_time(use_cout, "Time using cout (synced)");
show_time(use_cout_unsync, "Time using cout (un-synced)");
show_time(use_stringstream, "Time using stringstream");
show_time(use_endl, "Time using endl");
show_time(use_fill_n, "Time using fill_n");
show_time(use_write, "Time using write");
return 0;
}
I ran this on Windows after compiling with VC++ 2013 (both x86 and x64 versions). Output from one run (with output redirected to a disk file) looked like this:
Time using printf: 0.953
Time using puts: 0.567
Time using cout (synced): 0.736
Time using cout (un-synced): 0.714
Time using stringstream: 0.725
Time using endl: 20.097
Time using fill_n: 0.749
Time using write: 0.499
As expected, results vary, but there are a few points I found interesting:
printf/puts are much faster than cout when writing to the NUL device
but cout keeps up quite nicely when writing to a real file
Quite a few proposed optimizations accomplish little
In my testing, fill_n is about as fast as anything else
By far the biggest optimization is avoiding endl
cout.write gave the fastest time (though probably not by a significant margin
I've recently edited the code to force a call to printf. Anders Kaseorg was kind enough to point out--that g++ recognizes the specific sequence printf("%s\n", foo); is equivalent to puts(foo);, and generates code accordingly (i.e., generates code to call puts instead of printf). Moving the format string to a global array, and passing that as the format string produces identical output, but forces it to be produced via printf instead of puts. Of course, it's possible they might optimize around this some day as well, but at least for now (g++ 5.1) a test with g++ -O3 -S confirms that it's actually calling printf (where the previous code compiled to a call to puts).
Sending std::endl to the stream appends a newline and flushes the stream. The subsequent invocation of cout.flush() is superfluous. If this was done when timing cout vs. printf then you were not comparing apples to apples.
By default, the C and C++ standard output streams are synchronised, so that writing to one causes a flush of the other, so explicit flushes are not needed.
Also, note that the C++ stream is synced to the C stream.
Thus it does extra work to stay in sync.
Another thing to note is to make sure you flush the streams an equal amount. If you continuously flush the stream on one system and not the other that will definitely affect the speed of the tests.
Before assuming that one is faster than the other you should:
un-sync C++ I/O from C I/O (see sync_with_stdio() ).
Make sure the amount of flushes is comparable.
You can further improve the performance of printf by increasing the buffer size for stdout:
setvbuf (stdout, NULL, _IOFBF, 32768); // any value larger than 512 and also a
// a multiple of the system i/o buffer size is an improvement
The number of calls to the operating system to perform i/o is almost always the most expensive component and performance limiter.
Of course, if cout output is intermixed with stdout, the buffer flushes defeat the purpose an increased buffer size.
You can use sync_with_stdio to make C++ IO faster.
cout.sync_with_stdio(false);
Should improve your output perfomance with cout.
Don't worry about the performance between printf and cout. If you want to gain performance, separate formatted output from non-formatted output.
puts("Hello World\n") is much faster than printf("%s", "Hellow World\n"). (Primarily due to the formatting overhead). Once you have isolated the formatted from plain text, you can do tricks like:
const char hello[] = "Hello World\n";
cout.write(hello, sizeof(hello) - sizeof('\0'));
To speed up formatted output, the trick is to perform all formatting to a string, then use block output with the string (or buffer):
const unsigned int MAX_BUFFER_SIZE = 256;
char buffer[MAX_BUFFER_SIZE];
sprintf(buffer, "%d times is a charm.\n", 5);
unsigned int text_length = strlen(buffer) - sizeof('\0');
fwrite(buffer, 1, text_length, stdout);
To further improve your program's performance, reduce the quantity of output. The less stuff you output, the faster your program will be. A side effect will be that your executable size will shrink too.
Well, I can't think of any reason to actually use cout to be honest. It's completely insane to have a huge bulky template to do something so simple that will be in every file. Also, it's like it's designed to be as slow to type as possible and after the millionth time of typing <<<< and then typing the value in between and getting something lik >variableName>>> on accident I never want to do that again.
Not to mention if you include std namespace the world will eventually implode, and if you don't your typing burden becomes even more ridiculous.
However I don't like printf a lot either. For me, the solution is to create my own concrete class and then call whatever io stuff is necessary within that. Then you can have really simple io in any manner you want and with whatever implementation you want, whatever formatting you want, etc (generally you want floats to always be one way for example, not to format them 800 ways for no reason, so putting in formatting with every call is a joke).
So all I type is something like
dout+"This is more sane than "+cPlusPlusMethod+" of "+debugIoType+". IMO at least";
dout++;
but you can have whatever you want. With lots of files it's surprising how much this improves compile time, too.
Also, there's nothing wrong with mixing C and C++, it should just be done jusdiciously and if you are using the things that cause the problems with using C in the first place it's safe to say the least of your worries is trouble from mixing C and C++.
Mixing C++ and C iomethods was recommended against by my C++ books, FYI. I'm pretty sure the C functions trample on the state expected/held by C++.