Why are std::fstreams so slow? - c++

I was working on a simple parser and when profiling I observed the bottleneck is in... file read! I extracted very simple test to compare the performance of fstreams and FILE* when reading a big blob of data:
#include <stdio.h>
#include <chrono>
#include <fstream>
#include <iostream>
#include <functional>
void measure(const std::string& test, std::function<void()> function)
{
auto start_time = std::chrono::high_resolution_clock::now();
function();
auto duration = std::chrono::duration_cast<std::chrono::nanoseconds>(std::chrono::high_resolution_clock::now() - start_time);
std::cout<<test<<" "<<static_cast<double>(duration.count()) * 0.000001<<" ms"<<std::endl;
}
#define BUFFER_SIZE (1024 * 1024 * 1024)
int main(int argc, const char * argv[])
{
auto buffer = new char[BUFFER_SIZE];
memset(buffer, 123, BUFFER_SIZE);
measure("FILE* write", [buffer]()
{
FILE* file = fopen("test_file_write", "wb");
fwrite(buffer, 1, BUFFER_SIZE, file);
fclose(file);
});
measure("FILE* read", [buffer]()
{
FILE* file = fopen("test_file_read", "rb");
fread(buffer, 1, BUFFER_SIZE, file);
fclose(file);
});
measure("fstream write", [buffer]()
{
std::ofstream stream("test_stream_write", std::ios::binary);
stream.write(buffer, BUFFER_SIZE);
});
measure("fstream read", [buffer]()
{
std::ifstream stream("test_stream_read", std::ios::binary);
stream.read(buffer, BUFFER_SIZE);
});
delete[] buffer;
}
The results of running this code on my machine are:
FILE* write 1388.59 ms
FILE* read 1292.51 ms
fstream write 3105.38 ms
fstream read 3319.82 ms
fstream write/read are about 2 times slower than FILE* write/read! And this while reading a big blob of data, without any parsing or other features of fstreams. I'm running the code on Mac OS, Intel I7 2.6GHz, 16GB 1600 MHz Ram, SSD drive. Please note that running again same code the time for FILE* read is very low (about 200 ms) probably because the file gets cached... This is why the files opened for reading are not created using the code.
Why when reading just a blob of binary data using fstream is so slow compared to FILE*?
EDIT 1: I updated the code and the times. Sorry for the delay!
EDIT 2: I added command line and new results (very similar to previous ones!)
$ clang++ main.cpp -std=c++11 -stdlib=libc++ -O3
$ ./a.out
FILE* write 1417.9 ms
FILE* read 1292.59 ms
fstream write 3214.02 ms
fstream read 3052.56 ms
Following the results for the second run:
$ ./a.out
FILE* write 1428.98 ms
FILE* read 196.902 ms
fstream write 3343.69 ms
fstream read 2285.93 ms
It looks like the file gets cached when reading for both FILE* and stream as the time reduces with the same amount for both of them.
EDIT 3: I reduced the code to this:
FILE* file = fopen("test_file_write", "wb");
fwrite(buffer, 1, BUFFER_SIZE, file);
fclose(file);
std::ofstream stream("test_stream_write", std::ios::binary);
stream.write(buffer, BUFFER_SIZE);
And started the profiler. It seems like stream spends lots of time in xsputn function, and the actual write calls have the same duration (as it should be, it's the same function...)
Running Time Self Symbol Name
3266.0ms 66.9% 0,0 std::__1::basic_ostream<char, std::__1::char_traits<char> >::write(char const*, long)
3265.0ms 66.9% 2145,0 std::__1::basic_streambuf<char, std::__1::char_traits<char> >::xsputn(char const*, long)
1120.0ms 22.9% 7,0 std::__1::basic_filebuf<char, std::__1::char_traits<char> >::overflow(int)
1112.0ms 22.7% 2,0 fwrite
1127.0ms 23.0% 0,0 fwrite
EDIT 4 For some reason this question is marked as duplicate. I wanted to point out that I don't use printf at all, I use only std::cout to write the time. The files used in the read part are the output from the write part, copied with different name to avoid caching

It would seem that, on Linux, for this large set of data, the implementation of fwrite is much more efficient, since it uses write rather than writev.
I'm not sure WHY writev is so much slower than write, but that appears to be where the difference is. And I see absolutely no real reason as to why the fstream needs to use that construct in this case.
This can easily be seen by using strace ./a.out (where a.out is the program testing this).
Output:
Fstream:
clock_gettime(CLOCK_REALTIME, {1411978373, 114560081}) = 0
open("test", O_WRONLY|O_CREAT|O_TRUNC, 0666) = 3
writev(3, [{NULL, 0}, {"\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 1073741824}], 2) = 1073741824
close(3) = 0
clock_gettime(CLOCK_REALTIME, {1411978386, 376353883}) = 0
write(1, "fstream write 13261.8 ms\n", 25fstream write 13261.8 ms) = 25
FILE*:
clock_gettime(CLOCK_REALTIME, {1411978386, 930326134}) = 0
open("test", O_WRONLY|O_CREAT|O_TRUNC, 0666) = 3
write(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 1073741824) = 1073741824
clock_gettime(CLOCK_REALTIME, {1411978388, 584197782}) = 0
write(1, "FILE* write 1653.87 ms\n", 23FILE* write 1653.87 ms) = 23
I don't have them fancy SSD drives, so my machine will be a bit slower on that - or something else is slower in my case.
As pointed out by Jan Hudec, I'm misinterpreting the results. I just wrote this:
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <sys/uio.h>
#include <unistd.h>
#include <iostream>
#include <cstdlib>
#include <cstring>
#include <functional>
#include <chrono>
void measure(const std::string& test, std::function<void()> function)
{
auto start_time = std::chrono::high_resolution_clock::now();
function();
auto duration = std::chrono::duration_cast<std::chrono::nanoseconds>(std::chrono::high_resolution_clock::now() - start_time);
std::cout<<test<<" "<<static_cast<double>(duration.count()) * 0.000001<<" ms"<<std::endl;
}
#define BUFFER_SIZE (1024 * 1024 * 1024)
int main()
{
auto buffer = new char[BUFFER_SIZE];
memset(buffer, 0, BUFFER_SIZE);
measure("writev", [buffer]()
{
int fd = open("test", O_CREAT|O_WRONLY);
struct iovec vec[] =
{
{ NULL, 0 },
{ (void *)buffer, BUFFER_SIZE }
};
writev(fd, vec, sizeof(vec)/sizeof(vec[0]));
close(fd);
});
measure("write", [buffer]()
{
int fd = open("test", O_CREAT|O_WRONLY);
write(fd, buffer, BUFFER_SIZE);
close(fd);
});
}
It is the actual fstream implementation that does something daft - probably copying the whole data in small chunks, somewhere and somehow, or something like that. I will try to find out further.
And the result is pretty much identical for both cases, and faster than both fstream and FILE* variants in the question.
Edit:
It would seem like, on my machine, right now, if you add fclose(file) after the write, it takes approximately the same amount of time for both fstream and FILE* - on my system, around 13 seconds to write 1GB - with old style spinning disk type drives, not SSD.
I can however write MUCH faster using this code:
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <sys/uio.h>
#include <unistd.h>
#include <iostream>
#include <cstdlib>
#include <cstring>
#include <functional>
#include <chrono>
void measure(const std::string& test, std::function<void()> function)
{
auto start_time = std::chrono::high_resolution_clock::now();
function();
auto duration = std::chrono::duration_cast<std::chrono::nanoseconds>(std::chrono::high_resolution_clock::now() - start_time);
std::cout<<test<<" "<<static_cast<double>(duration.count()) * 0.000001<<" ms"<<std::endl;
}
#define BUFFER_SIZE (1024 * 1024 * 1024)
int main()
{
auto buffer = new char[BUFFER_SIZE];
memset(buffer, 0, BUFFER_SIZE);
measure("writev", [buffer]()
{
int fd = open("test", O_CREAT|O_WRONLY, 0660);
struct iovec vec[] =
{
{ NULL, 0 },
{ (void *)buffer, BUFFER_SIZE }
};
writev(fd, vec, sizeof(vec)/sizeof(vec[0]));
close(fd);
});
measure("write", [buffer]()
{
int fd = open("test", O_CREAT|O_WRONLY, 0660);
write(fd, buffer, BUFFER_SIZE);
close(fd);
});
}
gives times of about 650-900 ms.
I can also edit the original program to give a time of approximately 1000ms for fwrite - simply remove the fclose.
I also added this method:
measure("fstream write (new)", [buffer]()
{
std::ofstream* stream = new std::ofstream("test", std::ios::binary);
stream->write(buffer, BUFFER_SIZE);
// Intentionally no delete.
});
and then it takes about 1000 ms here too.
So, my conclusion is that, somehow, sometimes, closing the file makes it flush to disk. In other cases, it doesn't. I still don't understand why...

TL;DR: Try adding this to your code before doing the writing:
const size_t bufsize = 256*1024;
char buf[bufsize];
mystream.rdbuf()->pubsetbuf(buf, bufsize);
When working with large files with fstream, make sure to use a stream buffer.
Counterintuitively, disabling stream buffering dramatically reduces performance. At least the MSVC implementation copies 1 char at a time to the filebuf when no buffer was set (see streambuf::xsputn()), which can make your application CPU-bound, which will result in lower I/O rates.
NB: You can find a complete sample application here.

A side note for whom interests.
The main keywords are Windows 2016 server /CloseHandle.
In our app we discovered a NASTY bug on win2016 server.
Our std code under EVERY windows version takes: (ms)
time CreateFile/SetFilePointer 1 WriteFile 0 CloseHandle 0
on windows 2016 we got:
time CreateFile/SetFilePointer 1 WriteFile 0 CloseHandle 275
And times grows with dimension of file, that is ABSURD.
After a LOT of investigations (we first found "CloseHandle" is the culprit...) we discovered that under windows2016 MS attached an "hook" in close function that triggers "Windows Defender" to scan ALL the file and prevents returning until done. (in other words scanning is synchronous, that is PURE MADNESS).
When we added exclusion in "Defender" for our file, all works fine.
I think is a BAD design, no antivirus stops normal file active INSIDE program space to scan files. (MS can do it as they have the power to do so.)

In contrary to other answers, a big issue with large file reads comes from buffering by the C standard library. Try using low level read/write calls in large chunks (1024KB) and see the performance jump.
File buffering by the C library is useful for reading or writing small chunks of data (smaller than disk block size).
On Windows I got almost a 3x performance boost dropping file buffering when reading and writing raw video streams.
I also opened the file using native OS (win32) API calls and told the OS not to cache the file as this involves yet another copy.

The stream is somehow broken on the MAC, old implementation or setup.
An old setup could cause the FILE to be written in the exe directory and the stream in the user directory, this shouldn't make any difference unless you got 2 disks or other different setting.
On my lousy Vista I get
Normal buffer+Uncached:
C++ 201103
FILE* write 4756 ms
FILE* read 5007 ms
fstream write 5526 ms
fstream read 5728 ms
Normal buffer+Cached:
C++ 201103
FILE* write 4747 ms
FILE* read 454 ms
fstream write 5490 ms
fstream read 396 ms
Large Buffer+cached:
C++ 201103
5th run:
FILE* write 4760 ms
FILE* read 446 ms
fstream write 5278 ms
fstream read 369 ms
This shows that the FILE write is faster than the fstream, but slower in read than fstream ... but all numbers are within ~10% of each other.
Try adding some more buffering to your stream to see if that helps.
const int MySize = 1024*1024;
char MrBuf[MySize];
stream.rdbuf()->pubsetbuf(MrBuf, MySize);
The equivalent for FILE is
const int MySize = 1024*1024;
if (!setvbuf ( file , NULL , _IOFBF , MySize ))
DieInDisgrace();

Related

Why is data corrupt when reading back from a file as it's being written with O_DIRECT

I have a C++ program that uses the POSIX API to write a file opened with O_DIRECT. Concurrently, another thread is reading back from the same file via a different file descriptor. I've noticed that occasionally the data read back from the file contains all zeroes, rather than the actual data I wrote. Why is this?
Here's an MCVE in C++17. Compile with g++ -std=c++17 -Wall -otest test.cpp or equivalent. Sorry I couldn't seem to make it any shorter. All it does is write 100 MiB of constant bytes (0x5A) to a file in one thread and read them back in another, printing a message if any of the read-back bytes are not equal to 0x5A.
WARNING, this MCVE will delete and rewrite any file in the current working directory named foo.
#include <algorithm>
#include <cstddef>
#include <cstdint>
#include <cstdlib>
#include <iostream>
#include <thread>
#include <fcntl.h>
#include <unistd.h>
#include <sys/stat.h>
#include <sys/types.h>
constexpr size_t CHUNK_SIZE = 1024 * 1024;
constexpr size_t TOTAL_SIZE = 100 * CHUNK_SIZE;
int main(int argc, char *argv[])
{
::unlink("foo");
std::thread write_thread([]()
{
int fd = ::open("foo", O_WRONLY | O_CREAT | O_DIRECT, 0777);
if (fd < 0) std::exit(-1);
uint8_t *buffer = static_cast<uint8_t *>(
std::aligned_alloc(4096, CHUNK_SIZE));
std::fill(buffer, buffer + CHUNK_SIZE, 0x5A);
size_t written = 0;
while (written < TOTAL_SIZE)
{
ssize_t rv = ::write(fd, buffer,
std::min(TOTAL_SIZE - written, CHUNK_SIZE));
if (rv < 0) { std::cerr << "write error" << std::endl; std::exit(-1); }
written += rv;
}
});
std::thread read_thread([]()
{
int fd = ::open("foo", O_RDONLY, 0);
if (fd < 0) std::exit(-1);
uint8_t *buffer = new uint8_t[CHUNK_SIZE];
size_t checked = 0;
while (checked < TOTAL_SIZE)
{
ssize_t rv = ::read(fd, buffer, CHUNK_SIZE);
if (rv < 0) { std::cerr << "write error" << std::endl; std::exit(-1); }
for (ssize_t i = 0; i < rv; ++i)
if (buffer[i] != 0x5A)
std::cerr << "readback mismatch at offset " << checked + i << std::endl;
checked += rv;
}
});
write_thread.join();
read_thread.join();
}
(Details such as proper error checking and resource management are omitted here for the sake of the MCVE. This is not my actual program but it shows the same behavior.)
I'm testing on Linux 4.15.0 with an SSD. About 1/3 of the time I run the program, the "readback mismatch" message prints. Sometimes it doesn't. In all cases, if I examine foo after the fact I find that it does contain the correct data.
If you remove O_DIRECT from the ::open() flags in the write thread, the problem goes away and the "readback mismatch" message never prints.
I could understand why my ::read() might return 0 or something to indicate I've already read everything that has been flushed to disk yet. But I can't understand why it would perform what appears to be a successful read, but with data other than what I wrote. Clearly I'm missing something, but what is it?
So, O_DIRECT has some additional constraints that might not make it what you're looking for:
Applications should avoid mixing O_DIRECT and normal I/O to the same
file, and especially to overlapping byte regions in the same file.
Even when the filesystem correctly handles the coherency issues in
this situation, overall I/O throughput is likely to be slower than
using either mode alone.
Instead, I think O_SYNC might be better, since it does provide the expected guarantees:
O_SYNC provides synchronized I/O file integrity completion, meaning
write operations will flush data and all associated metadata to the
underlying hardware. O_DSYNC provides synchronized I/O data
integrity completion, meaning write operations will flush data to the
underlying hardware, but will only flush metadata updates that are
required to allow a subsequent read operation to complete
successfully. Data integrity completion can reduce the number of
disk operations that are required for applications that don't need
the guarantees of file integrity completion.

How can I read multiple files faster?

in my program I want to read several text files (more than ~800 files), each with 256 lines and their filenames starting from 1.txt to n.txt, and store them into a database after several processing steps. My problem is the data's reading speed. I could speed the program up to about twice the speed it had before by using OpenMP multithreading for the reading loop. Is there a way to speed it up a bit more? My actual code is
std::string CCD_Folder = CCDFolder; //CCDFolder is a pointer to a char array
int b = 0;
int PosCounter = 0;
int WAVENUMBER, WAVELUT;
std::vector<std::string> tempstr;
std::string inputline;
//Input
omp_set_num_threads(YValue);
#pragma omp parallel for private(WAVENUMBER) private(WAVELUT) private(PosCounter) private(tempstr) private(inputline)
for(int i = 1; i < (CCD_Filenumbers+1); i++)
{
//std::cout << omp_get_thread_num() << ' ' << i << '\n';
//Umwandlung und Erstellung des Dateinamens, Öffnen des Lesekanals
std::string CCD_Filenumber = boost::lexical_cast<string>(i);
std::string CCD_Filename = CCD_Folder + '\\' + CCD_Filenumber + ".txt";
std::ifstream datain(CCD_Filename, std::ifstream::in);
while(!datain.eof())
{
std::getline(datain, inputline);
//Processing
};
};
All variables which are not defined here are defined somewhere else in my code, and it is working. So is there a possibility to speed this code a bit more up?
Thank you very much!
Some experiment:
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <Windows.h>
void generateFiles(int n) {
char fileName[32];
char fileStr[1032];
for (int i=0;i<n;i++) {
sprintf( fileName, "c:\\t\\%i.txt", i );
FILE * f = fopen( fileName, "w" );
for (int j=0;j<256;j++) {
int lineLen = rand() % 1024;
memset(fileStr, 'X', lineLen );
fileStr[lineLen] = 0x0D;
fileStr[lineLen+1] = 0x0A;
fileStr[lineLen+2] = 0x00;
fwrite( fileStr, 1, lineLen+2, f );
}
fclose(f);
}
}
void readFiles(int n) {
char fileName[32];
for (int i=0;i<n;i++) {
sprintf( fileName, "c:\\t\\%i.txt", i );
FILE * f = fopen( fileName, "r" );
fseek(f, 0L, SEEK_END);
int size = ftell(f);
fseek(f, 0L, SEEK_SET);
char * data = (char*)malloc(size);
fread(data, size, 1, f);
free(data);
fclose(f);
}
}
DWORD WINAPI readInThread( LPVOID lpParam )
{
int * number = (int *)lpParam;
char fileName[32];
sprintf( fileName, "c:\\t\\%i.txt", *number );
FILE * f = fopen( fileName, "r" );
fseek(f, 0L, SEEK_END);
int size = ftell(f);
fseek(f, 0L, SEEK_SET);
char * data = (char*)malloc(size);
fread(data, size, 1, f);
free(data);
fclose(f);
return 0;
}
int main(int argc, char ** argv) {
long t1 = GetTickCount();
generateFiles(256);
printf("Write: %li ms\n", GetTickCount() - t1 );
t1 = GetTickCount();
readFiles(256);
printf("Read: %li ms\n", GetTickCount() - t1 );
t1 = GetTickCount();
const int MAX_THREADS = 256;
int pDataArray[MAX_THREADS];
DWORD dwThreadIdArray[MAX_THREADS];
HANDLE hThreadArray[MAX_THREADS];
for( int i=0; i<MAX_THREADS; i++ )
{
pDataArray[i] = (int) HeapAlloc(GetProcessHeap(), HEAP_ZERO_MEMORY,
sizeof(int));
pDataArray[i] = i;
hThreadArray[i] = CreateThread(
NULL,
0,
readInThread,
&pDataArray[i],
0,
&dwThreadIdArray[i]);
}
WaitForMultipleObjects(MAX_THREADS, hThreadArray, TRUE, INFINITE);
printf("Read (threaded): %li ms\n", GetTickCount() - t1 );
}
first function just ugly thing to make a test dataset ( I know it can be done much better, but I honestly have no time )
1st experiment - sequential read
2nd experiment - read all in parallel
results:
256 files:
Write: 250 ms
Read: 140 ms
Read (threaded): 78 ms
1024 files:
Write: 1250 ms
Read: 547 ms
Read (threaded): 843 ms
I think second attempt clearly shows that on a long run 'dumb' thread creation just makes things even worse. Of course it needs improvements in a sense of preallocated workers, some thread pool etc, but I think with such fast operation as reading 100-200k from disk there is no really benefit of moving this functionality into thread. I have no time to write more 'clever' solution, but I have my doubts that it will be much faster because you will have to add system calls for mutexes etc...
going extreme you could think of preallocating memory pools etc.. but as being mentioned before code your posted just wrong.. it's a matter of milliseconds, but for sure not seconds
800 files (20 chars per line, 256 lines)
Write: 250 ms
Read: 63 ms
Read (threaded): 500 ms
Conclusion:
ANSWER IS:
Your reading code is wrong, you reading files so slow that there is a significant increase in speed then you make tasks runs in parallel. In the code above reading is actually faster then an expenses to spawn a thread
Your primary bottleneck is physically reading from the hard disk.
Unless you have the files on separate drives, the drive can only read data from one file at a time. Your best bet is to read each file as a whole rather read a portion of one file, tell the drive to locate to another file, read from there, and repeat. Repositioning the drive head to other locations, especially other files, is usually more expensive than letting the drive finish reading the single file.
The next bottle neck is the data channel between the processor and the hard drive. If your hard drives share any kind of communications channel, you will see a bottleneck, as data from each drive must come through the communications channel to your processor. Your processor will be sending commands to the drive(s) through this communications channel (PATA, SATA, USB, etc.).
The objective of the next steps is to reduce the overhead of the "middle men" between your program's memory and the hard drive communications interface. The most efficient is to access the controller directly; lesser efficient are using the OS functions; the "C" functions (fread and familiy) and least is the C++ streams. With increased efficiency comes tighter coupling with the platform and reduced safety (and simplicity).
I suggest the following:
Create multiple buffers in memory, large enough to save time, small
enough to prevent the OS from paging the memory to the hard drive.
Create a thread that reads the files into memory, as necessary.
Search the web for "double buffering". As long as there is space in
the buffer, this thread will read data.
Create multiple "outgoing" buffers.
Create a second thread that removes data from memory and "processes"
it, and inserts into the "outgoing" buffers.
Create a third thread that takes the data in the "outgoing" buffers
and sends to the databases.
Adjust the size of the buffers for the best efficiency within the
limitations of memory.
If you can access the DMA channels, use them to read from the hard drive into the "read buffers".
Next, you can optimize your code to efficiently use the data cache of the processor. For example, set up your "processing" so the data structures to not exceed a data line in the cache. Also, optimize your code to use registers (either specify the register keyword or use statement blocks so that the compiler knows when variables can be reused).
Other optimizations that may help:
Align data to the processors native word size, pad if necessary. For
example, prefer using 32 bytes instead of 13 or 24.
Fetch data in quantities of the processor's word size. For example,
access 4 octets (bytes) at a time on a 32-bit processor rather than 4
accesses of 1 byte.
Unroll loops - more instructions inside the loop, as branch
instructions slow down processing.
You are probably hitting the read limit of your disks, which means your options are somewhat limited. If this is a constant problem you could consider a different RAID structure, which will give you greater read throughput because more than one read head can access data at the same time.
To see if disk access really is the bottleneck, run your program with the time command:
>> /usr/bin/time -v <my program>
In the output you'll see how much CPU time you were utilizing compared to the amount of time required for things like disk access.
I would try going with C code for reading the file. I suspect that it'll be faster.
FILE* f = ::fopen( CCD_Filename.c_str(), "rb" );
if( f == NULL )
{
return;
}
::fseek( f, 0, SEEK_END );
const long lFileBytes = ::ftell( f );
::fseek( f, 0, SEEK_SET );
char* fileContents = new char[lFileBytes + 1];
const size_t numObjectsRead = ::fread( fileContents, lFileBytes, 1, f );
::fclose( f );
if( numObjectsRead < 1 )
{
delete [] fileContents;
return;
}
fileContents[lFileBytes] = '\0';
// assign char buffer of file contents here
delete [] fileContents;

Why can 'dd' read from a pipe faster than my own program using ifstream?

I have two programs that pass data to each other via linux pipes (named or otherwise). I need to hit a transfer rate of ~2600 MB/s between the two programs, but am currently seeing a slower rate of about ~2200 MB/s. However, I found that if I replace my 2nd process with 'dd' instead, the transfer rate jumps to over 3000 MB/s. Is there something about the way my program is reading from the pipe that is less efficient than the way 'dd' does it? What can I do to improve this throughput? Is 'ifstream' inherently slower than other methods of reading binary data from pipe?
To summarize the two scenarios:
Scenario 1:
Program 1 -> [named pipe] -> Program 2
Yields ~2200 MB/s transfer rate
Scenario2:
Program 1 -> [named pipe] -> 'dd if=pipename of=/dev/null bs=8M'
Yields ~3000 MB/s transfer
rate.
Here is the way my Program 2 currently reads from pipe:
ifstream inputFile;
inputFile.open(inputFileName.c_str(), ios::in | ios::binary);
while (keepLooping)
{
inputFile.read(&buffer[0], 8*1024*1024);
bytesRead = inputFile.gcount();
//Do something with data
}
Update:
I have now tried using 'read(fd, &buffer[0], 8*1024*1024)' instead of istream, seemed to show a mild improvement (but not as much as dd)
I also tried using stream->rdbuf()->sgetn(&buffer[0], 8*1024*1024) instead of stream->read(), which did not help.
The difference appears to be due to using an array instead of std::vector, which I still have a hard time believing. My two sets of code are shown below for comparison. The first can ingest from Program 1 at a rate of about 2500 MB/s. The second can ingest at a rate of 3100 MB/s.
Program 1 (2500 MB/s)
int main(int argc, char **argv)
{
int fd = open("/tmp/fifo2", O_RDONLY);
std::vector<char> buf(8*1024*1024);
while(1)
{
read(fd, &buf[0], 8*1024*1024);
}
}
Program 2 (3100 MB/s)
int main(int argc, char **argv)
{
int fd = open("/tmp/fifo2", O_RDONLY);
char buf[8*1024*1024];
while(1)
{
read(fd, &buf[0], 8*1024*1024);
}
}
Both are compiled with -O3 using gcc version 4.4.6. If anyone can explain the reason for this I'd be very interested (since I understand std::vector to basically be a wrapper around an array).
Edit: I just tested Program 3, below, that can uses ifstream and runs at 3000 MB/s. So it appears that using ifstream instead of 'read()' incurs a very slight performance degradation. Much less than the hit taken from using std::vector.
Program 3 (3000 MB/s)
int main(int argc, char **argv)
{
ifstream file("/tmp/fifo2", ios::in | ios::binary);
char buf[8*1024*1024];
while(1)
{
file.read(&buf[0], 32*1024);
}
}
Edit 2:
I modded Program 2's code to use malloc'd memory instead of memory on the stack and the performance dropped to match the vector performance. Thanks, ipc, for keying me onto this.
This code compiled with g++ -Ofast:
int main(int argc, char *argv[])
{
if (argc != 2) return -1;
std::ifstream in(argv[1]);
std::vector<char> buf(8*1024*1024);
in.rdbuf()->pubsetbuf(&buf[0], buf.size());
std::ios_base::sync_with_stdio(false);
std::cout << in.rdbuf();
}
does not perform that bad at all.
$ time <<this program>> <<big input file>> >/dev/null
0.20s user 3.50s system 9% cpu 40.548 total
$ time dd if=<<big input file>> bs=8M > /dev/null
0.01s user 3.84s system 9% cpu 40.786 total
You have to consider that std::cout shares a buffer with stdout which is really time consuming if not switched off. So call std::ios_base::sync_with_stdio(false); if you want speed and do not intend to use C's input output methods (which are slower anyway).
Also, for raw and fast input/output in C++, use the methods from streambuf, obtained by rdbuf().

working of fwrite in c++

I am trying to simulate race conditions in writing to a file. This is what I am doing.
Opening a.txt in append mode in process1
writing "hello world" in process1
prints the ftell in process1 which is 11
put process1 in sleep
open a.txt again in append mode in process2
writing "hello world" in process2 (this correctly appends to the end of the file)
prints the ftell in process2 which is 22 (correct)
writing "bye world" in process2 (this correctly appends to the end of the file).
process2 quits
process1 resumes, and prints its ftell value, which is 11.
writing "bye world" by process1 --- i assume as the ftell of process1 is 11, this should overwrite the file.
However, the write of process1 is writing to the end of the file and there is no contention in writing between the processes.
I am using fopen as fopen("./a.txt", "a+)
Can anyone tell why is this behavior and how can I simulate the race condition in writing to the file?
The code of process1:
#include <iostream>
#include <fstream>
#include <string>
#include <stdio.h>
#include "time.h"
using namespace std;
int main()
{
FILE *f1= fopen("./a.txt","a+");
cout<<"opened file1"<<endl;
string data ("hello world");
fwrite(data.c_str(), sizeof(char), data.size(), f1);
fflush(f1);
cout<<"file1 tell "<<ftell(f1)<<endl;
cout<<"wrote file1"<<endl;
sleep(3);
string data1 ("bye world");;
cout<<"wrote file1 end"<<endl;
cout<<"file1 2nd tell "<<ftell(f1)<<endl;
fwrite(data1.c_str(), sizeof(char), data1.size(), f1);
cout<<"file1 2nd tell "<<ftell(f1)<<endl;
fflush(f1);
return 0;
}
In process2, I have commented out the sleep statement.
I am using the following script to run:
./process1 &
sleep 2
./process2 &
Thanks for your time.
The writer code:
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#define BLOCKSIZE 1000000
int main(int argc, char **argv)
{
FILE *f = fopen("a.txt", "a+");
char *block = malloc(BLOCKSIZE);
if (argc < 2)
{
fprintf(stderr, "need argument\n");
}
memset(block, argv[1][0], BLOCKSIZE);
for(int i = 0; i < 3000; i++)
{
fwrite(block, sizeof(char), BLOCKSIZE, f);
}
fclose(f);
}
The reader function:
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#define BLOCKSIZE 1000000
int main(int argc, char **argv)
{
FILE *f = fopen("a.txt", "r");
int c;
int oldc = 0;
int rl = 0;
while((c = fgetc(f)) != EOF)
{
if (c != oldc)
{
if (rl)
{
printf("Got %d of %c\n", rl, oldc);
}
oldc = c;
rl = 0;
}
rl++;
}
fclose(f);
}
I ran ./writefile A & ./writefile B then ./readfile
I got this:
Got 1000999424 of A
Got 999424 of B
Got 999424 of A
Got 4096 of B
Got 4096 of A
Got 995328 of B
Got 995328 of A
Got 4096 of B
Got 4096 of A
Got 995328 of B
Got 995328 of A
Got 4096 of B
Got 4096 of A
Got 995328 of B
Got 995328 of A
Got 4096 of B
Got 4096 of A
Got 995328 of B
Got 995328 of A
Got 4096 of B
Got 4096 of A
Got 995328 of B
Got 995328 of A
As you can see, there are nice long runs of A and B, but they are not exactly 1000000 characters long, which is the size I wrote them. The whole file, after a trialrun with a smaller size in the first run is just short of 7GB.
For reference: Fedora Core 16, with my own compiled 3.7rc5 kernel, gcc 4.6.3, x86-64, and ext4 on top of lvm, AMD PhenomII quad core processor, 16GB of RAM
Writing in append mode is an atomic operation. This is why it doesn't break.
Now... how to break it?
Try memory mapping the file and writing in the memory from the two processes. I'm pretty sure this will break it.
I'm pretty sure you can't RELY on this behaviour, but it may well work reliably on some systems. Writing to the same file from two different processes is likely to cause problems sooner or later, if you "try hard enough". And sod's law says that that's exactly when your boss is checking if the software works, when your customer takes delivery of the system you've sold, or when you are finalizing your report that took ages to produce, or some other important time.
The behavior you're trying to break or see depends on which OS you are working on, as writing in a file is a system call.
On what you told us about the first file descriptor to not overwrite what the second process wrote, the fact you opened the file in append mode in both process may have actualized the ftell value before actually writing in it.
Did you try to do the same with the standard open and write functions? Might be interesting as well.
EDIT: The C++ Reference doc explains about the fopen append option here:
"append/update: Open a file for update (both for input and output) with all output operations writing data at the end of the file. Repositioning operations (fseek, fsetpos, rewind) affects the next input operations, but output operations move the position back to the end of file."
This explains the behavior you observed.

zlib gzgets extremely slow?

I'm doing stuff related to parsing huge globs of textfiles, and was testing what input method to use.
There is not much of a difference using c++ std::ifstreams vs c FILE,
According to the documentation of zlib, it supports uncompressed files, and will read the file without decompression.
I'm seeing a difference from 12 seconds using non zlib to more than 4 minutes using zlib.h
This I've tested doing multiple runs, so its not a disk cache issue.
Am I using zlib in some wrong way?
thanks
#include <zlib.h>
#include <cstdio>
#include <cstdlib>
#include <fstream>
#define LENS 1000000
size_t fg(const char *fname){
fprintf(stderr,"\t-> using fgets\n");
FILE *fp =fopen(fname,"r");
size_t nLines =0;
char *buffer = new char[LENS];
while(NULL!=fgets(buffer,LENS,fp))
nLines++;
fprintf(stderr,"%lu\n",nLines);
return nLines;
}
size_t is(const char *fname){
fprintf(stderr,"\t-> using ifstream\n");
std::ifstream is(fname,std::ios::in);
size_t nLines =0;
char *buffer = new char[LENS];
while(is. getline(buffer,LENS))
nLines++;
fprintf(stderr,"%lu\n",nLines);
return nLines;
}
size_t iz(const char *fname){
fprintf(stderr,"\t-> using zlib\n");
gzFile fp =gzopen(fname,"r");
size_t nLines =0;
char *buffer = new char[LENS];
while(0!=gzgets(fp,buffer,LENS))
nLines++;
fprintf(stderr,"%lu\n",nLines);
return nLines;
}
int main(int argc,char**argv){
if(atoi(argv[2])==0)
fg(argv[1]);
if(atoi(argv[2])==1)
is(argv[1]);
if(atoi(argv[2])==2)
iz(argv[1]);
}
I guess you are using zlib-1.2.3. In this version, gzgets() is virtually calling gzread() for each byte. Calling gzread() in this way has a big overhead. You can compare the CPU time of calling gzread(gzfp, buffer, 4096) once and of calling gzread(gzfp, buffer, 1) for 4096 times. The result is the same, but the CPU time is hugely different.
What you should do is to implement buffered I/O for zlib, reading ~4KB data in a chunk with one gzread() call (like what fread() does for read()). The latest zlib-1.2.5 is said to be significantly improved on gzread/gzgetc/.... You may try that as well. As it is released very recently, I have not tried personally.
EDIT:
I have tried zlib-1.2.5 just now. gzgetc and gzgets in 1.2.5 are much faster than those in 1.2.3.