C/C++ best way to send a number of bytes to stdout - c++

Profiling my program and the function print is taking a lot of time to perform. How can I send "raw" byte output directly to stdout instead of using fwrite, and making it faster (need to send all 9bytes in the print() at the same time to the stdout) ?
void print(){
unsigned char temp[9];
temp[0] = matrix[0][0];
temp[1] = matrix[0][1];
temp[2] = matrix[0][2];
temp[3] = matrix[1][0];
temp[4] = matrix[1][1];
temp[5] = matrix[1][2];
temp[6] = matrix[2][0];
temp[7] = matrix[2][1];
temp[8] = matrix[2][2];
fwrite(temp,1,9,stdout);
}
Matrix is defined globally to be a unsigned char matrix[3][3];

IO is not an inexpensive operation. It is, in fact, a blocking operation, meaning that the OS can preempt your process when you call write to allow more CPU-bound processes to run, before the IO device you're writing to completes the operation.
The only lower level function you can use (if you're developing on a *nix machine), is to use the raw write function, but even then your performance will not be that much faster than it is now. Simply put: IO is expensive.

The top rated answer claims that IO is slow.
Here's a quick benchmark with a sufficiently large buffer to take the OS out of the critical performance path, but only if you're willing to receive your output in giant blurps. If latency to first byte is your problem, you need to run in "dribs" mode.
Write 10 million records from a nine byte array
Mint 12 AMD64 on 3GHz CoreDuo under gcc 4.6.1
340ms to /dev/null
710ms to 90MB output file
15254ms to 90MB output file in "dribs" mode
FreeBSD 9 AMD64 on 2.4GHz CoreDuo under clang 3.0
450ms to /dev/null
550ms to 90MB output file on ZFS triple mirror
1150ms to 90MB output file on FFS system drive
22154ms to 90MB output file in "dribs" mode
There's nothing slow about IO if you can afford to buffer properly.
#include <stdio.h>
#include <assert.h>
#include <stdlib.h>
#include <string.h>
int main (int argc, char* argv[])
{
int dribs = argc > 1 && 0==strcmp (argv[1], "dribs");
int err;
int i;
enum { BigBuf = 4*1024*1024 };
char* outbuf = malloc (BigBuf);
assert (outbuf != NULL);
err = setvbuf (stdout, outbuf, _IOFBF, BigBuf); // full line buffering
assert (err == 0);
enum { ArraySize = 9 };
char temp[ArraySize];
enum { Count = 10*1000*1000 };
for (i = 0; i < Count; ++i) {
fwrite (temp, 1, ArraySize, stdout);
if (dribs) fflush (stdout);
}
fflush (stdout); // seems to be needed after setting own buffer
fclose (stdout);
if (outbuf) { free (outbuf); outbuf = NULL; }
}

The rawest form of output you can do is the probable the write system call, like this
write (1, matrix, 9);
1 is the file descriptor for standard out (0 is standard in, and 2 is standard error). Your standard out will only write as fast as the one reading it at the other end (i.e. the terminal, or the program you're pipeing into) which might be rather slow.
I'm not 100% sure, but you could try setting non-blocking IO on fd 1 (using fcntl) and hope the OS will buffer it for you until it can be consumed by the other end. It's been a while, but I think it works like this
fcntl (1, F_SETFL, O_NONBLOCK);
YMMV though. Please correct me if I'm wrong on the syntax, as I said, it's been a while.

Perhaps your problem is not that fwrite() is slow, but that it is buffered.
Try calling fflush(stdout) after the fwrite().
This all really depends on your definition of slow in this context.

All printing is fairly slow, although iostreams are really slow for printing.
Your best bet would be to use printf, something along the lines of:
printf("%c%c%c%c%c%c%c%c%c\n", matrix[0][0], matrix[0][1], matrix[0][2], matrix[1][0],
matrix[1][1], matrix[1][2], matrix[2][0], matrix[2][1], matrix[2][2]);

As everyone has pointed out IO in tight inner loop is expensive. I have normally ended up doing conditional cout of Matrix based on some criteria when required to debug it.
If your app is console app then try redirecting it to a file, it will be lot faster than doing console refreshes. e.g app.exe > matrixDump.txt

What's wrong with:
fwrite(matrix,1,9,stdout);
both the one and the two dimensional arrays take up the same memory.

Try running the program twice. Once with output and once without. You will notice that overall, the one without the io is the fastest. Also, you could fork the process (or create a thread), one writing to a file(stdout), and one doing the operations.

So first, don't print on every entry. Basically what i am saying is do not do like that.
for(int i = 0; i<100; i++){
printf("Your stuff");
}
instead allocate a buffer either on stack or on heap, and store you infomration there and then just throw this bufffer into stdout, just liek that
char *buffer = malloc(sizeof(100));
for(int i = 100; i<100; i++){
char[i] = 1; //your 8 byte value goes here
}
//once you are done print it to a ocnsole with
write(1, buffer, 100);
but in your case, just use write(1, temp, 9);

I am pretty sure you can increase the output performance by increasing the buffer size. So you have less fwrite calls. write might be faster but I am not sure. Just try this:
❯ yes | dd of=/dev/null count=1000000
1000000+0 records in
1000000+0 records out
512000000 bytes (512 MB, 488 MiB) copied, 2.18338 s, 234 MB/s
vs
> yes | dd of=/dev/null count=100000 bs=50KB iflag=fullblock
100000+0 records in
100000+0 records out
5000000000 bytes (5.0 GB, 4.7 GiB) copied, 2.63986 s, 1.9 GB/s
The same applies to your code. Some tests during the last days show that probably good buffer sizes are around 1 << 12 (=4096) and 1<<16 (=65535) bytes.

You can simply:
std::cout << temp;
printf is more C-Style.
Yet, IO operations are costly, so use them wisely.

Related

Fastest way to read millions of integers from stdin C++?

I am working on a sorting project and I've come to the point where a main bottleneck is reading in the data. It takes my program about 20 seconds to sort 100,000,000 integers read in from stdin using cin and std::ios::sync_with_stdio(false); but it turns out that 10 of those seconds is reading in the data to sort. We do know how many integers we will be reading in (the count is at the top of the file we need to sort).
How can I make this faster? I know it's possible because a student in a previous semester was able to do counting sort in a little over 3 seconds (and that's basically purely read time).
The program is just fed the contents of a file with integers separated by newlines like $ ./program < numstosort.txt
Thanks
Here is the relevant code:
std::ios::sync_with_stdio(false);
int max;
cin >> max;
short num;
short* a = new short[max];
int n = 0;
while(cin >> num) {
a[n] = num;
n++;
}
This will get your data into memory about as fast as possible, assuming Linux/POSIX running on commodity hardware. Note that since you apparently aren't allowed to use compiler optimizations, C++ IO is not going to be the fastest way to read data. As others have noted, without optimizations the C++ code will not run anywhere near as fast as it can.
Given that the redirected file is already open as stdin/STDIN_FILENO, use low-level system call/C-style IO. That won't need to be optimized, as it will run just about as fast as possible:
struct stat sb;
int rc = ::fstat( STDIN_FILENO, &sb );
// use C-style calloc() to get memory that's been
// set to zero as calloc() is often optimized to be
// faster than a new followed by a memset().
char *data = (char *)::calloc( 1, sb.st_size + 1 );
size_t totalRead = 0UL;
while ( totalRead < sb.st_size )
{
ssize_t bytesRead = ::read( STDIN_FILENO,
data + totalRead, sb.st_size - totalRead );
if ( bytesRead <= 0 )
{
break;
}
totalRead += bytesRead;
}
// data is now in memory - start processing it
That code will read your data into memory as one long C-style string. And the lack of compiler optimizations won't matter one bit as it's all almost bare-metal system calls.
Using fstat() to get the file size allows allocating all the needed memory at once - no realloc() or copying data around is necessary.
You'll need to add some error checking, and a more robust version of the code would check to be sure the data returned from fstat() actually is a regular file with an actual size, and not a "useless use of cat" such as cat filename | YourProgram, because in that case the fstat() call won't return a useful file size. You'll need to examine the sb.st_mode field of the struct stat after the call to see what the stdin stream really is:
::fstat( STDIN_FILENO, &sb );
...
if ( S_ISREG( sb.st_mode ) )
{
// regular file...
}
(And for really high-performance systems, it can be important to ensure that the memory pages you're reading data into are actually mapped in your process address space. Performance can really stall if data arrives faster than the kernel's memory management system can create virtual-to-physical mappings for the pages data is getting dumped into.)
To handle a large file as fast as possible, you'd want to go multithreaded, with one thread reading data and feeding one or more data processing threads so you can start processing data before you're done reading it.
Edit: parsing the data.
Again, preventing compiler optimizations probably makes the overhead of C++ operations slower than C-style processing. Based on that assumption, something simple will probably run faster.
This would probably work a lot faster in a non-optimized binary, assuming the data is in a C-style string read in as above:
char *next;
long count = ::strtol( data, &next, 0 );
long *values = new long[ count ];
for ( long ii = 0; ii < count; ii++ )
{
values[ ii ] = ::strtol( next, &next, 0 );
}
That is also very fragile. It relies on strtol() skipping over leading whitespace, meaning if there's anything other than whitespace between the numeric values it will fail. It also relies on the initial count of values being correct. Again - that code will fail if that's not true. And because it can replace the value of next before checking for errors, if it ever goes off the rails because of bad data it'll be hopelessly lost.
But it should be about as fast as possible without allowing compiler optimizations.
That's what crazy about not allowing compiler optimizations. You can write simple, robust C++ code to do all your processing, make use of a good optimizing compiler, and probably run almost as fast as the code I posted - which has no error checking and will fail spectacularly in unexpected and undefined ways if fed unexpected data.
You can make it faster if you use a SolidState hard drive. If you want to ask something about code performance, you need to post how are you doing things in the first place.
You may be able to speed up your program by reading the data into a buffer, then converting the text in the buffer to internal representation.
The thought behind this is that all stream devices like to keep streaming. Starting and stopping the stream wastes time. A block read transfers a lot of data with one transaction.
Although cin is buffered, by using cin.read and a buffer, you can make the buffer a lot bigger than cin uses.
If the data has fixed width fields, there are opportunities to speed up the input and conversion processes.
Edit 1: Example
const unsigned int BUFFER_SIZE = 65536;
char text_buffer[BUFFER_SIZE];
//...
cin.read(text_buffer, BUFFER_SIZE);
//...
int value1;
int arguments_scanned = snscanf(&text_buffer, REMAINING_BUFFER_SIZE,
"%d", &value1);
The tricky part is handling the cases where the text of a number is cut off at the end of the buffer.
Can you ran this little test in compare to your test with and without commented line?
#include <iostream>
#include <cstdlib>
int main()
{
std::ios::sync_with_stdio(false);
char buffer[20] {0,};
int t = 0;
while( std::cin.get(buffer, 20) )
{
// t = std::atoi(buffer);
std::cin.ignore(1);
}
return 0;
}
Pure read test:
#include <iostream>
#include <cstdlib>
int main()
{
std::ios::sync_with_stdio(false);
char buffer[1024*1024];
while( std::cin.read(buffer, 1024*1024) )
{
}
return 0;
}

Fast C++ String Output

I have a program that outputs the data from an FPGA. Since the data changes EXTREMELY fast, I'm trying to increase the speed of the program. Right now I am printing data like this
for (int i = 0; i < 100; i++) {
printf("data: %d\n",getData(i));
}
I found that using one printf greatly increases speed
printf("data: %d \n data: %d \n data: %d \n",getData(1),getData(2),getData(3));
However, as you can see, its very messy and I can't use a for loop. I tried concatenating the strings first using sprintf and then printing everything out at once, but it's just as slow as the first method. Any suggestions?
Edit:
I'm already printing to a file first, because I realized the console scrolling would be an issue. But its still too slow. I'm debugging a memory controller for an external FPGA, so the closer to the real speed the better.
If you are writing to stdout, you might not be able to influence this all.
Otherwise, set buffering
setvbuf http://en.cppreference.com/w/cpp/io/c/setvbuf
std::nounitbuf http://en.cppreference.com/w/cpp/io/manip/unitbuf
and untie the input output streams (C++) http://en.cppreference.com/w/cpp/io/basic_ios/tie
std::ios_base::sync_with_stdio(false) (thanks #Dietmar)
Now, Boost Karma is known to be pretty performant. However, I'd need to know more about your input data.
Meanwhile, try to buffer your writes manually: Live on Coliru
#include <stdio.h>
int getData(int i) { return i; }
int main()
{
char buf[100*24]; // or some other nice, large enough size
char* const last = buf+sizeof(buf);
char* out = buf;
for (int i = 0; i < 100; i++) {
out += snprintf(out, last-out, "data: %d\n", getData(i));
}
*out = '\0';
printf("%s", buf);
}
Wow, I can't believe I didn't do this earlier.
const int size = 100;
char data[size];
for (int i = 0; i < size; i++) {
*(data + i) = getData(i);
}
for (int i = 0; i < size; i++) {
printf("data: %d\n",*(data + i));
}
As I said, printf was the bottleneck, and sprintf wasn't much of an improvement either. So I decided to avoid any sort of printing until the very end, and use pointers instead
How much data? Store it in RAM until you're done, then print it. Also, file output may be faster. Depending on the terminal, your program may be blocking on writes. You may want to select for write-ability and write directly to STDOUT, instead.
basically you can't do lots of synchronous terminal IO on something where you want consistent, predictable performance.
I suggest you format your text to a buffer, then use the fwrite function to write the buffer.
Building off of dasblinkenlight's answer, use fwrite instead of puts. The puts function is searching for a terminating nul character. The fwrite function writes as-is to the console.
char buf[] = "data: 0000000000\r\n";
for (int i = 0; i < 100; i++) {
// int portion starts at position 6
itoa(getData(i), &buf[6], 10);
// The -1 is because we don't want to write the nul character.
fwrite(buf, 1, sizeof(buf) - 1, stdout);
}
You may want to read all the data into a separate raw data buffer, then format the raw data into a "formatted" data buffer and finally blast the entire "formatted" data buffer using one fwrite call.
You want to minimize the calls to send data out because there is an overhead involved. The fwrite function has about the same overhead for writing 1 character as it does writing 10,000 characters. This is where buffering comes in. Using a 1024 buffer of items would mean you use 1 function call to write 1024 items versus 1024 calls writing one item each. The latter is 1023 extra function calls.
Try printing an \r at the end of your string instead of the usual \n -- if that works on your system. That way you don't get continuous scrolling.
It depends on your environment if this works. And, of course, you won't be able to read all of the data if it's changing really fast.
Have you considered printing only every n th entry?

How can I read multiple files faster?

in my program I want to read several text files (more than ~800 files), each with 256 lines and their filenames starting from 1.txt to n.txt, and store them into a database after several processing steps. My problem is the data's reading speed. I could speed the program up to about twice the speed it had before by using OpenMP multithreading for the reading loop. Is there a way to speed it up a bit more? My actual code is
std::string CCD_Folder = CCDFolder; //CCDFolder is a pointer to a char array
int b = 0;
int PosCounter = 0;
int WAVENUMBER, WAVELUT;
std::vector<std::string> tempstr;
std::string inputline;
//Input
omp_set_num_threads(YValue);
#pragma omp parallel for private(WAVENUMBER) private(WAVELUT) private(PosCounter) private(tempstr) private(inputline)
for(int i = 1; i < (CCD_Filenumbers+1); i++)
{
//std::cout << omp_get_thread_num() << ' ' << i << '\n';
//Umwandlung und Erstellung des Dateinamens, Öffnen des Lesekanals
std::string CCD_Filenumber = boost::lexical_cast<string>(i);
std::string CCD_Filename = CCD_Folder + '\\' + CCD_Filenumber + ".txt";
std::ifstream datain(CCD_Filename, std::ifstream::in);
while(!datain.eof())
{
std::getline(datain, inputline);
//Processing
};
};
All variables which are not defined here are defined somewhere else in my code, and it is working. So is there a possibility to speed this code a bit more up?
Thank you very much!
Some experiment:
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <Windows.h>
void generateFiles(int n) {
char fileName[32];
char fileStr[1032];
for (int i=0;i<n;i++) {
sprintf( fileName, "c:\\t\\%i.txt", i );
FILE * f = fopen( fileName, "w" );
for (int j=0;j<256;j++) {
int lineLen = rand() % 1024;
memset(fileStr, 'X', lineLen );
fileStr[lineLen] = 0x0D;
fileStr[lineLen+1] = 0x0A;
fileStr[lineLen+2] = 0x00;
fwrite( fileStr, 1, lineLen+2, f );
}
fclose(f);
}
}
void readFiles(int n) {
char fileName[32];
for (int i=0;i<n;i++) {
sprintf( fileName, "c:\\t\\%i.txt", i );
FILE * f = fopen( fileName, "r" );
fseek(f, 0L, SEEK_END);
int size = ftell(f);
fseek(f, 0L, SEEK_SET);
char * data = (char*)malloc(size);
fread(data, size, 1, f);
free(data);
fclose(f);
}
}
DWORD WINAPI readInThread( LPVOID lpParam )
{
int * number = (int *)lpParam;
char fileName[32];
sprintf( fileName, "c:\\t\\%i.txt", *number );
FILE * f = fopen( fileName, "r" );
fseek(f, 0L, SEEK_END);
int size = ftell(f);
fseek(f, 0L, SEEK_SET);
char * data = (char*)malloc(size);
fread(data, size, 1, f);
free(data);
fclose(f);
return 0;
}
int main(int argc, char ** argv) {
long t1 = GetTickCount();
generateFiles(256);
printf("Write: %li ms\n", GetTickCount() - t1 );
t1 = GetTickCount();
readFiles(256);
printf("Read: %li ms\n", GetTickCount() - t1 );
t1 = GetTickCount();
const int MAX_THREADS = 256;
int pDataArray[MAX_THREADS];
DWORD dwThreadIdArray[MAX_THREADS];
HANDLE hThreadArray[MAX_THREADS];
for( int i=0; i<MAX_THREADS; i++ )
{
pDataArray[i] = (int) HeapAlloc(GetProcessHeap(), HEAP_ZERO_MEMORY,
sizeof(int));
pDataArray[i] = i;
hThreadArray[i] = CreateThread(
NULL,
0,
readInThread,
&pDataArray[i],
0,
&dwThreadIdArray[i]);
}
WaitForMultipleObjects(MAX_THREADS, hThreadArray, TRUE, INFINITE);
printf("Read (threaded): %li ms\n", GetTickCount() - t1 );
}
first function just ugly thing to make a test dataset ( I know it can be done much better, but I honestly have no time )
1st experiment - sequential read
2nd experiment - read all in parallel
results:
256 files:
Write: 250 ms
Read: 140 ms
Read (threaded): 78 ms
1024 files:
Write: 1250 ms
Read: 547 ms
Read (threaded): 843 ms
I think second attempt clearly shows that on a long run 'dumb' thread creation just makes things even worse. Of course it needs improvements in a sense of preallocated workers, some thread pool etc, but I think with such fast operation as reading 100-200k from disk there is no really benefit of moving this functionality into thread. I have no time to write more 'clever' solution, but I have my doubts that it will be much faster because you will have to add system calls for mutexes etc...
going extreme you could think of preallocating memory pools etc.. but as being mentioned before code your posted just wrong.. it's a matter of milliseconds, but for sure not seconds
800 files (20 chars per line, 256 lines)
Write: 250 ms
Read: 63 ms
Read (threaded): 500 ms
Conclusion:
ANSWER IS:
Your reading code is wrong, you reading files so slow that there is a significant increase in speed then you make tasks runs in parallel. In the code above reading is actually faster then an expenses to spawn a thread
Your primary bottleneck is physically reading from the hard disk.
Unless you have the files on separate drives, the drive can only read data from one file at a time. Your best bet is to read each file as a whole rather read a portion of one file, tell the drive to locate to another file, read from there, and repeat. Repositioning the drive head to other locations, especially other files, is usually more expensive than letting the drive finish reading the single file.
The next bottle neck is the data channel between the processor and the hard drive. If your hard drives share any kind of communications channel, you will see a bottleneck, as data from each drive must come through the communications channel to your processor. Your processor will be sending commands to the drive(s) through this communications channel (PATA, SATA, USB, etc.).
The objective of the next steps is to reduce the overhead of the "middle men" between your program's memory and the hard drive communications interface. The most efficient is to access the controller directly; lesser efficient are using the OS functions; the "C" functions (fread and familiy) and least is the C++ streams. With increased efficiency comes tighter coupling with the platform and reduced safety (and simplicity).
I suggest the following:
Create multiple buffers in memory, large enough to save time, small
enough to prevent the OS from paging the memory to the hard drive.
Create a thread that reads the files into memory, as necessary.
Search the web for "double buffering". As long as there is space in
the buffer, this thread will read data.
Create multiple "outgoing" buffers.
Create a second thread that removes data from memory and "processes"
it, and inserts into the "outgoing" buffers.
Create a third thread that takes the data in the "outgoing" buffers
and sends to the databases.
Adjust the size of the buffers for the best efficiency within the
limitations of memory.
If you can access the DMA channels, use them to read from the hard drive into the "read buffers".
Next, you can optimize your code to efficiently use the data cache of the processor. For example, set up your "processing" so the data structures to not exceed a data line in the cache. Also, optimize your code to use registers (either specify the register keyword or use statement blocks so that the compiler knows when variables can be reused).
Other optimizations that may help:
Align data to the processors native word size, pad if necessary. For
example, prefer using 32 bytes instead of 13 or 24.
Fetch data in quantities of the processor's word size. For example,
access 4 octets (bytes) at a time on a 32-bit processor rather than 4
accesses of 1 byte.
Unroll loops - more instructions inside the loop, as branch
instructions slow down processing.
You are probably hitting the read limit of your disks, which means your options are somewhat limited. If this is a constant problem you could consider a different RAID structure, which will give you greater read throughput because more than one read head can access data at the same time.
To see if disk access really is the bottleneck, run your program with the time command:
>> /usr/bin/time -v <my program>
In the output you'll see how much CPU time you were utilizing compared to the amount of time required for things like disk access.
I would try going with C code for reading the file. I suspect that it'll be faster.
FILE* f = ::fopen( CCD_Filename.c_str(), "rb" );
if( f == NULL )
{
return;
}
::fseek( f, 0, SEEK_END );
const long lFileBytes = ::ftell( f );
::fseek( f, 0, SEEK_SET );
char* fileContents = new char[lFileBytes + 1];
const size_t numObjectsRead = ::fread( fileContents, lFileBytes, 1, f );
::fclose( f );
if( numObjectsRead < 1 )
{
delete [] fileContents;
return;
}
fileContents[lFileBytes] = '\0';
// assign char buffer of file contents here
delete [] fileContents;

Reading file with fread() in reverse order causes memory leak?

I have a program that basically does this:
Opens some binary file
Reads the file backwards (by backwards, I mean it starts near EOF, and ends reading at beginning of file, i.e. reads the file "right-to-left"), using 4MB chunks
Closes the file
My question is: why memory consumption looks like below, even though there are no obvious memory leaks in my attached code?
Here's the source of program that was run to obtain above image:
#include <stdio.h>
#include <string.h>
int main(void)
{
//allocate stuff
const int bufferSize = 4*1024*1024;
FILE *fileHandle = fopen("./input.txt", "rb");
if (!fileHandle)
{
fprintf(stderr, "No file for you\n");
return 1;
}
unsigned char *buffer = new unsigned char[bufferSize];
if (!buffer)
{
fprintf(stderr, "No buffer for you\n");
return 1;
}
//get file size. file can be BIG, hence the fseeko() and ftello()
//instead of fseek() and ftell().
fseeko(fileHandle, 0, SEEK_END);
off_t totalSize = ftello(fileHandle);
fseeko(fileHandle, 0, SEEK_SET);
//read the file... in reverse order. This is important.
for (off_t pos = totalSize - bufferSize, j = 0;
pos >= 0;
pos -= bufferSize, j ++)
{
if (j % 10 == 0)
{
fprintf(stderr,
"reading like crazy: %lld / %lld\n",
pos, totalSize);
}
/*
* below is the heart of the problem. see notes below
*/
//seek to desired position
fseeko(fileHandle, pos, SEEK_SET);
//read the chunk
fread(buffer, sizeof(unsigned char), bufferSize, fileHandle);
}
fclose(fileHandle);
delete []buffer;
}
I have also following observations:
Even though RAM usage jumps by 1GB, the whole program uses only 5MB thorough whole execution.
Commenting call to fread() out makes memory leak go away. This is weird, since I don't allocate anything anywhere near it, that could trigger memory leak...
Also, reading the file normally instead of backwards (= commenting call to fseeko() out), makes memory leak go away as well. This is the ultra-weird part.
Further information...
Following doesn't help:
Checking results of fread() - yields nothing out of ordinary.
Switching to normal, 32-bit fseek and ftell.
Doing stuff like setbuf(fileHandle, NULL).
Doing stuff like setvbuf(fileHandle, NULL, _IONBF, *any integer*).
Compiled with g++ 4.5.3 on Windows 7 via cygwin and mingw; without any optimalizations, just g++ test.cpp -o test. Both present such behaviour.
The file used in tests was 4GB long, full of zeros.
The weird pause in the middle of the chart could be explained with some kind of temporary I/O hangup, unrelated to this question.
Finally, if I wrap reading in infinite loop... the memory usage stops increasing after first iteration.
I think it has to do with some kind of internal cache building up till it's filled with whole file. How does it really work behind the scenes? How can I prevent that in a portable way??
I think, this is more an OS issue (or even an OS resource use reporting issue) than an issue with your program. Of course, it only uses 5 MB of memory: 1 MB for itself (libs, stack etc.) and 4 MB for the buffer. Whenever you do a fread(), the OS seems to "bind" part of the file to your process, and seems to release it not at the same speed. As memory use on your machine is low, this is not a problem: The OS just keeps the already read data "hanging around" longer than necessary, probably assuming, that your application might read it again, soon, and then it doesn't have to do that binding again.
If memory pressure was higher, than the OS is very likely to unbind the memory faster, so that jump on your memory usage history would be smaller.
I had the exact same problem, although in Java but it doesn't matter in this context. I solved it by reading much bigger chunks at a time. I also read 4Mb size chunks, but when I increased it to 100-200 Mb the problem went away. Perhaps it'll do that for you as well. I'm on Windows 7.

how many bytes actually written by ostream::write?

suppose I send a big buffer to ostream::write, but only the beginning part of it is actually successfully written, and the rest is not written
int main()
{
std::vector<char> buf(64 * 1000 * 1000, 'a'); // 64 mbytes of data
std::ofstream file("out.txt");
file.write(&buf[0], buf.size()); // try to write 64 mbytes
if(file.bad()) {
// but suppose only 10 megabyte were available on disk
// how many were actually written to file???
}
return 0;
}
what ostream function can tell me how many bytes were actually written?
You can use .tellp() to know the output position in the stream to compute the number of bytes written as:
size_t before = file.tellp(); //current pos
if(file.write(&buf[0], buf.size())) //enter the if-block if write fails.
{
//compute the difference
size_t numberOfBytesWritten = file.tellp() - before;
}
Note that there is no guarantee that numberOfBytesWritten is really the number of bytes written to the file, but it should work for most cases, since we don't have any reliable way to get the actual number of bytes written to the file.
I don't see any equivalent to gcount(). Writing directly to the streambuf (with sputn()) would give you an indication, but there is a fundamental problem in your request: write are buffered and failure detection can be delayed to the effective writing (flush or close) and there is no way to get access to what the OS really wrote.