I am working on a sorting project and I've come to the point where a main bottleneck is reading in the data. It takes my program about 20 seconds to sort 100,000,000 integers read in from stdin using cin and std::ios::sync_with_stdio(false); but it turns out that 10 of those seconds is reading in the data to sort. We do know how many integers we will be reading in (the count is at the top of the file we need to sort).
How can I make this faster? I know it's possible because a student in a previous semester was able to do counting sort in a little over 3 seconds (and that's basically purely read time).
The program is just fed the contents of a file with integers separated by newlines like $ ./program < numstosort.txt
Thanks
Here is the relevant code:
std::ios::sync_with_stdio(false);
int max;
cin >> max;
short num;
short* a = new short[max];
int n = 0;
while(cin >> num) {
a[n] = num;
n++;
}
This will get your data into memory about as fast as possible, assuming Linux/POSIX running on commodity hardware. Note that since you apparently aren't allowed to use compiler optimizations, C++ IO is not going to be the fastest way to read data. As others have noted, without optimizations the C++ code will not run anywhere near as fast as it can.
Given that the redirected file is already open as stdin/STDIN_FILENO, use low-level system call/C-style IO. That won't need to be optimized, as it will run just about as fast as possible:
struct stat sb;
int rc = ::fstat( STDIN_FILENO, &sb );
// use C-style calloc() to get memory that's been
// set to zero as calloc() is often optimized to be
// faster than a new followed by a memset().
char *data = (char *)::calloc( 1, sb.st_size + 1 );
size_t totalRead = 0UL;
while ( totalRead < sb.st_size )
{
ssize_t bytesRead = ::read( STDIN_FILENO,
data + totalRead, sb.st_size - totalRead );
if ( bytesRead <= 0 )
{
break;
}
totalRead += bytesRead;
}
// data is now in memory - start processing it
That code will read your data into memory as one long C-style string. And the lack of compiler optimizations won't matter one bit as it's all almost bare-metal system calls.
Using fstat() to get the file size allows allocating all the needed memory at once - no realloc() or copying data around is necessary.
You'll need to add some error checking, and a more robust version of the code would check to be sure the data returned from fstat() actually is a regular file with an actual size, and not a "useless use of cat" such as cat filename | YourProgram, because in that case the fstat() call won't return a useful file size. You'll need to examine the sb.st_mode field of the struct stat after the call to see what the stdin stream really is:
::fstat( STDIN_FILENO, &sb );
...
if ( S_ISREG( sb.st_mode ) )
{
// regular file...
}
(And for really high-performance systems, it can be important to ensure that the memory pages you're reading data into are actually mapped in your process address space. Performance can really stall if data arrives faster than the kernel's memory management system can create virtual-to-physical mappings for the pages data is getting dumped into.)
To handle a large file as fast as possible, you'd want to go multithreaded, with one thread reading data and feeding one or more data processing threads so you can start processing data before you're done reading it.
Edit: parsing the data.
Again, preventing compiler optimizations probably makes the overhead of C++ operations slower than C-style processing. Based on that assumption, something simple will probably run faster.
This would probably work a lot faster in a non-optimized binary, assuming the data is in a C-style string read in as above:
char *next;
long count = ::strtol( data, &next, 0 );
long *values = new long[ count ];
for ( long ii = 0; ii < count; ii++ )
{
values[ ii ] = ::strtol( next, &next, 0 );
}
That is also very fragile. It relies on strtol() skipping over leading whitespace, meaning if there's anything other than whitespace between the numeric values it will fail. It also relies on the initial count of values being correct. Again - that code will fail if that's not true. And because it can replace the value of next before checking for errors, if it ever goes off the rails because of bad data it'll be hopelessly lost.
But it should be about as fast as possible without allowing compiler optimizations.
That's what crazy about not allowing compiler optimizations. You can write simple, robust C++ code to do all your processing, make use of a good optimizing compiler, and probably run almost as fast as the code I posted - which has no error checking and will fail spectacularly in unexpected and undefined ways if fed unexpected data.
You can make it faster if you use a SolidState hard drive. If you want to ask something about code performance, you need to post how are you doing things in the first place.
You may be able to speed up your program by reading the data into a buffer, then converting the text in the buffer to internal representation.
The thought behind this is that all stream devices like to keep streaming. Starting and stopping the stream wastes time. A block read transfers a lot of data with one transaction.
Although cin is buffered, by using cin.read and a buffer, you can make the buffer a lot bigger than cin uses.
If the data has fixed width fields, there are opportunities to speed up the input and conversion processes.
Edit 1: Example
const unsigned int BUFFER_SIZE = 65536;
char text_buffer[BUFFER_SIZE];
//...
cin.read(text_buffer, BUFFER_SIZE);
//...
int value1;
int arguments_scanned = snscanf(&text_buffer, REMAINING_BUFFER_SIZE,
"%d", &value1);
The tricky part is handling the cases where the text of a number is cut off at the end of the buffer.
Can you ran this little test in compare to your test with and without commented line?
#include <iostream>
#include <cstdlib>
int main()
{
std::ios::sync_with_stdio(false);
char buffer[20] {0,};
int t = 0;
while( std::cin.get(buffer, 20) )
{
// t = std::atoi(buffer);
std::cin.ignore(1);
}
return 0;
}
Pure read test:
#include <iostream>
#include <cstdlib>
int main()
{
std::ios::sync_with_stdio(false);
char buffer[1024*1024];
while( std::cin.read(buffer, 1024*1024) )
{
}
return 0;
}
Related
I'm new to C++ and am making an app that uses a lot of putc to write data in output which is file. Because of high writes its being slowed down, I used to code in Delphi, so I know how to solve it, like make a memory stream and write into it every time we need to write into output, and if size of memory stream is larger than buffer size we want, write it into output and clear the memory stream. How should I do this with C++ or any better solution?
putc is already buffered, 4 KB is default you can use setvbuf for changing that value :D
setvbuf
Writing to a file should be very quick. It is usually the emptying of the buffer that takes some time. Consider using the character \n instead of std::endl.
I think a good answer to your question is here: Writing a binary file in C++ very fast
Where the answer is:
#include <stdio.h>
const unsigned long long size = 8ULL*1024ULL*1024ULL;
unsigned long long a[size];
int main()
{
FILE* pFile;
pFile = fopen("file.binary", "wb");
for (unsigned long long j = 0; j < 1024; ++j){
//Some calculations to fill a[]
fwrite(a, 1, size*sizeof(unsigned long long), pFile);
}
fclose(pFile);
return 0;
}
The most important thing in your case is to write as much data you can, with the least possible I/O requests.
I have a program that outputs the data from an FPGA. Since the data changes EXTREMELY fast, I'm trying to increase the speed of the program. Right now I am printing data like this
for (int i = 0; i < 100; i++) {
printf("data: %d\n",getData(i));
}
I found that using one printf greatly increases speed
printf("data: %d \n data: %d \n data: %d \n",getData(1),getData(2),getData(3));
However, as you can see, its very messy and I can't use a for loop. I tried concatenating the strings first using sprintf and then printing everything out at once, but it's just as slow as the first method. Any suggestions?
Edit:
I'm already printing to a file first, because I realized the console scrolling would be an issue. But its still too slow. I'm debugging a memory controller for an external FPGA, so the closer to the real speed the better.
If you are writing to stdout, you might not be able to influence this all.
Otherwise, set buffering
setvbuf http://en.cppreference.com/w/cpp/io/c/setvbuf
std::nounitbuf http://en.cppreference.com/w/cpp/io/manip/unitbuf
and untie the input output streams (C++) http://en.cppreference.com/w/cpp/io/basic_ios/tie
std::ios_base::sync_with_stdio(false) (thanks #Dietmar)
Now, Boost Karma is known to be pretty performant. However, I'd need to know more about your input data.
Meanwhile, try to buffer your writes manually: Live on Coliru
#include <stdio.h>
int getData(int i) { return i; }
int main()
{
char buf[100*24]; // or some other nice, large enough size
char* const last = buf+sizeof(buf);
char* out = buf;
for (int i = 0; i < 100; i++) {
out += snprintf(out, last-out, "data: %d\n", getData(i));
}
*out = '\0';
printf("%s", buf);
}
Wow, I can't believe I didn't do this earlier.
const int size = 100;
char data[size];
for (int i = 0; i < size; i++) {
*(data + i) = getData(i);
}
for (int i = 0; i < size; i++) {
printf("data: %d\n",*(data + i));
}
As I said, printf was the bottleneck, and sprintf wasn't much of an improvement either. So I decided to avoid any sort of printing until the very end, and use pointers instead
How much data? Store it in RAM until you're done, then print it. Also, file output may be faster. Depending on the terminal, your program may be blocking on writes. You may want to select for write-ability and write directly to STDOUT, instead.
basically you can't do lots of synchronous terminal IO on something where you want consistent, predictable performance.
I suggest you format your text to a buffer, then use the fwrite function to write the buffer.
Building off of dasblinkenlight's answer, use fwrite instead of puts. The puts function is searching for a terminating nul character. The fwrite function writes as-is to the console.
char buf[] = "data: 0000000000\r\n";
for (int i = 0; i < 100; i++) {
// int portion starts at position 6
itoa(getData(i), &buf[6], 10);
// The -1 is because we don't want to write the nul character.
fwrite(buf, 1, sizeof(buf) - 1, stdout);
}
You may want to read all the data into a separate raw data buffer, then format the raw data into a "formatted" data buffer and finally blast the entire "formatted" data buffer using one fwrite call.
You want to minimize the calls to send data out because there is an overhead involved. The fwrite function has about the same overhead for writing 1 character as it does writing 10,000 characters. This is where buffering comes in. Using a 1024 buffer of items would mean you use 1 function call to write 1024 items versus 1024 calls writing one item each. The latter is 1023 extra function calls.
Try printing an \r at the end of your string instead of the usual \n -- if that works on your system. That way you don't get continuous scrolling.
It depends on your environment if this works. And, of course, you won't be able to read all of the data if it's changing really fast.
Have you considered printing only every n th entry?
i wrote an application which processes data on the GPU. Code works well, but i have the problem that the reading part of the input file (~3GB, text) is the bottleneck of my application. (The read from the HDD is fast, but the processing line by line is slow).
I read a line with getline() and copy line 1 to a vector, line2 to a vector and skip lines 3 and 4. And so on for the rest of the 11 mio lines.
I tried several approaches to get the file at the best time possible:
Fastest method I found is using boost::iostreams::stream
Others were:
Read the file as gzip, to minimize IO, but is slower than directly
reading it.
copy file to ram by read(filepointer, chararray, length)
and process it with a loop to distinguish the lines (also slower than boost)
Any suggestions how to make it run faster?
void readfastq(char *filename, int SRlength, uint32_t blocksize){
_filelength = 0; //total datasets (each 4 lines)
_SRlength = SRlength; //length of the 2. line
_blocksize = blocksize;
boost::iostreams::stream<boost::iostreams::file_source>ins(filename);
in = ins;
readNextBlock();
}
void readNextBlock() {
timeval start, end;
gettimeofday(&start, 0);
string name;
string seqtemp;
string garbage;
string phredtemp;
_seqs.empty();
_phred.empty();
_names.empty();
_filelength = 0;
//read only a part of the file i.e the first 4mio lines
while (std::getline(in, name) && _filelength<_blocksize) {
std::getline(in, seqtemp);
std::getline(in, garbage);
std::getline(in, phredtemp);
if (seqtemp.size() != _SRlength) {
if (seqtemp.size() != 0)
printf("Error on read in fastq: size is invalid\n");
} else {
_names.push_back(name);
for (int k = 0; k < _SRlength; k++) {
//handle special letters
if(seqtemp[k]== 'A') ...
else{
_seqs.push_back(5);
}
}
_filelength++;
}
}
EDIT:
The source-file is downloadable under https://docs.google.com/open?id=0B5bvyb427McSMjM2YWQwM2YtZGU2Mi00OGVmLThkODAtYzJhODIzYjNhYTY2
I changed the function readfastq to read the file, because of some pointer problems. So if you call readfastq the blocksize (in lines) must be bigger than the number of lines to read.
SOLUTION:
I found a solution, which get the time for read in the file from 60sec to 16sec. I removed the inner-loop which handeles the special characters and do this in GPU. This decreases the read-in time and only minimal increases the GPU running time.
Thanks for your suggestions.
void readfastq(char *filename, int SRlength) {
_filelength = 0;
_SRlength = SRlength;
size_t bytes_read, bytes_expected;
FILE *fp;
fp = fopen(filename, "r");
fseek(fp, 0L, SEEK_END); //go to the end of file
bytes_expected = ftell(fp); //get filesize
fseek(fp, 0L, SEEK_SET); //go to the begining of the file
fclose(fp);
if ((_seqarray = (char *) malloc(bytes_expected/2)) == NULL) //allocate space for file
err(EX_OSERR, "data malloc");
string name;
string seqtemp;
string garbage;
string phredtemp;
boost::iostreams::stream<boost::iostreams::file_source>file(filename);
while (std::getline(file, name)) {
std::getline(file, seqtemp);
std::getline(file, garbage);
std::getline(file, phredtemp);
if (seqtemp.size() != SRlength) {
if (seqtemp.size() != 0)
printf("Error on read in fastq: size is invalid\n");
} else {
_names.push_back(name);
strncpy( &(_seqarray[SRlength*_filelength]), seqtemp.c_str(), seqtemp.length()); //do not handle special letters here, do on GPU
_filelength++;
}
}
}
First instead of reading the file into memory you may work with file mappings. You just have to build your program as 64-bit to fit 3GB of virtual address space (for 32-bit application only 2GB is accessible in the user mode). Or alternatively you may map & process your file by parts.
Next, it sounds to me that your bottleneck is "copying a line to a vector". Dealing with vectors involves dynamic memory allocation (heap operations), which in a critical loop hits the performance very seriously). If this is the case - either avoid using vectors, or make sure they're declared outside the loop. The latter helps because when you reallocate/clear vectors they do not free memory.
Post your code (or a part of it) for more suggestions.
EDIT:
It seems that all your bottlenecks are related to string management.
std::getline(in, seqtemp); reading into an std::string deals with the dynamic memory allocation.
_names.push_back(name); This is even worse. First the std::string is placed into the vector by value. Means - the string is copied, hence another dynamic allocation/freeing happens. Moreover, when eventually the vector is internally reallocated - all the contained strings are copied again, with all the consequences.
I recommend using neither standard formatted file I/O functions (Stdio/STL) nor std::string. To achieve better performance you should work with pointers to strings (rather than copied strings), which is possible if you map the entire file. Plus you'll have to implement the file parsing (division into lines).
Like in this code:
class MemoryMappedFileParser
{
const char* m_sz;
size_t m_Len;
public:
struct String {
const char* m_sz;
size_t m_Len;
};
bool getline(String& out)
{
out.m_sz = m_sz;
const char* sz = (char*) memchr(m_sz, '\n', m_Len);
if (sz)
{
size_t len = sz - m_sz;
m_sz = sz + 1;
m_Len -= (len + 1);
out.m_Len = len;
// for Windows-format text files remove the '\r' as well
if (len && '\r' == out.m_sz[len-1])
out.m_Len--;
} else
{
out.m_Len = m_Len;
if (!m_Len)
return false;
m_Len = 0;
}
return true;
}
};
if _seqs and _names are std::vectors and you can guess the final size of them before processing the whole 3GB of data, you can use reserve to avoid most of the memory re-allocation during pushing back the new elements in the loop.
You should be aware of the fact that the vectors effectively produce another copy of parts of the file in main memory. So unless you have a main memory sufficiently large to store the text file plus the vector and its contents, you will probably end up with a number of page faults that also have a significant influence on the speed of your program.
You are apparently using <stdio.h> since using getline.
Perhaps fopen-ing the file with fopen(path, "rm"); might help, because the m tells (it is a GNU extension) to use mmap for reading.
Perhaps setting a big buffer (i.e. half a megabyte) with setbuffer could also help.
Probably, using the readahead system call (in a separate thread perhaps) could help.
But all this are guesses. You should really measure things.
General suggestions:
Code the simplest, most straight-forward, clean approach,
Measure,
Measure,
Measure,
Then if all else fails:
Read raw bytes (read(2)) in page-aligned chunks. Do so sequentially, so kernel's read-ahead plays to your advantage.
Re-use the same buffer to minimize cache flushing.
Avoid copying data, parse in place, pass around pointers (and sizes).
mmap(2)-ing [parts of the] file is another approach. This also avoids kernel-userland copy.
Depending on your disk speed, using a very fast de compression algorithm might help, like fastlz (there are at least two other that might be more efficient, but under GPL, so licence can be a problem).
Also, using C++ data structures and functions car increase the speed as you can maybe achieve a better compiler-time optimization. Going the C way isn't always the fastes! In some bad conditions, using char* you need to parse the whole string to reach the \0 yielding desastrous performances.
For parsing your data, using boost::spirit::qi is also probably the most optimized approach http://alexott.blogspot.com/2010/01/boostspirit2-vs-atoi.html
I have got a huge text file loaded into a CMemFile object and would like to parse it line by line (separated by newline chars).
Originally it is a zip file on disk, and I unzip it into memory to parse it, therefore the CMemFile.
One working way to read line by line is this (m_file is a smart pointer to a CMemFile):
CArchive archive(m_file.get(), CArchive::load);
CString line;
while(archive.ReadString(line))
{
ProcessLine(string(line));
}
Since it takes a lot of time, I tried to write my own routine:
const UINT READSIZE = 1024;
const char NEWLINE = '\n';
char readBuffer[READSIZE];
UINT bytesRead = 0;
char *posNewline = NULL;
const char* itEnd = readBuffer + READSIZE;
ULONGLONG currentPosition = 0;
ULONGLONG newlinePositionInBuffer = 0;
do
{
currentPosition = m_file->GetPosition();
bytesRead = m_file->Read(&readBuffer, READSIZE);
if(bytesRead == 0) break; // EOF
posNewline = std::find(readBuffer, readBuffer + bytesRead, NEWLINE);
if(posNewline != itEnd)
{
// found newline
ProcessLine(string(readBuffer, posNewline));
newlinePositionInBuffer = posNewline - readBuffer + 1; // +1 to skip \r
m_file->Seek(currentPosition + newlinePositionInBuffer, CFile::begin);
}
} while(true);
Measuring the performance showed both methods take about the same time...
Can you think of any performance improvements or a faster way to do the parsing?
Thanks for any advice
A few notes and comments that may be useful:
Profiling is the only way for sure to know what the code is doing and how long it takes. Often the bottleneck is not obvious from the code itself. One basic method would be to time the loading, the uncompressing, and the parsing individually.
The actual loading of the file from disk, and in your case the uncompressing, may actually take significantly more time than the parsing, especially if your ProcessFile() function is a nop. If your parsing only takes 1% of the total time then you're never going to get much from trying to optimize that 1%. This is something profiling your code would tell you.
A general way to optimize a load/parse algorithm is to look at how many times a particular byte is read/parsed. The minimum, and possibly fastest, algorithm must read and parse each byte only once. Looking at your algorithms each byte appears to be copied a half-dozen times and potentially parsed a similar number. Reducing these numbers may help reduce the overall algorithm time, although the relative gain may not be much overall.
Using a profiler showed that 75 % of process time was wasted in this line of code:
ProcessLine(string(readBuffer, posNewline));
Mainly the creation of the temporary string caused a big overhead (many allocs). The ProcessLine function itself contains no code. By changing the declaration from:
void ProcessLine(const std::string &);
to:
inline void ProcessLine(const char*, const char*);
process time used was reduced by a factor of five.
You could run both the decompression and the parsing in separate threads. Each time the decompression produces some data, you should pass it to the parsing thread using a message mechanism to parse.
This allows both to run in parallel, and also result in a smaller memory overhead since you work in blocks rather than the entire decompressed file (which would result in less page faults and swaps to virtual memory).
I think your problem might be that you are reading in too much and reseeking to new line.
If you file was
foo
bar
etc
Say 10 bytes average on a line. You will read 10 lines...and read the 9 lines again.
Profiling my program and the function print is taking a lot of time to perform. How can I send "raw" byte output directly to stdout instead of using fwrite, and making it faster (need to send all 9bytes in the print() at the same time to the stdout) ?
void print(){
unsigned char temp[9];
temp[0] = matrix[0][0];
temp[1] = matrix[0][1];
temp[2] = matrix[0][2];
temp[3] = matrix[1][0];
temp[4] = matrix[1][1];
temp[5] = matrix[1][2];
temp[6] = matrix[2][0];
temp[7] = matrix[2][1];
temp[8] = matrix[2][2];
fwrite(temp,1,9,stdout);
}
Matrix is defined globally to be a unsigned char matrix[3][3];
IO is not an inexpensive operation. It is, in fact, a blocking operation, meaning that the OS can preempt your process when you call write to allow more CPU-bound processes to run, before the IO device you're writing to completes the operation.
The only lower level function you can use (if you're developing on a *nix machine), is to use the raw write function, but even then your performance will not be that much faster than it is now. Simply put: IO is expensive.
The top rated answer claims that IO is slow.
Here's a quick benchmark with a sufficiently large buffer to take the OS out of the critical performance path, but only if you're willing to receive your output in giant blurps. If latency to first byte is your problem, you need to run in "dribs" mode.
Write 10 million records from a nine byte array
Mint 12 AMD64 on 3GHz CoreDuo under gcc 4.6.1
340ms to /dev/null
710ms to 90MB output file
15254ms to 90MB output file in "dribs" mode
FreeBSD 9 AMD64 on 2.4GHz CoreDuo under clang 3.0
450ms to /dev/null
550ms to 90MB output file on ZFS triple mirror
1150ms to 90MB output file on FFS system drive
22154ms to 90MB output file in "dribs" mode
There's nothing slow about IO if you can afford to buffer properly.
#include <stdio.h>
#include <assert.h>
#include <stdlib.h>
#include <string.h>
int main (int argc, char* argv[])
{
int dribs = argc > 1 && 0==strcmp (argv[1], "dribs");
int err;
int i;
enum { BigBuf = 4*1024*1024 };
char* outbuf = malloc (BigBuf);
assert (outbuf != NULL);
err = setvbuf (stdout, outbuf, _IOFBF, BigBuf); // full line buffering
assert (err == 0);
enum { ArraySize = 9 };
char temp[ArraySize];
enum { Count = 10*1000*1000 };
for (i = 0; i < Count; ++i) {
fwrite (temp, 1, ArraySize, stdout);
if (dribs) fflush (stdout);
}
fflush (stdout); // seems to be needed after setting own buffer
fclose (stdout);
if (outbuf) { free (outbuf); outbuf = NULL; }
}
The rawest form of output you can do is the probable the write system call, like this
write (1, matrix, 9);
1 is the file descriptor for standard out (0 is standard in, and 2 is standard error). Your standard out will only write as fast as the one reading it at the other end (i.e. the terminal, or the program you're pipeing into) which might be rather slow.
I'm not 100% sure, but you could try setting non-blocking IO on fd 1 (using fcntl) and hope the OS will buffer it for you until it can be consumed by the other end. It's been a while, but I think it works like this
fcntl (1, F_SETFL, O_NONBLOCK);
YMMV though. Please correct me if I'm wrong on the syntax, as I said, it's been a while.
Perhaps your problem is not that fwrite() is slow, but that it is buffered.
Try calling fflush(stdout) after the fwrite().
This all really depends on your definition of slow in this context.
All printing is fairly slow, although iostreams are really slow for printing.
Your best bet would be to use printf, something along the lines of:
printf("%c%c%c%c%c%c%c%c%c\n", matrix[0][0], matrix[0][1], matrix[0][2], matrix[1][0],
matrix[1][1], matrix[1][2], matrix[2][0], matrix[2][1], matrix[2][2]);
As everyone has pointed out IO in tight inner loop is expensive. I have normally ended up doing conditional cout of Matrix based on some criteria when required to debug it.
If your app is console app then try redirecting it to a file, it will be lot faster than doing console refreshes. e.g app.exe > matrixDump.txt
What's wrong with:
fwrite(matrix,1,9,stdout);
both the one and the two dimensional arrays take up the same memory.
Try running the program twice. Once with output and once without. You will notice that overall, the one without the io is the fastest. Also, you could fork the process (or create a thread), one writing to a file(stdout), and one doing the operations.
So first, don't print on every entry. Basically what i am saying is do not do like that.
for(int i = 0; i<100; i++){
printf("Your stuff");
}
instead allocate a buffer either on stack or on heap, and store you infomration there and then just throw this bufffer into stdout, just liek that
char *buffer = malloc(sizeof(100));
for(int i = 100; i<100; i++){
char[i] = 1; //your 8 byte value goes here
}
//once you are done print it to a ocnsole with
write(1, buffer, 100);
but in your case, just use write(1, temp, 9);
I am pretty sure you can increase the output performance by increasing the buffer size. So you have less fwrite calls. write might be faster but I am not sure. Just try this:
❯ yes | dd of=/dev/null count=1000000
1000000+0 records in
1000000+0 records out
512000000 bytes (512 MB, 488 MiB) copied, 2.18338 s, 234 MB/s
vs
> yes | dd of=/dev/null count=100000 bs=50KB iflag=fullblock
100000+0 records in
100000+0 records out
5000000000 bytes (5.0 GB, 4.7 GiB) copied, 2.63986 s, 1.9 GB/s
The same applies to your code. Some tests during the last days show that probably good buffer sizes are around 1 << 12 (=4096) and 1<<16 (=65535) bytes.
You can simply:
std::cout << temp;
printf is more C-Style.
Yet, IO operations are costly, so use them wisely.