Synchronisation issue calling ofstream write with custom buffer (pubsetbuf) - c++

I am writing potentially large (multi-gigabyte) binary files to a network location using std::ofstream. Previously I was flushing the buffer at frequent intervals, which hurt performance especially at peak times - reducing the flush frequency has helped. I am interested in finding other ways to further improve write performance.
According to various online sources, writing directly to a buffer associated with the stream's streambuf object via stream.rdbuf()->pubsetbuf(buffer) avoids the write overhead described here.
Instead of passing each value in turn to ofstream::write, I populate and pass a byte buffer which has a fixed size of 8192 bytes. I had expected that directly populating the buffer would avoid the need to call write, however, omitting write results in nothing being written to file. In any case, my tests show that the use of pubsetbuf with write is significantly faster than not using pubsetbuf and writing one 8-byte value at a time.
Unfortunately, if I do not flush the stream following each write call, the content of the buffer is both appended to the file AND written at the start, overwriting the data previously at that position. If I flush immediately following the write call, the resulting file is correct. I have checked that the stream pointer is correct every time a write happens.
Alternatively, if I don't bother calling stream.rdbuf()->pubsetbuf(buffer) at all and continue to pass the pre-populated byte buffer to write, I avoid the need to flush. I'm currently profiling which of the two approaches performs better:
pubsetbuf with write + flush every time (suspect poor performance over network)
no pubsetbuf, write and a single flush at the end
This illustrates the overwrite issue with a much reduced data set (comment stream.flush() back in to have it behave correctly):
vector<double> data;
for (int i = 0; i < 400; ++i) data.push_back(111.0);
for (int i = 0; i < 400; ++i) data.push_back(222.0);
for (int i = 0; i < 400; ++i) data.push_back(333.0);
bool lastPoint;
size_t numPoints = data.size();
size_t bytesPerValue = sizeof(data[0]);
// Create a byte buffer
static const size_t BUFFER_SIZE = 8192;
vector<unsigned char> buffer;
buffer.resize(BUFFER_SIZE);
size_t bufferIndex = 0;
bool bufferFull = false;
// Create output stream
ofstream stream("\\\\server\\networkpath\\output.dat", ios::binary);
// Set stream buffer to avoid write calls
stream.rdbuf()->pubsetbuf(reinterpret_cast<char*>(&buffer[0]), BUFFER_SIZE);
for (size_t i = 0; i < numPoints; i++)
{
lastPoint = i == (numPoints - 1);
// Write to buffer
memcpy(&buffer[bufferIndex], reinterpret_cast<const unsigned char*>(&data[i]), bytesPerValue);
// Find next empty index
bufferIndex += bytesPerValue;
bufferFull = ((bufferIndex + bytesPerValue) > BUFFER_SIZE);
if (lastPoint || bufferFull)
{
stream.write(reinterpret_cast<const char*>(&buffer[0]), bufferIndex);
//stream.flush();
// If the line above is commented back in, the content of the output file is correct.
// However, I wish to avoid flushing often when writing across network.
// Without commenting back in, the file contains the expected number of bytes (9600) but the data in the second write call
// (bytes 8193-9600) appears both at the start and end of the file.
bufferIndex = 0;
}
}
stream.flush();
stream.close();
Is there a way to use the buffering technique via pubsetbuf without the need to flush every time a write is performed?

Related

Fastest way to read millions of integers from stdin C++?

I am working on a sorting project and I've come to the point where a main bottleneck is reading in the data. It takes my program about 20 seconds to sort 100,000,000 integers read in from stdin using cin and std::ios::sync_with_stdio(false); but it turns out that 10 of those seconds is reading in the data to sort. We do know how many integers we will be reading in (the count is at the top of the file we need to sort).
How can I make this faster? I know it's possible because a student in a previous semester was able to do counting sort in a little over 3 seconds (and that's basically purely read time).
The program is just fed the contents of a file with integers separated by newlines like $ ./program < numstosort.txt
Thanks
Here is the relevant code:
std::ios::sync_with_stdio(false);
int max;
cin >> max;
short num;
short* a = new short[max];
int n = 0;
while(cin >> num) {
a[n] = num;
n++;
}
This will get your data into memory about as fast as possible, assuming Linux/POSIX running on commodity hardware. Note that since you apparently aren't allowed to use compiler optimizations, C++ IO is not going to be the fastest way to read data. As others have noted, without optimizations the C++ code will not run anywhere near as fast as it can.
Given that the redirected file is already open as stdin/STDIN_FILENO, use low-level system call/C-style IO. That won't need to be optimized, as it will run just about as fast as possible:
struct stat sb;
int rc = ::fstat( STDIN_FILENO, &sb );
// use C-style calloc() to get memory that's been
// set to zero as calloc() is often optimized to be
// faster than a new followed by a memset().
char *data = (char *)::calloc( 1, sb.st_size + 1 );
size_t totalRead = 0UL;
while ( totalRead < sb.st_size )
{
ssize_t bytesRead = ::read( STDIN_FILENO,
data + totalRead, sb.st_size - totalRead );
if ( bytesRead <= 0 )
{
break;
}
totalRead += bytesRead;
}
// data is now in memory - start processing it
That code will read your data into memory as one long C-style string. And the lack of compiler optimizations won't matter one bit as it's all almost bare-metal system calls.
Using fstat() to get the file size allows allocating all the needed memory at once - no realloc() or copying data around is necessary.
You'll need to add some error checking, and a more robust version of the code would check to be sure the data returned from fstat() actually is a regular file with an actual size, and not a "useless use of cat" such as cat filename | YourProgram, because in that case the fstat() call won't return a useful file size. You'll need to examine the sb.st_mode field of the struct stat after the call to see what the stdin stream really is:
::fstat( STDIN_FILENO, &sb );
...
if ( S_ISREG( sb.st_mode ) )
{
// regular file...
}
(And for really high-performance systems, it can be important to ensure that the memory pages you're reading data into are actually mapped in your process address space. Performance can really stall if data arrives faster than the kernel's memory management system can create virtual-to-physical mappings for the pages data is getting dumped into.)
To handle a large file as fast as possible, you'd want to go multithreaded, with one thread reading data and feeding one or more data processing threads so you can start processing data before you're done reading it.
Edit: parsing the data.
Again, preventing compiler optimizations probably makes the overhead of C++ operations slower than C-style processing. Based on that assumption, something simple will probably run faster.
This would probably work a lot faster in a non-optimized binary, assuming the data is in a C-style string read in as above:
char *next;
long count = ::strtol( data, &next, 0 );
long *values = new long[ count ];
for ( long ii = 0; ii < count; ii++ )
{
values[ ii ] = ::strtol( next, &next, 0 );
}
That is also very fragile. It relies on strtol() skipping over leading whitespace, meaning if there's anything other than whitespace between the numeric values it will fail. It also relies on the initial count of values being correct. Again - that code will fail if that's not true. And because it can replace the value of next before checking for errors, if it ever goes off the rails because of bad data it'll be hopelessly lost.
But it should be about as fast as possible without allowing compiler optimizations.
That's what crazy about not allowing compiler optimizations. You can write simple, robust C++ code to do all your processing, make use of a good optimizing compiler, and probably run almost as fast as the code I posted - which has no error checking and will fail spectacularly in unexpected and undefined ways if fed unexpected data.
You can make it faster if you use a SolidState hard drive. If you want to ask something about code performance, you need to post how are you doing things in the first place.
You may be able to speed up your program by reading the data into a buffer, then converting the text in the buffer to internal representation.
The thought behind this is that all stream devices like to keep streaming. Starting and stopping the stream wastes time. A block read transfers a lot of data with one transaction.
Although cin is buffered, by using cin.read and a buffer, you can make the buffer a lot bigger than cin uses.
If the data has fixed width fields, there are opportunities to speed up the input and conversion processes.
Edit 1: Example
const unsigned int BUFFER_SIZE = 65536;
char text_buffer[BUFFER_SIZE];
//...
cin.read(text_buffer, BUFFER_SIZE);
//...
int value1;
int arguments_scanned = snscanf(&text_buffer, REMAINING_BUFFER_SIZE,
"%d", &value1);
The tricky part is handling the cases where the text of a number is cut off at the end of the buffer.
Can you ran this little test in compare to your test with and without commented line?
#include <iostream>
#include <cstdlib>
int main()
{
std::ios::sync_with_stdio(false);
char buffer[20] {0,};
int t = 0;
while( std::cin.get(buffer, 20) )
{
// t = std::atoi(buffer);
std::cin.ignore(1);
}
return 0;
}
Pure read test:
#include <iostream>
#include <cstdlib>
int main()
{
std::ios::sync_with_stdio(false);
char buffer[1024*1024];
while( std::cin.read(buffer, 1024*1024) )
{
}
return 0;
}

Fast C++ String Output

I have a program that outputs the data from an FPGA. Since the data changes EXTREMELY fast, I'm trying to increase the speed of the program. Right now I am printing data like this
for (int i = 0; i < 100; i++) {
printf("data: %d\n",getData(i));
}
I found that using one printf greatly increases speed
printf("data: %d \n data: %d \n data: %d \n",getData(1),getData(2),getData(3));
However, as you can see, its very messy and I can't use a for loop. I tried concatenating the strings first using sprintf and then printing everything out at once, but it's just as slow as the first method. Any suggestions?
Edit:
I'm already printing to a file first, because I realized the console scrolling would be an issue. But its still too slow. I'm debugging a memory controller for an external FPGA, so the closer to the real speed the better.
If you are writing to stdout, you might not be able to influence this all.
Otherwise, set buffering
setvbuf http://en.cppreference.com/w/cpp/io/c/setvbuf
std::nounitbuf http://en.cppreference.com/w/cpp/io/manip/unitbuf
and untie the input output streams (C++) http://en.cppreference.com/w/cpp/io/basic_ios/tie
std::ios_base::sync_with_stdio(false) (thanks #Dietmar)
Now, Boost Karma is known to be pretty performant. However, I'd need to know more about your input data.
Meanwhile, try to buffer your writes manually: Live on Coliru
#include <stdio.h>
int getData(int i) { return i; }
int main()
{
char buf[100*24]; // or some other nice, large enough size
char* const last = buf+sizeof(buf);
char* out = buf;
for (int i = 0; i < 100; i++) {
out += snprintf(out, last-out, "data: %d\n", getData(i));
}
*out = '\0';
printf("%s", buf);
}
Wow, I can't believe I didn't do this earlier.
const int size = 100;
char data[size];
for (int i = 0; i < size; i++) {
*(data + i) = getData(i);
}
for (int i = 0; i < size; i++) {
printf("data: %d\n",*(data + i));
}
As I said, printf was the bottleneck, and sprintf wasn't much of an improvement either. So I decided to avoid any sort of printing until the very end, and use pointers instead
How much data? Store it in RAM until you're done, then print it. Also, file output may be faster. Depending on the terminal, your program may be blocking on writes. You may want to select for write-ability and write directly to STDOUT, instead.
basically you can't do lots of synchronous terminal IO on something where you want consistent, predictable performance.
I suggest you format your text to a buffer, then use the fwrite function to write the buffer.
Building off of dasblinkenlight's answer, use fwrite instead of puts. The puts function is searching for a terminating nul character. The fwrite function writes as-is to the console.
char buf[] = "data: 0000000000\r\n";
for (int i = 0; i < 100; i++) {
// int portion starts at position 6
itoa(getData(i), &buf[6], 10);
// The -1 is because we don't want to write the nul character.
fwrite(buf, 1, sizeof(buf) - 1, stdout);
}
You may want to read all the data into a separate raw data buffer, then format the raw data into a "formatted" data buffer and finally blast the entire "formatted" data buffer using one fwrite call.
You want to minimize the calls to send data out because there is an overhead involved. The fwrite function has about the same overhead for writing 1 character as it does writing 10,000 characters. This is where buffering comes in. Using a 1024 buffer of items would mean you use 1 function call to write 1024 items versus 1024 calls writing one item each. The latter is 1023 extra function calls.
Try printing an \r at the end of your string instead of the usual \n -- if that works on your system. That way you don't get continuous scrolling.
It depends on your environment if this works. And, of course, you won't be able to read all of the data if it's changing really fast.
Have you considered printing only every n th entry?

How to access chunk of memory from memory mapped file using boost?

I am trying to read a large file in x,y,z. Typically it runs into gbs of data.
I have created memory mapped file using Boost. However, I am still not very clear as how to access a chunk of memory from this file.
Boost provides function char* data() that returns pointer to the first byte of the buffer.(I get entire data as buffer).
Is there a way by which I can read the data chunk by chunk. Ideally, I would like to read the data in chunks of say 10,000.
Following is code.
boost::iostreams::mapped_file_source file;
std::string filename("MyFile.pts");
unsigned size = 58678952192;
file.open(filename, size);
int numBytes = size*sizeof(float)*3;
cl_float3 *data = new cl_float3[size];
float * tmp = (float*)file.data();
for(int i = 0; i < size;i++)
{
data[i].x = tmp[i*3];
data[i].y = tmp[i*3+1];
data[i].z = tmp[i*3+2];
}
delete[] tmp;
boost::iostreams behave just like std::basic_iostream
We can use unformatted IO
char buff[10000];
file.read(buff,10000);

how many bytes actually written by ostream::write?

suppose I send a big buffer to ostream::write, but only the beginning part of it is actually successfully written, and the rest is not written
int main()
{
std::vector<char> buf(64 * 1000 * 1000, 'a'); // 64 mbytes of data
std::ofstream file("out.txt");
file.write(&buf[0], buf.size()); // try to write 64 mbytes
if(file.bad()) {
// but suppose only 10 megabyte were available on disk
// how many were actually written to file???
}
return 0;
}
what ostream function can tell me how many bytes were actually written?
You can use .tellp() to know the output position in the stream to compute the number of bytes written as:
size_t before = file.tellp(); //current pos
if(file.write(&buf[0], buf.size())) //enter the if-block if write fails.
{
//compute the difference
size_t numberOfBytesWritten = file.tellp() - before;
}
Note that there is no guarantee that numberOfBytesWritten is really the number of bytes written to the file, but it should work for most cases, since we don't have any reliable way to get the actual number of bytes written to the file.
I don't see any equivalent to gcount(). Writing directly to the streambuf (with sputn()) would give you an indication, but there is a fundamental problem in your request: write are buffered and failure detection can be delayed to the effective writing (flush or close) and there is no way to get access to what the OS really wrote.

very fast text file processing (C++)

i wrote an application which processes data on the GPU. Code works well, but i have the problem that the reading part of the input file (~3GB, text) is the bottleneck of my application. (The read from the HDD is fast, but the processing line by line is slow).
I read a line with getline() and copy line 1 to a vector, line2 to a vector and skip lines 3 and 4. And so on for the rest of the 11 mio lines.
I tried several approaches to get the file at the best time possible:
Fastest method I found is using boost::iostreams::stream
Others were:
Read the file as gzip, to minimize IO, but is slower than directly
reading it.
copy file to ram by read(filepointer, chararray, length)
and process it with a loop to distinguish the lines (also slower than boost)
Any suggestions how to make it run faster?
void readfastq(char *filename, int SRlength, uint32_t blocksize){
_filelength = 0; //total datasets (each 4 lines)
_SRlength = SRlength; //length of the 2. line
_blocksize = blocksize;
boost::iostreams::stream<boost::iostreams::file_source>ins(filename);
in = ins;
readNextBlock();
}
void readNextBlock() {
timeval start, end;
gettimeofday(&start, 0);
string name;
string seqtemp;
string garbage;
string phredtemp;
_seqs.empty();
_phred.empty();
_names.empty();
_filelength = 0;
//read only a part of the file i.e the first 4mio lines
while (std::getline(in, name) && _filelength<_blocksize) {
std::getline(in, seqtemp);
std::getline(in, garbage);
std::getline(in, phredtemp);
if (seqtemp.size() != _SRlength) {
if (seqtemp.size() != 0)
printf("Error on read in fastq: size is invalid\n");
} else {
_names.push_back(name);
for (int k = 0; k < _SRlength; k++) {
//handle special letters
if(seqtemp[k]== 'A') ...
else{
_seqs.push_back(5);
}
}
_filelength++;
}
}
EDIT:
The source-file is downloadable under https://docs.google.com/open?id=0B5bvyb427McSMjM2YWQwM2YtZGU2Mi00OGVmLThkODAtYzJhODIzYjNhYTY2
I changed the function readfastq to read the file, because of some pointer problems. So if you call readfastq the blocksize (in lines) must be bigger than the number of lines to read.
SOLUTION:
I found a solution, which get the time for read in the file from 60sec to 16sec. I removed the inner-loop which handeles the special characters and do this in GPU. This decreases the read-in time and only minimal increases the GPU running time.
Thanks for your suggestions.
void readfastq(char *filename, int SRlength) {
_filelength = 0;
_SRlength = SRlength;
size_t bytes_read, bytes_expected;
FILE *fp;
fp = fopen(filename, "r");
fseek(fp, 0L, SEEK_END); //go to the end of file
bytes_expected = ftell(fp); //get filesize
fseek(fp, 0L, SEEK_SET); //go to the begining of the file
fclose(fp);
if ((_seqarray = (char *) malloc(bytes_expected/2)) == NULL) //allocate space for file
err(EX_OSERR, "data malloc");
string name;
string seqtemp;
string garbage;
string phredtemp;
boost::iostreams::stream<boost::iostreams::file_source>file(filename);
while (std::getline(file, name)) {
std::getline(file, seqtemp);
std::getline(file, garbage);
std::getline(file, phredtemp);
if (seqtemp.size() != SRlength) {
if (seqtemp.size() != 0)
printf("Error on read in fastq: size is invalid\n");
} else {
_names.push_back(name);
strncpy( &(_seqarray[SRlength*_filelength]), seqtemp.c_str(), seqtemp.length()); //do not handle special letters here, do on GPU
_filelength++;
}
}
}
First instead of reading the file into memory you may work with file mappings. You just have to build your program as 64-bit to fit 3GB of virtual address space (for 32-bit application only 2GB is accessible in the user mode). Or alternatively you may map & process your file by parts.
Next, it sounds to me that your bottleneck is "copying a line to a vector". Dealing with vectors involves dynamic memory allocation (heap operations), which in a critical loop hits the performance very seriously). If this is the case - either avoid using vectors, or make sure they're declared outside the loop. The latter helps because when you reallocate/clear vectors they do not free memory.
Post your code (or a part of it) for more suggestions.
EDIT:
It seems that all your bottlenecks are related to string management.
std::getline(in, seqtemp); reading into an std::string deals with the dynamic memory allocation.
_names.push_back(name); This is even worse. First the std::string is placed into the vector by value. Means - the string is copied, hence another dynamic allocation/freeing happens. Moreover, when eventually the vector is internally reallocated - all the contained strings are copied again, with all the consequences.
I recommend using neither standard formatted file I/O functions (Stdio/STL) nor std::string. To achieve better performance you should work with pointers to strings (rather than copied strings), which is possible if you map the entire file. Plus you'll have to implement the file parsing (division into lines).
Like in this code:
class MemoryMappedFileParser
{
const char* m_sz;
size_t m_Len;
public:
struct String {
const char* m_sz;
size_t m_Len;
};
bool getline(String& out)
{
out.m_sz = m_sz;
const char* sz = (char*) memchr(m_sz, '\n', m_Len);
if (sz)
{
size_t len = sz - m_sz;
m_sz = sz + 1;
m_Len -= (len + 1);
out.m_Len = len;
// for Windows-format text files remove the '\r' as well
if (len && '\r' == out.m_sz[len-1])
out.m_Len--;
} else
{
out.m_Len = m_Len;
if (!m_Len)
return false;
m_Len = 0;
}
return true;
}
};
if _seqs and _names are std::vectors and you can guess the final size of them before processing the whole 3GB of data, you can use reserve to avoid most of the memory re-allocation during pushing back the new elements in the loop.
You should be aware of the fact that the vectors effectively produce another copy of parts of the file in main memory. So unless you have a main memory sufficiently large to store the text file plus the vector and its contents, you will probably end up with a number of page faults that also have a significant influence on the speed of your program.
You are apparently using <stdio.h> since using getline.
Perhaps fopen-ing the file with fopen(path, "rm"); might help, because the m tells (it is a GNU extension) to use mmap for reading.
Perhaps setting a big buffer (i.e. half a megabyte) with setbuffer could also help.
Probably, using the readahead system call (in a separate thread perhaps) could help.
But all this are guesses. You should really measure things.
General suggestions:
Code the simplest, most straight-forward, clean approach,
Measure,
Measure,
Measure,
Then if all else fails:
Read raw bytes (read(2)) in page-aligned chunks. Do so sequentially, so kernel's read-ahead plays to your advantage.
Re-use the same buffer to minimize cache flushing.
Avoid copying data, parse in place, pass around pointers (and sizes).
mmap(2)-ing [parts of the] file is another approach. This also avoids kernel-userland copy.
Depending on your disk speed, using a very fast de compression algorithm might help, like fastlz (there are at least two other that might be more efficient, but under GPL, so licence can be a problem).
Also, using C++ data structures and functions car increase the speed as you can maybe achieve a better compiler-time optimization. Going the C way isn't always the fastes! In some bad conditions, using char* you need to parse the whole string to reach the \0 yielding desastrous performances.
For parsing your data, using boost::spirit::qi is also probably the most optimized approach http://alexott.blogspot.com/2010/01/boostspirit2-vs-atoi.html