I'm currently implementing a ping/pong buffering scheme to safely write a file to disk. I'm using C++/Boost on a Linux/CentOS machine. Now I'm facing the problem to force the actual write of the file to disk. Is it possible to do so irrespective of all the caching policies of the filesystem (ext3/ext4) / SO custom rules / RAID controller / harddisk controller ?
Is it best to use plain fread()/fwrite(), c++ ostream or boost filesystem?
I've heard that simply flushing out the file (fflush()) doesn't guarantee the actual write
fflush (for FILE*), std::flush (for IOStream) to force your program to send to the OS.
POSIX has
sync(2) to ask to schedule writing its buffers, but can return before the writing is done (Linux is waiting that the data is send to the hardware before returning).
fsync(2) which is guaranteed to wait for the data to be send to the hardware, but needs a file descriptor (you can get one from a FILE* with fileno(3), I know of no standard way to get one from an IOStream).
O_SYNC as a flag to open(2).
In all cases, the hardware may have it's own buffers (but if it has control on it, a good implementation will try to flush them also and ISTR that some disks are using capacitors so that they are able to flush whatever happens to the power) and network file systems have their own caveat.
You can use fsync()/fdatasync() to force(Note 1) the data onto the storage.
Those requres a file descriptor, as given by e.g. open().
The linux manpage have more linux specific info, particularly on the difference of fsync and fdatasync.
If you don't use file desciptors directly, many abstractions will contain internal buffers residing in your process.
e.g. if you use a FILE*, you first have to flush the data out of your application.
//... open and write data to a FILE *myfile
fflush(myfile);
fsync(fileno(myfile));
Note 1: These calls force the OS to ensure that any data in any OS cache is written to the drive, and the drive acknowledges that fact. Many hard-drives lie to the OS about this, and might stuff the data in cache memory on the drive.
Not in standard C++. You'll have to use some sort of system specific
IO, like open with the O_SYNC flag under Unix, and then write.
Note that this is partially implicit by the fact that ostream (and in
C, FILE*) are buffered. If you don't know exactly when something is
written to disk, then it doesn't make much sense to insist on the
transactional integrity of the write. (It wouldn't be too hard to
design a streambuf which only writes when you do an explicit flush,
however.)
EDIT:
As a simple example:
class SynchronizedStreambuf : public std::streambuf
{
int myFd;
std::vector<char> myBuffer;
protected:
virtual int overflow( int ch );
virtual int sync();
public:
SynchronizedStreambuf( std::string const& filename );
~SynchronizedStreambuf();
};
int SynchronizedStreambuf::overflow( int ch )
{
if ( myFd == -1 ) {
return traits_type::eof();
} else if ( ch == traits_type::eof() ) {
return sync() == -1 ? traits_type::eof() : 0;
} else {
myBuffer.push_back( ch );
size_t nextPos = myBuffer.size();
myBuffer.resize( 1000 );
setp( &myBuffer[0] + nextPos, &myBuffer[0] + myBuffer.size() );
return ch;
}
}
int SynchronizedStreambuf::sync()
{
size_t toWrite = pptr() - &myBuffer[0];
int result = (toWrite == 0 || write( myFd, &myBuffer[0], toWrite ) == toWrite ? 0 : -1);
if ( result == -1 ) {
close( myFd );
setp( NULL, NULL );
myFd = -1;
} else {
setp( &myBuffer[0], &myBuffer[0] + myBuffer.size() );
}
return result;
}
SynchronizedStreambuf::SynchronizedStreambuf( std::string const& filename )
: myFd( open( filename.c_str(), O_WRONLY | O_CREAT | O_SYNC, 0664 ) )
{
}
SynchronizedStreambuf::~SynchronizedStreambuf()
{
sync();
close( myFd );
}
(This has only been superficially tested, but the basic idea is there.)
Related
Why does the below code not stop the compiler from flushing the buffer automatically?
cout.sync_with_stdio(false);
cin.tie(nullptr);
cout << "hello";
cout << "world";
int a;
cin >> a;
output:
helloworld
I'm using Visual Studio 2012 Ultimate
AFAIK, the stream can be flushed whenever the implementation likes to do so, i.e. there's no guarantee that the stream will be flushed after an insert operation. However, you could use one of these manipulators to ensure your stream gets flushed (these are the only ones I know of so if someone is aware of others, please comment):
std::endl - inserts a newline into the stream and flushes it,
std::flush - just flushes the stream,
std::(no)unitbuf - enables/disables flushing the stream after each insert operation.
The standard allows an implementation to flush any time it feels
like it, but from a quality of implementation point of view, one
really doesn't expect a flush here. You might try adding
a setbuf, telling std::cin to use a buffer you specify:
std::cout.rdbuf()->setbuf( buffer, sizeof(buffer) );
Again,the standard doesn't guarantee anything, but if this isn't
respected, I'd consider the quality bad enough to warrant a bug
report.
Finally, if worse comes to worse, you can always insert
a filtering streambuf which does the buffering you want. You
shouldn't have to, but it won't be the first time we've had to
write extra code to work around a lack of quality in compilers
or libraries. If all you're doing is straightforward output (no
seeks, or anything, something like the following should do the
trick:
class BufferingOutStreambuf : public std::streambuf
{
std::streambuf* myDest;
std::ostream* myOwner;
std::vector<char> myBuffer;
static size_t const bufferSize = 1000;
protected:
virtual int underflow( int ch )
{
return sync() == -1
? EOF
: sputc( ch );
}
virtual int sync()
{
int results = 0;
if ( pptr() != pbase() ) {
if ( myDest->sputn( pbase(), pptr() - pbase() )
!= pptr() - pbase() ) {
results = -1;
}
}
setp( &myBuffer[0], &myBuffer[0] + myBuffer.size() );
return results;
}
public:
BufferingOutStreambuf( std::streambuf* dest )
: myDest( dest )
, myOwner( NULL )
, myBuffer( bufferSize )
{
setp( &myBuffer[0], &myBuffer[0] + myBuffer.size() );
}
BufferingOutStreambuf( std::ostream& dest )
: myDest( dest.rdbuf() )
, myOwner( &dest )
, myBuffer( bufferSize )
{
setp( &myBuffer[0], &myBuffer[0] + myBuffer.size() );
myOwner->rdbuf( this );
}
~BufferingOutStreambuf()
{
if ( myOwner != NULL ) {
myOwner->rdbuf( myDest );
}
}
};
Then just do:
BufferingOutStreambuf buffer( std::cout );
as the first line in main. (One could argue that iostreams
should have been designed to work like this from the start, with
filtering streambuf for buffering and code translation. But
it wasn't, and this shouldn't be necessary with a decent
implementation.)
I have to write a program in C (or C++) in Linux that will tests write and read speed on different file systems. I have to be sure that all data are written to the disk (not in cache).
So my first question - what function should I use to open a new file? I used before open function with parameters O_DIRECT and O_SYNC and everything was fine except one thing - writing small files like 1KB was extremely slow, something like 0.01MB/s.
I tried to use fopen function instead open, and fflush function to be sure that all data writes direct to the disk, and I tested it first on FAT32 file system. 1000 files with 1KB was written to disk (here SD card) in 5 sec. something like 0.18MB/s, and I think that is correct.
Now the problem occurs when testing EXT4 and NTFS file systems. On EXT4. 1KB files was written something like 12MB/s (wrong), when testing 100KB transfer was 180MB/s (terribly wrong, my SD card has transfer rate only 20MB/s).
My actually code for write files looks like this:
clock_gettime(CLOCK_REALTIME, &ts);
for ( int i = 0; i < amount; ++i)
{
p = fopen(buffer2, "w+");
fwrite(buff, size*1024, 1, p);
if ( fflush(p) != 0 ) { cout << "fflush error"; return 0; }
fclose(p);
}
clock_gettime(CLOCK_REALTIME, &ts2);
time2 = diff2(ts,ts2);
works only good for FAT32 file system. The second code (used before) looks like this:
for ( int i = 0; i < amount; ++i)
{
int fd = open(buffer2, O_WRONLY | O_CREAT, 0777);
if ( error(fd, "open") ) return false;
if ( (write(fd, buff, size*1024)) < 0 ) { perror("write error"); return 0; }
if ( (fsync(fd)) == -1 ) { perror("fsync"); return 0; }
close(fd);
}
works for all file systems but small files writes extremely slow.
Maybe I should use different code for different file system? Any ideas?
EDIT:
I have found why writing small files is slow. It is because of fsync function, and on different file systems it takes different time. I am calling fsync every write, so here is the problem.
Is there any way to call it at the end, when all files are written? Or maybe every few seconds? Does I have to use different thread?
See How do I ensure data is written to disk before closing fstream? but I don't think you can ensure that data is actually on disk rather than in a cache in the disk controller or even in the drive's onboard cache
I have a function to get a FileSize of a file. I am running this on WinCE. Here is my current code which seems particularily slow
int Directory::GetFileSize(const std::string &filepath)
{
int filesize = -1;
#ifdef linux
struct stat fileStats;
if(stat(filepath.c_str(), &fileStats) != -1)
filesize = fileStats.st_size;
#else
std::wstring widePath;
Unicode::AnsiToUnicode(widePath, filepath);
HANDLE hFile = CreateFile(widePath.c_str(), 0, FILE_SHARE_READ | FILE_SHARE_WRITE, 0, OPEN_EXISTING, FILE_ATTRIBUTE_NORMAL, 0);
if (hFile > 0)
{
filesize = ::GetFileSize( hFile, NULL);
}
CloseHandle(hFile);
#endif
return filesize;
}
At least for Windows, I think I'd use something like this:
__int64 Directory::GetFileSize(std::wstring const &path) {
WIN32_FIND_DATAW data;
HANDLE h = FindFirstFileW(path.c_str(), &data);
if (h == INVALID_HANDLE_VALUE)
return -1;
FindClose(h);
return data.nFileSizeLow | (__int64)data.nFileSizeHigh << 32;
}
If the compiler you're using supports it, you might want to use long long instead of __int64. You probably do not want to use int though, as that will only work correctly for files up to 2 gigabytes, and files larger than that are now pretty common (though perhaps not so common on a WinCE device).
I'd expect this to be faster than most other methods though. It doesn't require opening the file itself at all, just finding the file's directory entry (or, in the case of something like NTFS, its master file table entry).
Your solution is already rather fast to query the size of a file.
Under Windows, at least for NTFS and FAT, the file system driver will keep the file size in the cache, so it is rather fast to query it. The most time-consuming work involved is switching from user-mode to kernel-mode, rather than the file system driver's processing.
If you want to make it even faster, you have to use your own cache policy in user-mode, e.g. a special hash table, to avoid switching from user-mode to kernel-mode. But I don't recommend you to do that, because you will gain little performance.
PS: You'd better avoid the statement Unicode::AnsiToUnicode(widePath, filepath); in your function body. This function is rather time-consuming.
Just an idea (I haven't tested it), but I would expect
GetFileAttributesEx to be fastest at the system level. It
avoids having to open the file, and logically, I would expect it
to be faster than FindFirstFile, since it doesn't have to
maintain any information for continuing the search.
You could roll your own but I don't see why your approach is slow:
int Get_Size( string path )
{
// #include <fstream>
FILE *pFile = NULL;
// get the file stream
fopen_s( &pFile, path.c_str(), "rb" );
// set the file pointer to end of file
fseek( pFile, 0, SEEK_END );
// get the file size
int Size = ftell( pFile );
// return the file pointer to begin of file if you want to read it
// rewind( pFile );
// close stream and release buffer
fclose( pFile );
return Size;
}
I'm currently implementing a ping/pong buffering scheme to safely write a file to disk. I'm using C++/Boost on a Linux/CentOS machine. Now I'm facing the problem to force the actual write of the file to disk. Is it possible to do so irrespective of all the caching policies of the filesystem (ext3/ext4) / SO custom rules / RAID controller / harddisk controller ?
Is it best to use plain fread()/fwrite(), c++ ostream or boost filesystem?
I've heard that simply flushing out the file (fflush()) doesn't guarantee the actual write
fflush (for FILE*), std::flush (for IOStream) to force your program to send to the OS.
POSIX has
sync(2) to ask to schedule writing its buffers, but can return before the writing is done (Linux is waiting that the data is send to the hardware before returning).
fsync(2) which is guaranteed to wait for the data to be send to the hardware, but needs a file descriptor (you can get one from a FILE* with fileno(3), I know of no standard way to get one from an IOStream).
O_SYNC as a flag to open(2).
In all cases, the hardware may have it's own buffers (but if it has control on it, a good implementation will try to flush them also and ISTR that some disks are using capacitors so that they are able to flush whatever happens to the power) and network file systems have their own caveat.
You can use fsync()/fdatasync() to force(Note 1) the data onto the storage.
Those requres a file descriptor, as given by e.g. open().
The linux manpage have more linux specific info, particularly on the difference of fsync and fdatasync.
If you don't use file desciptors directly, many abstractions will contain internal buffers residing in your process.
e.g. if you use a FILE*, you first have to flush the data out of your application.
//... open and write data to a FILE *myfile
fflush(myfile);
fsync(fileno(myfile));
Note 1: These calls force the OS to ensure that any data in any OS cache is written to the drive, and the drive acknowledges that fact. Many hard-drives lie to the OS about this, and might stuff the data in cache memory on the drive.
Not in standard C++. You'll have to use some sort of system specific
IO, like open with the O_SYNC flag under Unix, and then write.
Note that this is partially implicit by the fact that ostream (and in
C, FILE*) are buffered. If you don't know exactly when something is
written to disk, then it doesn't make much sense to insist on the
transactional integrity of the write. (It wouldn't be too hard to
design a streambuf which only writes when you do an explicit flush,
however.)
EDIT:
As a simple example:
class SynchronizedStreambuf : public std::streambuf
{
int myFd;
std::vector<char> myBuffer;
protected:
virtual int overflow( int ch );
virtual int sync();
public:
SynchronizedStreambuf( std::string const& filename );
~SynchronizedStreambuf();
};
int SynchronizedStreambuf::overflow( int ch )
{
if ( myFd == -1 ) {
return traits_type::eof();
} else if ( ch == traits_type::eof() ) {
return sync() == -1 ? traits_type::eof() : 0;
} else {
myBuffer.push_back( ch );
size_t nextPos = myBuffer.size();
myBuffer.resize( 1000 );
setp( &myBuffer[0] + nextPos, &myBuffer[0] + myBuffer.size() );
return ch;
}
}
int SynchronizedStreambuf::sync()
{
size_t toWrite = pptr() - &myBuffer[0];
int result = (toWrite == 0 || write( myFd, &myBuffer[0], toWrite ) == toWrite ? 0 : -1);
if ( result == -1 ) {
close( myFd );
setp( NULL, NULL );
myFd = -1;
} else {
setp( &myBuffer[0], &myBuffer[0] + myBuffer.size() );
}
return result;
}
SynchronizedStreambuf::SynchronizedStreambuf( std::string const& filename )
: myFd( open( filename.c_str(), O_WRONLY | O_CREAT | O_SYNC, 0664 ) )
{
}
SynchronizedStreambuf::~SynchronizedStreambuf()
{
sync();
close( myFd );
}
(This has only been superficially tested, but the basic idea is there.)
I have been doing research on creating my own ostream and along with that a streambuf to handle the buffer for my ostream. I actually have most of it working, I can insert (<<) into my stream and get strings no problem. I do this by implimenting the virtual function xsputn. However if I input (<<) a float or an int to the stream instead of a string xsputn never gets called.
I have walked through the code and I see that the stream is calling do_put, then f_put which eventually tries to put the float 1 character at a time into the buffer. I can get it to call my implementation of the virtual function overflow(int c) if I leave my buffer with no space and thereby get the data for the float and the int.
Now here is the problem, I need to know when the float is done being put into the buffer. Or to put it another way, I need to know when this is the last time overflow will be called for a particular value being streamed in. The reason xsputn works for me is because I get the whole value up front and its length. So i can copy it into the buffer then call out to the function waiting for the buffer to be full.
I am admittedly abusing the ostream design in that I need to cache the output then send it all at once for each inputted value (<<).
Anyways to be clear I will restate what I am shooting for in another way. There is a very good chance I am just going about it the wrong way.
I want to use an inherited ostream and streambuf so I can input values into it and allow it to handle my type conversion for me, then I want to ferry that information off to another object that I am passing a handle down to the streambuf to (for?). That object has expensive i/o so I dont want to send the data 1 char at a time.
Sorry in advance if this is unclear. And thank you for your time.
It's not too clear what you're doing, although it sounds roughly
right. Just to be sure: all your ostream does is provide
convenience constructors to create and install your streambuf,
a destructor, and possibly an implementation of rdbuf to
handle buffers of the right type. Supposing that's true:
defining xsputn in your streambuf is purely an optimization.
The key function you have to define is overflow. The simplest
implementation of overflow just takes a single character, and
outputs it to the sink. Everything beyond that is optimization:
you can, for example, set up a buffer using setp; if you do
this, then overflow will only be called when the buffer is
full, or a flush was requested. In this case, you'll have to
output buffer as well (use pbase and pptr to get the
addresses). (The streambuf base class initializes the
pointers to create a 0 length buffer, so overflow will be
called for every character.) Other functions which you might
want to override in (very) specific cases:
imbue: If you need the locale for some reason. (Remember that
the current character encoding is part of the locale.)
setbuf: To allow client code to specify a buffer. (IMHO, it's
usually not worth the bother, but you may have special
requirements.)
seekoff: Support for seeking. I've never used this in any of
my streambufs, so I can't give any information beyond what
you could read in the standard.
sync: Called on flush, should output any characters in the
buffer to the sink. If you never call setp (so there's no
buffer), you're always in sync, and this can be a no-op.
overflow or uflow can call this one, or both can call some
separate function. (About the only difference between sync
and uflow is that uflow will only be called if there is
a buffer, and it will never be called if the buffer is empty.
sync will be called if the client code flushes the stream.)
When writing my own streams, unless performance dictates
otherwise, I'll keep it simple, and only override overflow.
If performance dictates a buffer, I'll usually put the code to
flush the buffer into a separate write(address, length)
function, and implement overflow and sync along the lines
of:
int MyStreambuf::overflow( int ch )
{
if ( pbase() == NULL ) {
// save one char for next overflow:
setp( buffer, buffer + bufferSize - 1 );
if ( ch != EOF ) {
ch = sputc( ch );
} else {
ch = 0;
}
} else {
char* end = pptr();
if ( ch != EOF ) {
*end ++ = ch;
}
if ( write( pbase(), end - pbase() ) == failed ) {
ch = EOF;
} else if ( ch == EOF ) {
ch = 0;
}
setp( buffer, buffer + bufferSize - 1 );
}
return ch;
}
int sync()
{
return (pptr() == pbase()
|| write( pbase(), pptr() - pbase() ) != failed)
? 0
: -1;
}
Generally, I'll not bother with xsputn, but if your client
code is outputting a lot of long strings, it could be useful.
Something like this should do the trick:
streamsize xsputn(char const* p, streamsize n)
{
streamsize results = 0;
if ( pptr() == pbase()
|| write( pbase(), pptr() - pbase() ) != failed ) {
if ( write(p, n) != failed ) {
results = n;
}
}
setp( buffer, buffer + bufferSize - 1 );
return results;
}