While troubleshooting some performance problems in our apps, I found out that C's stdio.h functions (and, at least for our vendor, C++'s fstream classes) are threadsafe. As a result, every time I do something as simple as fgetc, the RTL has to acquire a lock, read a byte, and release the lock.
This is not good for performance.
What's the best way to get non-threadsafe file I/O in C and C++, so that I can manage locking myself and get better performance?
MSVC provides _fputc_nolock, and GCC provides unlocked_stdio and flockfile, but I can't find any similar functions in my compiler (CodeGear C++Builder).
I could use the raw Windows API, but that's not portable and I assume would be slower than an unlocked fgetc for character-at-a-time I/O.
I could switch to something like the Apache Portable Runtime, but that could potentially be a lot of work.
How do others approach this?
Edit: Since a few people wondered, I had tested this before posting. fgetc doesn't do system calls if it can satisfy reads from its buffer, but it does still do locking, so locking ends up taking an enormous percentage of time (hundreds of locks to acquire and release for a single block of data read from disk). Not doing character-at-a-time I/O would be a solution, but C++Builder's fstream classes unfortunately use fgetc (so if I want to use iostream classes, I'm stuck with it), and I have a lot of legacy code that uses fgetc and friends to read fields out of record-style files (which would be reasonable if it weren't for locking issues).
I'd simply not do IO a char at a time if it is sensible performance wise.
fgetc is almost certainly not reading a byte each time you call it (where by 'reading' I mean invoking a system call to perform I/O). Look somewhere else for your performance bottleneck, as this is probably not the problem, and using unsafe functions is certainly not the solution. Any lock handling you do will probably be less efficient than the handling done by the standard routines.
The easiest way would be to read the entire file in memory, and then provide your own fgetc-like interface to that buffer.
Why not just memory map the file? Memory mapping is portable (except in Windows Vista which requires you to jump through hopes to use it now, the dumbasses). Anyhow, map your file into memory, and do you're own locking/not-locking on the resulting memory location.
The OS handles all the locking required to actually read from the disk - you'll NEVER be able to eliminate this overhead. But your processing overhead, on the otherhand, won't be affected by extraneous locking other than that which you do yourself.
the multi-platform approach is pretty simple. Avoid functions or operators where standard specifies that they should use sentry. sentry is an inner class in iostream classes which ensures stream consistency for every output character and in multi-threaded environment it locks the stream related mutex for each character being output. This avoids race conditions at low level but still makes the output unreadable, since strings from two threads might be output concurrently as the following example states:
thread 1 should write: abc
thread 2 should write: def
The output might look like: adebcf instead of abcdef or defabc. This is because sentry is implemented to lock and unlock per character.
The standard defines it for all functions and operators dealing with istream or ostream. The only way to avoid this is to use stream buffers and your own locking (per string for example).
I have written an app, which outputs some data to a file and mesures the speed. If you add here a function which ouptuts using the fstream directly without using the buffer and flush, you will see the speed difference. It uses boost, but I hope it is not a problem for you. Try to remove all the streambuffers and see the difference with and without them. I my case the performance drawback was factor 2-3 or so.
The following article by N. Myers will explain how locales and sentry in c++ IOStreams work. And for sure you should look up in ISO C++ Standard document, which functions use sentry.
Good Luck,
Ovanes
#include <vector>
#include <fstream>
#include <iterator>
#include <algorithm>
#include <iostream>
#include <cassert>
#include <cstdlib>
#include <boost/progress.hpp>
#include <boost/shared_ptr.hpp>
double do_copy_via_streambuf()
{
const size_t len = 1024*2048;
const size_t factor = 5;
::std::vector<char> data(len, 1);
std::vector<char> buffer(len*factor, 0);
::std::ofstream
ofs("test.dat", ::std::ios_base::binary|::std::ios_base::out);
noskipws(ofs);
std::streambuf* rdbuf = ofs.rdbuf()->pubsetbuf(&buffer[0], buffer.size());
::std::ostreambuf_iterator<char> oi(rdbuf);
boost::progress_timer pt;
for(size_t i=1; i<=250; ++i)
{
::std::copy(data.begin(), data.end(), oi);
if(0==i%factor)
rdbuf->pubsync();
}
ofs.flush();
double rate = 500 / pt.elapsed();
std::cout << rate << std::endl;
return rate;
}
void count_avarage(const char* op_name, double (*fct)())
{
double av_rate=0;
const size_t repeat = 1;
std::cout << "doing " << op_name << std::endl;
for(size_t i=0; i<repeat; ++i)
av_rate+=fct();
std::cout << "average rate for " << op_name << ": " << av_rate/repeat
<< "\n\n~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\n\n"
<< std::endl;
}
int main()
{
count_avarage("copy via streambuf iterator", do_copy_via_streambuf);
return 0;
}
One thing to consider is to build a custom runtime. Most compilers provide the source to the runtime library (I'd be surprised if it weren't in the C++ Builder package).
This could end up being a lot of work, but maybe they've localized the thread support to make something like this easy. For example, with the embedded system compiler I'm using, it's designed for this - they have documented hooks to add the lock routines. However, it's possible that this could be a maintenance headache, even if it turns out to be relatively easy initially.
Another similar route would be to talk to someone like Dinkumware about using a 3rd party runtime that provides the capabilities you need.
Related
Is there a way to force MPI to always block on send? This might be useful when looking for deadlocks in a distributed algorithm which otherwise depends on the buffering MPI might choose to do on send.
For example, the following program (run with 2 processes) works without problems on my machine:
// C++
#include <iostream>
#include <thread>
// Boost
#include <boost/mpi.hpp>
namespace mpi = boost::mpi;
int main() {
using namespace std::chrono_literals;
mpi::environment env;
mpi::communicator world;
auto me = world.rank();
auto other = 1 - me;
char buffer[10] = {0};
while (true) {
world.send(other, 0, buffer);
world.recv(other, 0, buffer);
std::cout << "Node " << me << " received" << std::endl;
std::this_thread::sleep_for(200ms);
}
}
But if I change the size of the buffer to 10000 it blocks indefinitely.
For pure MPI codes, what you describe is exactly what MPI_Ssend() gives you. However, here, you are not using pure MPI, you are using boost::mpi. And unfortunately, according to boost::mpi's documentation, MPI_Ssend() isn't supported.
That said, maybe boost::mpi offers another way, but I doubt it.
If you want blocking behavior, use MPI_Ssend. It will block until a matching receive has been posted, without buffering the request. The amount of buffering provided by MPI_Send is (intentionally) implementation specific. The behavior you get for a buffer of 10000 may differ when trying a different implementation.
I don't know if you can actually tweak the buffering configuration, and I wouldn't try because it would not be portable. Instead, I'd try to use the MPI_Ssend variant in some debug configuration, and use the default MPI_Send when best performance are needed.
(disclaimer: I'm not familiar with boost's implementation, but MPI is a standard. Also, I saw Gilles comment after posting this answer...)
You can consider tuning the eager limit value (http://blogs.cisco.com/performance/what-is-an-mpi-eager-limit) to force that send operation to block on any message size. The way to establish the eager limit, depends on the MPI implementation. On Intel MPI you can use the I_MPI_EAGER_THRESHOLD environment variable (see https://software.intel.com/sites/products/documentation/hpc/ics/impi/41/lin/Reference_Manual/Communication_Fabrics_Control.htm), for instance.
I have to write a program for school that calculates current, voltage, and efficiency. I have almost finished the program but now I want to write the results in a logfile. I have already read some threads but it didn't really help.
here is the part that I want to write in a logfile:
cout<<"Die spannung U1 betraegt"<<U1<<"Ohm."<<endl;
I would really appreciate help thanks.
Simply using File I/O in C++ locally should solve your issue:
#include <fstream>
//...
ofstream fout("logfile.txt");
if (fout){
fout << "Die spannung U1 betraegt" << U1 << "Ohm." <<endl;
fout.close();
}
However, logging can become very cumbersome, so people have come up with all kinds of solutions for loggers. I found this article on logfiles (In context of the Singleton design pattern) to be very useful.
I would recommend using FILE and fprintf.
http://pic.dhe.ibm.com/infocenter/tpfhelp/current/index.jsp?topic=%2Fcom.ibm.ztpf-ztpfdf.doc_put.cur%2Fgtpc2%2Fcpp_fprintf-printf-sprintf.html
Remember - if you have threads - you need to protect the object,
don't forget to fflush() when the content is meaningful, and to fclose when you're done.
There are other methods to do it- I presonally like the bare bone the most..
Is there a canonical / public / free implementations variant of std::stringstream where I don't pay for a full string copy each time I call str()? (Possibly through providing a direct c_str() member in the osteam class?)
I've found two questions here:
C++ stl stringstream direct buffer access (Yeah, it's basically the same title, but note that it's accepted answer doesn't fit this here question at all.)
Stream from std::string without making a copy? (Again, accepted answer doesn't match this question.)
And "of course" the deprecated std::strstream class does allow for direct buffer access, although it's interface is really quirky (apart from it being deprecated).
It also seems one can find several code samples that do explain how one can customize std::streambuf to allow direct access to the buffer -- I haven't tried it in practice, but it seems quite easily implemented.
My question here is really two fold:
Is there any deeper reason why std::[o]stringstream (or, rather, basic_stringbuf) does not allow direct buffer access, but only access through an (expensive) copy of the whole buffer?
Given that it seems easy, but not trivial to implement this, is there any varaint available via boost or other sources, that package this functionality?
Note: The performance hit of the copy that str() makes is very measurable(*), so it seems weird to have to pay for this when the use cases I have seen so far really never need a copy returned from the stringstream. (And if I'd need a copy I could always make it at the "client side".)
(*): With our platform (VS 2005), the results I measure in the release version are:
// tested in a tight loop:
// variant stream: run time : 100%
std::stringstream msg;
msg << "Error " << GetDetailedErrorMsg() << " while testing!";
DoLogErrorMsg(msg.str().c_str());
// variant string: run time: *** 60% ***
std::string msg;
((msg += "Error ") += GetDetailedErrorMsg()) += " while testing!";
DoLogErrorMsg(msg.c_str());
So using a std::string with += (which obviously only works when I don't need custom/number formatting is 40% faster that the stream version, and as far as I can tell this is only due to the complete superfluous copy that str() makes.
I will try to provide an answer to my first bullet,
Is there any deeper reason why std::ostringstream does not allow direct buffer access
Looking at how a streambuf / stringbuf is defined, we can see that the buffer character sequence is not NULL terminated.
As far as I can see, a (hypothetical) const char* std::ostringstream::c_str() const; function, providing direct read-only buffer access, can only make sense when the valid buffer range would always be NULL terminated -- i.e. (I think) when sputc would always make sure that it inserts a terminating NULL after the character it inserts.
I wouldn't think that this is a technical hindrance per se, but given the complexity of the basic_streambuf interface, I'm totally not sure whether it is correct in all cases.
As for the second bullet
Given that it seems easy, but not trivial to implement this, is there
any variant available via boost or other sources, that package this
functionality?
There is Boost.Iostreams and it even contains an example of how to implement an (o)stream Sink with a string.
I came up with a little test implementation to measure it:
#include <string>
#include <boost/iostreams/stream.hpp>
#include <libs/iostreams/example/container_device.hpp> // container_sink
namespace io = boost::iostreams;
namespace ex = boost::iostreams::example;
typedef ex::container_sink<std::wstring> wstring_sink;
struct my_boost_ostr : public io::stream<wstring_sink> {
typedef io::stream<wstring_sink> BaseT;
std::wstring result;
my_boost_ostr() : BaseT(result)
{ }
// Note: This is non-const for flush.
// Suboptimal, but OK for this test.
const wchar_t* c_str() {
flush();
return result.c_str();
}
};
In the tests I did, using this with it's c_str()helper ran slightly faster than a normal ostringstream with it's copying str().c_str() version.
I do not include measuring code. Performance in this area is very brittle, make sure to measure your use case yourself! (For example, the constructor overhead of a string stream is non-negligible.)
I'm working with some existing code which is deserializing objects stored in text files (I potentially need to read tens of millions of these). The contents of the file are first read into a wstring and then it makes a wistringstream from that. Running the Very Sleepy profiler on the program shows that it is spending about 20% of its time in the following call stacks:
Mtxlock or RtlEnterCritialSection
std::_Mutex::_Lock
std::flush
std::basic_istream<wchar_t, std::char_traits<wchar_t> >::get
<rest of my program>
and similar ones with std::_Mutex::_Unlock. I'm using Visual C++ 2008.
Looking in istream, I see that it constructs a sentry object which calls _Lock and _Unlock methods on the underlying basic_streambuf. This in turn just call _Lock and _Unlock on a _Mutex associated with that buffer. These are then defined as follows:
#if _MULTI_THREAD
// actually defines non-empty _Lock() and _Unlock() methods
#else /* _MULTI_THREAD */
void _Lock()
{ // do nothing
}
void _Unlock()
{ // do nothing
}
#endif /* _MULTI_THREAD */
It looks like _MULTI_THREAD is set in yvals.h as
#define _MULTI_THREAD 1 /* nontrivial locks if multithreaded */
Now, I know there will never be another thread trying to access this buffer, but it looks to me like there's no way around this locking while using the standard iostreams, which seems both odd and frustrating. Am I missing something? Is there a workaround for this?
Check the value for Runtime Library in Project properties, C/C++, Code Generation. If it's multi-threaded, change it to a non-multithreaded version.
In any version after Visual C++ 7.1 (!), you are out of luck as it's been removed, and you are stuck with the multithreaded CRT.
The std::flush seems senseless in your case. I can't see how you'd flush an istream, so I suspect it's a result of a tie. You may want to un-tie, i.e. call tie(NULL) on your wistringstream. That should also reduce the number of locks taken.
It turned out accessing the underlying buffer directly by replacing things like
c = _text_in->get();
with things like this
c = _text_in->rdbuf()->sbumpc();
fixed the problem and provided a big boost to performance.
When trying to come up with an answer to this question, I wrote this little test-program:
#include <iostream>
#include <fstream>
#include <vector>
#include <iterator>
#include <algorithm>
void writeFile() {
int data[] = {0,1,2,3,4,5,6,7,8,9,1000};
std::basic_ofstream<int> file("test.data", std::ios::binary);
std::copy(data, data+11, std::ostreambuf_iterator<int>(file));
}
void readFile() {
std::basic_ifstream<int> file("test.data", std::ios::binary);
std::vector<int> data(std::istreambuf_iterator<int>(file),
(std::istreambuf_iterator<int>()));
std::copy(data.begin(), data.end(),
std::ostream_iterator<int>(std::cout, " "));
std::cout << std::endl;
}
int main()
{
writeFile();
readFile();
return 0;
}
It works as expected, writing the data to the file, and after reading the file, it correctly prints:
0 1 2 3 4 5 6 7 8 9 1000
However, I am not sure if there are any pitfalls (endianess issues aside, you always have these when dealing with binary data)? Is this allowed?
It works as expected.
I'm not sure what you are expecting...
Is this allowed?
That's probably not portable. Streams relies on char_traits and on facets which are defined in the standard only for char and wchar_t. An implementation can provides more, but my bet would be that you are relying on a minimal default implementation of those templates and not on a conscious implementation for int. I'd not be surprised that a more in depth use would leads to problems.
Instantiating any of the iostream classes, or basic_string, on anything
but char or wchar_t, without providing a specific custom traits class,
is undefined behavior; most of the libraries I've seen do define it to
do something, but that definition often isn't specified, and is
different between VC++ and g++ (the two cases I've looked at). If you
define and use your own traits class, some of the functionality should
work.
For just about all of the formatted inserters and extractors (the << and
>> operators), istream and ostream delegate to various facets in
the locale; if any of these are used, you'll have to take steps to
ensure that these work as well. (This usually means providing a new
numpunct facet.)
Even if you only use the streambuf (as in your example), filebuf
uses the codecvt facet. And an implementation isn't required to provide
a codecvt, and if it does, can do pretty much whatever it wants in
it. And since filebuf always writes and reads char to and from the
file, this translation must do something. I'm actually rather
surprised that your code worked, because of this. But you still don't
know what was actually on the disk, which means you can't document it,
which means that you won't be able to read it sometime in the future.
If your goal is to write binary data, your first step should be to
define the binary format, then write read and write functions which
implement it. Possibly using the iostream << and >> syntax, and
probably using a basic_streambuf<char> for the actual input and
output; a basic_streambuf<char> that you've carefully imbued with the
"C" locale. Or rather than define your own binary format, just use an
existing one, like XDR. (All of this paragraph supposes that you want
to keep the data, and read it later. If these are just temporary files,
for spilling temporary internal data to disk during a single run, and
will be deleted at the end of the program execution, simpler solutions
are valid.)