Limit filesize of a filestream? - c++

My application currently logs in a very simple way:
void Log::create( const std::string& path, bool append )
{
if(append)
m_log.open(path.c_str(),std::ios_base::out | std::ios_base::app);
else
m_log.open(path.c_str(),std::ios_base::out | std::ios_base::trunc);
}
std::ofstream& Log::get()
{
return m_log;
}
void Log::write( const std::string& what )
{
get() << "[" << TimeOfDay::getDate() << "] ";
get() << what << std::endl;
}
void Log::write( const std::string& where, const std::string& what )
{
get() << "[" << TimeOfDay::getDate() << "] ";
get() << "[" << where << "] " << what << std::endl;
}
std::ofstream& Log::write()
{
get() << "[" << TimeOfDay::getDate() << "] ";
return get();
}
std::ofstream Log::m_log;
This application runs on a server. Now, if the log exceeds a certain file size, I want to stop logging.
Is there a way to do this without boost or other libraries?
Thanks

You can create a filtering stream buffer which is set up to write a file but stops writing if it has written more than a specified amount of data. Something like this:
class limitbuf
: public std::streambuf {
std::streambuf* sbuf;
size_t size;
size_t limit;
char buffer[1024];
public:
limitbuf(std::streambuf* sbuf, size_t limit)
: sbuf(sbuf), limit(limit), size(0)
{
this->setp(buffer, buffer + 1023);
}
int overflow(int c) {
if (c != std::char_traits<char>::eof()) {
this->pptr() = std::char_traits<char>::to_char_type(c);
this->pbump(1);
}
return this->sync() == 0
? std::char_traits<char>::not_eof(c)
: std::char_traits<char>::eof();
}
int sync() {
if (this->size < limit) {
this->size += this->sbuf->sputn(this->pbase(),
std::min(
size_t(this->pptr() - this->pbase()),
this->limit - this->size)
);
this->sbuf->pubsync();
}
this->setp(this->pbase(), this->epptr());
return 0;
}
};
Just install this stream buffer as a filter to your log file and it should be limiting to some suitable size:
std::ofstream out("some.log", 16384);
limitbuf sbuf(out.rdbuf());
std::ostream log(&sbuf);
The basic idea of this stream buffer is fairly simple: data is buffer internally and written upon a buffer overflow or upon a flush:
When the buffer set up with setp() overflows the stream calls overflow(c) with the next character to be written (or, potentially, with std::char_traits<char>::eof()). Since te stream buffer was told about a buffer one character smaller than actually available, the overflowing character is added to the buffer and the overall buffer is flushed.
When the buffer is flushed (e.g. by using std::endl on the std::ostream writing to this buffer) the function sync() ends up being called. Its just is to write the characters currently buffered. The code simply sees if there is still space for anything to be written and writes the character if there is space available. The size member maintains how many characters are written and limit is set up to indicate how much data is to be written.
If the stream buffer should do more than just limit the output, it may be necessary to modify the logic of what is happening if there is no more space. For example, if there are left-over characters which can't be written the stream buffer could decide to open a new file (and possibly move other files).

Related

Unexpected results from ios_base::xalloc() and ostream::iword() (iomanip)

My goal is to have an iomanip inserter with parameter that can be used to determine if a message will be printed to the stream (or not.) The idea is that a static mask will include bits set for the categories of messages that should be streamed (and bits cleared for messages to be discarded.) The inserter will be used to specify what category (or categories) a message belongs to and if the mask anded with the presented categories is not zero, the message would be streamed out. I have this working but with file scope mask and categories. It seems to me that (at least the category) could be stored with the stream using xalloc() to provide an index and iword() to store/retrieve values at that index but that seems not to be working for me. I have read various Internet references for these functions and my expectation is that sequential calls to xalloc() should return increasing values. In the code below the value returned is always 4. My second puzzlement is where the storage for the iword() backing store is held. Is this static for ostream? Part of every ostream object?
Code follows
#include <iostream>
#include <sstream>
// from http://stackoverflow.com/questions/2212776/overload-handling-of-stdendl
//
// g++ -o blah blah.cpp
//
// Adding an iomanip with argument as in
// http://stackoverflow.com/questions/20792101/how-to-store-formatting-settings-with-an-iostream
//
using namespace std;
// don't really want file scope variables... Can these be stored in stream?
static int pri=0; // value for a message
static int mask=1; // mask for enabled output (if pri&mask => output)
static int priIDX() { // find index for storing priority choice
static int rc = ios_base::xalloc();
return rc;
}
class setPri // Store priority in stream (but how to retrieve when needed?)
{
size_t _n;
public:
explicit setPri(size_t n): _n(n) {}
size_t getn() const {return _n;}
friend ostream& operator<<(ostream& os, const setPri& obj)
{
size_t n = obj.getn();
int ix = priIDX();
pri = os.iword(ix) = n; // save in stream (?) and to file scope variable
os << "setPri(" << n << ") ix:" << ix << " "; // indicate update
return os;
}
};
class MyStream: public ostream
{
// Write a stream buffer that discards if mask & pri not zero
class MyStreamBuf: public stringbuf
{
ostream& output;
public:
MyStreamBuf(ostream& str)
:output(str)
{}
// When we sync the stream with the output.
// 1) report priority mask (temporary)
// 2) Write output if same bit set in mask and priority
// 3) flush the actual output stream we are using.
virtual int sync ( )
{
int ix = priIDX();
int myPri(output.iword(ix));
output << "ix:" << ix << " myPri:" << myPri << '\n';
if( mask & pri) // can't use (myPri&mask)
output << ' ' << str();
str("");
output.flush();
return 0;
}
};
// My Stream just uses a version of my special buffer
MyStreamBuf buffer;
public:
MyStream(ostream& str)
:buffer(str)
{
rdbuf(&buffer);
}
};
int main()
{
MyStream myStream(cout);
myStream << setPri(1) << " this should output" << endl;
myStream << setPri(2) << " this should not output" << endl;
myStream << setPri(3) << " this should also output" << endl;
}
Note that in sync() the code tries to fetch the value from the stream but the returned value is always 0 as if it was not set to begin with.
In my searches to get to this point I have seen comments that it is not a good idea to subclass an ostream. Feel free to suggest a better alternative! (That I can understand. ;) )
Thanks!
static int priIDX() { // find index for storing priority choice
static int rc = ios_base::xalloc();
return rc;
}
This will always return the same value, as your value is static. And therefore only initialized on the first call.
The storage for the iword data is dynamic and allocated separately by each stream object whenever something is stored there.

How to speed up counting the occurences of a word in large files?

I need to count the occurrences of the string "<page>" in a 104gb file, for getting the number of articles in a given Wikipedia dump. First, I've tried this.
grep -F '<page>' enwiki-20141208-pages-meta-current.xml | uniq -c
However, grep crashes after a while. Therefore, I wrote the following program. However, it only processes 20mb/s of the input file on my machine which is about 5% workload of my HDD. How can I speed up this code?
#include <iostream>
#include <fstream>
#include <string>
int main()
{
// Open up file
std::ifstream in("enwiki-20141208-pages-meta-current.xml");
if (!in.is_open()) {
std::cout << "Could not open file." << std::endl;
return 0;
}
// Statistics counters
size_t chars = 0, pages = 0;
// Token to look for
const std::string token = "<page>";
size_t token_length = token.length();
// Read one char at a time
size_t matching = 0;
while (in.good()) {
// Read one char at a time
char current;
in.read(&current, 1);
if (in.eof())
break;
chars++;
// Continue matching the token
if (current == token[matching]) {
matching++;
// Reached full token
if (matching == token_length) {
pages++;
matching = 0;
// Print progress
if (pages % 1000 == 0) {
std::cout << pages << " pages, ";
std::cout << (chars / 1024 / 1024) << " mb" << std::endl;
}
}
}
// Start over again
else {
matching = 0;
}
}
// Print result
std::cout << "Overall pages: " << pages << std::endl;
// Cleanup
in.close();
return 0;
}
Assuming there are no insanely large lines in the file using something like
for (std::string line; std::getline(in, line); } {
// find the number of "<page>" strings in line
}
is bound to be a lot faster! Reading each characters as a string of one character is about the worst thing you can possibly do. It is really hard to get any slower. For each character, there stream will do something like this:
Check if there is a tie()ed stream which needs flushing (there isn't, i.e., that's pointless).
Check if the stream is in good shape (except when having reached the end it is but this check can't be omitted entirely).
Call xsgetn() on the stream's stream buffer.
This function first checks if there is another character in the buffer (that's similar to the eof check but different; in any case, doing the eof check only after the buffer was empty removes a lot of the eof checks)
Transfer the character to the read buffer.
Have the stream check if it reached all (1) characters and set stream flags as needed.
There is a lot of waste in there!
I can't really imagine why grep would fail except that some line blows massively over the expected maximum line length. Although the use of std::getline() and std::string() is likely to have a much bigger upper bound, it is still not effective to process huge lines. If the file may contain lines which are massive, it may be more reasonable to use something along the lines of this:
for (std::istreambuf_iterator<char> it(in), end;
(it = std::find(it, end, '<') != end; ) {
// match "<page>" at the start of of the sequence [it, end)
}
For a bad implementation of streams that's still doing too much. Good implementations will do the calls to std::find(...) very efficiently and will probably check multiple characters at one, adding a check and loop only for something like every 16th loop iteration. I'd expect the above code to turn your CPU-bound implementation into an I/O-bound implementation. Bad implementation may still be CPU-bound but it should still be a lot better.
In any case, remember to enable optimizations!
I'm using this file to test with: http://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-meta-current1.xml-p000000010p000010000.bz2
It takes roughly 2.4 seconds versus 11.5 using your code. The total character count is slightly different due to not counting newlines, but I assume that's acceptable since it's only used to display progress.
void parseByLine()
{
// Open up file
std::ifstream in("enwiki-latest-pages-meta-current1.xml-p000000010p000010000");
if(!in)
{
std::cout << "Could not open file." << std::endl;
return;
}
size_t chars = 0;
size_t pages = 0;
const std::string token = "<page>";
std::string line;
while(std::getline(in, line))
{
chars += line.size();
size_t pos = 0;
for(;;)
{
pos = line.find(token, pos);
if(pos == std::string::npos)
{
break;
}
pos += token.size();
if(++pages % 1000 == 0)
{
std::cout << pages << " pages, ";
std::cout << (chars / 1024 / 1024) << " mb" << std::endl;
}
}
}
// Print result
std::cout << "Overall pages: " << pages << std::endl;
}
Here's an example that adds each line to a buffer and then processes the buffer when it reaches a threshold. It takes 2 seconds versus ~2.4 from the first version. I played with several different thresholds for the buffer size and also processing after a fixed number (16, 32, 64, 4096) of lines and it all seems about the same as long as there is some batching going on. Thanks to Dietmar for the idea.
int processBuffer(const std::string& buffer)
{
static const std::string token = "<page>";
int pages = 0;
size_t pos = 0;
for(;;)
{
pos = buffer.find(token, pos);
if(pos == std::string::npos)
{
break;
}
pos += token.size();
++pages;
}
return pages;
}
void parseByMB()
{
// Open up file
std::ifstream in("enwiki-latest-pages-meta-current1.xml-p000000010p000010000");
if(!in)
{
std::cout << "Could not open file." << std::endl;
return;
}
const size_t BUFFER_THRESHOLD = 16 * 1024 * 1024;
std::string buffer;
buffer.reserve(BUFFER_THRESHOLD);
size_t pages = 0;
size_t chars = 0;
size_t progressCount = 0;
std::string line;
while(std::getline(in, line))
{
buffer += line;
if(buffer.size() > BUFFER_THRESHOLD)
{
pages += processBuffer(buffer);
chars += buffer.size();
buffer.clear();
}
if((pages / 1000) > progressCount)
{
++progressCount;
std::cout << pages << " pages, ";
std::cout << (chars / 1024 / 1024) << " mb" << std::endl;
}
}
if(!buffer.empty())
{
pages += processBuffer(buffer);
chars += buffer.size();
std::cout << pages << " pages, ";
std::cout << (chars / 1024 / 1024) << " mb" << std::endl;
}
}

can I gain performance by controlling the buffer flush in std::ofstream?

I am writing my own logger. Hoping to gain some performance, I added a piece of code to control the number of flushes to std::ofstream. To do this, I used a temporary buffer of type std::stringstream. The log operations are written into this buffer first and then flushed into the std::ofstream in the right time. (look at void flushLog()) :
#include<iostream>
#include<sstream>
#include<fstream>
class BasicLogger
{
std::stringstream out;
std::ofstream logFile;
typedef std::basic_ostream<char, std::char_traits<char> > CoutType;
typedef CoutType& (*StandardEndLine)(CoutType&);
public:
BasicLogger(std::string id_){
std::string path = id_ + ".txt";
if(path.size()){
logFile.open(path.c_str());
if ((logFile.is_open() && logFile.good())){
}
}
}
BasicLogger& operator<<(StandardEndLine manip) {
std::cout << "Blogger:call to cout type oprtor" << std::endl;
manip(out);
return *this;
}
template <typename T>
BasicLogger & operator<< (const T& val)
{
std::cout << "Blogger:call to oprtor" << std::endl;
out << val;
if(out.tellp() > 512000/*500KB*/){// by some googling this estimated hardcode value promises less cycles to write to a file
flushLog();
}
return *this;
}
void flushLog()
{
if ((logFile.is_open() && logFile.good()))
{
logFile << out.str();
logFile.flush();
out.str(std::string());
}
}
};
coming to know std::ofstream already has its own buffer, I need to rethink if manipulating the buffer was a right thing to do. ??
You need to consider that a log file shouldn't have too much write latency, as you need to be able to examine recent log entries in case of a failure.
Output buffering has diminishing returns as you increase the buffer size. Going from zero to 1 byte doubles your speed, as it cuts system calls in half; going from 1 to 4096 or 8192 is a major win as it corresponds to a disk block size; going from 8192 to 512k isn't likely to make much difference at all. I suggest you benchmark this before you over-commit yourself.

how to get char pointer from ostringstream without a copy in c++

I have a ostringstream variable which contains some data.
I want to get set a char * pointer to the data inside the ostringstream.
If I do the following:
std::ostringstream ofs;
.....
const char *stam = (ofs.str()).c_str();
There is a copy of the content of the string in ofs.
I want to get a pointer to that content without a copy.
Is there a way to do so?
This actually answers the question... took a while but I wanted to do it for the same reasons (efficiency vs portability is fine for my situation):
class mybuf : public std::stringbuf {
public:
// expose the terribly named end/begin pointers
char *eback() {
return std::streambuf::eback();
}
char *pptr() {
return std::streambuf::pptr();
}
};
class myos : public std::ostringstream {
mybuf d_buf;
public:
myos() {
// replace buffer
std::basic_ostream<char>::rdbuf(&d_buf);
}
char *ptr();
};
char *myos::ptr() {
// assert contiguous
assert ( tellp() == (d_buf.pptr()-d_buf.eback()) );
return d_buf.eback();
}
int main() {
myos os;
os << "hello";
std::cout << "size: " << os.tellp() << std::endl;
std::string dat(os.ptr(),os.tellp());
std::cout << "data: " << dat << std::endl;
}
This points to, yet again, the deeper, underlying problem with the standard library - a confusion between contracts and "safety". When writing a messaging service, I need a library with efficient contracts... not safety. Other times, when writing a UI, I want strong safety - and cares less about efficiency.
Although you can't get a pointer to the character buffer in the ostringstream, you can get access to its characters without copying them if you switch to using stringstream. A stringstream allows input and output (reading from and writing to the stream), whereas ostringstream allows only output (writing to the stream). Example:
std::stringstream ss;
ss << "This is a test.";
// Read stringstream from index 0. Use different values to look at any character index.
ss.seekg(0);
char ch;
while (ss.get(ch)) { // loop getting single characters
std::cout << ch;
}
ss.clear(); // Clear eof bit in case you want to read more from ss
This site has a pretty good overview of stringstreams and what you can do with them.

How to easily indent output to ofstream?

Is there an easy way to indent the output going to an ofstream object? I have a C++ character array that is null terminate and includes newlines. I'd like to output this to the stream but indent each line with two spaces. Is there an easy way to do this with the stream manipulators like you can change the base for integer output with special directives to the stream or do I have to manually process the array and insert the extra spaces manually at each line break detected?
Seems like the string::right() manipulator is close:
http://www.cplusplus.com/reference/iostream/manipulators/right/
Thanks.
-William
This is the perfect situation to use a facet.
A custom version of the codecvt facet can be imbued onto a stream.
So your usage would look like this:
int main()
{
/* Imbue std::cout before it is used */
std::ios::sync_with_stdio(false);
std::cout.imbue(std::locale(std::locale::classic(), new IndentFacet()));
std::cout << "Line 1\nLine 2\nLine 3\n";
/* You must imbue a file stream before it is opened. */
std::ofstream data;
data.imbue(indentLocale);
data.open("PLOP");
data << "Loki\nUses Locale\nTo do something silly\n";
}
The definition of the facet is slightly complex.
But the whole point is that somebody using the facet does not need to know anything about the formatting. The formatting is applied independent of how the stream is being used.
#include <locale>
#include <algorithm>
#include <iostream>
#include <fstream>
class IndentFacet: public std::codecvt<char,char,std::mbstate_t>
{
public:
explicit IndentFacet(size_t ref = 0): std::codecvt<char,char,std::mbstate_t>(ref) {}
typedef std::codecvt_base::result result;
typedef std::codecvt<char,char,std::mbstate_t> parent;
typedef parent::intern_type intern_type;
typedef parent::extern_type extern_type;
typedef parent::state_type state_type;
int& state(state_type& s) const {return *reinterpret_cast<int*>(&s);}
protected:
virtual result do_out(state_type& tabNeeded,
const intern_type* rStart, const intern_type* rEnd, const intern_type*& rNewStart,
extern_type* wStart, extern_type* wEnd, extern_type*& wNewStart) const
{
result res = std::codecvt_base::noconv;
for(;(rStart < rEnd) && (wStart < wEnd);++rStart,++wStart)
{
// 0 indicates that the last character seen was a newline.
// thus we will print a tab before it. Ignore it the next
// character is also a newline
if ((state(tabNeeded) == 0) && (*rStart != '\n'))
{
res = std::codecvt_base::ok;
state(tabNeeded) = 1;
*wStart = '\t';
++wStart;
if (wStart == wEnd)
{
res = std::codecvt_base::partial;
break;
}
}
// Copy the next character.
*wStart = *rStart;
// If the character copied was a '\n' mark that state
if (*rStart == '\n')
{
state(tabNeeded) = 0;
}
}
if (rStart != rEnd)
{
res = std::codecvt_base::partial;
}
rNewStart = rStart;
wNewStart = wStart;
return res;
}
// Override so the do_out() virtual function is called.
virtual bool do_always_noconv() const throw()
{
return false; // Sometime we add extra tabs
}
};
See: Tom's notes below
Well this is not the answer I'm looking for, but in case there is no such answer, here is a way to do this manually:
void
indentedOutput(ostream &outStream, const char *message, bool &newline)
{
while (char cur = *message) {
if (newline) {
outStream << " ";
newline = false;
}
outStream << cur;
if (cur == '\n') {
newline = true;
}
++message;
}
}
A way to add such feature would be to write a filtering streambuf (i.e. a streambuf which forwards the IO operation to another streambuf but manipulate the data transfered) which add the indentation as part of its filter operation. I gave an example of writing a streambuf here and boost provides a library to help in that.
If your case, the overflow() member would simply test for '\n' and then add the indent just after if needed (exactly what you have done in your indentedOuput function, excepted that newline would be a member of the streambuf). You could probably have a setting to increase or decrease the indent size (perhaps accessible via a manipulator, the manipulator would have to do a dynamic_cast to ensure that the streambuf associated to the stream is of the correct type; there is a mechanism to add user data to stream -- basic_ios::xalloc, iword and pword -- but here we want to act on the streambuf).
I've had good success with Martin's codecvt facet based suggestion, but I had problems using it on std::cout on OSX, since by default this stream uses a basic_streambuf based streambuf which ignores the imbued facet. The following line switches std::cout and friends to use a basic_filebuf based streambuf, which will use the imbued facet.
std::ios::sync_with_stdio(false);
With the associated side effect that the iostream standard stream objects may operate independently of the standard C streams.
Another note is since this facet does not have a static std::locale::id, which meant that calling std::has_facet<IndentFacet> on the locale always returned true. Adding a std::local::id meant that the facet was not used, since basic_filebuf looks for the base class template.
There is no simple way, but a lot has been written about the complex
ways to achieve this. Read this article for a good explanation of
the topic. Here is another article, unfortunately in German. But
its source code should help you.
For example you could write a function which logs a recursive structure. For each level of recursion the indentation is increased:
std::ostream& operator<<(std::ostream& stream, Parameter* rp)
{
stream << "Parameter: " << std::endl;
// Get current indent
int w = format::get_indent(stream);
stream << "Name: " << rp->getName();
// ... log other attributes as well
if ( rp->hasParameters() )
{
stream << "subparameter (" << rp->getNumParameters() << "):\n";
// Change indent for sub-levels in the hierarchy
stream << format::indent(w+4);
// write sub parameters
stream << rp->getParameters();
}
// Now reset indent
stream << format::indent(w);
return stream;
}
I have generalized Loki Astarti's solution to work with arbitrary indentation levels. The solution has a nice, easy to use interface, but the actual implementation is a little fishy. It can be found on github:https://github.com/spacemoose/ostream_indenter
There's a more involved demo in the github repo, but given:
#include "indent_facet.hpp"
/// This probably has to be called once for every program:
// http://stackoverflow.com/questions/26387054/how-can-i-use-stdimbue-to-set-the-locale-for-stdwcout
std::ios_base::sync_with_stdio(false);
// This is the demo code:
std::cout << "I want to push indentation levels:\n" << indent_manip::push
<< "To arbitrary depths\n" << indent_manip::push
<< "and pop them\n" << indent_manip::pop
<< "back down\n" << indent_manip::pop
<< "like this.\n" << indent_manip::pop;
}
It produces the following output:
I want to push indentation levels:
To arbitrary depths
and pop them
back down
like this.
I would appreciate any feedback as to the utility of the code.
Simple whitespace manipulator
struct Whitespace
{
Whitespace(int n)
: n(n)
{
}
int n;
};
std::ostream& operator<<(std::ostream& stream, const Whitespace &ws)
{
for(int i = 0; i < ws.n; i++)
{
stream << " ";
}
return stream;
}