Is there a 'catch' with FastFormat? - c++

I just read about the FastFormat C++ i/o formatting library, and it seems too good to be true: Faster even than printf, typesafe, and with what I consider a pleasing interface:
// prints: "This formats the remaining arguments based on their order - in this case we put 1 before zero, followed by 1 again"
fastformat::fmt(std::cout, "This formats the remaining arguments based on their order - in this case we put {1} before {0}, followed by {1} again", "zero", 1);
// prints: "This writes each argument in the order, so first zero followed by 1"
fastformat::write(std::cout, "This writes each argument in the order, so first ", "zero", " followed by ", 1);
This looks almost too good to be true. Is there a catch? Have you had good, bad or indifferent experiences with it?

Is there a 'catch' with FastFormat?
Last time I checked, there was one annoying catch:
You can only use either the narrow string version or the wide string version of this library. (The functions for wchar_t and char are the same -- which type is used is a compile time switch.)
With iostreams, stdio or Boost.Format you can use both.

Found one "catch", though for most people it will never manifest. From the project page:
Atomic operation. It doesn't write out statement elements one at a time, like the IOStreams, so has no atomicity issues
The only way I can see this happening is if it buffers the whole write() call's output itself, then writes it out to the ostream in one step. This means it needs to allocate memory, and if an object passed into the write() call produces a lot of output (several megabytes or more), it can consume up to twice that much memory in internal buffers (assuming it uses the grow-a-buffer-by-doubling-its-size-each-time trick).
If you're just using it for logging, and not, say, dumping huge amounts of XML, you'll never see this problem.
The only other "catch" I'm seeing is:
Highly portable. It will work with all good modern C++ compilers; it even works with Visual C++ 6!
So it won't work with an old C++ compiler, like cfront, whereas iostreams is backward compatible to the late 80's. Again, I'd be surprised if anyone ever had a problem with this.

Although FastFormat is a good library there are a number of issues with it:
Limited formatting support, in particular the following features are not supported:
Leading zeros (or any other non-space padding)
Octal/hexadecimal encoding
Runtime width/alignment specification
The library is quite big for a relatively small task of formatting and has even bigger dependency (STLSoft).

It looks pretty interesting indeed! Good tip regardless, and +1 for that!
I've been playing with it for a bit. The main drawback I see is that FastFormat supports less formatting options for the output. This is I think a direct consequence of the way the higher typesafety is achieved, and a good tradeoff depending on your circumstances.

If you look in detail at his performance benchmark page, you'll notice that good old C printf-family functions are still winning on Linux. In fact, the only test case where they perform poorly is the test case that should be static string concatenations, where I would expect printf to be wasteful. Moreover, GCC provides static type-checking on printf-style function calls, so the benefit of type-safety is reduced. So: if you are running on Linux and if you need the absolute best performance, FastFormat is probably not the optimal solution.

The library depends on a couple of environment variables, as mentioned in the docs.
That might be no biggie to some people, but I'd prefer my code to be as self-contained as possible. If I check it out from source control, it should work and compile. It won't, if it requires you to set environment variables.

Related

A better replacement for istrstream?

istrstream was perfect for my needs - basically, take a fixed char buffer, and give me a simple way to extract lines getline() and test for eof()
I'm switching our projects to C++ 17 compliance - which has deprecated istrsteam - apparently because there are too many C++ programmers who cannot fathom fixed buffer memory management (are you serious?!)
At any rate, the istringstream provides the same use semantics, but it imposes the need to now copy the entire fixed character buffer at construction time.
This is an anti-pattern.
What I am looking for is either a way to use a string_view in place of a string for the istringstream, or alternately a better replacement for stringstream which itself handles externally managed fixed buffer (it need only point into it, it never need worry about managing that resource, just as strstream did).
Currently, in VS 2017, this is illegal, and if I understand things correctly, is illegal everywhere in the current state-of-art of C++ (I'm sure you'll correct me if I'm wrong!)
std::string_view raw_view(reinterpret_cast<const char *>(raw_buffer.get()), raw_buffer.size());
std::istringstream raw_stream(raw_view);
So - ideas?
Note: Peter Sommerlad has a proposal for this exact idea here for the C++ standards body:
http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2017/p0448r1.pdf
Continue using istrstream for the time being. It likely won't be removed until either P0448 (using std::span<char> as the source/destination of a stream buffer) or P0408 (the ability to move data into/outof stringstreams) is adopted by the standard. Either of those would serve your needs well.
That being said, if all you're trying to do is get substrings between \ns, it would be far more efficient (even with the above proposals) to just use a regex search. Or just a regular search, since you're just looking for \n. That would give you a pair of iterators that represents a line. Using iostreams for line-by-line processing of an already-loaded character buffer is overkill and will never be as efficient as the alternative.

Optimal way of reading a complete file to a string using fstream?

Many other posts, like " Read whole ASCII file into C++ std::string " explain what some of the options are but do not describe pro and cons of various methods in any depth. I want to know why one method is preferable over another?
All of these use std::fstream to read the file into a std::string. I am unsure what the costs and benefits of each method. Lets assume this is for the common case where the read files are known to be of some smallish size memory can easily accommodate, clearly reading a multi-terrabyte file into an memory is a bad idea no matter how you do it.
The most common way after a few googles searches to read a whole file into an std::string involves using std::getline and appending a newline character to it after each line. This seems needless to me, but is there some performance or compatibility reason that this is ideal?
std::string Results;
std::ifstream ResultReader("file.txt");
while(ResultReader)
{
std::getline(ResultReader, Results);
Results.push_back('\n');
}
Another way I pieced together is to change the getline delimiter so it is something not in the file. The EOF char is seems unlikely to be in the middle of the file so that seems a likely candidate. This includes a cast so there is at least one reason not to do it, but this does read a file at once with no string concatenation. Presumably there is still some cost for the delimiter checks. Are there any other good reasons not to do this?
std::string Results;
std::ifstream ResultReader("file.txt");
std::getline(ResultReader, Results, (char)std::char_traits<char>::eof());
The cast means that on systems that define std::char_traits::eof() as something other than -1 might have problems. Is this a practical reason to not choose this over other methods that use std::getline and string::push_pack('\n').
How does these compare to other ways of reading the file at once like in this question: Read whole ASCII file into C++ std::string
std::ifstream ResultReader("file.txt");
std::string Results((std::istreambuf_iterator<char>(ResultReader)),
std::istreambuf_iterator<char>());
It would seem this would be best. It offloads almost all the work onto the standard library which ought to be heavily optimized for the given platform. I see no reason for checks other than stream validity and the end of the file. Is this ideal or are there problems with this that are unseen.
Does the standard or do details of some implementation provide reasons to prefer some method over another? Have I missed some method that might prove ideal in a wide variety of circumstances?
What is a simplest, most idiomatic, best performing and standard compliant way of reading a whole file into an std::string?
EDIT - 2
This question has prompted me to write a small suite of benchmarks. They are MIT license and available on github at: https://github.com/Sqeaky/CppFileToStringExperiments
Fastest - TellSeekRead and CTellSeekRead- These have the system provide an easy to get the size and reads the file in one go.
Faster - Getline Appending and Eof - The checking of chars does not seem to impose any cost.
Fast - RdbufMove and Rdbuf - The std::move seems to make no difference in release.
Slow - Iterator, BackInsertIterator and AssignIterator - Something is wrong with iterators and input streams. The work great in memory, but not here. That said some of these are faster than others.
I have added every method suggested so far, including those in links. I would appreciate if someone could run this on windows and with other compilers. I currently do not have access to a machine with NTFS and it has been noted that this and compiler details could be important.
As for measuring simplicity and idiomatic-ness how do we measure these objectively? Simplicity seems doable, perhaps use something line LOCs and Cyclomatic complexity, but how idiomatic something is seems purely subjective.
What is a simplest, most idiomatic, best performing and standard
compliant way of reading a whole file into an std::string?
those are pertty much contradicting requests, one most likely to lessen the other. simpler code won't be the fastest, or more idiomatic.
after exploring this area for a while I've come to some conclusions:
1) the most performance penalty causing is the IO action itself - the less IO actions taken - the fastest the code
2) memory allocations also quite expensive, but not as expensive as the IO
3) reading as binary is faster than reading as text
4) using the OS API will probably be faster than C++ streams
5) std::ios_base::sync_with_stdio doesn't really effect the performence, it's an urban legend.
using std::getline is probably not the best choice if performence is needed because of these reasons: it will make N IO actions and N allocations for N lines.
A compromise which is fast, standard and elegant is to get the file size, allocate all the memory in one time, then reading the file in one time:
std::ifstream fileReader(<your path here>,std::ios::binary|std::ios::ate);
if (fileReader){
auto fileSize = fileReader.tellg();
fileReader.seekg(std::ios::beg);
std::string content(fileSize,0);
fileReader.read(&content[0],fileSize);
}
move the content around to prevent un-needed copies.
This website has a good comparison on several different methods for doing that. The one I currently use is:
std::string read_sequence() {
std::ifstream f("sequence.fasta");
std::ostringstream ss;
ss << f.rdbuf();
return ss.str();
}
If your text files are separated by newlines, this will keep them. If you want to remove that, for instance (which is my case most of the times), you can just add a call to something such as
auto s = ss.str();
s.erase(std::remove_if(s.begin(), s.end(),
[](char c) { return c == '\n'; }), s.end());
There are two big difficulties with your question. First, the Standard doesn't mandate any particular implementation (yes, nearly everybody started with the same implementation; but they've been modifying it over time, and the optimal I/O code for NTFS, say, will be different than the optimal I/O code for ext4), so it is possible (although somewhat unlikely) for a particular approach to be fastest on one platform, but not another. Second, there's a little difficulty in defining "optimal"; I assume you mean "fastest," but that's not necessarily the case.
There are approaches that are idiomatic, and perfectly fine C++, but unlikely to give wonderful performance. If your goal is to end up with a single std::string, using std::getline(std::ostream&, std::string&) very likely to be slower than necessary. The std::getline() call has to look for the '\n', and you'll occasionally reallocate and copy the destination std::string. Even so, it's ridiculously simple, and easy to understand. That could be optimal from a maintenance perspective, assuming you don't need the absolute fastest performance possible. This will also be a good approach if you don't need the whole file in one giant std::string at one time. You'll be very frugal with memory.
An approach that is likely more efficient is to manipulate the read buffer:
std::string read_the_whole_file(std::ostream& ostr)
{
std::ostringstream sstr;
sstr << ostr.rdbuf();
return sstr.str();
}
Personally, I'm just as likely to use std::fopen() and std::fread() (and std::unique_ptr<FILE>) because, on Windows at least, you'll get a better error message when std::fopen() fails than when constructing a file stream object fails. I consider the better error message an important factor when deciding which approach is optimal.

How do I associate changed lines with functions in a git repository of C code?

I'm attempting to construct a “heatmap” from a multi-year history stored in a git repository where the unit of granularity is individual functions. Functions should grow hotter as they change more times, more frequently, and with more non-blank lines changed.
As a start, I examined the output of
git log --patch -M --find-renames --find-copies-harder --function-context -- *.c
I looked at using Language.C from Hackage, but it seems to want a complete translation unit—expanded headers and all—rather being able to cope with a source fragment.
The --function-context option is new since version 1.7.8. The foundation of the implementation in v1.7.9.4 is a regex:
PATTERNS("cpp",
/* Jump targets or access declarations */
"!^[ \t]*[A-Za-z_][A-Za-z_0-9]*:.*$\n"
/* C/++ functions/methods at top level */
"^([A-Za-z_][A-Za-z_0-9]*([ \t*]+[A-Za-z_][A-Za-z_0-9]*([ \t]*::[ \t]*[^[:space:]]+)?){1,}[ \t]*\\([^;]*)$\n"
/* compound type at top level */
"^((struct|class|enum)[^;]*)$",
/* -- */
"[a-zA-Z_][a-zA-Z0-9_]*"
"|[-+0-9.e]+[fFlL]?|0[xXbB]?[0-9a-fA-F]+[lL]?"
"|[-+*/<>%&^|=!]=|--|\\+\\+|<<=?|>>=?|&&|\\|\\||::|->"),
This seems to recognize boundaries reasonably well but doesn’t always leave the function as the first line of the diff hunk, e.g., with #include directives at the top or with a hunk that contains multiple function definitions. An option to tell diff to emit separate hunks for each function changed would be really useful.
This isn’t safety-critical, so I can tolerate some misses. Does that mean I likely have Zawinski’s “two problems”?
I realise this suggestion is a bit tangential, but it may help in order to clarify and rank requirements. This would work for C or C++ ...
Instead of trying to find text blocks which are functions and comparing them, use the compiler to make binary blocks. Specifically, for every C/C++ source file in a change set, compile it to an object. Then use the object code as a basis for comparisons.
This might not be feasible for you, but IIRC there is an option on gcc to compile so that each function is compiled to an 'independent chunk' within the generated object code file. The linker can pull each 'chunk' into a program. (It is getting pretty late here, so I will look this up in the morning, if you are interested in the idea. )
So, assuming we can do this, you'll have lots of functions defined by chunks of binary code, so a simple 'heat' comparison is 'how much longer or shorter is the code between versions for any function?'
I am also thinking it might be practical to use objdump to reconstitute the assembler for the functions. I might use some regular expressions at this stage to trim off the register names, so that changes to register allocation don't cause too many false positive (changes).
I might even try to sort the assembler instructions in the function bodies, and diff them to get a pattern of "removed" vs "added" between two function implementations. This would give a measure of change which is pretty much independent of layout, and even somewhat independent of the order of some of the source.
So it might be interesting to see if two alternative implementations of the same function (i.e. from different a change set) are the same instructions :-)
This approach should also work for C++ because all names have been appropriately mangled, which should guarantee the same functions are being compared.
So, the regular expressions might be kept very simple :-)
Assuming all of this is straightforward, what might this approach fail to give you?
Side Note: This basic strategy could work for any language which targets machine code, as well as VM instruction sets like the Java VM Bytecode, .NET CLR code, etc too.
It might be worth considering building a simple parser, using one of the common tools, rather than just using regular expressions. Clearly it is better to choose something you are familiar with, or which your organisation already uses.
For this problem, a parser doesn't actually need to validate the code (I assume it is valid when it is checked in), and it doesn't need to understand the code, so it might be quite dumb.
It might throw away comments (retaining new lines), ignore the contents of text strings, and treat program text in a very simple way. It mainly needs to keep track of balanced '{' '}', balanced '(' ')' and all the other valid program text is just individual tokens which can be passed 'straight through'.
It's output might be a separate file/function to make tracking easier.
If the language is C or C++, and the developers are reasonably disciplined, they might never use 'non-syntactic macros'. If that is the case, then the files don't need to be preprocessed.
Then a parser is mostly just looking for a the function name (an identifier) at file scope followed by ( parameter-list ) { ... code ... }
I'd SWAG it would be a few days work using yacc & lex / flex & bison, and it might be so simple that their is no need for the parser generator.
If the code is Java, then ANTLR is a possible, and I think there was a simple Java parser example.
If Haskell is your focus, their may be student projects published which have made a reasonable stab at a parser.

complexity of parsing C++

Out of curiosity, I was wondering what were some "theoretical" results about parsing C++.
Let n be the size of my project (in LOC, for example, but since we'll deal with big-O it's not very important)
Is C++ parsed in O(n) ? If not, what's the complexity?
Is C (or Java or any simpler language in the sense of its grammar) parsed in O(n)?
Will C++1x introduce new features that will be even harder to parse?
References would be greatly appreciated!
I think the term "parsing" is being interpreted by different people in different ways for the purposes of the question.
In a narrow technical sense, parsing is merely verifying the the source code matches the grammar (or perhaps even building a tree).
There's a rather widespread folk theorem that says you cannot parse C++ (in this sense) at all because you must resolve the meaning of certain symbols to parse. That folk theorem is simply wrong.
It arises from the use of "weak" (LALR or backtracking recursive descent) parsers, which, if they commit to the wrong choice of several possible subparse of a locally ambiguous part of text (this SO thread discusses an example), fail completely by virtue of sometimes making that choice. The way those that use such parser resolve the dilemma is collect symbol table data as parsing occurs and mash extra checking into the parsing process to force the parser to make the right choice when such choice is encountered. This works at the cost of significantly tangling name and type resolution with parsing, which makes building such parsers really hard. But, at least for legacy GCC, they used LALR which is linear time on parsing and I don't think that much more expensive if you include the name/type capture that the parser does (there's more to do after parsing because I don't think they do it all).
There are at least two implementations of C++ parsers done using "pure" GLR parsing technology, which simply admits that the parse may be locally ambiguous and captures the multiple parses without comment or significant overhead. GLR parsers are linear time where there are no local ambiguities. They are more expensive in the ambiguity region, but as a practical matter, most the of source text in a standard C++ program falls into the "linear time" part. So the effective rate is linear, even capturing the ambiguities. Both of the implemented parsers do name and type resolution after parsing and use inconsistencies to eliminate the incorrect ambiguous parses.
(The two implementations are Elsa and our (SD's) C++ Front End. I can't speak for Elsa's current capability (I don't think it has been updated in years), but ours does all of C++11 [EDIT Jan 2015: now full C++14 EDIT Oct 2017: now full C++17] including GCC and Microsoft variants).
If you take the hard computer science definition that a language is extensionally defined as an arbitrary set of strings (Grammars are supposed to be succinct ways to encode that intensionally) and treating parsing as "check the the syntax of the program is correct" then for C++ you have expand the templates to verify that each actually expands completely. There's a Turing machine hiding in the templates, so in theory checking that a C++ program is valid is impossible (no time limits). Real compilers (honoring the standard) place fixed constraints on how much template unfolding they'll do, and so does real memory, so in practice C++ compilers finish. But they can take arbitrarily long to "parse" a program in this sense. And I think that's the answer most people care about.
As a practical matter, I'd guess most templates are actually pretty simple, so C++ compilers can finish as fast as other compilers on average. Only people crazy enough to write Turing machines in templates pay a serious price. (Opinion: the price is really the conceptual cost of shoehorning complicated things onto templates, not the compiler execution cost.)
Depends what you mean by "parsed", but if your parsing is supposed to include template instantiation, then not in general:
[Shortcut if you want to avoid reading the example - templates provide a rich enough computational model that instantiating them is, in general, a halting-style problem]
template<int N>
struct Triangle {
static const int value = Triangle<N-1>::value + N;
};
template<>
struct Triangle<0> {
static const int value = 0;
};
int main() {
return Triangle<127>::value;
}
Obviously, in this case the compiler could theoretically spot that triangle numbers have a simple generator function, and calculate the return value using that. But otherwise, instantiating Triangle<k> is going to take O(k) time, and clearly k can go up pretty quickly with the size of this program, as far as the limit of the int type.
[End of shortcut]
Now, in order to know whether Triangle<127>::value is an object or a type, the compiler in effect must instantiate Triangle<127> (again, maybe in this case it could take a shortcut since value is defined as an object in every template specialization, but not in general). Whether a symbol represents an object or a type is relevant to the grammar of C++, so I would probably argue that "parsing" C++ does require template instantiation. Other definitions might vary, though.
Actual implementations arbitrarily cap the depth of template instantiation, making a big-O analysis irrelevant, but I ignore that since in any case actual implementations have natural resource limits, also making big-O analysis irrelevant...
I expect you can produce similarly-difficult programs in C with recursive #include, although I'm not sure whether you intend to include the preprocessor as part of the parsing step.
Aside from that, C, in common with plenty of other languages, can have O(not very much) parsing. You may need symbol lookup and so on, which as David says in his answer cannot in general have a strict O(1) worst case bound, so O(n) might be a bit optimistic. Even an assembler might look up symbols for labels. But for example dynamic languages don't necessarily even need symbol lookup for parsing, since that might be done at runtime. If you pick a language where all the parser needs to do is establish which symbols are keywords, and do some kind of bracket-matching, then the Shunting Yard algorithm is O(n), so there's hope.
Hard to tell if C++ can be "just parsed", as - contrary to most languages - it cannot be analysed syntactically without performing semantic analysis at the same time.

Why is strncpy insecure?

I am looking to find out why strncpy is considered insecure. Does anybody have any sort of documentation on this or examples of an exploit using it?
Take a look at this site; it's a fairly detailed explanation. Basically, strncpy() doesn't require NUL termination, and is therefore susceptible to a variety of exploits.
The original problem is obviously that strcpy(3) was not a memory-safe operation, so an attacker could supply a string longer than the buffer which would overwrite code on the stack, and if carefully arranged, could execute arbitrary code from the attacker.
But strncpy(3) has another problem in that it doesn't supply null termination in every case at the destination. (Imagine a source string longer than the destination buffer.) Future operations may expect conforming C nul-terminated strings between equally sized buffers and malfunction downstream when the result is copied to yet a third buffer.
Using strncpy(3) is better than strcpy(3) but things like strlcpy(3) are better still.
To safely use strncpy, one must either (1) manually stick a null character onto the result buffer, (2) know that the buffer ends with a null beforehand, and pass (length-1) to strncpy, or (3) know that the buffer will never be copied using any method that won't bound its length to the buffer length.
It's important to note that strncpy will zero-fill everything in the buffer past the copied string, while other length-limited strcpy variants will not. This may at some cases be a performance drain, but in other cases be a security advantage. For example, if one used strlcpy to copy "supercalifragilisticexpalidocious" into a buffer and then to copy "it", the buffer would hold "it^ercalifragilisticexpalidocious^" (using "^" to represent a zero byte). If the buffer gets copied to a fixed-sized format, the extra data might tag along with it.
The question is based on a "loaded" premise, which makes the question itself invalid.
The bottom line here is that strncpy is not considered insecure and has never been considered insecure. The only claims of "insecurity" that can be attached to that function are the broad claims of general insecurity of C memory model and C language itself. (But that is obviously a completely different topic).
Within the realm of C language the misguided belief of some kind of "insecurity" inherent in strncpy is derived from the widespread dubious pattern of using strncpy for "safe string copying", i.e. something this function does not do and has never been intended for. Such usage is indeed highly error prone. But even if you put an equality sign between "highly error prone" and "insecure", it is still a usage problem (i.e. a lack of education problem) not a strncpy problem.
Basically, one can say that the only problem with strncpy is a unfortunate naming, which makes newbie programmers assume that they understand what this function does instead of actually reading the specification. Looking at the function name an incompetent programmer assumes that strncpy is a "safe version" of strcpy, while in reality these two functions are completely unrelated.
Exactly the same claim can be made against the division operator, for one example. As most of you know, one of the most frequently-asked questions about C language goes as "I assumed that 1/2 will evaluate to 0.5 but I got 0 instead. Why?" Yet, we don't claim that the division operator is insecure just because language beginners tend to misinterpret its behavior.
For another example, we don't call pseudo-random number generator functions "insecure" just because incompetent programmers are often unpleasantly surprised by the fact that their output is not truly random.
That is exactly how it is with strncpy function. Just like it takes time for beginner programmers to learn what pseudo-random number generators actually do, it takes them time to learn what strncpy actually does. It takes time to learn that strncpy is a conversion function, intended for converting zero-terminated strings to fixed-width strings. It takes time to learn that strncpy has absolutely nothing to do with "safe string copying" and can't be meaningfully used for that purpose.
Granted, it usually takes much longer for a language student to learn the purpose of strncpy than to sort things out with the division operator. However, this is a basis for any "insecurity" claims against strncpy.
P.S. The CERT document linked in the accepted answer is dedicated to exactly that: to demonstrate the insecurities of the typical incompetent abuse of strncpy function as a "safe" version of strcpy. It is not in any way intended to claim that strncpy itself is somehow insecure.
A pathc of Git 2.19 (Q3 2018) finds that it is too easy to misuse system API functions such as strcat(); strncpy(); ... and forbids those functions in this codebase.
See commit e488b7a, commit cc8fdae, commit 1b11b64 (24 Jul 2018), and commit c8af66a (26 Jul 2018) by Jeff King (peff).
(Merged by Junio C Hamano -- gitster -- in commit e28daf2, 15 Aug 2018)
banned.h: mark strcat() as banned
The strcat() function has all of the same overflow problems as strcpy().
And as a bonus, it's easy to end up accidentally quadratic, as each subsequent call has to walk through the existing string.
The last strcat() call went away in f063d38 (daemon: use
cld->env_array when re-spawning, 2015-09-24, Git 2.7.0).
In general, strcat() can be replaced either with a dynamic string
(strbuf or xstrfmt), or with xsnprintf if you know the length is bounded.