Technical background to the C++ fmt::print vs. fmt::format_to naming?

Technical background to the C++ fmt::print vs. fmt::format_to naming? - c++

Why is it fmt::format_to(OutputIt, ...) and not fmt::print(OutputIt, ...)??
I'm currently familiarizing myself with {fmt}, a / the modern C++ formatting library.
While browsing the API, I found the naming a bit disjoint, but given my little-to-no experience with the library (and my interest in API design), I would like to get behind these naming choices: (fmt core API reference)
There's fmt::format(...) -> std::string which makes sense, it returns a formatted string.
Then we have void fmt::print([stream, ] ...) which also makes sense naming wise (certainly given the printf legacy).
But then we have fmt::format_to(OutputIt, ...) -> OutputIt which resembles, apart from the return type, what print does with streams.
Now obviously, one can bike shed names all day, but here the question is not on why we have format vs. print (which is quite explainable to me), but why a function that clearly(?) behaves like the write-to-stream-kind has been bundled with the format_... naming style.
So, as the question title already asks, is there a technical difference in how fmt::print(stream, ...) behaves when formatting to a streams vs. how fmt::format_to(OutputIt, ...) behaves when formatting to an output iterator?
Or was/is this purely a style choice? Also, given that the GitHube repo explicitly lists the fmt tag here, I was hoping that we could get a authoritative answer on this from the original API authors.

While it would be possible to name format_to print, the former is closer conceptually to format than to print. format_to is a generalization of format that writes output via an output iterator instead of std::string. Therefore the naming reflects that.
print without a stream argument on the other hand writes to stdout and print with an argument generalizes that to an arbitrary stream. Writing to a stream is fundamentally different from writing to an output iterator because it involves additional buffering, synchronization, etc. Other languages normally use "print" for this sort of functionality so this convention is followed in {fmt}.
This becomes blurry because you can have an output iterator that writes to a stream but even there the current naming reflects what high-level APIs are being used.
In other words format_to is basically a fancy STL algorithm (there is even format_to_n similarly to copy/copy_n) while print is a formatted output function.

Related

Fortran READ(,), WRITE(,) arguments

This question has been covered somewhat in previous SO questions. However, previous discussions seem somewhat incomplete.
Fortran has several I/O statements. There is READ(*,*) and WRITE(*,*), etc. The first asterisk (*) is the standard asterisk designating an input or output from the keyboard to/from the screen. My question is about the second asterisk:
The second asterisk designates the format of the I/O elements, the data TYPE which is being used. If this asterisk is left unchanged, the fortran complier uses the default format (whatever that may be, based on the compiler). Users must use a number of format descriptors to designate the data type, precision, and so forth.
(1) Are these format descriptors universal for all Fortran compilers and for all versions of Fortran?
(2) Where can I find the standard list of these format descriptors? For example, F8.3 means that the number should be printed using fixed point notation with field width 8 and 3 decimal places.
EDIT: A reference for edit descriptors can be found here: http://fortranwiki.org/fortran/show/Edit+descriptors

First, as a clarification, the 1st asterisk in the READ/WRITE statement has a slightly different meaning than you state. For write, it means write to the default file unit (in linux world generally standard out), for read it means read from the default file unit (in linux world generally standard in), either of which may not necessarily be connected to a terminal screen or a keyboard.
The 2nd asterisk means use list directed IO. For read statements this is generally useful because you don't need a specified format for your input. It breaks up the line into fields separated by space or comma (maybe a couple others that aren't commonly used), and reads each field in turn into the variable associated with that field in the argument list, ignoring unread fields, and continuing onto the next line if not enough fields were read in (unless a line termination character \ is explicitly included).
For writes, it means the compiler is allowed to determine what format to write the variables out (I believe with no separator). I believe it is allowed to do this at run time, so that you are all but guaranteed that the value it is trying to write will fit into the format specifier used, so you can be assured that you won't get ******* written out. The down side is you have to manually include a separator character in your argument list, or all your numbers will run together.
In general, using list directed read is more of a convenience to the user, so they don't have to fit their inputs into rigidly defined fields, and list directed writes are a convenience to the programmer, in case they're not sure what the output will look like.

When you have a data transfer statement like read(*,*) ... it's helpful to understand exactly what this means. read(*,*) is equivalent to the more verbose read(unit=*, fmt=*). This second asterisk, as you have it, makes this read statement (or corresponding write statement) list-directed.
List-directed input/output, as described elsewhere, is a convenience for the programmer. The Fortran standards specify lots of constraints that the compiler must follow, but this language has things like "reasonable values", so allowing output to vary by compiler, settings, and so on.
Again, as described elsewhere, fine user control over the output (or input) comes with giving a format specification. Instead of read(*,fmt=*), something like read(*,fmt=1014) or read(*,fmt=format_variable_or_literal). I take it your question is: what is this format specification?
I won't go into details of all of the possible edit descriptors, but I will say in response to (2): you can find the list of those edit descriptors in the Fortran standard (Clause 10 of Fortran 2008 goes into all the detail) or a good reference book.
To answer (1): no, edit descriptors are not universal. Even across Fortran standards. Of note are:
The introduction of I0 (and other minimal-width specifiers) for output in Fortran 95;
The removal of the H edit descriptor in Fortran 95;
The introduction of the DT edit descriptor in Fortran 2003.

scanf on an istream object

NOTE: I've seen the post What is the cin analougus of scanf formatted input? before asking the question and the post doesn't solve my problem here. The post seeks for C++-way to do it, but as I mentioned already, it is inconvenient to just use C++-way to do it sometimes and I have clear examples for that.
I am trying to read data from an istream object, and sometimes it is inconvenient to just use C++-style ways such as operator>>, e.g. the data are in special form 123:456 so you have to imbue to make ':' as space (which is very hacky, as opposed to %d:%d in scanf), or 00123 where you want to read as string and convert decimal instead of octal (as opposed to %d in scanf), and possibly many other cases.
The reason I chose istream as interface is because it can be derived and therefore more flexible. For example, we can create in-memory streams, or some customized streams that generated on the fly, etc. C-style FILE*, on the other hand, is very limited, at least in a standard-compliant way, on creating customized streams.
So my questions is, is there a way to do scanf-like data extraction on istream object? I think fscanf internally read character by character from FILE* using fgetc, while istream also provides such interface. So it is possible by just copying and pasting the code of fscanf and replace the FILE* with the istream object, but that's very hacky. Is there a smarter and cleaner way, or is there some existing work on this?
Thanks.

You should never, under any circumstances, use scanf or its relatives for anything, for three reasons:
Many format strings, including for instance all the simple uses of %s, are just as dangerous as gets.
It is almost impossible to recover from malformed input, because scanf does not tell you how far in characters into the input it got when it hit something unexpected.
Numeric overflow triggers undefined behavior: yes, that means scanf is allowed to crash the entire program if a numeric field in the input has too many digits.
Prior to C++11, the C++ specification defined istream formatted input of numbers in terms of scanf, which means that last objection is very likely to apply to them as well! (In C++11 the specification is changed to use strto* instead and to do something predictable if that detects overflow.)
What you should do instead is: read entire lines of input into std::string objects with getline, hand-code logic to split them up into fields (I don't remember off the top of my head what the C++-string equivalent of strsep is, but I'm sure it exists) and then convert numeric strings to machine numbers with the strtol/strtod family of functions.
I cannot emphasize this enough: THE ONLY 100% RELIABLE WAY TO CONVERT STRINGS TO NUMBERS IN C OR C++, unless you are lucky enough to have a C++ runtime that is already C++11-conformant in this regard, IS WITH THE strto* FUNCTIONS, and you must use them correctly:
errno = 0;
result = strtoX(s, &ends, 10); // omit 10 for floats
if (s == ends || *ends || errno)
parse_error();
(The OpenBSD manpages, linked above, explain why you have to do this fairly convoluted thing.)
(If you're clever, you can use ends and some manual logic to skip that colon, instead of strsep.)

I do not recommend you to mix C++ input output and C input output. No that they are really incompatible but they could just plain interoperate wrong.
For example Oracle docs recommend not to mix it http://www.oracle.com/technetwork/articles/servers-storage-dev/mixingcandcpluspluscode-305840.html
But no one stops you from reading data into the buffer and parsing it with standard c functions like sscanf.
...
string curString;
int a, b;
...
std::getline(inputStream, curString);
int sscanfResult == sscanf(curString.cstr(), "%d:%d", &a, &b);
if (2 != sscanfResult)
throw "error";
...
But it won't help in some situations when your stream is just one long contiguous sequence of symbols(like some string turned into memory stream).
Making your own fscanf from scratch or porting(?) the original CRT function actually isn't the worst possible idea. Just make sure you have tested it thoroughly(low level custom char manipulation was always a source of pain in C).
I've never really tried the boost\spirit and such parsing infrastructure could really be an overkill for your project. But boost libraries are usually well tested and designed. You could at least try to use it.

Based on #tmyklebu's comment, I implemented streamScanf which wraps istream as FILE* via fopencookie: https://github.com/likan999/codejam/blob/master/Common/StreamScanf.cpp

Named parameter string formatting in C++

I'm wondering if there is a library like Boost Format, but which supports named parameters rather than positional ones. This is a common idiom in e.g. Python, where you have a context to format strings with that may or may not use all available arguments, e.g.
mouse_state = {}
mouse_state['button'] = 0
mouse_state['x'] = 50
mouse_state['y'] = 30
#...
"You clicked %(button)s at %(x)d,%(y)d." % mouse_state
"Targeting %(x)d, %(y)d." % mouse_state
Are there any libraries that offer the functionality of those last two lines? I would expect it to offer a API something like:
PrintFMap(string format, map<string, string> args);
In Googling I have found many libraries offering variations of positional parameters, but none that support named ones. Ideally the library has few dependencies so I can drop it easily into my code. C++ won't be quite as idiomatic for collecting named arguments, but probably someone out there has thought more about it than me.
Performance is important, in particular I'd like to keep memory allocations down (always tricky in C++), since this may be run on devices without virtual memory. But having even a slow one to start from will probably be faster than writing it from scratch myself.

The fmt library supports named arguments:
print("You clicked {button} at {x},{y}.",
arg("button", "b1"), arg("x", 50), arg("y", 30));
And as a syntactic sugar you can even (ab)use user-defined literals to pass arguments:
print("You clicked {button} at {x},{y}.",
"button"_a="b1", "x"_a=50, "y"_a=30);
For brevity the namespace fmt is omitted in the above examples.
Disclaimer: I'm the author of this library.

I've always been critic with C++ I/O (especially formatting) because in my opinion is a step backward in respect to C. Formats needs to be dynamic, and makes perfect sense for example to load them from an external resource as a file or a parameter.
I've never tried before however to actually implement an alternative and your question made me making an attempt investing some weekend hours on this idea.
Sure the problem was more complex than I thought (for example just the integer formatting routine is 200+ lines), but I think that this approach (dynamic format strings) is more usable.
You can download my experiment from this link (it's just a .h file) and a test program from this link (test is probably not the correct term, I used it just to see if I was able to compile).
The following is an example
#include "format.h"
#include <iostream>
using format::FormatString;
using format::FormatDict;
int main()
{
std::cout << FormatString("The answer is %{x}") % FormatDict()("x", 42);
return 0;
}
It is different from boost.format approach because uses named parameters and because
the format string and format dictionary are meant to be built separately (and for
example passed around). Also I think that formatting options should be part of the
string (like printf) and not in the code.
FormatDict uses a trick for keeping the syntax reasonable:
FormatDict fd;
fd("x", 12)
("y", 3.141592654)
("z", "A string");
FormatString is instead just parsed from a const std::string& (I decided to preparse format strings but a slower but probably acceptable approach would be just passing the string and reparsing it each time).
The formatting can be extended for user defined types by specializing a conversion function template; for example
struct P2d
{
int x, y;
P2d(int x, int y)
: x(x), y(y)
{
}
};
namespace format {
template<>
std::string toString<P2d>(const P2d& p, const std::string& parms)
{
return FormatString("P2d(%{x}; %{y})") % FormatDict()
("x", p.x)
("y", p.y);
}
}
after that a P2d instance can be simply placed in a formatting dictionary.
Also it's possible to pass parameters to a formatting function by placing them between % and {.
For now I only implemented an integer formatting specialization that supports
Fixed size with left/right/center alignment
Custom filling char
Generic base (2-36), lower or uppercase
Digit separator (with both custom char and count)
Overflow char
Sign display
I've also added some shortcuts for common cases, for example
"%08x{hexdata}"
is an hex number with 8 digits padded with '0's.
"%026/2,8:{bindata}"
is a 24-bit binary number (as required by "/2") with digit separator ":" every 8 bits (as required by ",8:").
Note that the code is just an idea, and for example for now I just prevented copies when probably it's reasonable to allow storing both format strings and dictionaries (for dictionaries it's however important to give the ability to avoid copying an object just because it needs to be added to a FormatDict, and while IMO this is possible it's also something that raises non-trivial problems about lifetimes).
UPDATE
I've made a few changes to the initial approach:
Format strings can now be copied
Formatting for custom types is done using template classes instead of functions (this allows partial specialization)
I've added a formatter for sequences (two iterators). Syntax is still crude.
I've created a github project for it, with boost licensing.

The answer appears to be, no, there is not a C++ library that does this, and C++ programmers apparently do not even see the need for one, based on the comments I have received. I will have to write my own yet again.

Well I'll add my own answer as well, not that I know (or have coded) such a library, but to answer to the "keep the memory allocation down" bit.
As always I can envision some kind of speed / memory trade-off.
On the one hand, you can parse "Just In Time":
class Formater:
def __init__(self, format): self._string = format
def compute(self):
for k,v in context:
while self.__contains(k):
left, variable, right = self.__extract(k)
self._string = left + self.__replace(variable, v) + right
This way you don't keep a "parsed" structure at hand, and hopefully most of the time you'll just insert the new data in place (unlike Python, C++ strings are not immutable).
However it's far from being efficient...
On the other hand, you can build a fully constructed tree representing the parsed format. You will have several classes like: Constant, String, Integer, Real, etc... and probably some subclasses / decorators as well for the formatting itself.
I think however than the most efficient approach would be to have some kind of a mix of the two.
explode the format string into a list of Constant, Variable
index the variables in another structure (a hash table with open-addressing would do nicely, or something akin to Loki::AssocVector).
There you are: you're done with only 2 dynamically allocated arrays (basically). If you want to allow a same key to be repeated multiple times, simply use a std::vector<size_t> as a value of the index: good implementations should not allocate any memory dynamically for small sized vectors (VC++ 2010 doesn't for less than 16 bytes worth of data).
When evaluating the context itself, look up the instances. You then parse the formatter "just in time", check it agaisnt the current type of the value with which to replace it, and process the format.
Pros and cons:
- Just In Time: you scan the string again and again
- One Parse: requires a lot of dedicated classes, possibly many allocations, but the format is validated on input. Like Boost it may be reused.
- Mix: more efficient, especially if you don't replace some values (allow some kind of "null" value), but delaying the parsing of the format delays the reporting of errors.
Personally I would go for the One Parse scheme, trying to keep the allocations down using boost::variant and the Strategy Pattern as much I could.

Given that Python it's self is written in C and that formatting is such a commonly used feature, you might be able (ignoring copy write issues) to rip the relevant code from the python interpreter and port it to use STL maps rather than Pythons native dicts.

I've writen a library for this puporse, check it out on GitHub.
Contributions are wellcome.

Generalized stream parsing?

Are there any libraries or technologies(in any language) that provide a regular-expression-like tool for any sort of stream-like or list-like data(as opposed to only character strings)?
For example, suppose you were writing a parser for your pet programming language. You've already got it lexed into a list of Common Lisp objects representing the tokens.
You might use a pattern like this to parse function calls(using C-style syntax):
(pattern (:var (:class ident)) (:class left-paren)
(:optional (:var object)) (:star (:class comma) (:var :object)) (:class right-paren))
Which would bind variables for the function name and each of the function arguments(actually, it would probably be implemented so that this pattern would probably bind a variable for the function name, one for the first argument, and a list of the rest, but that's not really an important detail).
Would something like this be useful at all?

I don't know how many replies you'll receive on a subject like this, as most languages lack the sort of robust stream APIs you seem to have in mind; thus, most of the people reading this probably don't know what you're talking about.
Smalltalk is a notable exception, shipping with a rich hierarchy of Stream classes that--coupled with its Collection classes--allow you to do some pretty impressive stuff. While most Smalltalks also ship with regex support (the pure ST implementation by Vassili Bykov is a popular choice), the regex classes unfortunately are not integrated with the Stream classes in the same way the Collection classes are. This means that using streams and regexes in Smalltalk usually involves reading character strings from a stream and then testing those strings separately with regex patterns--not the sort "read next n characters up until a pattern matches," or "read next n characters matching this pattern" type of functionally you likely have in mind.
I think a powerful stream API coupled with powerful regex support would be great. However, I think you'd have trouble generalizing about different stream types. A read stream on a character string would pose few difficulties, but file and TCP streams would have their own exceptions and latencies that you would have to handle gracefully.

Try looking at scala.util.regexp, both the API documentation, and the code example at http://scala.sygneca.com/code/automata. I think would allow a computational linguist to match strings of words by looking for part of speech patterns, for example.

This is the principle behind most syntactic parsers, which operate in two phases. The first phase is the lexer, where identifiers, language keywords, and other special characters (arithmetic operators, braces, etc) are identified and split into Token objects that typically have a numeric field indicating the type of the lexeme, and optionally another field indicating the text of the lexeme.
In the second phase, a syntactic parser operates on the Token objects, matching them by magic number alone, to parse phrases. (Software for doing this includes Antlr, yacc/bison, Scala's cala.util.parsing.combinator.syntactical library, and plenty of others). The two phases don't entirely have to depend on each other -- you can get your Token objects from anywhere else that you like. The magic number aspect seems to be important, though, because the magic numbers are assigned to constants, and they're what make it easy to express your grammar in a readable language.
And remember, that anything you can accomplish with a regular expression can also be accomplished with a context-free grammar (usually just as easily).

Is there a 'catch' with FastFormat?

I just read about the FastFormat C++ i/o formatting library, and it seems too good to be true: Faster even than printf, typesafe, and with what I consider a pleasing interface:
// prints: "This formats the remaining arguments based on their order - in this case we put 1 before zero, followed by 1 again"
fastformat::fmt(std::cout, "This formats the remaining arguments based on their order - in this case we put {1} before {0}, followed by {1} again", "zero", 1);
// prints: "This writes each argument in the order, so first zero followed by 1"
fastformat::write(std::cout, "This writes each argument in the order, so first ", "zero", " followed by ", 1);
This looks almost too good to be true. Is there a catch? Have you had good, bad or indifferent experiences with it?

Is there a 'catch' with FastFormat?
Last time I checked, there was one annoying catch:
You can only use either the narrow string version or the wide string version of this library. (The functions for wchar_t and char are the same -- which type is used is a compile time switch.)
With iostreams, stdio or Boost.Format you can use both.

Found one "catch", though for most people it will never manifest. From the project page:
Atomic operation. It doesn't write out statement elements one at a time, like the IOStreams, so has no atomicity issues
The only way I can see this happening is if it buffers the whole write() call's output itself, then writes it out to the ostream in one step. This means it needs to allocate memory, and if an object passed into the write() call produces a lot of output (several megabytes or more), it can consume up to twice that much memory in internal buffers (assuming it uses the grow-a-buffer-by-doubling-its-size-each-time trick).
If you're just using it for logging, and not, say, dumping huge amounts of XML, you'll never see this problem.
The only other "catch" I'm seeing is:
Highly portable. It will work with all good modern C++ compilers; it even works with Visual C++ 6!
So it won't work with an old C++ compiler, like cfront, whereas iostreams is backward compatible to the late 80's. Again, I'd be surprised if anyone ever had a problem with this.

Although FastFormat is a good library there are a number of issues with it:
Limited formatting support, in particular the following features are not supported:
Leading zeros (or any other non-space padding)
Octal/hexadecimal encoding
Runtime width/alignment specification
The library is quite big for a relatively small task of formatting and has even bigger dependency (STLSoft).

It looks pretty interesting indeed! Good tip regardless, and +1 for that!
I've been playing with it for a bit. The main drawback I see is that FastFormat supports less formatting options for the output. This is I think a direct consequence of the way the higher typesafety is achieved, and a good tradeoff depending on your circumstances.

If you look in detail at his performance benchmark page, you'll notice that good old C printf-family functions are still winning on Linux. In fact, the only test case where they perform poorly is the test case that should be static string concatenations, where I would expect printf to be wasteful. Moreover, GCC provides static type-checking on printf-style function calls, so the benefit of type-safety is reduced. So: if you are running on Linux and if you need the absolute best performance, FastFormat is probably not the optimal solution.

The library depends on a couple of environment variables, as mentioned in the docs.
That might be no biggie to some people, but I'd prefer my code to be as self-contained as possible. If I check it out from source control, it should work and compile. It won't, if it requires you to set environment variables.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js