I'm writing a c++ parser for a custom option file for an application. I have a loop that reads lines in the form of option=value from a text file where value must be converted to double. In pseudocode it does the following:
while(not EOF)
statement <- read_from_file
useful_statement <- remove whitespaces, comments, etc from statement
equal_position <- find '=' in useful_statement
option_str <- useful_statement[0:equal_position)
value_str <- useful_statement[equal_position:end)
find_option(option_str) <- double(value_str)
To handle the string splitting and passing around to functions, I use std::string_view because it avoids excessive copying and clearly states the intent of viewing segments of a pre-existing std::string. I've done everything to the point where std::string_view value_str points to the exact part of useful_statement that contains the value I want to extract, but I can't figure out the way to read a double from an std::string_view.
I know of std::stod which doesn't work with std::string_view. It allows me to write
double value = std::stod(std::string(value_str));
However, this is ugly because it converts to a string which is not actually needed, and even though it will presumably not make a noticeable difference in my case, it could be too slow if one had to read a huge amount of numbers from a text file.
On the other hand, atof won't work because I can't guarantee a null terminator. I could hack it by adding \0 to useful_statement when constructing it, but that will make the code confusing to a reader and make it too easy to break if the code is altered/refactored.
So, what would be a clean, intuitive and reasonably efficient way to do this?
Since you marked your question with C++1z, then that (theoretically) means you have access to from_chars. It can handle your string-to-number conversion without needing anything more than a pair of const char*s:
double dbl;
auto result = from_chars(value_str.data(), value_str.data() + value_str.size(), dbl);
Of course, this requires that your standard library provide an implementation of from_chars.
Headers:
#include <boost/convert.hpp>
#include <boost/convert/strtol.hpp>
Then:
std::string x { "aa123.4"};
const std::string_view y(x.c_str()+2, 5); // Window that views the characters "123.4".
auto value = boost::convert<double>(y, boost::cnv::strtol());
if (value.has_value())
{
cout << value.get() << "\n"; // Prints: 123.4
}
Tested Compilers:
MSVC 2017
p.s. Can easily install Boost using vcpkg (defaults to 32-bit, second command is for 64-bit):
vcpkg install boost-convert
vcpkg install boost-convert:x64-windows
Update: Apparently, many Boost functions use string streams internally, which has a lock on the global OS locale. So they have terrible multi-threaded performance**.
I would now recommend something like stoi() with substr instead. See: Safely convert std::string_view to int (like stoi or atoi)
** This strange quirk of Boost renders most of Boost string processing absolutely useless in a multi-threaded environment, which is strange paradox indeed. This is the voice of hard won experience talking - measure it for yourself if you have any doubts. A 48-core machine runs no faster with many Boost calls compared to a 2-core machine. So now I avoid certain parts of Boost like the proverbial plague, as anything can have a dependency on that damn global OS locale lock.
Related
In my code the following line gives me data that performs the task its meant for:
const char *key = "\xf1`\xf8\a\\\x9cT\x82z\x18\x5\xb9\xbc\x80\xca\x15";
The problem is that it gets converted at compile time according to rules that I don't fully understand. How does "\x" work in a String?
What I'd like to do is to get the same result but from a string exactly like that fed in at run time. I have tried a lot of things and looked for answers but none that match closely enough for me to be able to apply.
I understand that \x denotes a hex number. But I don't know in which form that gets 'baked out' by the compiler (gcc).
What does that ` translate into?
Does the "\a" do something similar to "\x"?
This is indeed provided by the compiler, but this part is not member of the standard library. That means that you are left with 3 ways:
dynamically write a C++ source file containing the string, and writing it on its standard output. Compile it and (providing popen is available) execute it from your main program and read its input. Pretty ugly isn't it...
use the source of an existing compiler, or directly its internal libraries. Clang is probably a good starting point because it has been designed to be modular. But it could require a good amount of work to find where that damned specific point is coded and how to use that...
just mimic what the compiler does, and write your own parser by hand. It is not that hard, and will learn you why tests are useful...
If it was not clear until here, I strongly urge you to use the third way ;-)
If you want to translate "escape" codes in strings that you get as input at run-time then you need to do it yourself, explicitly.
One way is to read the input into one string. Then copy the characters from that source string into a new destination string, one by one. If you see a backslash then you discard it, fetch the next character, and if it's an x you can use e.g. std::stoi to convert the next few characters into its corresponding integer value, and append that number to the destination string (either adding it with std::to_string, or using output string streams and the normal "output" operator <<).
So I've looked around for how to convert a string to a short and found a lot on how to convert a string to an integer. I would leave a question as a comment on those threads, but I don't have enough reputation. So, what I want to do is convert a string to a short, because the number should never go above three or below zero and shorts save memory (as far as I'm aware).
To be clear, I'm not referring to ASCII codes.
Another thing I want to be able to do is to check if the conversion of the string to the short fails, because I'll be using a string which consists of a users input.
I know I can do this with a while loop, but if there's a built in function to do this in C++ that would be just as, or more, efficient than a while loop, I would love to hear about it.
Basically, an std::stos function is missing for unknown reasons, but you can easily roll your own. Use std::stoi to convert to int, check value against short boundaries given by e.g. std::numeric_limits<short>, throw std::range_error if it's not in range, otherwise return that value. There.
If you already have the Boost library installed you might use boost::lexical_cast for convenience, but otherwise I would avoid it (mainly for the verbosity and library dependency, and it's also a little inefficient).
Earlier boost::lexical_cast was known for not being very efficient, I believe because it was based internally on stringstreams, but as reported in comments here the modern version is faster than conversion via stringstream, and for that matter than via scanf.
An efficient way is to use boost::lexical_cast:
short myShort = boost::lexical_cast<short>(myString);
You will need to install boost library and the following include: #include <boost/lexical_cast.hpp>
You should catch bad_lexical_cast in case the cast fails:
try
{
short myShort = boost::lexical_cast<short>(myString);
}
catch(bad_lexical_cast &)
{
// Do something
}
You can also use ssprintf with the %hi format specifier.
Example:
short port;
char szPort[] = "80";
sscanf(szPort, "%hi", &port);
the number should never go above three or below zero
If you really really need to save memory, then this will also fit in a char (regardless whether char is signed or unsigned).
Another 'extreme' trick: if you can trust there are no weird things like "002" then what you have is a single character string. If that is the case, and you really really need performance, try:
char result = (char)( *ptr_c_string - '0' );
I have been using sscanf() in my parser to get some css like tokens such as color code some variations below;
#FDC69A
#ff0
orange
Example code will be;
int r g b;
cosnt char* s = "#FAFAFA";
if(sscanf(s, "#%02x%02x%02x", &r, &g, &b) == 3){
// color code ok
}
My preferred language for current project is c++, I think sscanf can be faster than regular character by character parsing and overall code will be bug free & minimal still it may have portability issues across different compilers.
A thing I noticed is, popular of open source project do not use sscanf for tokenizing input buffers instead they do it char by char, it is a bad programming practice to use sscanf in parsing that i am following?
The biggest problem with sscanf (as well as scanf and fscanf) is that numeric overflow causes undefined behavior. For example:
const char *s = "999999999999999999999999999999";
int n;
sscanf(s, "%d", &n);
The C standard says exactly nothing about how this code behaves. It might set n to some arbitrary value, it might report an error, or it might crash.
(In practice, existing implementations are likely to behave sensibly, for some value of "sensibly".)
if(sscanf(s, "#%02x%02x%02x", &r, &g, &b) == 3) is robust... nothing to worry about there.
Historically, the big concern with those functions was that someone might specify a format flag that doesn't match the argument (e.g. %d not given an int*)... many modern compilers have enough validation to avoid accidents like that.
Still, C++ has iostreams, and people tend to use those for many I/O and parsing operations as the stream destructors automatically flush and close files and release descriptors, they're type safe, extensible to user-defined types, you can generally reuse parsing/output code for any type of stream, and they're often convenient. They'd be significantly more tedious for your specific test above though.
If you've noticed lots of OSS programs scanning character by character, it may be because:
They're doing more complex parsing - where they want to branch to different parsing logic after reading individual characters, or
In your code you have a firm expectation of what to expect, so it's reasonable to do a sscanf to test that, but if you were writing say a compiler it'd be too slow to try a huge if/else list of hundreds sscanf attempts to recognise tokens.
Relevant for scanf, fscanf but not sscanf - avoid scanning too far so they can ungetc, which (from memory) is only portably guaranteed to work for 1 character.
NOTE: I've seen the post What is the cin analougus of scanf formatted input? before asking the question and the post doesn't solve my problem here. The post seeks for C++-way to do it, but as I mentioned already, it is inconvenient to just use C++-way to do it sometimes and I have clear examples for that.
I am trying to read data from an istream object, and sometimes it is inconvenient to just use C++-style ways such as operator>>, e.g. the data are in special form 123:456 so you have to imbue to make ':' as space (which is very hacky, as opposed to %d:%d in scanf), or 00123 where you want to read as string and convert decimal instead of octal (as opposed to %d in scanf), and possibly many other cases.
The reason I chose istream as interface is because it can be derived and therefore more flexible. For example, we can create in-memory streams, or some customized streams that generated on the fly, etc. C-style FILE*, on the other hand, is very limited, at least in a standard-compliant way, on creating customized streams.
So my questions is, is there a way to do scanf-like data extraction on istream object? I think fscanf internally read character by character from FILE* using fgetc, while istream also provides such interface. So it is possible by just copying and pasting the code of fscanf and replace the FILE* with the istream object, but that's very hacky. Is there a smarter and cleaner way, or is there some existing work on this?
Thanks.
You should never, under any circumstances, use scanf or its relatives for anything, for three reasons:
Many format strings, including for instance all the simple uses of %s, are just as dangerous as gets.
It is almost impossible to recover from malformed input, because scanf does not tell you how far in characters into the input it got when it hit something unexpected.
Numeric overflow triggers undefined behavior: yes, that means scanf is allowed to crash the entire program if a numeric field in the input has too many digits.
Prior to C++11, the C++ specification defined istream formatted input of numbers in terms of scanf, which means that last objection is very likely to apply to them as well! (In C++11 the specification is changed to use strto* instead and to do something predictable if that detects overflow.)
What you should do instead is: read entire lines of input into std::string objects with getline, hand-code logic to split them up into fields (I don't remember off the top of my head what the C++-string equivalent of strsep is, but I'm sure it exists) and then convert numeric strings to machine numbers with the strtol/strtod family of functions.
I cannot emphasize this enough: THE ONLY 100% RELIABLE WAY TO CONVERT STRINGS TO NUMBERS IN C OR C++, unless you are lucky enough to have a C++ runtime that is already C++11-conformant in this regard, IS WITH THE strto* FUNCTIONS, and you must use them correctly:
errno = 0;
result = strtoX(s, &ends, 10); // omit 10 for floats
if (s == ends || *ends || errno)
parse_error();
(The OpenBSD manpages, linked above, explain why you have to do this fairly convoluted thing.)
(If you're clever, you can use ends and some manual logic to skip that colon, instead of strsep.)
I do not recommend you to mix C++ input output and C input output. No that they are really incompatible but they could just plain interoperate wrong.
For example Oracle docs recommend not to mix it http://www.oracle.com/technetwork/articles/servers-storage-dev/mixingcandcpluspluscode-305840.html
But no one stops you from reading data into the buffer and parsing it with standard c functions like sscanf.
...
string curString;
int a, b;
...
std::getline(inputStream, curString);
int sscanfResult == sscanf(curString.cstr(), "%d:%d", &a, &b);
if (2 != sscanfResult)
throw "error";
...
But it won't help in some situations when your stream is just one long contiguous sequence of symbols(like some string turned into memory stream).
Making your own fscanf from scratch or porting(?) the original CRT function actually isn't the worst possible idea. Just make sure you have tested it thoroughly(low level custom char manipulation was always a source of pain in C).
I've never really tried the boost\spirit and such parsing infrastructure could really be an overkill for your project. But boost libraries are usually well tested and designed. You could at least try to use it.
Based on #tmyklebu's comment, I implemented streamScanf which wraps istream as FILE* via fopencookie: https://github.com/likan999/codejam/blob/master/Common/StreamScanf.cpp
I'm wondering if there is a library like Boost Format, but which supports named parameters rather than positional ones. This is a common idiom in e.g. Python, where you have a context to format strings with that may or may not use all available arguments, e.g.
mouse_state = {}
mouse_state['button'] = 0
mouse_state['x'] = 50
mouse_state['y'] = 30
#...
"You clicked %(button)s at %(x)d,%(y)d." % mouse_state
"Targeting %(x)d, %(y)d." % mouse_state
Are there any libraries that offer the functionality of those last two lines? I would expect it to offer a API something like:
PrintFMap(string format, map<string, string> args);
In Googling I have found many libraries offering variations of positional parameters, but none that support named ones. Ideally the library has few dependencies so I can drop it easily into my code. C++ won't be quite as idiomatic for collecting named arguments, but probably someone out there has thought more about it than me.
Performance is important, in particular I'd like to keep memory allocations down (always tricky in C++), since this may be run on devices without virtual memory. But having even a slow one to start from will probably be faster than writing it from scratch myself.
The fmt library supports named arguments:
print("You clicked {button} at {x},{y}.",
arg("button", "b1"), arg("x", 50), arg("y", 30));
And as a syntactic sugar you can even (ab)use user-defined literals to pass arguments:
print("You clicked {button} at {x},{y}.",
"button"_a="b1", "x"_a=50, "y"_a=30);
For brevity the namespace fmt is omitted in the above examples.
Disclaimer: I'm the author of this library.
I've always been critic with C++ I/O (especially formatting) because in my opinion is a step backward in respect to C. Formats needs to be dynamic, and makes perfect sense for example to load them from an external resource as a file or a parameter.
I've never tried before however to actually implement an alternative and your question made me making an attempt investing some weekend hours on this idea.
Sure the problem was more complex than I thought (for example just the integer formatting routine is 200+ lines), but I think that this approach (dynamic format strings) is more usable.
You can download my experiment from this link (it's just a .h file) and a test program from this link (test is probably not the correct term, I used it just to see if I was able to compile).
The following is an example
#include "format.h"
#include <iostream>
using format::FormatString;
using format::FormatDict;
int main()
{
std::cout << FormatString("The answer is %{x}") % FormatDict()("x", 42);
return 0;
}
It is different from boost.format approach because uses named parameters and because
the format string and format dictionary are meant to be built separately (and for
example passed around). Also I think that formatting options should be part of the
string (like printf) and not in the code.
FormatDict uses a trick for keeping the syntax reasonable:
FormatDict fd;
fd("x", 12)
("y", 3.141592654)
("z", "A string");
FormatString is instead just parsed from a const std::string& (I decided to preparse format strings but a slower but probably acceptable approach would be just passing the string and reparsing it each time).
The formatting can be extended for user defined types by specializing a conversion function template; for example
struct P2d
{
int x, y;
P2d(int x, int y)
: x(x), y(y)
{
}
};
namespace format {
template<>
std::string toString<P2d>(const P2d& p, const std::string& parms)
{
return FormatString("P2d(%{x}; %{y})") % FormatDict()
("x", p.x)
("y", p.y);
}
}
after that a P2d instance can be simply placed in a formatting dictionary.
Also it's possible to pass parameters to a formatting function by placing them between % and {.
For now I only implemented an integer formatting specialization that supports
Fixed size with left/right/center alignment
Custom filling char
Generic base (2-36), lower or uppercase
Digit separator (with both custom char and count)
Overflow char
Sign display
I've also added some shortcuts for common cases, for example
"%08x{hexdata}"
is an hex number with 8 digits padded with '0's.
"%026/2,8:{bindata}"
is a 24-bit binary number (as required by "/2") with digit separator ":" every 8 bits (as required by ",8:").
Note that the code is just an idea, and for example for now I just prevented copies when probably it's reasonable to allow storing both format strings and dictionaries (for dictionaries it's however important to give the ability to avoid copying an object just because it needs to be added to a FormatDict, and while IMO this is possible it's also something that raises non-trivial problems about lifetimes).
UPDATE
I've made a few changes to the initial approach:
Format strings can now be copied
Formatting for custom types is done using template classes instead of functions (this allows partial specialization)
I've added a formatter for sequences (two iterators). Syntax is still crude.
I've created a github project for it, with boost licensing.
The answer appears to be, no, there is not a C++ library that does this, and C++ programmers apparently do not even see the need for one, based on the comments I have received. I will have to write my own yet again.
Well I'll add my own answer as well, not that I know (or have coded) such a library, but to answer to the "keep the memory allocation down" bit.
As always I can envision some kind of speed / memory trade-off.
On the one hand, you can parse "Just In Time":
class Formater:
def __init__(self, format): self._string = format
def compute(self):
for k,v in context:
while self.__contains(k):
left, variable, right = self.__extract(k)
self._string = left + self.__replace(variable, v) + right
This way you don't keep a "parsed" structure at hand, and hopefully most of the time you'll just insert the new data in place (unlike Python, C++ strings are not immutable).
However it's far from being efficient...
On the other hand, you can build a fully constructed tree representing the parsed format. You will have several classes like: Constant, String, Integer, Real, etc... and probably some subclasses / decorators as well for the formatting itself.
I think however than the most efficient approach would be to have some kind of a mix of the two.
explode the format string into a list of Constant, Variable
index the variables in another structure (a hash table with open-addressing would do nicely, or something akin to Loki::AssocVector).
There you are: you're done with only 2 dynamically allocated arrays (basically). If you want to allow a same key to be repeated multiple times, simply use a std::vector<size_t> as a value of the index: good implementations should not allocate any memory dynamically for small sized vectors (VC++ 2010 doesn't for less than 16 bytes worth of data).
When evaluating the context itself, look up the instances. You then parse the formatter "just in time", check it agaisnt the current type of the value with which to replace it, and process the format.
Pros and cons:
- Just In Time: you scan the string again and again
- One Parse: requires a lot of dedicated classes, possibly many allocations, but the format is validated on input. Like Boost it may be reused.
- Mix: more efficient, especially if you don't replace some values (allow some kind of "null" value), but delaying the parsing of the format delays the reporting of errors.
Personally I would go for the One Parse scheme, trying to keep the allocations down using boost::variant and the Strategy Pattern as much I could.
Given that Python it's self is written in C and that formatting is such a commonly used feature, you might be able (ignoring copy write issues) to rip the relevant code from the python interpreter and port it to use STL maps rather than Pythons native dicts.
I've writen a library for this puporse, check it out on GitHub.
Contributions are wellcome.