C++ string parsing ideas

C++ string parsing ideas - c++

I have the output of another program that was more intended to be human readable than machine readable, but yet am going to parse it anyway. It's nothing too complex.
Yet, I'm wondering what the best way to do this in C++ is. This is more of a 'general practice' type of question.
I looked into Boost.Spirit, and even got it working a bit. That thing is crazy! If I was designing the language that I was reading, it might be the right tool for the job. But as it is, given its extreme compile-times, the several pages of errors from g++ when I do anything wrong, it's just not what I need. (I don't have much need for run-time performance either.)
Thinking about using C++ operator <<, but that seems worthless. If my file has lines like "John has 5 widgets", and others "Mary works at 459 Ramsy street" how can I even make sure I have a line of the first type in my program, and not the second type? I have to read the whole line and then use things like string::find and string::substr I guess.
And that leaves sscanf. It would handle the above cases beautifully
if( sscanf( str, "%s has %d widgets", chararr, & intvar ) == 2 )
// then I know I matched "foo has bar" type of string,
// and I now have the parameters too
So I'm just wondering if I'm missing something or if C++ really doesn't have much built-in alternative.

sscanf does indeed sound like a pretty good fit for your requirements:
you may do some redundant parsing, but you don't have performance requirements prohibiting that
it localises the requirements on the different input words and allows parsing of non-string values directly into typed variables, making the different input formats easy to understand
A potential problem is that it's error prone, and if you have lots of oft-changing parsing phrases then the testing effort and risk can be worrying. Keeping the spirit of sscanf but using istream for type safety:
#include <iostream>
#include <sstream>
// Str captures a string literal and consumes the same from an istream...
// (for non-literals, better to have `std::string` member to guarantee lifetime)
class Str
{
public:
Str(const char* p) : p_(p) { }
const char* c_str() const { return p_; }
private:
const char* p_;
};
bool operator!=(const Str& lhs, const Str& rhs)
{
return strcmp(lhs.c_str(), rhs.c_str()) != 0;
}
std::istream& operator>>(std::istream& is, const Str& str)
{
std::string s;
if (is >> s)
if (s.c_str() != str)
is.setstate(std::ios_base::failbit);
return is;
}
// sample usage...
int main()
{
std::stringstream is("Mary has 4 cats");
int num_dogs, num_cats;
if (is >> Str("Mary") >> Str("has") >> num_dogs >> Str("dogs"))
{
std::cout << num_dogs << " dogs\n";
}
else if (is.clear(), is.seekg(0), // "reset" the stream...
(is >> Str("Mary") >> Str("has") >> num_cats >> Str("cats")))
{
std::cout << num_cats << " cats\n";
}
}

The GNU tools flex and bison are very powerful tools you could use that are along the lines of Spirit but (according to some people) easier to use, partially because the error reporting is a bit better since the tools have their own compilers. This, or Spirit, or some other parser generator, is the "correct" way to go with this because it affords you the greatest flexibility in your approach.
If you're thinking about using strtok, you might want to instead take a look at stringstream, which splits on whitespace and lets you do some nice formatting conversions between strings, primitives, etc. It can also be plugged into the STL algorithms, and avoids all the messy details of raw C-style string memory management.

I've written extensive parsing code in C++. It works just great for that, but I wrote the code myself and didn't rely on more general code written by someone else. C++ doesn't come with extensive code already written, but it's a great language to write such code in.
I'm not sure what your question is beyond just that you'd like to find code someone has already written that will do what you need. Part of the problem is that you haven't really described what you need, or asked a question for that matter.
If you can make the question more specific, I'd be happy to try and offer a more specific answer.

I've used Boost.Regex (Which I think is also tr1::regex). Easy to use.

there is always strtok() I suppose

Have a look at strtok.

Depending on exactly what you want to parse, you may well want a regular expression library.
See msdn or earlier question.
Personally, again depending the exact format, I'd consider using perl to do an initial conversion into a more machine readable format (E.g. variable record CSV) and then import into C++ much more easily.
If sticking to C++, you need to:
Identify a record - hopefully just a
line
Determine the type of the record - use regex
Parse the record - scanf is fine
A base class on the lines of:
class Handler
{
public:
Handler(const std::string& regexExpr)
: regex_(regexExpr)
{}
bool match(const std::string& s)
{
return std::tr1::regex_match(s,regex_);
}
virtual bool process(const std::string& s) = 0;
private:
std::tr1::basic_regex<char> regex_;
};
Define a derived class for each record type, stick an instance of each in a set and search for matches.
class WidgetOwner : public Handler
{
public:
WidgetOwner()
: Handler(".* has .* widgets")
{}
virtual bool process(const std::string& s)
{
char name[32];
int widgets= 0;
int fieldsRead = sscanf( s.c_str(), "%32s has %d widgets", name, & widgets) ;
if (fieldsRead == 2)
{
std::cout << "Found widgets in " << s << std::endl;
}
return fieldsRead == 2;
}
};
struct Pred
{
Pred(const std::string& record)
: record_(record)
{}
bool operator()(Handler* handler)
{
return handler->match(record_);
}
std::string record_;
};
std::set<Handler*> handlers_;
handlers_.insert(new WidgetOwner);
handlers_.insert(new WorkLocation);
Pred pred(line);
std::set<Handler*>::iterator handlerIt =
std::find_if(handlers_.begin(), handlers_.end(), pred);
if (handlerIt != handlers_.end())
(*handlerIt)->process(line);

Related

How to replace #include <optional>

I come to you today with another question that my brain can't process by itself:
I got a cpp file that includes optional as a header file. Unfortunately, this works only on c++17 forwards, and I'm trying to compile it in c++14. This cpp file uses optional like this
std::optional<std::string> GetStringPropertyValueFromJson(const std::string& Property, const web::json::value& Json)
{
if (Json.has_field(utility::conversions::to_string_t(Property)))
{
auto& propertyValue = Json.at(utility::conversions::to_string_t(Property));
if (propertyValue.is_string())
{
return std::optional<std::string>{utility::conversions::to_utf8string(propertyValue.as_string())};
}
}
return std::nullopt;
}
and then the function is used to assign values like this:
std::string tokenType = GetStringPropertyValueFromJson("token_type", responseContent).value_or("");
std::string accessToken = GetStringPropertyValueFromJson("access_token", responseContent).value_or("");
Please help me with a proper substitution for OPTIONAL. Thanks and much love
PS: From what i've read, you can replace optional with pair somehow in order to get a similar result, but I don't really know how exactly.
PPS: I am new here so any tips on how to better write my questions or anything else are greatly appreciated :)

I guess in C++14 the optional header could be included by #include <experimental/optional>.

Change your method signature to
std::string GetStringPropertyValueFromJson(const std::string& Property, const web::json::value& Json)
and in the end just return the empty string
return "";
Then later in your code use it without std::optional::value_or:
std::string tokenType = GetStringPropertyValueFromJson("token_type", responseContent);
The logic is exactly the same and you don't use std::optional.
I see now your other question about possibility to use std::pair. Yes, you could also change your method to:
std::pair<std::string, bool> GetStringPropertyValueFromJson(const std::string& Property, const web::json::value& Json)
and return std::make_pair(valueFromJson, true) in case your json property has been found, or std::make_pair("", false) in case it was not. This also solves the problem with empty (but existing) json property.

A poor mans optional string that should be sufficient for your code is this:
struct my_nullopt {};
struct my_optional {
private:
std::string value;
bool has_value = false;
public:
my_optional(my_nullopt) {}
my_optional(const std::string& v) : value(v),has_value(true) {}
T value_or(const std::string& v) {
return has_value ? value : v;
}
};
Its a rather limited interface, for example it is not possible to set the value after construction. But it appears that you do not need that.
Alternatively you can use boost/optional.
Note that the tip you got about using a pair is just what I did above: The value and a bool. Just that std::pair is for cases where you cannot give better names than first and second (eg in generic code), but it is simple to provide a better interface than std::pair does here. With a pair the value_or would be something along the line of x.first ? x.second : "".
PS: Only in the end I realized that the code you present does not actually make use of what std::optional has to offer. As you are calling value_or(""), you cannot distinguish between a field with value "" or "" because the optional had no value. Because of that, the most simple solution is to use a plain std::string and return "" instead of std::nullopt.

Should single-use values be inline, function-level const variables, or class-level static const variables?

I have a function that performs a few string comparisons based on an argument. The strings that are being compared against are not used elsewhere. My instinct is to declare all of the strings as consts at the beginning of the function. However, they could just be inline, or declared on the class level. What is preferred?
Here is the gist of the function:
void MyType::parse(const wstring& input)
{
if (input == value1) { do1; }
else if (input == value2) { do2; }
}
Possible options for the values:
A. Inline values:
if (input == L"foo") { do1; }
B. Function-level values:
void MyType::parse(const wstring& input)
{
const wstring foo = L"foo";
if (input == foo) { do1; }
...
}
C. Class-level static constants:
.h
class MyType
{
private:
static const std::wstring kFoo;
}
.cpp
const wstring MyType::kFoo(L"foo");
...
void MyType::parse(const wstring& input)
{
if (input == kFoo) { do1; }
...
}
There are probably other options as well. Now, opinions differ as to readability, so while those are important, it's impossible to have a definite answer about that. So, when I ask, "which is preferred?" I'm asking about which performs best and has the lowest complexity.

My personal preference:
Keep the literal as close to the point of use as possible - so options A or B, but not C.
To choose between A an B, ask yourself "Does the literal itself make sense to someone else reading this code?". If it does, go for A and the code is still self-documenting. If it doesn't, option B gives you the opportunity to provide a meaningful name to the literal.
Examples:
// option A
void MyType::parse(const wstring& input)
{
if (input == L"QUIT") { quit(); }
else if (input == L"CONTINUE") { read_next(); }
}
// option B
void MyType::parse(const wstring& input)
{
static const wstring quit_command = L"*34!";
static const wstring continue_command = L"*17!";
if (input == quit_command) { quit(); }
else if (input == continue_command) { read_next(); }
}

What do you prefer?
They're not all equivalent of course.
If you give them (named) namespace or global scope, they can get external visibility, meaning you can define them in a separate TU and even change their definition without recompiling (just linking). If that TU is in a dynamic library that linking might be at runtime.
Also, function locals are usually not separately documented. However if these values have significant meaning, you might want to document them. If you don't wish to imply external linkage, make them file-static, e.g.:
namespace /*local to TU*/ {
/** #brief the file pattern is used when ...
*/
constexpr char const* file_pattern = "......";
}
That way, your class declaration doesn't leak implementation details and doesn't need to change if those details change.
So, it's up to you. But consider your needs for testing, maintainability and documentation.

The question you have to ask is: do you see this string being changed/modified in the future? If so, inline will not work. If you know this string will never be changed, received through a get() function, or modified, then I would say inline is best since you do not have to declare space in memory to hold the variable (and save a line of code).

I personally would go with the last variant. You have one single point of change if you require your "Magic String" to be modified, which is always a good idea. Even if you only use the string once, I would suggest that you still have one constant somewhere, otherwise you will do it the one way for this scenario, but the other way in another, which is inconsistent.
Just my 2 ct.

Generally, you should refer to your employer's coding standard.
If that does not explain which to use, ask your team lead.
If he/she does not care, your instincts are fine.
My experience has been varied ... I prefer the const std::string defined close to the first time used.
Edit: (some now missing comment apparently thought the above was incomplete)
Should single-use values be
inline function-level const variables,
class-level static const variables
As I previously stated;
I prefer the single-use value as close to the first time use as possible.
and thus not in the class-level constants (neither static nor otherwise)
I generally prefer them on their own line, so perhaps this means not in-line, and not anonymous. I suppose this is related to the idea of "no magic numbers in your code." (even though this is not a number.)

C++ regex on partial data

I have a callback function, which provides pointer to data and it's size. I don't know what size will be next time and which call will be the last. And I need to match incoming data with regex and save matches.
Something like that.
class data_filter
{
public:
data_filter(const std::string& re)
: re_(re)
{}
public:
// callback func. It will be called many times with data parts
void process(const char* data, const size_t len)
{
re_.match(data, len, m_); // if found match, add it to matches
}
public:
void print_matches()
{
for(size_t i = 0; i < m_.size(); ++i)
{
std::cout << m_[i] << std::endl;
}
}
private:
some_cool_regex re_;
cool_regex_matches m_;
};
If absolutely neccessary i can provide some fixed buffer for regex backtracking, but i would like to avoid it.
I already had a brief look at boost::regex with partial_match option. As far as i understood from a first glance it can provide such functionality, but user should manually deal with temporary buffer.
So, should i stick with boost or there are some libraries that match my needs closer?
Thanks.

Since, indeed, there could be a need for backtracking, your options for streaming are limited or non-existent.
Boost Spirit "solves" the same issue by using the multi_pass_iterator<> adapter around input iterators. The adapter is able to maintain a buffer of previously read data for backtracking, freeing it as soon as it is no longer required (e.g. due to an expectation point).
If you shared some details about "some cool regex" then I could probably show you how to do this.
UPDATE Just found this library: https://github.com/openresty/sregex
libsregex - A non-backtracking regex engine library for large data streams

How can I get a std::string from an enum type?

I have some error codes that I would like represent as strings:
enum class ErrorCode
{
OK,
InvalidInput,
BadAlloc,
Other
};
I want to create an intuitive and simple way of getting strings that represent these errors. The simple solutions is:
std::string const ErrorCode2Str(ErrorCode errorCode)
{
switch (errorCode)
{
case OK:
return "OK";
case InvalidInput:
return "Invalid Input";
case BadAlloc:
return "Allocation Error";
case Other:
return "Other Error";
default:
throw Something;
}
}
Is there a better way? Can I overload an ErrorCode to string cast somehow? Can I create a ErrorCode::str() function? Is there a standard solution to this problem?

One possibility is a map:
class to_str {
std::unordered_map<ErrorCode, std::string> strings;
public:
to_str() {
strings[ErrorCode::OK] = "Ok";
strings[ErrorCode::InvalidInput] = "Invalid Input";
strings[ErrorCode::BadAlloc] = "Allocation Error";
strings[ErrorCode::Other] = "Other";
}
std::string operator()(ErrorCode e) {
return strings[e];
}
};
// ...
auto e = foo(some_input);
if (e != ErrorCode::OK)
std::cerr << to_str()(e);
It's obviously not a huge difference, but I find it at least marginally more readable, and think it's probably a bit more maintainable in the long term.

There is no prefect solution to this, and a lot of libraries out there do what you are currently doing.
But if you want a different way of doing it, you can turn the error into a class like so:
#include <iostream>
#include <string>
class Error
{
public:
Error(int key, std::string message) : key(key), message(message){}
int key;
std::string message;
operator int(){return key;}
operator std::string(){ return message; }
bool operator==(Error rValue){return this->key == rValue.key; }
};
int main()
{
Error e(0, "OK");
int errorCode = e;
std::string errorMessage = e;
std::cout << errorCode << " " << errorMessage;
}

Although many simple ways to do an enum-to-string or string-to-enum conversion exist, I woud like consider, here, a more generalized way.
Why doesn't C++ allow native contruct for it? There are mainly two reasons:
The first is technical: C++ doesn't have any reflection mechanism: compiled symbols simple cease to exist (and become just numbers). And since they don't exist, you cannot get them back.
The second is more a programming issue: enumerals are "shared" between the compiler and the programmer. String literals are shared between the progam and the end-user. That may be not a programmer and may not speak English (and we don't know what he speaks).
A general way to solve the problem is so to spkit it in two parts: one is at stream level, and the other at localization level.
What does it happen when you write std::cout << 42 ?
The operator<<(ostream&, int) implementation, in fact calls use_facet<num_put<char> >(cout.getloc()).do_put(int) which in turn use eventually the numpunct facet that define how to handle signs, decimal separator and digit group separators.
The standard way to handle enumeral output is so, by implementing an ostrea<<enumeral operator that gets a facet and calls on it a method to actually write that string.
Such a facet can them be implemented a number of times and made available for each supported language.
That's not easy and straightforward, but that's how C++ I/O is conceived.
Once you did all that, the idiomatic way to get a string is using a strngstream imbued with a local that supports all the enums and classes required facets.
Too complex? may be. But if you think this is too complicated, stop to teach std::cout << "Hello wrld" << std::endl; and write a more simple "output library".

C++ stream second insertion operator

Is it possible to define a second insertion operator to have two modes of outputting a class? Say e.g. one that outputs all members and one that just outputs some basic unique identifier that is grep-able in a log? If so, is there an operator that is usually chosen? I would guess as analogy to << one might use <<< if that is legal?
Thanks

If you want to output only the id, then the best idea is probably to provide a method to get the id in a type that's streamable (e.g. std::string id() const;). That's much more intuitive to other people working on the code than some strange operator use.
Your suggestion of <<< (it's not possible to create new operators in C++, but ignoring that for a moment) reveals that you're happy for there to be different code at the point of call. Therefore, the only benefit you'd get would be the saving of a few character's source code; it isn't worth the obfuscation.
By way of contrast, there are situations where you want the same streaming notation to invoke different behaviours, such as switching between id-only and full data, or different representations such as tag/value, CSV, XML, and binary. These alternatives are usually best communicated by either:
using different stream types (e.g. XMLStream rather than std::ostream), and defining XMLStream& operator<<(XMLStream&, const My_Type&) etc, and/or
using stream manipulators - you can create your own - random Google result: http://www.informit.com/articles/article.aspx?p=171014&seqNum=2

There's no such thing already defined or in use by convention.
Also, you cannot define your own operators in C++, you have to use one of the ones already in the language and overloadable, and <<< isn't an operator in C++, so it is out anyway.
I'd strongly recommend you don't use some other operator for this. (See rule #1 here for a more thorough explanation.) If you have subtle differences between output operations, well-chosen functions names go a long way for making better code than unclear operators arbitrarily picked.

No. You can't define your own operators (<<< doesn't exist in C++). But you can define a id() method returning a string and output this.

There is no such operator as <<< in C++.
You are, however, free to implement, for example operator <(ostream&,Object&), which would do what you want. The problem is, code may get unreadable when you try to chain < and << together.

you can use operator | for instance. Another way of doing this is to define small tag classes for which the operator is overloaded; example (pretty simplistic but you get the point):
template< class T >
struct GrepTag
{
GrepTag( const T& );
T value;
}
template< class T >
Greptag< T > MakeGrepTag( const T& x )
{
return GrepTag< T >( x );
}
template< class T >
MyClass& MyClass::operator << ( const GrepTag< T >& g )
{
//output g.value here
}
MyClass() << MakeGrepTag( "text" );
Yet another way, more like the standard streams, is to use a tag as well but keep some state internally:
struct GrepTag
{
}
MyClass& MyClass::operator << ( const GrepTag& g )
{
grepState = true;
}
template< class T >
MyClass& MyClass::operator << ( const T& )
{
if( grepState )
{
//output special
grepState = false;
}
else
{
//output normal
}
}
MyClass() << GrepTag() << "text";

You cannot define your own operators in C++. You can only overload those that exist.
So I recomend not using an operator for outputting basic unique identifier grep-able in a log. This doesn't correspond to any existing operator role. Use a method instead, such as exportToLog().

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

C++ string parsing ideas - c++

I've used Boost.Regex (Which I think is also tr1::regex). Easy to use.

there is always strtok() I suppose

Have a look at strtok.

Related

How to replace #include <optional>

Should single-use values be inline, function-level const variables, or class-level static const variables?

C++ regex on partial data

How can I get a std::string from an enum type?

C++ stream second insertion operator

Categories

Resources