C++ regex on partial data - c++

I have a callback function, which provides pointer to data and it's size. I don't know what size will be next time and which call will be the last. And I need to match incoming data with regex and save matches.
Something like that.
class data_filter
{
public:
data_filter(const std::string& re)
: re_(re)
{}
public:
// callback func. It will be called many times with data parts
void process(const char* data, const size_t len)
{
re_.match(data, len, m_); // if found match, add it to matches
}
public:
void print_matches()
{
for(size_t i = 0; i < m_.size(); ++i)
{
std::cout << m_[i] << std::endl;
}
}
private:
some_cool_regex re_;
cool_regex_matches m_;
};
If absolutely neccessary i can provide some fixed buffer for regex backtracking, but i would like to avoid it.
I already had a brief look at boost::regex with partial_match option. As far as i understood from a first glance it can provide such functionality, but user should manually deal with temporary buffer.
So, should i stick with boost or there are some libraries that match my needs closer?
Thanks.

Since, indeed, there could be a need for backtracking, your options for streaming are limited or non-existent.
Boost Spirit "solves" the same issue by using the multi_pass_iterator<> adapter around input iterators. The adapter is able to maintain a buffer of previously read data for backtracking, freeing it as soon as it is no longer required (e.g. due to an expectation point).
If you shared some details about "some cool regex" then I could probably show you how to do this.
UPDATE Just found this library: https://github.com/openresty/sregex
libsregex - A non-backtracking regex engine library for large data streams

Related

Preferred way of extracting and matching protobuf message typename from an Any package

I've been using Any to package dynamic messages for protobuf.
On the receiver end, I use Any.type_url() to match the enclosed message types.
Correct me if I'm wrong, knowing that .GetDescriptor() is unavailable with Any, I still want to make the matching less of a clutter. I tried to extract the message type with brute force like this:
MyAny pouch;
// unpacking pouch
// Assume the message is "type.googleapis.com/MyTypeName"
....
const char* msgPrefix = "type.googleapis.com/";
auto lenPrefix = strlen(msgPrefix);
const std::string& msgURL = pouch.msg().type_url();
MamStr msgName = msgURL.substr(lenPrefix, msgURL.size());
if (msgName == "MyTypeName") {
// do stuff ...
}
But I'm still wondering if there are cleaner way of skipping the prefix to get the "basename" of the type URL.
Thanks!
You could try
std::string getBaseName(std::string const & url) {
return url.substr(url.find_last_of("/\\") + 1);
}
if it suits you well.
Although there are some cases, it may not explode it correctly.
Let's say you have two params as basename: http://url.com/example/2
This will get the latest, which is 2...
If you are not looking for cross-platform support, you can always go for https://learn.microsoft.com/en-us/cpp/c-runtime-library/reference/splitpath-wsplitpath?view=vs-2019

check if something was serialized in std::ostream

Is there an easy way to check if something was serialized in stl::ostream. I am looking for something like:
some preparation
// ... a very complex code that may result in adding or not to the stream,
// that I will prefer not to change
check if the stream has something added
Note that this will need to works recursively. Is using register_callback is a good idea or there is easier way?
First the immediate question: register_callback() is intended to deal with appropriate copying and releasing of resources stored in pword() and will have operations only related to that (i.e., copying, assigning, and releasing plus observing std::locale changes). So, no, that won't help you at all.
What you can do, however, is to create a filtering stream buffer which observes if there was a write to the stream, e.g., something like this:
class changedbuf: std::streambuf {
std::streambuf* d_sbuf;
bool d_changed;
int_type overflow(int_type c) {
if (!traits_type::eq_int_type(c, traits_type::eof())) {
this->d_changed = true;
}
return this->d_sbuf->sputc(c);
}
public:
changedbuf(std::streambuf* sbuf): d_sbuf(d_sbuf), d_changed() {}
bool changed() const { return this->d_changed; }
}
You can use this in place of the std::ostream you already have, e.g.:
void f(std::ostream& out) {
changedbuf changedbuf(out.rdbuf());
std::ostream changedout(&changedbuf);
// use changedout instead of out; if you need to use a global objects, you'd
// replace/restore the used stream buffer using the version of rdbuf() taking
// an argument
if (changedbuf.change()) {
std::cout << "there was a change\n";
}
}
A real implementation would actually provide a buffer and deal with proper flushing (i.e., override sync()) and sequence output (i.e., override xsputn()). However, the above version is sufficient as a proof-of-concept.
Others are likely to suggest the use of std::ostringstream. Depending on the amount of data written, this can easily become a performance hog, especially compared to an advanced version of changedbuf which appropriately deals with buffering.
Are you passing the stream into the complex code, or is it globally visible? Can it be any kind of ostream or can you constrain the type to ofstream or ostringstream?
You may be able to use tellp to determine whether the file position has changed since your preparation code, if your ostream type supports it (such as with most fstreams). Or, if you're passing the stream in, you could pass an empty ostringstream in and check that it's not empty when the string is extracted to be printed out.
It's not entirely obvious which solution, if any, would be appropriate for you without knowing more about the context of your code and the specifics of your problem. The best answer may be to return (or set as a by-reference out-parameter) a flag indicating whether the stream was inserted into.

boost::program_options : iterating over and printing all options

I have recently started to use boost::program_options and found it to be highly convenient. That said, there is one thing missing that I was unable to code myself in a good way:
I would like to iterate over all options that have been collected in a boost::program_options::variables_map to output them on the screen. This should become a convenience function, that I can simply call to list all options that were set without the need to update the function when I add new options or for each program.
I know that I can check and output individual options, but as said above, this should become a general solution that is oblivious to the actual options. I further know that I can iterate over the contents of variables_map since it is simply an extended std::map. I could then check for the type containd in the stored boost::any variable and use .as<> to convert it back to the appropriate type. But this would mean coding a long switch block with one case for each type. And this doesn't look like good coding style to me.
So the question is, is there a better way to iterate over these options and output them?
As #Rost previously mentioned, Visitor pattern is a good choice here. To use it with PO you need to use notifiers for your options in such a way that if option is passed notifier will fill an entry in your set of boost::variant values. The set should be stored separately. After that you could iterate over your set and automatically process actions (i.e. print) on them using boost::apply_visitor.
For visitors, inherit from boost::static_visitor<>
Actually, I made Visitor and generic approach use more broad.
I created a class MyOption that holds description, boost::variant for value and other options like implicit, default and so on. I fill a vector of objects of the type MyOption in the same way like PO do for their options (see boost::po::options_add()) via templates. In the moment of passing std::string() or double() for boosts::variant initialization you fill type of the value and other things like default, implicit.
After that I used Visitor pattern to fill boost::po::options_description container since boost::po needs its own structures to parse input command line. During the filling I set notifyer for each option - if it will be passed boost::po will automatically fill my original object of MyOption.
Next you need to execute po::parse and po::notify. After that you will be able to use already filled std::vector<MyOption*> via Visitor pattern since it holds boost::variant inside.
What is good about all of this - you have to write your option type only once in the code - when filling your std::vector<MyOption*>.
PS. if using this approach you will face a problem of setting notifyer for an option with no value, refer to this topic to get a solution: boost-program-options: notifier for options with no value
PS2. Example of code:
std::vector<MyOptionDef> options;
OptionsEasyAdd(options)
("opt1", double(), "description1")
("opt2", std::string(), "description2")
...
;
po::options_descripton boost_descriptions;
AddDescriptionAndNotifyerForBoostVisitor add_decr_visitor(boost_descriptions);
// here all notifiers will be set automatically for correct work with each options' value type
for_each(options.begin(), options.end(), boost::apply_visitor(add_descr_visitor));
It's a good case to use Visitor pattern. Unfortunately boost::any doesn't support Visitor pattern like boost::variant does. Nevertheless there are some 3rd party approaches.
Another possible idea is to use RTTI: create map of type_info of known types mapped to type handler functor.
Since you are going to just print them out anyway you can grab original string representation when you parse. (likely there are compiler errors in the code, I ripped it out of my codebase and un-typedefed bunch of things)
std::vector<std::string> GetArgumentList(const std::vector<boost::program_options::option>& raw)
{
std::vector<std::string> args;
BOOST_FOREACH(const boost::program_options::option& option, raw)
{
if(option.unregistered) continue; // Skipping unknown options
if(option.value.empty())
args.push_back("--" + option.string_key));
else
{
// this loses order of positional options
BOOST_FOREACH(const std::string& value, option.value)
{
args.push_back("--" + option.string_key));
args.push_back(value);
}
}
}
return args;
}
Usage:
boost::program_options::parsed_options parsed = boost::program_options::command_line_parser( ...
std::vector<std::string> arguments = GetArgumentList(parsed.options);
// print
I was dealing with just this type of problem today. This is an old question, but perhaps this will help people who are looking for an answer.
The method I came up with is to try a bunch of as<...>() and then ignore the exception. It's not terribly pretty, but I got it to work.
In the below code block, vm is a variables_map from boost program_options. vit is an iterator over vm, making it a pair of std::string and boost::program_options::variable_value, the latter being a boost::any. I can print the name of the variable with vit->first, but vit->second isn't so easy to output because it is a boost::any, ie the original type has been lost. Some should be cast as a std::string, some as a double, and so on.
So, to cout the value of the variable, I can use this:
std::cout << vit->first << "=";
try { std::cout << vit->second.as<double>() << std::endl;
} catch(...) {/* do nothing */ }
try { std::cout << vit->second.as<int>() << std::endl;
} catch(...) {/* do nothing */ }
try { std::cout << vit->second.as<std::string>() << std::endl;
} catch(...) {/* do nothing */ }
try { std::cout << vit->second.as<bool>() << std::endl;
} catch(...) {/* do nothing */ }
I only have 4 types that I use to get information from the command-line/config file, if I added more types, I would have to add more lines. I'll admit that this is a bit ugly.

Several specific methods or one generic method?

this is my first question after long time checking on this marvelous webpage.
Probably my question is a little silly but I want to know others opinion about this. What is better, to create several specific methods or, on the other hand, only one generic method? Here is an example...
unsigned char *Method1(CommandTypeEnum command, ParamsCommand1Struct *params)
{
if(params == NULL) return NULL;
// Construct a string (command) with those specific params (params->element1, ...)
return buffer; // buffer is a member of the class
}
unsigned char *Method2(CommandTypeEnum command, ParamsCommand2Struct *params)
{
...
}
unsigned char *Method3(CommandTypeEnum command, ParamsCommand3Struct *params)
{
...
}
unsigned char *Method4(CommandTypeEnum command, ParamsCommand4Struct *params)
{
...
}
or
unsigned char *Method(CommandTypeEnum command, void *params)
{
switch(command)
{
case CMD_1:
{
if(params == NULL) return NULL;
ParamsCommand1Struct *value = (ParamsCommand1Struct *) params;
// Construct a string (command) with those specific params (params->element1, ...)
return buffer;
}
break;
// ...
default:
break;
}
}
The main thing I do not really like of the latter option is this,
ParamsCommand1Struct *value = (ParamsCommand1Struct *) params;
because "params" could not be a pointer to "ParamsCommand1Struct" but a pointer to "ParamsCommand2Struct" or someone else.
I really appreciate your opinions!
General Answer
In Writing Solid Code, Steve Macguire's advice is to prefer distinct functions (methods) for specific situations. The reason is that you can assert conditions that are relevant to the specific case, and you can more easily debug because you have more context.
An interesting example is the standard C run-time's functions for dynamic memory allocation. Most of it is redundant, as realloc can actually do (almost) everything you need. If you have realloc, you don't need malloc or free. But when you have such a general function, used for several different types of operations, it's hard to add useful assertions and it's harder to write unit tests, and it's harder to see what's happening when debugging. Macquire takes it a step farther and suggests that, not only should realloc just do _re_allocation, but it should probably be two distinct functions: one for growing a block and one for shrinking a block.
While I generally agree with his logic, sometimes there are practical advantages to having one general purpose method (often when operations is highly data-driven). So I usually decide on a case by case basis, with a bias toward creating very specific methods rather than overly general purpose ones.
Specific Answer
In your case, I think you need to find a way to factor out the common code from the specifics. The switch is often a signal that you should be using a small class hierarchy with virtual functions.
If you like the single method approach, then it probably should be just a dispatcher to the more specific methods. In other words, each of those cases in the switch statement simply call the appropriate Method1, Method2, etc. If you want the user to see only the general purpose method, then you can make the specific implementations private methods.
Generally, it's better to offer separate functions, because they by their prototype names and arguments communicate directly and visibly to the user that which is available; this also leads to more straightforward documentation.
The one time I use a multi-purpose function is for something like a query() function, where a number of minor query functions, rather than leading to a proliferation of functions, are bundled into one, with a generic input and output void pointer.
In general, think about what you're trying to communicate to the API user by the API prototypes themselves; a clear sense of what the API can do. He doesn't need excessive minutae; he does need to know the core functions which are the entire point of having the API in the first place.
First off, you need to decide which language you are using. Tagging the question with both C and C++ here makes no sense. I am assuming C++.
If you can create a generic function then of course that is preferable (why would you prefer multiple, redundant functions?) The question is; can you? However, you seem to be unaware of templates. We need to see what you have omitted here to tell if you if templates are suitable however:
// Construct a string (command) with those specific params (params->element1, ...)
In the general case, assuming templates are appropriate, all of that turns into:
template <typename T>
unsigned char *Method(CommandTypeEnum command, T *params) {
// more here
}
On a side note, how is buffer declared? Are you returning a pointer to dynamically allocated memory? Prefer RAII type objects and avoid dynamically allocating memory like that if so.
If you are using C++ then I would avoid using void* as you don't really need to. There is nothing wrong with having multiple methods. Note that you don't actually have to rename the function in your first set of examples - you can just overload a function using different parameters so that there is a separate function signature for each type. Ultimately, this kind of question is very subjective and there are a number of ways of doing things. Looking at your functions of the first type, you would perhaps be well served by looking into the use of templated functions
You could create a struct. That's what I use to handle console commands.
typedef int (* pFunPrintf)(const char*,...);
typedef void (CommandClass::*pKeyFunc)(char *,pFunPrintf);
struct KeyCommand
{
const char * cmd;
unsigned char cmdLen;
pKeyFunc pfun;
const char * Note;
long ID;
};
#define CMD_FORMAT(a) a,(sizeof(a)-1)
static KeyCommand Commands[]=
{
{CMD_FORMAT("one"), &CommandClass::CommandOne, "String Parameter",0},
{CMD_FORMAT("two"), &CommandClass::CommandTwo, "String Parameter",1},
{CMD_FORMAT("three"), &CommandClass::CommandThree, "String Parameter",2},
{CMD_FORMAT("four"), &CommandClass::CommandFour, "String Parameter",3},
};
#define AllCommands sizeof(Commands)/sizeof(KeyCommand)
And the Parser function
void CommandClass::ParseCmd( char* Argcommand )
{
unsigned int x;
for ( x=0;x<AllCommands;x++)
{
if(!memcmp(Commands[x].cmd,Argcommand,Commands[x].cmdLen ))
{
(this->*Commands[x].pfun)(&Argcommand[Commands[x].cmdLen],&::printf);
break;
}
}
if(x==AllCommands)
{
// Unknown command
}
}
I use a thread safe printf pPrintf, so ignore it.
I don't really know what you want to do, but in C++ you probably should derive multiple classes from a Formatter Base class like this:
class Formatter
{
virtual void Format(unsigned char* buffer, Command command) const = 0;
};
class YourClass
{
public:
void Method(Command command, const Formatter& formatter)
{
formatter.Format(buffer, command);
}
private:
unsigned char* buffer_;
};
int main()
{
//
Params1Formatter formatter(/*...*/);
YourClass yourObject;
yourObject.Method(CommandA, formatter);
// ...
}
This removes the resposibility to handle all that params stuff from your class and makes it closed for changes. If there will be new commands or parameters during further development you don't have to modifiy (and eventually break) existing code but add new classes that implement the new stuff.
While not full answer this should guide you in correct direction: ONE FUNCTION ONE RESPONSIBILITY. Prefer the code where it is responsible for one thing only and does it well. The code whith huge switch statement (which is not bad by itself) where you need cast void * to some other type is a smell.
By the way I hope you do realise that according to standard you can only cast from void * to <type> * only when the original cast was exactly from <type> * to void *.

C++ string parsing ideas

I have the output of another program that was more intended to be human readable than machine readable, but yet am going to parse it anyway. It's nothing too complex.
Yet, I'm wondering what the best way to do this in C++ is. This is more of a 'general practice' type of question.
I looked into Boost.Spirit, and even got it working a bit. That thing is crazy! If I was designing the language that I was reading, it might be the right tool for the job. But as it is, given its extreme compile-times, the several pages of errors from g++ when I do anything wrong, it's just not what I need. (I don't have much need for run-time performance either.)
Thinking about using C++ operator <<, but that seems worthless. If my file has lines like "John has 5 widgets", and others "Mary works at 459 Ramsy street" how can I even make sure I have a line of the first type in my program, and not the second type? I have to read the whole line and then use things like string::find and string::substr I guess.
And that leaves sscanf. It would handle the above cases beautifully
if( sscanf( str, "%s has %d widgets", chararr, & intvar ) == 2 )
// then I know I matched "foo has bar" type of string,
// and I now have the parameters too
So I'm just wondering if I'm missing something or if C++ really doesn't have much built-in alternative.
sscanf does indeed sound like a pretty good fit for your requirements:
you may do some redundant parsing, but you don't have performance requirements prohibiting that
it localises the requirements on the different input words and allows parsing of non-string values directly into typed variables, making the different input formats easy to understand
A potential problem is that it's error prone, and if you have lots of oft-changing parsing phrases then the testing effort and risk can be worrying. Keeping the spirit of sscanf but using istream for type safety:
#include <iostream>
#include <sstream>
// Str captures a string literal and consumes the same from an istream...
// (for non-literals, better to have `std::string` member to guarantee lifetime)
class Str
{
public:
Str(const char* p) : p_(p) { }
const char* c_str() const { return p_; }
private:
const char* p_;
};
bool operator!=(const Str& lhs, const Str& rhs)
{
return strcmp(lhs.c_str(), rhs.c_str()) != 0;
}
std::istream& operator>>(std::istream& is, const Str& str)
{
std::string s;
if (is >> s)
if (s.c_str() != str)
is.setstate(std::ios_base::failbit);
return is;
}
// sample usage...
int main()
{
std::stringstream is("Mary has 4 cats");
int num_dogs, num_cats;
if (is >> Str("Mary") >> Str("has") >> num_dogs >> Str("dogs"))
{
std::cout << num_dogs << " dogs\n";
}
else if (is.clear(), is.seekg(0), // "reset" the stream...
(is >> Str("Mary") >> Str("has") >> num_cats >> Str("cats")))
{
std::cout << num_cats << " cats\n";
}
}
The GNU tools flex and bison are very powerful tools you could use that are along the lines of Spirit but (according to some people) easier to use, partially because the error reporting is a bit better since the tools have their own compilers. This, or Spirit, or some other parser generator, is the "correct" way to go with this because it affords you the greatest flexibility in your approach.
If you're thinking about using strtok, you might want to instead take a look at stringstream, which splits on whitespace and lets you do some nice formatting conversions between strings, primitives, etc. It can also be plugged into the STL algorithms, and avoids all the messy details of raw C-style string memory management.
I've written extensive parsing code in C++. It works just great for that, but I wrote the code myself and didn't rely on more general code written by someone else. C++ doesn't come with extensive code already written, but it's a great language to write such code in.
I'm not sure what your question is beyond just that you'd like to find code someone has already written that will do what you need. Part of the problem is that you haven't really described what you need, or asked a question for that matter.
If you can make the question more specific, I'd be happy to try and offer a more specific answer.
I've used Boost.Regex (Which I think is also tr1::regex). Easy to use.
there is always strtok() I suppose
Have a look at strtok.
Depending on exactly what you want to parse, you may well want a regular expression library.
See msdn or earlier question.
Personally, again depending the exact format, I'd consider using perl to do an initial conversion into a more machine readable format (E.g. variable record CSV) and then import into C++ much more easily.
If sticking to C++, you need to:
Identify a record - hopefully just a
line
Determine the type of the record - use regex
Parse the record - scanf is fine
A base class on the lines of:
class Handler
{
public:
Handler(const std::string& regexExpr)
: regex_(regexExpr)
{}
bool match(const std::string& s)
{
return std::tr1::regex_match(s,regex_);
}
virtual bool process(const std::string& s) = 0;
private:
std::tr1::basic_regex<char> regex_;
};
Define a derived class for each record type, stick an instance of each in a set and search for matches.
class WidgetOwner : public Handler
{
public:
WidgetOwner()
: Handler(".* has .* widgets")
{}
virtual bool process(const std::string& s)
{
char name[32];
int widgets= 0;
int fieldsRead = sscanf( s.c_str(), "%32s has %d widgets", name, & widgets) ;
if (fieldsRead == 2)
{
std::cout << "Found widgets in " << s << std::endl;
}
return fieldsRead == 2;
}
};
struct Pred
{
Pred(const std::string& record)
: record_(record)
{}
bool operator()(Handler* handler)
{
return handler->match(record_);
}
std::string record_;
};
std::set<Handler*> handlers_;
handlers_.insert(new WidgetOwner);
handlers_.insert(new WorkLocation);
Pred pred(line);
std::set<Handler*>::iterator handlerIt =
std::find_if(handlers_.begin(), handlers_.end(), pred);
if (handlerIt != handlers_.end())
(*handlerIt)->process(line);