Parse string into and unknown amount of regex groups in C++ - c++

I know the exact format of the text I should be getting. In particular, it should match a regex with a variable number of groups.
I want to use the C++ regex library to determine (a) if it is valid text, and (b) to parse those groups into a vector. How can I do this? I can find examples online to do (a), but not (b).
#include <string>
#include <regex>
#include <vector>
bool parse_this_text(std::string & text, std::vector<std::string> & group) {
// std::string text_regex = "^([a-z]*)(,[0-9]+)*$"
// if the text matches the regex, return true and parse each group into the vector
// else return false
???
}
Such that the following lines of code return the expected results.
std::vector<std::string> group;
parse_this_text("green,1", group);
// should return true with group = {"green", ",1"};
parse_this_text("yellow", group);
// should return true with group = {"yellow"};
parse_this_text("red,1,2,3", group);
// should return true with group = {"red", ",1", ",2", ",3"};
parse_this_text("blue,1.0,3.0,1,a", group);
// should return false (since it doesn't match the regex)
Thanks!

(?=^([a-zA-Z]*)(?:\,\d+)+$)^.*?(?:((?:\,\d+)+)).*?$
You can use this.This will first validate using lookahead and then return 2 groups.
1) containing name
2) containing all the rest of integers (This can be easily split) or you can use re.findall here
Though it doesnot answer your question fully , it might be of help.
Have a look.
http://regex101.com/r/wE3dU7/3

One option is to scan the string twice, the first time to check for validity and the second time to split it into fields. With the example in the OP, you don't really need regexen to split the line, once you know that it is correct; you can simply split on commas. But for the sake of exposition, you could use a std::regex_token_iterator (assuming you have a C++ library which supports those), something like this:
bool parse_this_text(const std::string& s, std::vector<std::string>& result) {
static const std::regex check("[[:alpha:]][[:alnum:]]*(,[[:digit:]])*",
std::regex_constants::nosubs);
static const std::regex split(",");
if (!std::regex_match(s, check))
return false;
std::sregex_token_iterator tokens(s.begin(), s.end(), split, -1);
result.clear();
std::copy(tokens, std::sregex_token_iterator(), std::back_inserter(result));
return true;
}
For more complicated cases, or applications in which the double scan is undesired, you can tokenize using successive calls to std::regex_search(), supplying the end of the previous match as the starting point, and std::regex_constants::continuous as the match flags; that will anchor each search to the character after the previous match. You could, in that case, use a std::regex_iterator, but I'm not convinced that the resulting code is any simpler.

Related

Checking if a string contains more than just keywords C++

Thank you for clicking on my question.
After countless hours of searching, I have not come across a solution and its quite difficult to search for something you don't know how to properly phrase in a search. Please help me out, I would appreciate it.
The data of the string would be like:
std::string keyword 1 "Hello";
std::string keyword 2 "Ola";
std::string test = Keyword1+Keyword2+keyword2;
Example of what I'm trying to achieve as a pseudocode:
if(test.contains(more then the 2 keywords))
I wanna make sure the string has other text than just the keywords above.
You can remove all instances of these keywords from your data and see what's left. It's not terribly efficient but shouldn't matter for reasonably sized inputs.
bool contains_more_than(std::vector<std::string> const& keywords, std::string sample) {
for (std::string const& keyword: keywords) {
size_t pos;
while ((pos = sample.find(keyword)) != sample.npos) {
sample.replace(pos, keyword.size(), "");
}
}
return !sample.empty();
}
Note that this might fail if some keyword is a substring of another:
contains_more_than({"123", "12345"}, "12345") returns True.
To avoid this you can first sort your keywords by std::string::size:
std::string(keywords.begin(), keywords.end(),
[](std::string const& s1, std::string const& s2) {
return s1.size() > s2.size();
});
Now:
contains_more_than({"12345", "123"}, "12345") returns False
A possible solution: expressed as a regular expression, you are testing whether the string matches ^(Hello|Ola)*$. That is, does the whole string match any number of repeats of "Hello" and/or "Ola" (and with nothing else)? You can use the regex standard library to match regular expressions in C++.

How to match a vector of regular expression with a one string?

If I want to verify one string is completely matches with the any one in the vector of strings then i will use
std::find(vectOfStrings.begin(), vectOfStrings.end(), "<targetString>") != v.end()
If the target string matches with any of the string in the vector then it will return true.
But what if i want to check one string is matches with any one of the vector of regular expressions?
Is there any standard library i can use to make it work like
std::find(vectOfRegExprsns.begin(), vectOfRegExprsns.end(), "<targetString>") != v.end()?
Any suggestions would be highly appreciated.
How about using std::find_if() with a lambda?
std::find_if(
vectOfRegExprsns.begin(), vectOfRegExprsns.end(),
[](const std::string& item) { return regex_match(item, std::regex(targetString))});

Understanding c++ regex by a simple example

I wrote the following simple example:
#include <iostream>
#include <string>
#include <regex>
int main ()
{
std::string str("1231");
std::regex r("^(\\d)");
std::smatch m;
std::regex_search(str, m, r);
for(auto v: m) std::cout << v << std::endl;
}
DEMO
and got confused by its behavior. If I understood the purpose of the match_result from there correctly, the only one 1 should have been printed. Actually:
If successful, it is not empty and contains a series of sub_match
objects: the first sub_match element corresponds to the entire match,
and, if the regex expression contained sub-expressions to be matched ([...])
The string passed to the function doesn't match the regex, therefore we should not have had the entire match.
What did I miss?
You still get the entire match but the entire match does not fit the entire string it fits the entire regex.
For example consider this:
#include <iostream>
#include <string>
#include <regex>
int main()
{
std::string str("1231");
std::regex r("^(\\d)\\d"); // entire match will be 2 numbers
std::smatch m;
std::regex_search(str, m, r);
for(auto v: m)
std::cout << v << std::endl;
}
Output:
12
1
The entire match (first sub_match) is what the entire regex matches against (part of the string).
The second sub_match is the first (and only) capture group
Looking at your original regex
std::regex r("^(\\d)");
|----| <- entire expression (sub_match #0)
std::regex r("^(\\d)");
|---| <- first capture group (sub_match #1)
That is where the two sub_matches come from.
From here
Returns whether **some** sub-sequence in the target sequence (the subject)
matches the regular expression rgx (the pattern). The target sequence is
either s or the character sequence between first and last, depending on
the version used.
So regex_search will search for anything in the input string that matches the regex. The whole string doesnt have to match, just part of it.
However, if you were to use regex_match, then the entire string must match.

trouble with case insensitive partial matching two strings?

I am trying to partial match two strings without case sensitivity. I do not want to use the boost libraries as most people don't have them on their compilers. I tried .find() that is in the standard c++ library, but it only checks if the user inputted string is in the first word of the string that is already there. like, if I have a dvd named Harry_Potter_Goblet, if I search for "goblet" or "Goblet", the program doesnt show Harry_Potter_Goblet as a result, only if I do case sensitive search for "Harry", then the resul shows a match. What am I doing wrong here? Here is my code.
Define a case-insensitive character comparison function:
#include <cctype>
bool case_insensitive_comp(char lhs, char rhs)
{
return std::toupper(lhs) == std::toupper(rhs);
}
Then, use std::search to find the sub-string within the larger string.
#include <algorithm>
....
std::string s1="Harry_Potter_Goblet";
std::string s2 = "goblet";
bool found = std::search(s1.begin(), s1.end(), s2.begin(), s2.end(), case_insensitive_comp) != s1.end();

boost::regex_search - boost kills my brain cells, again

Good programmers keep simple things easy right?
And it's not like the boost documentation makes your life less uneasy...
All I want is an implementation for:
// fulfils the function of a regex matching where the pattern may match a
// substring instead of the entire string
bool search( std::string, std::string, SomeResultType )
So it can be used as in:
std::string text, pattern;
SomeResultsType match;
if( search( text, pattern, match ) )
{
std::string result = match[0];
if( match[1].matched )
// where this is the second capture group, not recapturing the same group
std::string secondMatch = match[1];
}
I want my client code not to be bothered with templates and iterators... I know, I'm a wuss. After peering for an hour over the template spaghetti in the boost docs for doing something so simple, I feel like my productivity is seriously getting hampered and I don't feel like I've learned anything from it.
boost::regex_match does it pretty simple with boost::cmatch, except that it only matches the whole string, so I've been adapting all my patterns to match the whole strings, but I feel that it is a dirty hack and would prefer some more proper solution. If I would have known it would take this long, I would have stuck with regex_match
Also welcome, a copy of Reading boost documentation for dummies
Next week in Keep it simple and easy with boost, function binders! No, just kidding, I wouldn't do that to anyone.
Thanks for all help
I think you want regex_search: http://www.boost.org/doc/libs/1_44_0/libs/regex/doc/html/boost_regex/ref/regex_search.html
Probably this overload is the one you want:
bool regex_search(const basic_string& s,
match_results::const_iterator, Allocator>& m,
const basic_regex& e,
match_flag_type flags = match_default);
That seems to match what you wanted - SomeResultsType is smatch, and you need to convert your pattern to a regex first.
On Windows, you can use the .NET Regex class:
Example (copied from the linked page):
#using <System.dll>
using namespace System;
using namespace System::Text::RegularExpressions;
int main()
{
// Define a regular expression for repeated words.
Regex^ rx = gcnew Regex( "\\b(?<word>\\w+)\\s+(\\k<word>)\\b",static_cast<RegexOptions>(RegexOptions::Compiled | RegexOptions::IgnoreCase) );
// Define a test string.
String^ text = "The the quick brown fox fox jumped over the lazy dog dog.";
// Find matches.
MatchCollection^ matches = rx->Matches( text );
// Report the number of matches found.
Console::WriteLine( "{0} matches found.", matches->Count );
// Report on each match.
for each (Match^ match in matches)
{
String^ word = match->Groups["word"]->Value;
int index = match->Index;
Console::WriteLine("{0} repeated at position {1}", word, index);
}
}