Constructing boost regex - regex

I want to match every single number in the following string:
-0.237522264173E+01 0.110011117918E+01 0.563118085683E-01 0.540571836345E-01 -0.237680494785E+01 0.109394729137E+01 -0.237680494785E+01 0.109394729137E+01 0.392277532367E+02 0.478587433035E+02
However, for some reason the following boost::regex doesn't work:
(.*)(-?\\d+\\.\\d+E\\+\\d+ *){10}(.*)
What's wrong with it?
EDIT: posting relevant code:
std::ifstream plik("chains/peak-summary.txt");
std::string mystr((std::istreambuf_iterator<char>(plik)), std::istreambuf_iterator<char>());
plik.close();
boost::cmatch what;
boost::regex expression("(.*)(-?\\d+\\.\\d+E\\+\\d+ *){10}(.*)");
std::cout << "String to match against: \"" << mystr << "\"" << std::endl;
if(regex_match(mystr.c_str(), what, expression))
{
std::cout << "Match!";
std::cout << std::endl << what[0] << std::endl << what[1] << std::endl;
} else {
std::cout << "No match." << std::endl;
}
output:
String to match against: " -0.237555275450E+01 0.109397523269E+01 0.560420828508E-01 0.556732715285E-01 -0.237472295761E+01 0.110192835331E+01 -0.237472295761E+01 0.110192835331E+01 0.393040553508E+02 0.478540190640E+02
"
No match.
Also posting the contents of file read into the string:
[dare2be#schroedinger multinest-peak]$ cat chains/peak-summary.txt
-0.237555275450E+01 0.109397523269E+01 0.560420828508E-01 0.556732715285E-01 -0.237472295761E+01 0.110192835331E+01 -0.237472295761E+01 0.110192835331E+01 0.393040553508E+02 0.478540190640E+02

The (.*) around your regex match and consume all text at the start and end of the string, so if there are more than ten numbers, the first ones won't be matched.
Also, you're not allowing for negative exponents.
(-?\\d\\.\\d+E[+-]\\d+ *){10,}
should work.
This will match all of the numbers in a single string; if you want to match each number separately, you have to use (-?\\d\\.\\d+E[+-]\\d+) iteratively.

Try with:
(-?[0-9]+\\.[0-9]+E[+-][0-9]+)
Your (.*) in the beggining matches greedy whole string.

Related

std::regex - lookahead assertion not always working

I'm writing a module that's making some string substitutions into text to give to a scripting language. The language's syntax is vaugely lisp-y, so expressions are bounded by parentheses and symbols separated by spaces, most of them starting with '$'. A regular expression like this seems like it should give matches at the appropriate symbol boundaries:
auto re_match_abc = std::regex{ "(?=.*[[:space:]()])\\$abc(?=[()[:space:]].*)" };
But in my environment (Visual C++ 2017, 15.9.19, targetting C++-17) it can match strings without a suitable boundary in front of them:
std::cout << " $abc -> " << std::regex_replace(" $abc ", re_match_abc, "***") << std::endl;
std::cout << " ($abc) -> " << std::regex_replace("($abc)", re_match_abc, "***") << std::endl;
std::cout << "xyz$abc -> " << std::regex_replace("xyz$abc ", re_match_abc, "***") << std::endl;
std::cout << " $abcdef -> " << std::regex_replace(" $abcdef", re_match_abc, "***") << std::endl;
// Result from VC++ 2017:
//
// $abc -> ***
// ($abc) -> (***)
// xyz$abc -> xyz*** <= What's going wrong here?
// $abcdef -> $abcdef
Why is that regex ignoring the positive-lookahead requirement to have at least one space or parenthesis before the matching text?
[I realize that there are other ways to do this job and to do it really robustly maybe I should use something to turn the string into a token stream, but for the immediate job I have (and because the person authoring the strings that get processed is sitting next to me, so we can coordinate) I thought that regex replacements would do for now.]
You need to use a positive lookbehind instead. What you really want is this:
auto re_match_abc = std::regex{ "(?<=[[:space:]()])\\$abc(?=[()[:space:]])" };
You can try it out on a website like https://regex101.com/ (just remove the escaped backslash that's required for the C++ string). It explains what each piece of the regex is doing and shows you everything that matches.
Keep in mind that this will also match things like )$abc)
Edit: std::regex apparently does not support lookbehind. For you specific case you might try something like this:
auto re_match_abc = std::regex{ "([[:space:]()])\\$abc(?=[()[:space:]])" };
std::cout << " $abc -> " << std::regex_replace(" $abc ", re_match_abc, "$1***") << std::endl;
std::cout << " ($abc) -> " << std::regex_replace("($abc)", re_match_abc, "$1***") << std::endl;
std::cout << "xyz$abc -> " << std::regex_replace("xyz$abc ", re_match_abc, "$1***") << std::endl;
std::cout << " $abcdef -> " << std::regex_replace(" $abcdef", re_match_abc, "$1***") << std::endl;
output:
$abc -> ***
($abc) -> (***)
xyz$abc -> xyz$abc
$abcdef -> $abcdef
try it here
Here instead of a lookbehind we have a normal capture group. In the replacement we're emitting whatever we captured (a parenthesis or space) followed by the actual string we want to replace $abc with.

PCRE does not match

I suppose it's something very stupid, however this does not match, and I have no idea why.
I compiles successfully and everything, but it just doesn't match.
I've already used RE(".*") but it doesn't work as well.
The system is OS X (installed pcre using brew).
std::string s;
if (pcrecpp::RE("h.*o").FullMatch("hello", &s))
{
std::cout << "Successful match " << s << std::endl;
}
You are trying to extract one subpattern (in &s), but have not included any parentheses to capture that subpattern. Try this (untested, note parentheses).
std::string s;
if (pcrecpp::RE("(h.*o)").FullMatch("hello", &s))
{
std::cout << "Successful match " << s << std::endl;
}
The documentation at http://www.pcre.org/original/doc/html/pcrecpp.html has a similar example, stating:
Example: fails because there aren't enough sub-patterns:
!pcrecpp::RE("\w+:\d+").FullMatch("ruby:1234", &s);

Boost regex don't match tabs

I'm using boost regex_match and I have a problem with matching no tab characters.
My test application looks as follows:
#include <iostream>
#include <string>
#include <boost/spirit/include/classic_regex.hpp>
int
main(int args, char** argv)
{
boost::match_results<std::string::const_iterator> what;
if(args == 3) {
std::string text(argv[1]);
boost::regex expression(argv[2]);
std::cout << "Text : " << text << std::endl;
std::cout << "Regex: " << expression << std::endl;
if(boost::regex_match(text, what, expression, boost::match_default) != 0) {
int i = 0;
std::cout << text;
if(what[0].matched)
std::cout << " matches with regex pattern!" << std::endl;
else
std::cout << " does not match with regex pattern!" << std::endl;
for(boost::match_results<std::string::const_iterator>::const_iterator it=what.begin(); it!=what.end(); ++it) {
std::cout << "[" << (i++) << "] " << it->str() << std::endl;
}
} else {
std::cout << "Expression does not match!" << std::endl;
}
} else {
std::cout << "Usage: $> ./boost-regex <text> <regex>" << std::endl;
}
return 0;
}
If I run the program with these arguments, I don't get the expected result:
$> ./boost-regex "`cat file`" "(?=.*[^\t]).*"
Text : This text includes some tabulators
Regex: (?=.*[^\t]).*
This text includes some tabulators matches with regex pattern!
[0] This text includes some tabulators
In this case I would have expected that what[0].matched is false, but it's not.
Is there any mistake in my regular expression?
Or do I have to use other format/match flag?
Thank you in advance!
I am not sure what you want to do. My understanding is, you want the regex to fail as soon as there is a tab in the text.
Your positive lookahead assertion (?=.*[^\t]) is true as soon as it finds a non tab, and there are a lot of non tabs in your text.
If you want it to fail, when there is a tab, go the other way round and use a negative lookahead assertion.
(?!.*\t).*
this assertion will fail as soon as it find a tab.

Extract IP address from a string using boost regex?

I was wondering if anyone can help me, I've been looking around for regex examples but I still can't get my head over it.
The strings look like this:
"User JaneDoe, IP: 12.34.56.78"
"User JohnDoe, IP: 34.56.78.90"
How would I go about to make an expression that matches the above strings?
The question is how exactly do you want to match these, and what else do you want to exclude?
It's trivial (but rarely useful) to match any incoming string with a simple .*.
To match these more exactly (and add the possibility of extracting things like the user name and/or IP), you could use something like: "User ([^,]*), IP: (\\d{1,3}(\\.\\d{1,3}){3})". Depending on your input, this might still run into a problem with a name that includes a comma (e.g., "John James, Jr."). If you have to allow for that, it gets quite a bit uglier in a hurry.
Edit: Here's a bit of code to test/demonstrate the regex above. At the moment, this is using the C++0x regex class(es) -- to use Boost, you'll need to change the namespaces a bit (but I believe that should be about all).
#include <regex>
#include <iostream>
void show_match(std::string const &s, std::regex const &r) {
std::smatch match;
if (std::regex_search(s, match, r))
std::cout << "User Name: \"" << match[1]
<< "\", IP Address: \"" << match[2] << "\"\n";
else
std::cerr << s << "did not match\n";
}
int main(){
std::string inputs[] = {
std::string("User JaneDoe, IP: 12.34.56.78"),
std::string("User JohnDoe, IP: 34.56.78.90")
};
std::regex pattern("User ([^,]*), IP: (\\d{1,3}(\\.\\d{1,3}){3})");
for (int i=0; i<2; i++)
show_match(inputs[i], pattern);
return 0;
}
This prints out the user name and IP address, but in (barely) enough different format to make it clear that it's matching and printing out individual pieces, not just passing entire strings through.
#include <string>
#include <iostream>
#include <boost/regex.hpp>
int main() {
std::string text = "User JohnDoe, IP: 121.1.55.86";
boost::regex expr ("User ([^,]*), IP: (\\d{1,3}(\\.\\d{1,3}){3})");
boost::smatch matches;
try
{
if (boost::regex_match(text, matches, expr)) {
std::cout << matches.size() << std::endl;
for (int i = 1; i < matches.size(); i++) {
std::string match (matches[i].first, matches[i].second);
std::cout << "matches[" << i << "] = " << match << std::endl;
}
}
else {
std::cout << "\"" << expr << "\" does not match \"" << text << "\". matches size(" << matches.size() << ")" << std::endl;
}
}
catch (boost::regex_error& e)
{
std::cout << "Error: " << e.what() << std::endl;
}
return 0;
}
Edited: Fixed missing comma in string, pointed out by Davka, and changed cmatch to smatch

How can I access all matches of a repeated capture group, not just the last one?

My code is:
#include <boost/regex.hpp>
boost::cmatch matches;
boost::regex_match("alpha beta", matches, boost::regex("([a-z])+"));
cout << "found: " << matches.size() << endl;
And it shows found: 2 which means that only ONE occurrence is found… How to instruct it to find THREE occurrences? Thanks!
You should not call matches.size() before verifying that something was matched, i.e. your code should look rather like this:
#include <boost/regex.hpp>
boost::cmatch matches;
if (boost::regex_match("alpha beta", matches, boost::regex("([a-z])+")))
cout << "found: " << matches.size() << endl;
else
cout << "nothing found" << endl;
The output would be "nothing found" because regex_match tries to match the whole string. What you want is probably regex_search that is looking for substring. The code below could be a bit better for you:
#include <boost/regex.hpp>
boost::cmatch matches;
if (boost::regex_search("alpha beta", matches, boost::regex("([a-z])+")))
cout << "found: " << matches.size() << endl;
else
cout << "nothing found" << endl;
But will output only "2", i.e. matches[0] with "alpha" and matches[1] with "a" (the last letter of alpha - the last group matched)
To get the whole word in the group you have to change the pattern to ([a-z]+) and call the regex_search repeatedly as you did in your own answer.
Sorry to reply 2 years late, but if someone googles here as I did, then maybe it will be still useful for him...
This is what I've found so far:
text = "alpha beta";
string::const_iterator begin = text.begin();
string::const_iterator end = text.end();
boost::match_results<string::const_iterator> what;
while (regex_search(begin, end, what, boost::regex("([a-z]+)"))) {
cout << string(what[1].first, what[2].second-1);
begin = what[0].second;
}
And it works as expected. Maybe someone knows a better solution?
This works for me, maybe somebody will find it usefull..
std::string arg = "alpha beta";
boost::sregex_iterator it{arg.begin(), arg.end(), boost::regex("([a-z])+")};
boost::sregex_iterator end;
for (; it != end; ++it) {
std::cout << *it << std::endl;
}
Prints:
alpha
beta