how to define a regular expression in boost? - c++

I have a section in file:
[Source]
[Source.Ia32]
[Source.Ia64]
I have created the expression as:
const boost::regex source_line_pattern ("(Sources)(.*?)");
Now, I am trying to match the string, but I am not able to match; it is always returning 0.
if (boost::regex_match ( sToken, source_line_pattern ) )
return TRUE;
Please note that sToken value is [Source]. [Source.Ia32]... and so on.
Thanks,

There are at least two problems with your code. First, the
regular expression you give contains the literal string
"Sources", and not "Source", which is what you seem to be
trying to match. The second is that boost::regex_match is
bound: it must match the entire string. What you seem to want
is boost::regex_search. Depending on what you are doing,
however, it might be better to try to match the entire string:
"\\[Source(?:\\.(\\w+))?\\]\\s*". Which provides for capture of
the trailing part, if present (but not the leading
"Source"---no point, in general, in capturing something that is
a constant).
Note too that the sequence ".*?" is very dubious. Normally,
I would expect the regular expression parser to fail if
a (non-escaped) '*' is followed by a '?'.

The issue is that boost::regex_match only returns true if the entire input string is matched by the regex. So the '[' and ']' are not matched by your current regex, and it will fail.
Your options are either to use boost::regex_search, which will search for a substring of the input that matches your regex, or modify your regex to accept the entire string being passed in.

Related

Regular expressions with an alternative if the first one doesn't match

I need to have a regular expression that takes a function signature as an input and returns the name of the function, i.e I may have the following input:
FUNCTION(A,B,C)
and after applying the following regular expression:
^(.*?)(?=\()
I correctly obtain the word "FUNCTION" as expected.
However, sometimes I can get the name of the function WITHOUT parentheses (and therefore without parameters), like this:
FUNCTION
In this case, the previous regex fails and doesn't take the name. Is there any way to define a regex that, in case it cannot find the first regular expression, try another one? (In this case would be taking the whole input.)
From what I see, you want to match the first n characters other than (, ) and space.
Thus, it is much more efficient to use
^[^()\s]+
See demo
^(.*?)(?=\(|\s*$|\s)
This should do it for you.You need to use | or operator.
\s*$ === stop if you have 0 or more spaces and then string ends
\s ==== stop at the first instance of space
^([^)]+)\s*\(?
Could do what you want.
Explanation :
([^(]+) : one or more character that is not (
\s* : maybe some blank spaces
\(? : optionnal parenthesis

Lua pattern similar to regex positive lookahead?

I have a string which can contain any number of the delimiter §\n. I would like to remove all delimiters from a string, except the last occurrence which should be left as-is. The last delimiter can be in three states: \n, §\n or §§\n. There will never be any characters after the last variable delimiter.
Here are 3 examples with the different state delimiters:
abc§\ndef§\nghi\n
abc§\ndef§\nghi§\n
abc§\ndef§\nghi§§\n
I would like to remove all delimiters except the last occurrence.
So the result of gsub for the three examples above should be:
abcdefghi\n
abcdefghi§\n
abcdefghi§§\n
Using regular expressions, one could use §\\n(?=.), which matches properly for all three cases using positive lookahead, as there will never be any characters after the last variable delimiter.
I know I could check if the string has the delimiter at the end, and then after a substitution using the Lua pattern §\n I could add the delimiter back onto the string. That is however a very inelegant solution to a problem which should be possible to solve using a Lua pattern alone.
So how could this be done using a Lua pattern?
str:gsub( '§\\n(.)', '%1' ) should do what you want. This deletes the delimiter given that it is followed by another character, putting this character back into to string.
Test code
local str = {
'abc§\\ndef§\\nghi\\n',
'abc§\\ndef§\\nghi§\\n',
'abc§\\ndef§\\nghi§§\\n',
}
for i = 1, #str do
print( ( str[ i ]:gsub( '§\\n(.)', '%1' ) ) )
end
yields
abcdefghi\n
abcdefghi§\n
abcdefghi§§\n
EDIT: This answer doesn't work specifically for lua, but if you have a similar problem and are not constrained to lua you might be able to use it.
So if I understand correctly, you want a regex replace to make the first example look like the second. This:
/(.*?)§\\n(?=.*\\n)/g
will eliminate the non-last delimiters when replaced with
$1
in PCRE, at least. I'm not sure what flavor Lua follows, but you can see the example in action here.
REGEX:
/(.*?)§\\n(?=.*\\n)/g
TEST STRING:
abc§\ndef§\nghi\n
abc§\ndef§\nghi§\n
abc§\ndef§\nghi§§\n
SUBSTITUTION:
$1
RESULT:
abcdefghi\n
abcdefghi§\n
abcdefghi§§\n

Regex tokenizing with Boost only getting last letters of words

I am trying to parse a simple sentence structure with Boost. This is my first time using Boost, so I could be doing this completely wrong. What I want to do is only accept strings in this format:
Must start with a letter (case insensitive)
May contain:
Alphabetic characters
Numeric characters
Underscores
Hyphens
All other characters serve as delimiters
Since I don't know what characters are my delimiters (there could be tons), I have tried to make a regex that is sensitive to that. The only problem is, I am only getting the last letter of each word. This leads me to believe that my regex is correct, but my use of boost is not. Here's my code:
boost::regex regexp("[A-Za-z]([A-Za-z]|[0-9]|_|-)*", boost::regex::normal | boost::regbase::icase);
boost::sregex_token_iterator i(text.begin(), text.end(), regexp, 1);
boost::sregex_token_iterator j;
while(i != j){
cout << *i++ << std::endl;
}
I modeled this after what I found on the Boost website. I used the last example (at the bottom of the page) as a template to build mf code. In this instance, text is an object of type string.
Is my regex correct? Am I using boost correctly?
Change your regex to: ([A-Za-z][-A-Za-z0-9_]*)
By putting the parentheses around the whole expression, the entire thing will be captured, not just the last character matched. Putting the - in front causes it to be a matched character and not a range specifier.
You're requesting the first submatch for each RE match. That refers to this subexpression: ([A-Za-z]|[0-9]|_|-) and you're getting the last thing that matched (notice that it's qualified by a *) for each match. Hence, the last character. I think you should pass 0 for the submatch number, or just omit that parameter. When I modify your code to do that, it does what I think you're wanting it to do.

Need regexp to find substring between two tokens

I suspect this has already been answered somewhere, but I can't find it, so...
I need to extract a string from between two tokens in a larger string, in which the second token will probably appear again meaning... (pseudo code...)
myString = "A=abc;B=def_3%^123+-;C=123;" ;
myB = getInnerString(myString, "B=", ";" ) ;
method getInnerString(inStr, startToken, endToken){
return inStr.replace( EXPRESSION, "$1");
}
so, when I run this using expression ".+B=(.+);.+"
I get "def_3%^123+-;C=123;" presumably because it just looks for the LAST instance of ';' in the string, rather than stopping at the first one it comes to.
I've tried using (?=) in search of that first ';' but it gives me the same result.
I can't seem to find a regExp reference that explains how one can specify the "NEXT" token rather than the one at the end.
any and all help greatly appreciated.
Similar question on SO:
Regex: To pull out a sub-string between two tags in a string
Regex to replace all \n in a String, but no those inside [code] [/code] tag
Replace patterns that are inside delimiters using a regular expression call
RegEx matching HTML tags and extracting text
You're using a greedy pattern by not specifying the ? in it. Try this:
".+B=(.+?);.+"
Try this:
B=([^;]+);
This matches everything between B= and ; unless it is a ;. So it matches everything between B= and the first ; thereafter.
(This is a continuation of the conversation from the comments to Evan's answer.)
Here's what happens when your (corrected) regex is applied: First, the .+ matches the whole string. Then it backtracks, giving up most of the characters it just matched until it gets to the point where the B= can match. Then the (.+?) matches (and captures) everything it sees until the next part, the semicolon, can match. Then the final .+ gobbles up the remaining characters.
All you're really interested in is the "B=" and the ";" and whatever's between them, so why match the rest of the string? The only reason you have to do that is so you can replace the whole string with the contents of the capturing group. But why bother doing that if you can access contents of the group directly? Here's a demonstration (in Java, because I can't tell what language you're using):
String s = "A=abc;B=def_3%^123+-;C=123;";
Pattern p = Pattern.compile("B=(.*?);");
Matcher m = p.matcher(s);
if (m.find())
{
System.out.println(m.group(1));
}
Why do a 'replace' when a 'find' is so much more straightforward? Probably because your API makes it easier; that's why we do it in Java. Java has several regex-oriented convenience methods in its String class: replaceAll(), replaceFirst(), split(), and matches() (which returns true iff the regex matches the whole string), but not find(). And there's no convenience method for accessing capturing groups, either. We can't match the elegance of Perl one-liners like this:
print $1 if 'A=abc;B=def_3%^123+-;C=123;' =~ /B=(.*?);/;
...so we content ourselves with hacks like this:
System.out.println("A=abc;B=def_3%^123+-;C=123;"
.replaceFirst(".+B=(.*?);.+", "$1"));
Just to be clear, I'm not saying not to use these hacks, or that there's anything wrong with Evan's answer--there isn't. I just think we should understand why we use them, and what trade-offs we're making when we do.

A regex that validates a web address and matches an empty string?

The current expression validates a web address (HTTP), how do I change it so that an empty string also matches?
(http|https):\/\/[\w\-_]+(\.[\w\-_]+)+([\w\-\.,#?^=%&:/~\+#]*[\w\-\#?^=%&/~\+#])?
If you want to modify the expression to match either an entirely empty string or a full URL, you will need to use the anchor metacharacters ^ and $ (which match the beginning and end of a line respectively).
^(|https?:\/\/[\w\-_]+(\.[\w\-_]+)+([\w\-\.,#?^=%&:/~\+#]*[\w\-\#?^=%&/~\+#])?)$
As dirkgently pointed out, you can simplify your match for the protocol a little, so I've included that for you too.
Though, if you are using this expression from within a program or script, it may be simpler for you to use the languages own means of checking if the input is empty.
// in no particular language...
if input.length > 0 then
if input matches <regex> then
input is a URL
else
input is invalid
else
input is empty
Put the whole expression in parenthesis and mark it as optional (“?” quantifier, no or one repetition)
((http|https):\/\/[\w\-_]+(\.[\w\-_]+)+([\w\-\.,#?^=%&:/~\+#]*[\w\-\#?^=%&/~\+#])?)?
Use expression markers ^$ around your expression and add |^$ to the end. This way you're using the | or operator with two expressions showing that you have two different match cases.
^(https?:\/\/[\w\-_]+(\.[\w\-_]+)+([\w\-\.,#?^=%&:/~\+#]*[\w\-\#?^=%&/~\+#])?)$|^$
The key here is that |^$ means "or match blank".
Also, that expression with only work in javascript if you use a template string.
Expr? where Expr is your URL matcher. Just like I would for http and https: https?. The ? is a known as a Quantifier -- you can look it up. From Wikipedia:
? The question mark indicates there is zero or one of the preceding element.