Regex tokenizing with Boost only getting last letters of words - c++

I am trying to parse a simple sentence structure with Boost. This is my first time using Boost, so I could be doing this completely wrong. What I want to do is only accept strings in this format:
Must start with a letter (case insensitive)
May contain:
Alphabetic characters
Numeric characters
Underscores
Hyphens
All other characters serve as delimiters
Since I don't know what characters are my delimiters (there could be tons), I have tried to make a regex that is sensitive to that. The only problem is, I am only getting the last letter of each word. This leads me to believe that my regex is correct, but my use of boost is not. Here's my code:
boost::regex regexp("[A-Za-z]([A-Za-z]|[0-9]|_|-)*", boost::regex::normal | boost::regbase::icase);
boost::sregex_token_iterator i(text.begin(), text.end(), regexp, 1);
boost::sregex_token_iterator j;
while(i != j){
cout << *i++ << std::endl;
}
I modeled this after what I found on the Boost website. I used the last example (at the bottom of the page) as a template to build mf code. In this instance, text is an object of type string.
Is my regex correct? Am I using boost correctly?

Change your regex to: ([A-Za-z][-A-Za-z0-9_]*)
By putting the parentheses around the whole expression, the entire thing will be captured, not just the last character matched. Putting the - in front causes it to be a matched character and not a range specifier.

You're requesting the first submatch for each RE match. That refers to this subexpression: ([A-Za-z]|[0-9]|_|-) and you're getting the last thing that matched (notice that it's qualified by a *) for each match. Hence, the last character. I think you should pass 0 for the submatch number, or just omit that parameter. When I modify your code to do that, it does what I think you're wanting it to do.

Related

regex to match specific pattern of string followed by digits

Sample input:
___file___name___2000___ed2___1___2___3
DIFFERENT+FILENAME+(2000)+1+2+3+ed10
Desired output (eg, all letters and 4-digit numbers and literal 'ed' followed immediately by a digit of arbitrary length:
file name 2000 ed2
DIFFERENT FILENAME 2000 ed10
I am using:
[A-Za-z]+|[\d]{4}|ed\d+ which only returns:
file name 2000 ed
DIFFERENT FILENAME 2000 ed
I see that there is a related Q+A here:Regular Expression to match specific string followed by number?
eg using ed[0-9]* would match ed#, but unsure why it does not match in the above.
As written, your regex is correct. Remember, however, that regex tries to match its statements from left to right. Your ed\d+ is never going to match, because the ed was already consumed by your [A-Za-z] alternative. Reorder your regex and it'll work just fine:
ed\d+|[a-zA-Z]+|\d{4}
Demo
Nick's answer is right, but because in-order matching can be a less readable "gotcha", the best (order-insensitive) ways to do this kind of search are 1) with specified delimiters, and 2) by making each search term unique.
Jan's answer handles #1 well. But you would have to specify each specific delimiter, including its length (e.g. ___). It sounds like you may have some unusual delimiters, so this may not be ideal.
For #2, then, you can make each search term unique. (That is, you want the thing matching "file" and "name" to be distinct from the thing matching "2000", and both to be distinct from the thing matching "ed2".)
One way to do this is [A-Za-z]+(?![0-9a-zA-Z])|[\d]{4}|ed\d+. This is saying that for the first type of search term, you want an alphabet string which is followed by a non-alphanumeric character. This keeps it distinct from the third search term, which is an alphabet string followed by some digit(s). This also allows you to specify any range of delimiters inside of that negative lookbehind.
demo
You might very well use (just grab the first capturing group):
(?:^|___|[+(]) # delimiter before
([a-zA-Z0-9]{2,}) # the actual content
(?=$|___|[+)]) # delimiter afterwards
See a demo on regex101.com

regex needed for parsing string

I am working with government measures and am required to parse a string that contains variable information based on delimiters that come from issuing bodies associated with the fda.
I am trying to retrieve the delimiter and the value after the delimiter. I have searched for hours to find a regex solution to retrieve both the delimiter and the value that follows it and, though there seems to be posts that handle this, the code found in the post haven't worked.
One of the major issues in this task is that the delimiters often have repeated characters. For instance: delimiters are used such as "=", "=,", "/=". In this case I would need to tell the difference between "=" and "=,".
Is there a regex that would handle all of this?
Here is an example of the string :
=/A9999XYZ=>100T0479&,1Blah
Notice the delimiters are:
"=/"
"=>'
"&,1"
Any help would be appreciated.
You can use a regex like this
(=/|=>|&,1)|(\w+)
Working demo
The idea is that the first group contains the delimiters and the 2nd group the content. I assume the content can be word characters (a to z and digits with underscore). You have then to grab the content of every capturing group.
You need to capture both the delimiter and the value as group 1 and 2 respectively.
If your values are all alphanumeric, use this:
(&,1|\W+)(\w+)
See live demo.
If your values can contain non-alphanumeric characters, it get complicated:
(=/|=>|=,|=|&,1)((?:.(?!=/|=>|=,|=|&,1))+.)
See live demo.
Code the delimiters longest first, eg "=," before "=", otherwise the alternation, which matches left to right, will match "=" and the comma will become part of the value.
This uses a negative look ahead to stop matching past the next delimiter.

Regex, Word contains x but doesn't start with x

I'm not sure how to do this. I want a string to contain "end" but not start with end. An example would be the word "ascending". I've tried ^[^(end)]*end.*, but this finds "end" too.
So how would this be done correctly?
Your attempt didn't work because you confused a negated character class (defined with [^...], but able to cover just a single character by definition) with a negative lookahead (defined with (?!...)).
Have you written this properly, it would have looked like that:
^(?!end).+end.*$
... with + instead of * right after the preceding . for obvious reasons. )
Note that more straight-forward attempt - ^.+end.*$ - while looking pretty normal, actually fails on words looking like endendless (i.e., when there's an end substring in the target string, yet it does start with end too).
Also, if you're looking for words in the target string, and not just validate it, you should adjust the regex accordingly, using word boundary anchors instead of string anchors like that:
\b(?!end)\w+end\w*
Demo.

Lua pattern similar to regex positive lookahead?

I have a string which can contain any number of the delimiter §\n. I would like to remove all delimiters from a string, except the last occurrence which should be left as-is. The last delimiter can be in three states: \n, §\n or §§\n. There will never be any characters after the last variable delimiter.
Here are 3 examples with the different state delimiters:
abc§\ndef§\nghi\n
abc§\ndef§\nghi§\n
abc§\ndef§\nghi§§\n
I would like to remove all delimiters except the last occurrence.
So the result of gsub for the three examples above should be:
abcdefghi\n
abcdefghi§\n
abcdefghi§§\n
Using regular expressions, one could use §\\n(?=.), which matches properly for all three cases using positive lookahead, as there will never be any characters after the last variable delimiter.
I know I could check if the string has the delimiter at the end, and then after a substitution using the Lua pattern §\n I could add the delimiter back onto the string. That is however a very inelegant solution to a problem which should be possible to solve using a Lua pattern alone.
So how could this be done using a Lua pattern?
str:gsub( '§\\n(.)', '%1' ) should do what you want. This deletes the delimiter given that it is followed by another character, putting this character back into to string.
Test code
local str = {
'abc§\\ndef§\\nghi\\n',
'abc§\\ndef§\\nghi§\\n',
'abc§\\ndef§\\nghi§§\\n',
}
for i = 1, #str do
print( ( str[ i ]:gsub( '§\\n(.)', '%1' ) ) )
end
yields
abcdefghi\n
abcdefghi§\n
abcdefghi§§\n
EDIT: This answer doesn't work specifically for lua, but if you have a similar problem and are not constrained to lua you might be able to use it.
So if I understand correctly, you want a regex replace to make the first example look like the second. This:
/(.*?)§\\n(?=.*\\n)/g
will eliminate the non-last delimiters when replaced with
$1
in PCRE, at least. I'm not sure what flavor Lua follows, but you can see the example in action here.
REGEX:
/(.*?)§\\n(?=.*\\n)/g
TEST STRING:
abc§\ndef§\nghi\n
abc§\ndef§\nghi§\n
abc§\ndef§\nghi§§\n
SUBSTITUTION:
$1
RESULT:
abcdefghi\n
abcdefghi§\n
abcdefghi§§\n

how to define a regular expression in boost?

I have a section in file:
[Source]
[Source.Ia32]
[Source.Ia64]
I have created the expression as:
const boost::regex source_line_pattern ("(Sources)(.*?)");
Now, I am trying to match the string, but I am not able to match; it is always returning 0.
if (boost::regex_match ( sToken, source_line_pattern ) )
return TRUE;
Please note that sToken value is [Source]. [Source.Ia32]... and so on.
Thanks,
There are at least two problems with your code. First, the
regular expression you give contains the literal string
"Sources", and not "Source", which is what you seem to be
trying to match. The second is that boost::regex_match is
bound: it must match the entire string. What you seem to want
is boost::regex_search. Depending on what you are doing,
however, it might be better to try to match the entire string:
"\\[Source(?:\\.(\\w+))?\\]\\s*". Which provides for capture of
the trailing part, if present (but not the leading
"Source"---no point, in general, in capturing something that is
a constant).
Note too that the sequence ".*?" is very dubious. Normally,
I would expect the regular expression parser to fail if
a (non-escaped) '*' is followed by a '?'.
The issue is that boost::regex_match only returns true if the entire input string is matched by the regex. So the '[' and ']' are not matched by your current regex, and it will fail.
Your options are either to use boost::regex_search, which will search for a substring of the input that matches your regex, or modify your regex to accept the entire string being passed in.