Boost regex does not match - c++

I made a python regular expression and now I'm supposed to code the program in C++.
I was told to use boost's regex by the respective person.
It is supposed to match a group of at least one to 80 lower alphanumeric characters including underscore followed by a backslash then another group of at least one to 80 lower alphanumeric characters again including an underscore and last but not least a question mark. The total string must be at least 1 character long and is not allowed to exceed 256.
Here is my python regex:
^((?P<grp1>[a-z0-9_]{1,80})/(?P<grp2>[a-z0-9_]{1,80})([?])){1,256}$
My current boost regex is:
^(([a-z0-9_]{1,80})\/([a-z0-9_]{1,80})([?])){1,256}$
Cut down basically my code would look like this:
boost::cmatch match;
bool isMatch;
boost::regex myRegex = "^(([a-z0-9_]{1,80})\/([a-z0-9_]{1,80})([?])){1,256}$";
isMatch = boost::regex_match(str.c_str(), match, myRegex);
Edit: whoops totally forgot the question xDD. My problem is quite simple: The regex doesn't match though it's supposed to.
Example matches would be:
some/more?
object/value?
devel42/version_number?

The last requirement
The total string must be at least 1 character long and is not allowed to exceed 256.
is always true as your string is already limited from 3 to 162 characters. You have only to keep the first part of your regex:
^[a-z0-9_]{1,80}/[a-z0-9_]{1,80}\?$

My g++ gives me the warning "unknown escape sequence: '\/'"; that means you should use "\\/" instead of "\/". You need a backslash char stored in the string, and then let the regex parser eat it as a escaping trigger.
By the way, my boost also requires a constructor invocation, so
boost::regex myRegex("^(([a-z0-9_]{1,80})\\/([a-z0-9_]{1,80})([?])){1,256}$");
seems work.
You can also use C++11 raw string literal to avoid C++ escaping:
boost::regex myRegex(R"(^(([a-z0-9_]{1,80})\/([a-z0-9_]{1,80})([?])){1,256}$)");
By the way, testing <regex> in libstdc++ svn is welcome. It should come with GCC 4.9 ;)

The actual error was a new line sent to the server by the client on entering the respective string that would've been later compared.
Funny how the errors root is rarely where you expect it to be.
Anyways, thank you all for your answers. They gave me the ability to clean up my regular expressions.

Related

Java regex for definite or any character less than 11

I am a Rails developer but I need a regular expression that can allow a shortcode or any set of characters not more than 11 in total.
I was thinking something like:
(7575|[0-9a-zA-Z& ]*{11})
However, it has not worked.
I don't know what function you are using (this matters because find and matches behave differently), but to make things unambiguous, you can use the following:
^(7575|[0-9a-zA-Z& ]{1,11})$
The above means either match 7575 or match between 1 to 11 characters from the character set 0-9a-zA-Z& . If you want to allow an empty string as well, you will have to use {0,11} instead.
A slightly more memory efficient one would be ^(?:7575|[0-9a-zA-Z& ]{1,11})$ (since there are no capture groups).
^ matches the beginning of the string and $ matches the end of the string, thus ensuring there are no more characters before or after the matched part.
Further more memory efficient regex
"^(7575|[\w]{1,11})$"
where \w is A word character, short for [a-zA-Z_0-9]

Regex to Match All Except a String

Given the string beginend where begin and end are both optional, I want to match the whole string and back-reference only begin. Begin is unknown but alpha-numeric; end is literally end. How would I go about doing this?
In case it matters, I'd be using this in a Textpad macro to replace "beginend" with something else including "begin".
To match an string of "alpha-numeric" characters that do not contain "end" you can use something like:
(?:(?!end)[A-Za-z\d])+
An expression like this would do what you ask:
^((?:(?!end)[A-Za-z0-9])+)(?:end)?\z
EDITED (see after blockquote)
I don't have commenting privileges, so I can't comment on his
solution, but Qtax's solution will not work because it assumes that
begin will never contain the substring "end", e.g., it wouldn't
match the string "sendingend".
My solution:
^([A-Za-z0-9]*)(?:end)?$
Of course, it also depends on what you mean by alphanumeric. My
example has the strictest definition, i.e., just upper- and lower-case
letters plus digits. You'd need to add in other characters if you want
them. If you want to include the underscore as well as those
characters, you can replace the whole bulky [A-Za-z0-9] with \w
(equivalent to [A-Za-z0-9_]). Add \s if you want whitespace.
Since you said your regex knowledge is limited, I'll explain the rest
of the solution to you and whoever else comes along.
^ and $ match the beginning and the end of the string, respectively. By including the $ in particular, you're
guaranteeing that the last "end" you encounter is really at the end.
For example, without them, it would still match the string
"sendingsending" and the rest of your program would think it's found
that "end" at the end. With these, it's still going to match
"sendingsending" because any characters are allowed (see below), but
other steps in your script will recognize the presence of
"end". It actually doesn't matter much for this current
string, because the ([A-Za-z0-9]*) will capture the entire string if
"end" is not present. However, you therefore need another regex to
ensure the presence or absence of "end"...so you'd do something like
(end)$ to locate it.
([A-Za-z0-9]*): the square brackets contain the specific characters that are allowed (you should definitely read up on this if
you don't know). The * means it will match one of those characters 0
or more times, so this allows for no string (i.e., just "end") as well
as super-long strings. The parentheses are capturing that pattern so
you can back-reference it.
(?:end)?: the last ? makes it match this pattern 0 or 1 times (i.e., makes it optional). The (?:string) structure allows you to
group characters together as you would with parentheses but the ?:
makes it not save that pattern, so it uses less memory. In your
case, that memory would be negligible, but it's nice to know for
future use.
If you need more help, try Googling 'regex'. There's tons of good
references. You can also test them out. My personal favourite tester
is called My Regex Tester.
Good luck!
I just tried looking up TextPad macros, and you might run into a problem. As I've explained above, to verify the presence of "end" at the end of the string, you'll need something separate. I was envisioning some kind of conditional, something like IF (end)$ THEN replace with ^([A-Za-z0-9]*)(?:end)?$ ELSE use the whole string. However, I don't know if you can do that with these macros...it's hard to say, because I'm not a TextPad user and there's next to no documentation. If you can't, then I think you're going to have to put some restrictions on it. One idea is to not allow "end" to be anywhere in the begin substring (which is how Qtax's solution did it). But now I'm wondering...if "end" if going to be optional, and if conditionals aren't allowed, what's the point of having it at all? ...perhaps I'm overthinking things. I await your reply.
Try using a positive look-ahead. This is a zero-width assertion so won't be included in the match. It also allows for the substring end to be present within the alpha-numeric string
([a-z0-9]*)(?=end)
What this is saying is: Match an alpha-numeric string only if it is immediately followed by end

Regex to check if a string contains at least A-Za-z0-9 but not an &

I am trying to check if a string contains at least A-Za-z0-9 but not an &.
My experience with regexes is limited, so I started with the easy part and got:
.*[a-zA-Z0-9].*
However I am having troubling combining this with the does not contain an & portion.
I was thinking along the lines of ^(?=.*[a-zA-Z0-9].*)(?![&()]).* but that does not seem to do the trick.
Any help would be appreciated.
I'm not sure if this what you meant, but here is a regular expression that will match any string that:
contains at least one alpha-numeric character
does not contain a &
This expression ensures that the entire string is always matched (the ^ and $ at beginning and end), and that none of the characters matched are a "&" sign (the [^&]* sections):
^[^&]*[a-zA-Z0-9][^&]*$
However, it might be clearer in code to simply perform two checks, if you are not limited to a single expression.
Also, check out the \w class in regular expressions (it might be the better solution for catching alphanumeric chars if you want to allow non-ASCII characters).

Regex for a string up to 20 chars long with a comma

I need to define a regex for a string with the following requirements:
Maximum 20 characters
Must be in the form Name,Surname
No numbers and special characters allowed (again, it's a name&surname)
I already tried something like ^[^1-9\?\*\.\?\$\^\_]{1,20}[,][^1-9\?\*\.\?\$\^\_\-]{1,20}$ but as you can find, it also matches a 40 chars long string.
How can I check for the whole string's maximum length and at the same time impose 1 comma inside of it and obviously not at the borders?
Thank you
Try the regex:
^(?=[^,]+,[^,]+$)[a-zA-Z,]{1,20}$
Rubular Link
Explanation:
^ : Start anchor
(?=[^,]+,[^,]+$) : Positive lookahead to ensure string has exactly one comma
surrounded by at least one non-comma character on both sides.
[a-zA-Z,]{1,20} : Ensure entire string is of length max 20 and has only
letters and comma
$ : End anchor
You can do this using forward negative assertions:
^(?!.{21})[A-Za-z]+,[A-Za-z]+$
The regex contains two parts now, the actual definition, and a statement at the start, saying that from that point, there will not be 21 characters.
So for the definition as stated above, the regex becomes
^(?!.{21})[^1-9\?*\.\?\$\^_\,]+,[^1-9\?*\.\?\$\^_\,]+$
The obvious answer would be: Don't ask for name and surname in the same input field.
If you still want to do it: There's no easy way that I know of, but here is a possibility. To see the principle think your [^1-9\?\*\.\?\$\^\_\,] instead of X (I added he \, since it's kind of important :-)).
^(X{1},X{19})|(X{2},X{18})|...|(X{19},X{1})$
Quite ugly, but should work.
On a different note: You don't capture nearly all special characters with your exclusive range. But it's probably still better than an inclusive range.
As I say, I think stated the way you have it, it's not matchable by a regular expression -- it's a pushdown language.
However, you could always split on ',' and match each substring, then total.
I have you tried your example, but removing the
{1,20}
in the middle, leaving to try this:
^[[^1-9\?\*\.\?\$\^\_],[^1-9\?\*\.\?\$\^\_\-]]{1,20}$
Use:
[[a-zA-Z],[a-zA-Z]]{1,20}

Regex to *not* match any characters

I know it is quite some weird goal here but for a quick and dirty fix for one of our system we do need to not filter any input and let the corruption go into the system.
My current regex for this is "\^.*"
The problem with that is that it does not match characters as planned ... but for one match it does work. The string that make it not work is ^#jj (basically anything that has ^ ... ).
What would be the best way to not match any characters now ? I was thinking of removing the \  but only doing this will transform the "not" into a "start with" ...
The ^ character doesn't mean "not" except inside a character class ([]). If you want to not match anything, you could use a negative lookahead that matches anything: (?!.*).
A simple and cheap regex that will never match anything is to match against something that is simply unmatchable, for example: \b\B.
It's simply impossible for this regex to match, since it's a contradiction.
References
regular-expressions.info\Word Boundaries
\B is the negated version of \b. \B matches at every position where \b does not.
Another very well supported and fast pattern that would fail to match anything that is guaranteed to be constant time:
$unmatchable pattern $anything goes here etc.
$ of course indicates the end-of-line. No characters could possibly go after $ so no further state transitions could possibly be made. The additional advantage are that your pattern is intuitive, self-descriptive and readable as well!
tldr; The most portable and efficient regex to never match anything is $- (end of line followed by a char)
Impossible regex
The most reliable solution is to create an impossible regex. There are many impossible regexes but not all are as good.
First you want to avoid "lookahead" solutions because some regex engines don't support it.
Then you want to make sure your "impossible regex" is efficient and won't take too much computation steps to match... nothing.
I found that $- has a constant computation time ( O(1) ) and only takes two steps to compute regardless of the size of your text (https://regex101.com/r/yjcs1Z/3).
For comparison:
$^ and $. both take 36 steps to compute -> O(1)
\b\B takes 1507 steps on my sample and increase with the number of character in your string -> O(n)
Empty regex (alternative solution)
If your regex engine accepts it, the best and simplest regex to never match anything might be: an empty regex .
Instead of trying to not match any characters, why not just match all characters? ^.*$ should do the trick. If you have to not match any characters then try ^\j$ (Assuming of course, that your regular expression engine will not throw an error when you provide it an invalid character class. If it does, try ^()$. A quick test with RegexBuddy suggests that this might work.
^ is only not when it's in class (such as [^a-z] meaning anything but a-z). You've turned it into a literal ^ with the backslash.
What you're trying to do is [^]*, but that's not legal. You could try something like
" {10000}"
which would match exactly 10,000 spaces, if that's longer than your maximum input, it should never be matched.
((?iLmsux))
Try this, it matches only if the string is empty.
Interesting ... the most obvious and simple variant:
~^
.
https://regex101.com/r/KhTM1i/1
requiring usually only one computation step (failing directly at the start and being computational expensive only if the matched string begins with a long series of ~) is not mentioned among all the other answers ... for 12 years.
You want to match nothing at all? Neg lookarounds seems obvious, but can be slow, perhaps ^$ (matches empty string only) as an alternative?