Regex not completely matching the desired patterns - regex

I've been trying to write a regex to stictly match the following patterns -
3+, 3 months +, 3+ months, 3m+, 3 years +, 3y+, 3+ years.
I've written a pattern - [0-9]{1,2}\s?(m|months?|y|years?)?\s?\+ which works for most of the cases exept the 3+ months & 3+ years here it matches just the 3+ part and not the month part. I want to use the matched string somewhere and this causes an issue. To accomodate this I tried adding another group to make the regex look like [0-9]{1,2}\s?(m|months?|y|years?)?\s?\+ (m|months?|y|years?)\s|$ but this is also matching for 6+ math. Can someone help me with what the issue is here, also can this regex be improved?
I can use multiple regex to solve different different use cases, but I wanted to achieve this using only 1 regex.
Thanks

UPDATED:
I overlooked the strictness requirement. Something that helps me figuring out these problems is breaking the pattern down in a logic tree of sorts, like so:
Only XX months allowed
Only XX years allowed - \d{1,2}
If ## followed by a +
It must be followed by \s*(months|years)
If ## followed by a [my]
It must be followed by \s*\+
Close it off with $
That gets us on the right track to what we want. Again, it still permits undesired cases, but just revisit that thought exercise, tinker with the conditionals, and try to find common components that restrict the full regex to neat grouping of stricter patterns.
This should be closer to the strict solution you’re looking for:
\d{1,2}\s*([my]?\+|\+?\s*(months|years)|(months|years)\s*\+?)\s*$
===========================
ORIGINAL POST:
Here’s a first pass at a condensed version of what you want:
\d{1,2}[\+my]?\s*(months|years)?\s*\+?
Here’s a breakdown of the approach I took:
(\d{1,2})
^ Accommodate any two numbers (your approach is fine, \d means any number 0-9, saves a few characters)
([\+my]?\s*)
^ The characters following the given number may be m, y, or + followed by any number of spaces.
(months|years)?
^ We’ve accounted for all spaces with the previous piece of regex, so lets just say there might be months|years at this point.
(\s*\+?)
^ Last potential symbol is a +, but it might have several spaces in front of it.

Try the following regex
((?:\d{1,2}\+ (?:months|years))|(?:\d{1,2}\+)|(?:\d{1,2} (?:months|years) \+)|(?:\d{1,2}(?:m|y)\+))

You may use this regex with optional matches:
\d{1,2}\+?(?:\s?(?:months?|years?|[my])\b\s?\+?)?
RegEx Demo

Related

Regex to #h #min format

I'm trying to create a regex for "#h|H\s#min|MIN" format case insensitive For example 1H 2MIn, 2h 03min should match. 1H 60MIN, 02h 100min should not match. Thanks to Jesse point out. There is no limit on how many digits in hours. 60min suppose to be an hour. So anything above 59mins should not match.
Currently, I got:
/^[0-9]H|h\s([0-5]?\d)(MIN)|(min)$/
It's not worked for cases like 02h 100min.
So which part I did do wrong?
I thought ([0-5]?\d) suppose only match two digits.
Thank you for any help!
Edit:
I think I figured this out.
/^\d+h\s[0-5]?\dmin$/i
worked in this case.
Thanks again
The Problem
The main reason your expression isn't working as expected is because of how you're using the alternative (|) operator. This operator tells the engine that it can either match the pattern on the left side, or the pattern on the right side, but not both. The problem here is that these options aren't confined to a group, so your options aren't what you think they are.
Your entire expression is actually split into three alternative expressions. The engine is being told to match:
^[0-9]H OR
h\s([0-5]?\d)(MIN) OR
(min)$
This explains the strange behavior. This can be fixed by confining what you actually want to check for to a group. In this case, since you're wanting to match H or h and MIN or min, we can create the groups (H|h) and (MIN|min). Pretty simple. After these changes your expression would look like this:
^[0-9](H|h)\s([0-5]?\d)(MIN|min)$
Regex101
However this expression has another problem. It only matches one digit in the hour section. This can be fixed by adding a quantifier to it. In this case we can use +. This tells the engine to match the previous token between 1 and unlimited times. We can apply this to [0-9] to get [0-9]+, which will match any number between 0 and 9 between 1 and unlimited times. After this change your expression would look like this:
^[0-9]+(H|h)\s([0-5]?\d)(MIN|min)$
Regex101
Now you have a working expression, but it can be simplified quite a bit, so let's talk about that next.
Improvements
First, [0-9] can be replaced with \d. These do exactly the same thing. They both match any digit between 0 and 9.
^\d+(H|h)\s([0-5]?\d)(MIN|min)$
Regex101
Second, since this problem was caused by your need to match both upper and lower case, let's eliminate this problem entirely by using the case insensitivity (i) flag. The easiest way to apply this is by using (?i) at the beginning of your expression. This works across every language that supports Regex. With this flag, we no longer have to worry about matching upper and lower case letters. We can replace (H|h) with just h, and (MIN|min) with just min.
(?i)^\d+h\s([0-5]?\d)min$
Regex101
This expression should do the job. You can save 3 more characters by replacing \s with (space) and by removing the parentheses around [0-5]?\d, but those are insignificant changes so I'll leave that up to you.

regex substitution no global modifiers available

I'm using software with built-in regex implementation that does not support global modifiers, so I have to get it working without /g
my test string is(number of sections can be unlimited:
aaa%2dbbb%2dccc%2dddd%2deee
I want it to be: aaa-bbb-ccc-ddd-eee
normally I would write (%2d) and g flag and substitute with -
I managed to write this to match unlimited number of occurrences
(\w)((%2d)(\w+))+
but I have problems with substitution rule, because my group 2 has 2 subgroups and I cannot find out how to handle them,
can anyone help with substitution rule?
As comments in the end reach same conclusions that I had before posting question, I decided to post answer to close the question nicely (instead of deleting question, cause even negative answer is answer and may save someone an hour or more on research(that happened to me actually)). The general conclusion is - it's not possible to solve this with regex. And I'm quoting two best comments by #ltux here:
This problem can't be solved with regular expression in one go. If capture group is used with quantifier such as +, the content of the capture group will always be the last match found. In your case, the content of the 2nd capture group will be %2deee, and you can't get %2dbbb, %2dccc and so on, so there is chance for you to substitute it. – ltux 2 days ago
Regular expression can't solve your problem. You have to try to bypass the limitations of the software by yourself, unless you tell us which software you are using. – ltux 2 days ago
Create a file containing the line type that you want to process:
cat << EOF >> abcde.txt
aaa%2dbbb%2dccc%2dddd%2deee
EOF
Use this sed snippet as follows using the global substitution you mention as being the way you usually perform such a substitution.
sed -e "s#%2d#-#g" abcde.txt
aaa-bbb-ccc-ddd-eee
Basically you don't have to think about the type of characters that appear around the white space character but instead only concentrate on the white space itself. Replacing this character multiple times will solve the issue for you quite simply. In other words, pattern matching around the character you are concerned with changing is not necessary. This is a common issue that many of us fall into when dealing with regular expressions.
Basically the substitution is saying: find the first occurrence of a white space '%2d', replace it with a hyphen '-' and repeat for the rest of the string.

Mod Rewrite RegEx To Match Only If Previous Subset Matched

I am trying to make what I think is a simple regex for use with mod_rewrite.
I've tried various expressions, many of which I thought were promising, but all of which ultimately failed for one reason or another. They all also seem to fail once I add start/end string delimiters.
For example, ^user/(\d{1,10})(?=/)$ was one I tried, but among other things, it seems to group the trailing slash, and I only want to group the digits. I think I need to use a positive lookbehind, but I'm having difficulty because it's looking behind at a group.
What I am trying to match is strings that 1) begin with "user/" and 2) possibly end with (\d{1,10})/ (1 to 10 digits followed by a single slash)
Should Match:
user/
user/123/
user/1234567890/
Should not match:
user
user//
user/-4/
user/35.5/
user/123
user/123//
user/123/5/
user/12345678901/
Edit: Sorry about the formatting; I do not understand how to format anything via this markdown. Those examples are preceded by 4 spaces which I thought should make a code block, but obviously I thought wrong.
^user/(?:([0-9]{1,10})/)?$ should work just fine.
This: ^user(?=/)(/\d{1,10})?/$ Edit: if you want to group digits, ^user(?=/)(?:/(\d{1,10}))?/$

Regex to match a string that does not contain 'xxx'

One of my homework questions asked to develop a regex for all strings over x,y,z that did not contain xxx
After doing some reading I found out about negative lookahead and made this which works great:
(x(?!xx)|y|z)*
Still, in the spirit of completeness, is there anyway to write this without negative lookahead?
Reading I have done makes me think it can be done with some combination of carets (^), but I cannot get the right combination so I am not sure.
Taking it a step further, is it possible to exclude a string like xxx using only the or (|) operator, but still check the strings in a recursive fashion?
EDIT 9/6/2010:
Think I answered my own question. I messed with this some more, trying make this regex with only or (|) statements and I am pretty sure I figured it out... and it isn't nearly as messy as I thought it would be. If someone else has time to verify this with a human eye I would appreciate it.
(xxy|xxz|xy|xz|y|z)*(xxy|xxz|xx|xy|xz|x|y|z)
Try this:
^(x{0,2}(y|z|$))*$
The basic idea is this: for match at most 2 X's, followed by another letter or the end of the string.
When you reach a point where you have 3 X's, the regex has no rule that allows it to keep matching, and it fails.
Working example: http://rubular.com/r/ePH0fHlZxL
A less compact way to write the same is (with free spaces, usually the /x flag):
^(
y| # y is ok
z| # so is z
x(y|z|$)| # a single x, not followed by x
xx(y|z|$) # 2 x's, not followed by x
)*$
Based on the latest edit, here's an ever flatter version of the pattern: I'm not entirely sure I understand your fascination with the pipe, but you can eliminate some more options - by allowing an empty match on the second group you don't need to repeat permutations from the first group. That regex also allows ε, which I think is included in your language.
^(xxy|xxz|xy|xz|y|z)*(xx|x|)$
Basically you have the right answer already - well done you. :)
Carat (^) in a set [^abc] will only match where it does not find a character in that set so it's application for matching orders of characters (i.e. strings) is limited and weak.
Regex has numeric quantifiers {n} and {a,b} which allow you to match a defined number of repititions of a pattern, which would work for this specific pattern (because it's 'x' repeated) but it's not particularily expressive of the problem you're trying to solve (even for regex!) and is a bit brittle (it wouldn't be appropriate for negative match 'xyx' for example.
An or pattern again would be verbose and rather unexpressive but it could be done as the fragment:
(x|xx)[^x] // x OR xx followed by NOT x
Obviously you can do this with an iterative algorithm but that's highly inefficient compared to a regex.
Well done for thinking beyond the solution though.
I know you don't want to use lookahead, but here's another way to solve this:
^(?:(?!xxx)[xyz])*$
will match any line of characters x, y or z as long as it doesn't contain the string xxx.

Regular expression greedy match not working as expected

I have a very basic regular expression that I just can't figure out why it's not working so the question is two parts. Why does my current version not work and what is the correct expression.
Rules are pretty simple:
Must have minimum 3 characters.
If a % character is the first character must be a minimum of 4 characters.
So the following cases should work out as follows:
AB - fail
ABC - pass
ABCDEFG - pass
% - fail
%AB - fail
%ABC - pass
%ABCDEFG - pass
%%AB - pass
The expression I am using is:
^%?\S{3}
Which to me means:
^ - Start of string
%? - Greedy check for 0 or 1 % character
\S{3} - 3 other characters that are not white space
The problem is, the %? for some reason is not doing a greedy check. It's not eating the % character if it exists so the '%AB' case is passing which I think should be failing. Why is the %? not eating the % character?
Someone please show me the light :)
Edit: The answer I used was Dav below: ^(%\S{3}|[^%\s]\S{2})
Although it was a 2 part answer and Alan's really made me understand why. I didn't use his version of ^(?>%?)\S{3} because it worked but not in the javascript implementation. Both great answers and a lot of help.
The word for the behavior you described isn't greedy, it's possessive. Normal, greedy quantifiers match as much as they can originally, but back off if necessary to allow the whole regex to match (I like to think of them as greedy but accommodating). That's what's happening to you: the %? originally matches the leading percent sign, but if there aren't enough characters left for an overall match, it gives up the percent sign and lets \S{3} match it instead.
Some regex flavors (including Java and PHP) support possessive quantifiers, which never back off, even if that causes the overall match to fail. .NET doesn't have those, but it has the next best thing: atomic groups. Whatever you put inside an atomic group acts like a separate regex--it either matches at the position where it's applied or it doesn't, but it never goes back and tries to match more or less than it originally did just because the rest of the regex is failing (that is, the regex engine never backtracks into the atomic group). Here's how you would use it for your problem:
^(?>%?)\S{3}
If the string starts with a percent sign, the (?>%?) matches it, and if there aren't enough characters left for \S{3} to match, the regex fails.
Note that atomic groups (or possessive quantifiers) are not necessary to solve this problem, as #Dav demonstrated. But they're very powerful tools which can easily make the difference between impossible and possible, or too damn slow and slick as can be.
Regex will always try to match the whole pattern if it can - "greedy" doesn't mean "will always grab the character if it exists", but instead means "will always grab the character if it exists and a match can be made with it grabbed".
Instead, what you probably want is something like this:
^(%\S{3}|[^%\s]\S{2})
Which will match either a % followed by 3 characters, or a non-%, non-whitespace followed by 2 more.
I always love to look at RE questions to see how much time people spend on them to "Save time"
str.len() >= str[0]=='&' ? 4 : 3
Although in real life I'd be more explicit, I just wrote it that way because for some reason some people consider code brevity an advantage (I'd call it an anti-advantage, but that's not a popular opinion right now)
Try the regex modified a little based on Dav's original one:
^(%\S{3,}|[^%\s]\S{2,})
with the regex option "^ and $ match at line breaks" on.