Regex to #h #min format - regex

I'm trying to create a regex for "#h|H\s#min|MIN" format case insensitive For example 1H 2MIn, 2h 03min should match. 1H 60MIN, 02h 100min should not match. Thanks to Jesse point out. There is no limit on how many digits in hours. 60min suppose to be an hour. So anything above 59mins should not match.
Currently, I got:
/^[0-9]H|h\s([0-5]?\d)(MIN)|(min)$/
It's not worked for cases like 02h 100min.
So which part I did do wrong?
I thought ([0-5]?\d) suppose only match two digits.
Thank you for any help!
Edit:
I think I figured this out.
/^\d+h\s[0-5]?\dmin$/i
worked in this case.
Thanks again

The Problem
The main reason your expression isn't working as expected is because of how you're using the alternative (|) operator. This operator tells the engine that it can either match the pattern on the left side, or the pattern on the right side, but not both. The problem here is that these options aren't confined to a group, so your options aren't what you think they are.
Your entire expression is actually split into three alternative expressions. The engine is being told to match:
^[0-9]H OR
h\s([0-5]?\d)(MIN) OR
(min)$
This explains the strange behavior. This can be fixed by confining what you actually want to check for to a group. In this case, since you're wanting to match H or h and MIN or min, we can create the groups (H|h) and (MIN|min). Pretty simple. After these changes your expression would look like this:
^[0-9](H|h)\s([0-5]?\d)(MIN|min)$
Regex101
However this expression has another problem. It only matches one digit in the hour section. This can be fixed by adding a quantifier to it. In this case we can use +. This tells the engine to match the previous token between 1 and unlimited times. We can apply this to [0-9] to get [0-9]+, which will match any number between 0 and 9 between 1 and unlimited times. After this change your expression would look like this:
^[0-9]+(H|h)\s([0-5]?\d)(MIN|min)$
Regex101
Now you have a working expression, but it can be simplified quite a bit, so let's talk about that next.
Improvements
First, [0-9] can be replaced with \d. These do exactly the same thing. They both match any digit between 0 and 9.
^\d+(H|h)\s([0-5]?\d)(MIN|min)$
Regex101
Second, since this problem was caused by your need to match both upper and lower case, let's eliminate this problem entirely by using the case insensitivity (i) flag. The easiest way to apply this is by using (?i) at the beginning of your expression. This works across every language that supports Regex. With this flag, we no longer have to worry about matching upper and lower case letters. We can replace (H|h) with just h, and (MIN|min) with just min.
(?i)^\d+h\s([0-5]?\d)min$
Regex101
This expression should do the job. You can save 3 more characters by replacing \s with (space) and by removing the parentheses around [0-5]?\d, but those are insignificant changes so I'll leave that up to you.

Related

Is there any upper limit for number of groups used or the length of the regex in Notepad++?

I am new to using regex. I am trying to use the regex find and replace option in Notepad++.
I have used the following regex:
((?:)|(\+)|(-))(\d)((?:)|(\+)|(-))(/)((?:)|(\+)|(-))(\d)((?:)|(\+)|(-))
For the following text:
2/2
+2/+2
-2/-2
2+/2+
2-/2-
But I am able to get matches only for the first three. The last two, it only gives partial matches, excluding the last "+" and the "-". I am wondering if there is any upper limit for the number of groups (which i doubt is unlikely) that can be used or any upper limit for the maximum length of the regex. I am not sure why my regex is failing. Or if there is anything wrong with my regex, please correct it.
This is not an issue with Notepad++'s regex engine. The problem is that when you have alternations like (?:)|(\+)|(-), the regex engine will attempt to match the different options in the order they are specified. Since you specified an empty group first, it will attempt to match an empty string first, only matching the + or - if it needs to backtrack. This essentially makes the alternation lazy—it will never match any character unless it has to.
vks's answer works perfectly well, but just in case you actually needed those capturing groups separated out, you can do the same thing just by rewriting your alternations like this:
((\+)|(-)|(?:))(\d)((\+)|(-)|(?:))(/)((\+)|(-)|(?:))(\d)((\+)|(-)|(?:))
or even more simply, like this:
((\+)|(-)|)(\d)((\+)|(-)|)(/)((\+)|(-)|)(\d)((\+)|(-)|)
([-+]?)(\d)([-+]?)(/)([-+]?)(\d)([-+]?)
You can use this simple regex to match all cases.See here.
https://www.regex101.com/r/fG5pZ8/19

RegEx to find credit card numbers with embedded spaces

We currently have a content compliance in place where by we monitor anything that contains a credit card number with no spaces (e.g 5100080000000000)
What we need is for a reg ex to pick up credit card numbers that are entered with spaces every 4 digits (eg: 5100 0800 0000 0000)
We've been looking at alternate reg exs but have not yet found one that works for both scenarios mentioned above.
The current reg ex we use is below
^((4\d{3})|(5[1-5]\d{2})|(6011)|(34\d{1})|(37\d{1}))-?\d{4}-?\d{4}-?\d{4}|3[4,7][\d\s-]{15}$
Just add optional /s? in where you already have the optional -?
So your regex becomes
^((4\d{3})|(5[1-5]\d{2})|(6011)|(34\d{1})|(37\d{1}))-?\s?\d{4}-?\s?\d{4}-?\s?\d{4}|3[4,7][\d\s-]{15}$
It seems that you already accept a dash every four characters. Thus you can simply replace -? with [- ]? everywhere.
If you require the dashes or spaces to be consistent - that is, allow no grouping at all, or a dash every four characters, or a space every four characters, you can use a back reference to force the repetitions to be identical to the first match:
^(?:4\d{3}|5[1-5]\d{2}|6011|3[47]\d{2})([- ]?)\d{4}\1\d{4}\1\d{4}$
You will notice I removed the final 3[4,7]... which looked like an erroneous addition, apparently made when attempting to solve this problem partially. Also I changed the parentheses to non-grouping ones (?:...) or simply removed them where no grouping seemed necessary or useful, mainly because this makes it easier to see what the backreference \1 refers to. Finally, the 34.. and 37.. patterns had \d{1} where apparently \d{2} was intended (or if those particular series are only three digits before the first dash, the repetition {1} was just superfluous, but then the 3[4,7]... part would have been even more wrong!)
Won't all these ideas blow up on you as soon as someone uses and AMEX card and enters 3 or 5 numbers instead of 4 in any one 'block'
((\d+) *(\d+) *(\d+) *(\d+))
That would be the general idea (and it even works!), you can polish it if you want. There is a great page to test your regexp live - http://rubular.com/
Try this:
(\d{4} *\d{4} *\d{4} *\d{4})

Number Regular Expression Help

I am learning regular expressions and I am trying to create one that will validation either a whole number or a decimal.
I have created this regular expression:
^(\d+)|([\d+][\.{1}][\d+])$
It almost works, but it says a number like:
12.
12..
12..67
are matches.
I thought
([\d+][\.{1}][\d+])
meant it had to have one or more numbers, followed by a dot (and only one), followed by one or more numbers.
Can someone explain what I am doing wrong?
As a learning process I'm interested in what I am doing wrong rather than what is another way of doing it. I tried following the syntax examples but I have missed something.
You are wrong
([\d+][\.{1}][\d+])
with the square brackets are you creating character classes. that means
[\d+] does mean match a digit or a + once.
[\.{1}] does mean match a . or a { or a 1 or a }
To get the behaviour you expect remove the square brackets
(\d+\.{1}\d+)
This will match at least one digit, a . followed by one or more digits
The other problem here is the ^ belongs only to the first part of your expression and the $ belong only to the last part of your alternation. So you should put brackets around the complete alternation
^((\d+)|(\d+\.{1}\d+))$
If you don't need the match in a capturing group you can remove the brackets around the single alternatives
^(\d+|\d+\.{1}\d+)$
As last point as Jens noted
{1} is redundant \.{1} is the same than \.
Then we are at
^(\d+|\d+\.\d+)$
You can try with:
^(\d+(\.\d+)?)$
Your regex is nearly there, you just need to remove the square brackets -
^(\d+)|(\d+\.{1}\d+)$
Should work for what you want.

Regex to match a string that does not contain 'xxx'

One of my homework questions asked to develop a regex for all strings over x,y,z that did not contain xxx
After doing some reading I found out about negative lookahead and made this which works great:
(x(?!xx)|y|z)*
Still, in the spirit of completeness, is there anyway to write this without negative lookahead?
Reading I have done makes me think it can be done with some combination of carets (^), but I cannot get the right combination so I am not sure.
Taking it a step further, is it possible to exclude a string like xxx using only the or (|) operator, but still check the strings in a recursive fashion?
EDIT 9/6/2010:
Think I answered my own question. I messed with this some more, trying make this regex with only or (|) statements and I am pretty sure I figured it out... and it isn't nearly as messy as I thought it would be. If someone else has time to verify this with a human eye I would appreciate it.
(xxy|xxz|xy|xz|y|z)*(xxy|xxz|xx|xy|xz|x|y|z)
Try this:
^(x{0,2}(y|z|$))*$
The basic idea is this: for match at most 2 X's, followed by another letter or the end of the string.
When you reach a point where you have 3 X's, the regex has no rule that allows it to keep matching, and it fails.
Working example: http://rubular.com/r/ePH0fHlZxL
A less compact way to write the same is (with free spaces, usually the /x flag):
^(
y| # y is ok
z| # so is z
x(y|z|$)| # a single x, not followed by x
xx(y|z|$) # 2 x's, not followed by x
)*$
Based on the latest edit, here's an ever flatter version of the pattern: I'm not entirely sure I understand your fascination with the pipe, but you can eliminate some more options - by allowing an empty match on the second group you don't need to repeat permutations from the first group. That regex also allows ε, which I think is included in your language.
^(xxy|xxz|xy|xz|y|z)*(xx|x|)$
Basically you have the right answer already - well done you. :)
Carat (^) in a set [^abc] will only match where it does not find a character in that set so it's application for matching orders of characters (i.e. strings) is limited and weak.
Regex has numeric quantifiers {n} and {a,b} which allow you to match a defined number of repititions of a pattern, which would work for this specific pattern (because it's 'x' repeated) but it's not particularily expressive of the problem you're trying to solve (even for regex!) and is a bit brittle (it wouldn't be appropriate for negative match 'xyx' for example.
An or pattern again would be verbose and rather unexpressive but it could be done as the fragment:
(x|xx)[^x] // x OR xx followed by NOT x
Obviously you can do this with an iterative algorithm but that's highly inefficient compared to a regex.
Well done for thinking beyond the solution though.
I know you don't want to use lookahead, but here's another way to solve this:
^(?:(?!xxx)[xyz])*$
will match any line of characters x, y or z as long as it doesn't contain the string xxx.

Regular expression greedy match not working as expected

I have a very basic regular expression that I just can't figure out why it's not working so the question is two parts. Why does my current version not work and what is the correct expression.
Rules are pretty simple:
Must have minimum 3 characters.
If a % character is the first character must be a minimum of 4 characters.
So the following cases should work out as follows:
AB - fail
ABC - pass
ABCDEFG - pass
% - fail
%AB - fail
%ABC - pass
%ABCDEFG - pass
%%AB - pass
The expression I am using is:
^%?\S{3}
Which to me means:
^ - Start of string
%? - Greedy check for 0 or 1 % character
\S{3} - 3 other characters that are not white space
The problem is, the %? for some reason is not doing a greedy check. It's not eating the % character if it exists so the '%AB' case is passing which I think should be failing. Why is the %? not eating the % character?
Someone please show me the light :)
Edit: The answer I used was Dav below: ^(%\S{3}|[^%\s]\S{2})
Although it was a 2 part answer and Alan's really made me understand why. I didn't use his version of ^(?>%?)\S{3} because it worked but not in the javascript implementation. Both great answers and a lot of help.
The word for the behavior you described isn't greedy, it's possessive. Normal, greedy quantifiers match as much as they can originally, but back off if necessary to allow the whole regex to match (I like to think of them as greedy but accommodating). That's what's happening to you: the %? originally matches the leading percent sign, but if there aren't enough characters left for an overall match, it gives up the percent sign and lets \S{3} match it instead.
Some regex flavors (including Java and PHP) support possessive quantifiers, which never back off, even if that causes the overall match to fail. .NET doesn't have those, but it has the next best thing: atomic groups. Whatever you put inside an atomic group acts like a separate regex--it either matches at the position where it's applied or it doesn't, but it never goes back and tries to match more or less than it originally did just because the rest of the regex is failing (that is, the regex engine never backtracks into the atomic group). Here's how you would use it for your problem:
^(?>%?)\S{3}
If the string starts with a percent sign, the (?>%?) matches it, and if there aren't enough characters left for \S{3} to match, the regex fails.
Note that atomic groups (or possessive quantifiers) are not necessary to solve this problem, as #Dav demonstrated. But they're very powerful tools which can easily make the difference between impossible and possible, or too damn slow and slick as can be.
Regex will always try to match the whole pattern if it can - "greedy" doesn't mean "will always grab the character if it exists", but instead means "will always grab the character if it exists and a match can be made with it grabbed".
Instead, what you probably want is something like this:
^(%\S{3}|[^%\s]\S{2})
Which will match either a % followed by 3 characters, or a non-%, non-whitespace followed by 2 more.
I always love to look at RE questions to see how much time people spend on them to "Save time"
str.len() >= str[0]=='&' ? 4 : 3
Although in real life I'd be more explicit, I just wrote it that way because for some reason some people consider code brevity an advantage (I'd call it an anti-advantage, but that's not a popular opinion right now)
Try the regex modified a little based on Dav's original one:
^(%\S{3,}|[^%\s]\S{2,})
with the regex option "^ and $ match at line breaks" on.