Special way of forming regex? - regex

I've come across this regex and I was wondering how this is used:
^.*(?=.{10,})(?=.*\d)(?=.*[a-z])(?=.*[A-Z]).*$
I want to know what the individual section of the regex mean, not only what the regex in its whole does.
With the knowledge of regex's I have, I think it matches for any input (at least 10 chars long) that matches a digit (0-9), lowercase and uppercase letters, but I need confirmation if this is correct?
Edit
I also don't know what it is meant to validate, but looking at what I think it does, is it right that the regex can be simplified to:
[\d|[a-zA-Z]]{10,}
Edit 2
I've noticed my replacement regex doesn't make sure I have at least one of every requirements (at least a digit, upcase and lowcase letter). Any way to adjust it so the regex does that as well, or is that only possible with the original regex?

I can explain what the parts of the regex do, but in general I find this quite odd:
^.*(?=.{10,})(?=.*\d)(?=.*[a-z])(?=.*[A-Z]).*$
Basically what you said is true - there is no other magic in the regex.
^.* - match the beginning of the line and 0+ characters then ensure that
The following just assert - none of them matches/captures anything. It's called the positive lookahead if you want to look it up. if all of them evaluate to true, the last part of the regex will do the rest:
(?=.{10,}) - from where the first matching stops (could be after the beginning of the line) there is a string of 10+ chars (any chars)
(?=.*\d) - and there is at least one digit in the whole string ahead
(?=.*[a-z]) - and a lower case letter
(?=.*[A-Z]) - and an upper case letter
If all that is true, then:
.*$ - match everything till the end of the line
Note: if any of the asserts fail, nothing will be matched.
To your edit
I don't think so - it's not the same to say that there is an upper and lower case letter and a digit somewhere in the string, and to say that the string consists of 10+ characters of which all are either digits or letters (upper or lower case) or both. Your regex would match a string that consists of only digits as well as only letters or a mix of both - the original regex ensures that each of these classes is represented at least once. It seems that someone might have used it to validate a user password or something like that.

This is probably used to validate candidate passwords - it
Checks that it is at least 10 characters long
Checks that it contains at least one digit
Checks that it contains at least one lower case letter
Checks that it contains at least one upper case letter
Your replacement regex is not identical because it just ORs the above conditions - the long nasty regex ANDs them. Also there is no order to the above conditions; the letters or digits can occur anywhere in the string.
I don't see a way of simplifying it much further actually - you might perhaps remove the .* at the beginning and .*$ at the end since they don't really serve any purpose. But otherwise, that long regex does a good job of conjunctively imposing those conditions without imposing an order.

I think this is used for ensuring password strength: it has to be at least 10 chars long, with at least 1 digit, at least 1 lowercase letter, and at least 1 uppercase letter.
The most important part of the whole regex is the (?=...) operator, which matches, but does NOT consume the part of the string it matches. Multiple (?=...) next to one another, therefore, acts as an AND operator.
(?=.{10,}) matches any sequence of at least 10 chars.
(?=.*\d) matches a single digit that follows anything.
(?=.*[a-z]) matches a lowercase char that follows anything.
(?=.*[A-Z]) matches an uppercase char that follows anything.
So this regex will match any string that has a substring that is at least 10-char long, has at least a digit, a lowercase char, and an uppercase char.
You can see that it sounds more complicated than it should, especially the substring part. Indeed, the .* part right after ^ is not necessary, and we can simplify this as
^(?=.{10,})(?=.*\d)(?=.*[a-z])(?=.*[A-Z]).*$

It's a password strength validation regex as others have said, but that .* at the beginning should not be there. As it is, the .* initially consumes the whole string, then backtracks until it reaches a position where all four lookaheads can match. It works, but why make the regex do so much work if it doesn't have to?
^(?=.{10,})(?=.*\d)(?=.*[a-z])(?=.*[A-Z]).*$
With the leading .* removed, the regex never has to backtrack (unless you count returning to the starting position after a successful lookahead backtracking). As for the .*$ at the end, it might not be necessary, but it does no harm either. I would leave it in, just in case someone tries to use the result of the match for something instead of the original string.
One more point: you could make the regex more concise by removing the first lookahead and putting the .{10,} in place of the .*:
^(?=.*\d)(?=.*[a-z])(?=.*[A-Z]).{10,}$
The reason it's written the way it is, is to work around a long standing bug in Internet Explorer (ref). The bug finally got fixed in IE8 or IE9, but I would leave it the way it is, just in case.

Related

Regex to allow Strings starting with a letter and not having a specific set of characters

I need a regex that ensures two things -
My string must start with a letter. The letter can be small or capital.
The string must not contain certain specified characters.
Since there are two conditions involved, I tried designing my regex with the positive lookahead operator in regex (?=).
My regex for the String is
(?=^[a-zA-Z]$)(?=.[^"/',?%$#!#%^&+=|{}<>])
Where the first condition is to ensure that my string starts with a letter and the second condition is to ensure that the characters defined in the second condition are blocked. It still doesn't work for me. What am I missing? Is there a better way to approach this?
I don't know why having two conditions make you think that you should use lookaheads. In this case, 2 character classes should do:
^[a-zA-Z][^"\/',?%$#!#%^&*+=|{}<>]*$
The first character class matches the start (only letters), and the second matches the rest (no symbols).
You have a couple of problems:
your first lookahead asserts that the string is only one character
long (because of the $ at the end); and
the second lookahead only asserts that the second character is not one of the blocked ones (because you have no quantifier after the character class).
This would work better:
(?=^[a-zA-Z])(?=[^"/',?%$#!#%^&+=\`|{}<>]+$)
Note that since [a-zA-Z] is not part of the blocked group, you don't need the . to skip the first character in the second lookahead.

Is there a better way to validate multiple regex conditionals than giganic "or" statements?

I am practicing regular expressions and to test myself, I was trying to make some sort of simplified password validation expression. Basically, it would accept [0-9A-Za-z], but...
1) it needed to have a symbol (for simplicity sake, I only used [#&#!&*%$])
2) it needed to have a capital letter
In my mind, the best way to do this was with positive lookahead statements. The only problem was the validation before and after the symbol. If I had the capital letter lookahead at the beginning, it would only validate if the capital letter came before the symbol, and the same for the end. The only way I could counter this was to make a massive OR statement with the entire thing copied, but one having the lookahead at the beginning, and one having it at the end. This is the monstrosity that I came up with:
/^[0-9A-Za-z#&#!&*%$]*(?=[A-Z]+)[0-9A-Za-z#&#!&*%$]*(?=[#&#!&*%$])[#&#!&*%$][A-Za-z|#&#!&*%$]*|[0-9A-Za-z#&#!&*%$]*(?=[#&#!&*%$])[#&#!&*%$](?=[A-Z]+)[0-9A-Za-z#&#!&*%$]*$/
I'll try to break it down into parts that make sense to me (and hopefully to you guys as well).
First part of the OR statement
The beginning can be [0-9A-Za-z#&#!&*%$]*, so that's what I start with
Then comes the first positive lookahead, ensuring that there is a capital [A-Z]
Then comes the second lookahead ensuring that one of the symbols in [#&#!&*%$] is present.
Then, it allows any of those necessary symbols to come next
The first part ends with another allowance of [A-Za-z|#&#!&*%$]*
Second part of the OR statement
The second part is much like the first. Well, almost an entire copy and paste. I put an | OR symbol in place, but then instead of having the (?=[A-Z]+) lookahead before the symbol, I check for it after.
All in all, I put in a good amount of effort into something that works (for the most part). I did some extensive Googling, but nothing really seemed to answer my question. Is there an easier way to go about what I am looking to do?
You need to anchor the lookaheads at the start of the string (to just run them once) and add a .* or .*? before the required subpatterns in the lookaheads to allow the search anywhere on the line (note that . usually does not match line breaks, but your main pattern does not match them, so . is enough).
So, that said, you may use
^(?=.*[A-Z])(?=.*[#&#!&*%$])[0-9A-Za-z#&#!&*%$]*$
Details:
^ - start of string
(?=.*[A-Z]) - there must be an uppercase ASCII letter somewhere after any 0+ chars other than line breaks
(?=.*[#&#!&*%$]) - there must be a special char from the character class somewhere after any 0+ chars other than line breaks
[0-9A-Za-z#&#!&*%$]* - 0+ chars from the defined ranges or chars
$ - end of string.
See the regex demo. To make it more efficient, use the principle of contrast:
^(?=[^A-Z]*[A-Z])(?=[^#&#!&*%$]*[#&#!&*%$])[0-9A-Za-z#&#!&*%$]*$
^^^^^^^ ^^^^^^^^^^^^

RegEx: Non-repeating patterns?

I'm wrestling with how to write a specific regex, and thought I'd come here for a little guidance.
What I'm looking for is an expression that does the following:
Character length of 7 or more
Any single character is one of four patterns (uppercase letters, lowercase letters, numbers and a specific set of special characters. Let's say #$%#).
(Now, here's where I'm having problems):
Another single character would also match with one of the patterns described above EXCEPT for the pattern that was already matched. So, if the first pattern matched is an uppercase letter, the second character match should be a lowercase letter, number or special character from the pattern.
To give you an example, the string AAAAAA# would match, as would the string AAAAAAa. However, the string AAAAAAA, nor would the string AAAAAA& (as the ampersand was not part of the special character pattern).
Any ideas? Thanks!
If you only need two different kinds of characters, you can use the possessive quantifier feature (available in Objective C):
^(?:[a-z]++|[A-Z]++|[0-9]++|[#$%#]++)[a-zA-Z0-9#$%#]+$
or more concise with an atomic group:
^(?>[a-z]+|[A-Z]+|[0-9]+|[#$%#]+)[a-zA-Z0-9#$%#]+$
Since each branch of the alternation is a character class with a possessive quantifier, you can be sure that the first character matched by [a-zA-Z0-9#$%#]+ is from a different class.
About the string size, check it first separately with the appropriate function, if the size is too small, you will avoid the cost of a regex check.
First you need to do a negative lookahead to make sure the entire string doesn't consist of characters from a single group:
(?!(?:[a-z]*|[A-Z]*|[0-9]*|[#$%#]*)$)
Then check that it does contain at least 7 characters from the list of legal characters (and nothing else):
^[a-zA-Z0-9#$%#]{7,}$
Combining them (thanks to Shlomo for pointing that out):
^(?!(?:[a-z]*|[A-Z]*|[0-9]*|[#$%#]*)$)[a-zA-Z0-9#$%#]{7,}$

How to include special chars in this regex

First of all I am a total noob to regular expressions, so this may be optimized further, and if so, please tell me what to do. Anyway, after reading several articles about regex, I wrote a little regex for my password matching needs:
(?=.*[A-Z])(?=.*[a-z])(?=.*[0-9])(^[A-Z]+[a-z0-9]).{8,20}
What I am trying to do is: it must start with an uppercase letter, must contain a lowercase letter, must contain at least one number must contain at least on special character and must be between 8-20 characters in length.
The above somehow works but it doesn't force special chars(. seems to match any character but I don't know how to use it with the positive lookahead) and the min length seems to be 10 instead of 8. what am I doing wrong?
PS: I am using http://gskinner.com/RegExr/ to test this.
Let's strip away the assertions and just look at your base pattern alone:
(^[A-Z]+[a-z0-9]).{8,20}
This will match one or more uppercase Latin letters, followed by by a single lowercase Latin letter or decimal digit, followed by 8 to 20 of any character. So yes, at minimum this will require 10 characters, but there's no maximum number of characters it will match (e.g. it will allow 100 uppercase letters at the start of the string). Furthermore, since there's no end anchor ($), this pattern would allow any trailing characters after the matched substring.
I'd recommend a pattern like this:
^(?=.*[a-z])(?=.*[0-9])(?=.*[!##$])[A-Z]+[A-Za-z0-9!##$]{7,19}$
Where !##$ is a placeholder for whatever special characters you want to allow. Don't forget to escape special characters if necessary (\, ], ^ at the beginning of the character class, and- in the middle).
Using POSIX character classes, it might look like this:
^(?=.*[:lower:])(?=.*[:digit:])(?=.*[:punct:])[:upper:]+[[:alnum:][:punct:]]{7,19}$
Or using Unicode character classes, it might look like this:
^(?=.*[\p{Ll}])(?=.*\d)(?=.*[\p{P}\p{S}])[\p{Lu}]+[\p{L}\d\p{P}\p{S}]{7,19}$
Note: each of these considers a different set of 'special characters', so they aren't identical to the first pattern.
The following should work:
^(?=.*[a-z])(?=.*[0-9])(?=.*[^a-zA-Z0-9])[A-Z].{7,19}$
I removed the (?=.*[A-Z]) because the requirement that you must start with an uppercase character already covers that. I added (?=.*[^a-zA-Z0-9]) for the special characters, this will only match if there is at least one character that is not a letter or a digit. I also tweaked the length checking a little bit, the first step here was to remove the + after the [A-Z] so that we know exactly one character has been matched so far, and then changing the .{8,20} to .{7,19} (we can only match between 7 and 19 more characters if we already matched 1).
Well, here is how I would write it, if I had such requirements - excepting situations where it's absolutely not possible or practical, I prefer to break up complex regular expressions. Note that this is English-specific, so a Unicode or POSIX character class (where supported) may make more sense:
/^[A-Z]/ && /[a-z]/ && /[1-9]/ && /[whatever special]/ && ofCorrectLength(x)
That is, I would avoid trying to incorporate all the rules at once.

what's wrong with this regex for password rules

I'm trying for at least 2 letters, at least 2 non letters, and at least 6 characters in length:
^.*(?=.{6,})(?=[a-zA-Z]*){2,}(?=[0-9##$%^&+=]*){2,}.*$
but that misses the mark on many levels, yet I'm not sure why. Any suggestions?
While this type of test can be done with a regex, it may be easier and more maintainable to do a non-regex check. The regex to achieve this is fairly complex and a bit unreadable. But the code to run this test is fairly straight forward. For example take the following method as an implementation of your requirements (language C#)
public bool IsValid(string password) {
// arg null check ommitted
return password.Length >= 6 &&
password.Where(Char.IsLetter).Count() > 2 &&
password.Where(x => !Char.IsLetter(x)).Count() > 2;
}
To answer the question in the title, here's what's wrong with your regex:
First, the .* (dot-star) at the beginning consumes the whole string. Then the first lookahead, (?=.{6,}) is applied and fails because the match position is at the end of the string. So the regex engine starts backtracking, "taking back" characters by moving the match position backward one character at a time and reapplying the lookahead. When it's taken back six characters, the first lookahead succeeds and the next one is applied.
The second lookahead is (?=[a-zA-Z]*), which means "at the current match position, try to match zero or more ASCII letters." The match position is still six characters back from the end of the string, but it doesn't matter; the lookahead will always succeed no matter you apply it, because it can legally match zero characters. Also, the letters can be anywhere in the string, so the lookahead has to accommodate whatever intervening non-letters there might be.
Then you have {2,}. It's not part of the lookahead subexpression because it's outside the parentheses. In that position, it means the lookahead has to succeed two or more times, which makes no sense. If it succeeded once, it will succeed any number of times, because it's being applied at the same position every time. Some regex flavors treat it as an error when you apply a quantifier to a lookahead (or to any other zero-width assertion, eg, lookbehind, word boundary, line anchors). Most flavors seem to ignore the quantifier.
Then you have another lookahead that will always succeed, and another useless quantifier. Finally, the dot-star at the end re-consumes the six characters the first dot-star had to relinquish.
I think this is what you were trying for:
^
(?=.{6})
(?=(?:[^A-Za-z]*[A-Za-z]){2})
(?=(?:[^0-9##$%^&+=]*[0-9##$%^&+=]){2})
.*$
If you really want to use regular expressions, try this one:
(?=.{6})(?=[^a-zA-Z]*[a-zA-Z][^a-zA-Z]*[a-zA-Z])(?=[^0-9##$%^&+=]*[0-9##$%^&+=][^0-9##$%^&+=]*[0-9##$%^&+=])^.+$
This matches anything that is at least six characters long ((?=.{6,})) and does contain at least two alphabetic characters ((?=[a-zA-Z][^a-zA-Z]*[a-zA-Z])) and does contain at least two characters of the character set [0-9##$%^&+=] ((?=[0-9##$%^&+=][^0-9##$%^&+=]*[0-9##$%^&+=])).