what's wrong with this regex for password rules - regex

I'm trying for at least 2 letters, at least 2 non letters, and at least 6 characters in length:
^.*(?=.{6,})(?=[a-zA-Z]*){2,}(?=[0-9##$%^&+=]*){2,}.*$
but that misses the mark on many levels, yet I'm not sure why. Any suggestions?

While this type of test can be done with a regex, it may be easier and more maintainable to do a non-regex check. The regex to achieve this is fairly complex and a bit unreadable. But the code to run this test is fairly straight forward. For example take the following method as an implementation of your requirements (language C#)
public bool IsValid(string password) {
// arg null check ommitted
return password.Length >= 6 &&
password.Where(Char.IsLetter).Count() > 2 &&
password.Where(x => !Char.IsLetter(x)).Count() > 2;
}

To answer the question in the title, here's what's wrong with your regex:
First, the .* (dot-star) at the beginning consumes the whole string. Then the first lookahead, (?=.{6,}) is applied and fails because the match position is at the end of the string. So the regex engine starts backtracking, "taking back" characters by moving the match position backward one character at a time and reapplying the lookahead. When it's taken back six characters, the first lookahead succeeds and the next one is applied.
The second lookahead is (?=[a-zA-Z]*), which means "at the current match position, try to match zero or more ASCII letters." The match position is still six characters back from the end of the string, but it doesn't matter; the lookahead will always succeed no matter you apply it, because it can legally match zero characters. Also, the letters can be anywhere in the string, so the lookahead has to accommodate whatever intervening non-letters there might be.
Then you have {2,}. It's not part of the lookahead subexpression because it's outside the parentheses. In that position, it means the lookahead has to succeed two or more times, which makes no sense. If it succeeded once, it will succeed any number of times, because it's being applied at the same position every time. Some regex flavors treat it as an error when you apply a quantifier to a lookahead (or to any other zero-width assertion, eg, lookbehind, word boundary, line anchors). Most flavors seem to ignore the quantifier.
Then you have another lookahead that will always succeed, and another useless quantifier. Finally, the dot-star at the end re-consumes the six characters the first dot-star had to relinquish.
I think this is what you were trying for:
^
(?=.{6})
(?=(?:[^A-Za-z]*[A-Za-z]){2})
(?=(?:[^0-9##$%^&+=]*[0-9##$%^&+=]){2})
.*$

If you really want to use regular expressions, try this one:
(?=.{6})(?=[^a-zA-Z]*[a-zA-Z][^a-zA-Z]*[a-zA-Z])(?=[^0-9##$%^&+=]*[0-9##$%^&+=][^0-9##$%^&+=]*[0-9##$%^&+=])^.+$
This matches anything that is at least six characters long ((?=.{6,})) and does contain at least two alphabetic characters ((?=[a-zA-Z][^a-zA-Z]*[a-zA-Z])) and does contain at least two characters of the character set [0-9##$%^&+=] ((?=[0-9##$%^&+=][^0-9##$%^&+=]*[0-9##$%^&+=])).

Related

Regex to allow Strings starting with a letter and not having a specific set of characters

I need a regex that ensures two things -
My string must start with a letter. The letter can be small or capital.
The string must not contain certain specified characters.
Since there are two conditions involved, I tried designing my regex with the positive lookahead operator in regex (?=).
My regex for the String is
(?=^[a-zA-Z]$)(?=.[^"/',?%$#!#%^&+=|{}<>])
Where the first condition is to ensure that my string starts with a letter and the second condition is to ensure that the characters defined in the second condition are blocked. It still doesn't work for me. What am I missing? Is there a better way to approach this?
I don't know why having two conditions make you think that you should use lookaheads. In this case, 2 character classes should do:
^[a-zA-Z][^"\/',?%$#!#%^&*+=|{}<>]*$
The first character class matches the start (only letters), and the second matches the rest (no symbols).
You have a couple of problems:
your first lookahead asserts that the string is only one character
long (because of the $ at the end); and
the second lookahead only asserts that the second character is not one of the blocked ones (because you have no quantifier after the character class).
This would work better:
(?=^[a-zA-Z])(?=[^"/',?%$#!#%^&+=\`|{}<>]+$)
Note that since [a-zA-Z] is not part of the blocked group, you don't need the . to skip the first character in the second lookahead.

Perl Regex Negative Lookbehind Incorrect Match (SAS)

In SAS, I am setting up PXPARSE functions to extract meaningful information from free text answers from a survey. For the most part, I have done this without issue. However, I've started needing lookarounds and now I am getting an incorrect match despite my best efforts.
Here is the expression that is being evaluated:
hlhx=PRXPARSE('/yes|(?<!no).*homeless.*(for|in|year|age)|at\sage|couch|was\shomeless|multiple|
lived.*streets|(?<!\bnot).*at\srisk|has\sbeen|high\srisk|currently\shomeless|
liv(es|ing|ed).*car|many|(?<!\bno).*(hx|history|h.?o)|(?<!\bno)(?<!low).+risk/ox');
A couple of responses should not match this expression, but do:
no hx of homelessness and low risk of homelessness
owns home, no h/o homelessness; low risk for homelessness
no and little risk
Obviously I have not properly specified my lookbehinds. Any help would be greatly appreciated.
EDIT: To put a finer point on it, what part of the expression is causing a match with entries like those in the list?
Best,
Lauren
Here's how your regex matches no and little risk:
One of the branches in your regex is ...|(?<!\bno)(?<!low).+risk.
The regex engine starts by attempting a match at every position within the target string, starting at the beginning:
no and little risk
^
The first constraint is that the current position cannot be preceded by a word boundary followed by "no" (due to (?<!\bno)). This condition is satisfied: The beginning of the target string is not preceded by anything.
The second constraint is that the current position cannot be preceded by "low" (due to (?<!low)). This condition is also satisfied (see above).
Then we match one or more non-newline characters, but as many as possible of them (this is the .+ part). Here we initially consume the whole string:
no and little risk
------------------^
But then the regex requires a match of risk, which fails (there are no more characters left in the target string). This causes .+ to backtrack and consume fewer and fewer characters, until this happens:
no and little risk
--------------^
At this point, risk successfully matches and the regex finishes.
The basic problem is that want you want to do is (?<!\bno.+)(?<!low.+)risk, but what you wrote is (?<!\bno)(?<!low).+risk. These are two very different things!
The former means "match 'risk', but only if it's not preceded by 'no' or 'low' anywhere in the string (up to 1 character before 'risk')". The latter means "match any non-empty substring followed by 'risk', as long as it's not preceded by either 'no' or 'low'". This gives the regex engine the freedom to look for any matching position in the string, as long as it's not immediately preceded by "no" or "low" and is followed by ".+risk" somewhere.
Unfortunately (?<!\bno.+) is not a valid regex because look-behind assertions must have a fixed length.
One possible workaround is to do the following:
^(?!.*(?:\bno|low).+risk).*risk
This says: Starting from the beginning of the string, first make sure there is no "no" or "low" followed by "risk" anywhere, then match "risk" anywhere within the string.
This is not quite equivalent to the (hypothetical) variable-width look-behind version, because that one would have matched
risk no risk
^^^^
due to the presence of "risk" without "no" preceding it, whereas this workaround first finds
risk no risk
^^^^^^^
and immediately rejects the whole string.

Is there a better way to validate multiple regex conditionals than giganic "or" statements?

I am practicing regular expressions and to test myself, I was trying to make some sort of simplified password validation expression. Basically, it would accept [0-9A-Za-z], but...
1) it needed to have a symbol (for simplicity sake, I only used [#&#!&*%$])
2) it needed to have a capital letter
In my mind, the best way to do this was with positive lookahead statements. The only problem was the validation before and after the symbol. If I had the capital letter lookahead at the beginning, it would only validate if the capital letter came before the symbol, and the same for the end. The only way I could counter this was to make a massive OR statement with the entire thing copied, but one having the lookahead at the beginning, and one having it at the end. This is the monstrosity that I came up with:
/^[0-9A-Za-z#&#!&*%$]*(?=[A-Z]+)[0-9A-Za-z#&#!&*%$]*(?=[#&#!&*%$])[#&#!&*%$][A-Za-z|#&#!&*%$]*|[0-9A-Za-z#&#!&*%$]*(?=[#&#!&*%$])[#&#!&*%$](?=[A-Z]+)[0-9A-Za-z#&#!&*%$]*$/
I'll try to break it down into parts that make sense to me (and hopefully to you guys as well).
First part of the OR statement
The beginning can be [0-9A-Za-z#&#!&*%$]*, so that's what I start with
Then comes the first positive lookahead, ensuring that there is a capital [A-Z]
Then comes the second lookahead ensuring that one of the symbols in [#&#!&*%$] is present.
Then, it allows any of those necessary symbols to come next
The first part ends with another allowance of [A-Za-z|#&#!&*%$]*
Second part of the OR statement
The second part is much like the first. Well, almost an entire copy and paste. I put an | OR symbol in place, but then instead of having the (?=[A-Z]+) lookahead before the symbol, I check for it after.
All in all, I put in a good amount of effort into something that works (for the most part). I did some extensive Googling, but nothing really seemed to answer my question. Is there an easier way to go about what I am looking to do?
You need to anchor the lookaheads at the start of the string (to just run them once) and add a .* or .*? before the required subpatterns in the lookaheads to allow the search anywhere on the line (note that . usually does not match line breaks, but your main pattern does not match them, so . is enough).
So, that said, you may use
^(?=.*[A-Z])(?=.*[#&#!&*%$])[0-9A-Za-z#&#!&*%$]*$
Details:
^ - start of string
(?=.*[A-Z]) - there must be an uppercase ASCII letter somewhere after any 0+ chars other than line breaks
(?=.*[#&#!&*%$]) - there must be a special char from the character class somewhere after any 0+ chars other than line breaks
[0-9A-Za-z#&#!&*%$]* - 0+ chars from the defined ranges or chars
$ - end of string.
See the regex demo. To make it more efficient, use the principle of contrast:
^(?=[^A-Z]*[A-Z])(?=[^#&#!&*%$]*[#&#!&*%$])[0-9A-Za-z#&#!&*%$]*$
^^^^^^^ ^^^^^^^^^^^^

Regex for one word in sentence

So I am trying to match a (any) word(s) that would have:
At least one upper case letter
At least one lower case letter
At least one number
I currently got to this using lookaheads
^(?=.*\d)(?=.*[a-z])(?=.*[A-Z]).+$
But I am not able to get this to match on one word. I tried to use \b around the lookaheads but it doesn't work. The thing the word that I am trying to match on can have the above conditions in any order. Example: aB5 OR Ba5 OR 5Ba etc.. Need some pointers.
The main problem is that . includes spaces. You need to change your .'s to be restricted to word-characters only, i.e. \w. Note that \w is (mostly) [A-Za-z0-9_], if you wish to exclude some of these or include more, you should make the appropriate changes.
Another thing is that if you're looking for words in a string, you need to remove ^ and $ because these mean the start and end of the string respectively.
Since all your requirements are "at least" (as opposed to "at most"), you don't really need \b because of matching happens left-to-right, so you can never get part of a word.
Regex:
(?=\w*\d)(?=\w*[a-z])(?=\w*[A-Z])\w+
Test.
I currently got to this using lookaheads
^(?=.*\d)(?=.*[a-z])(?=.*[A-Z]).+$
But I am not able to get this to match on one word.
Lookaheads are the correct approach, but if you want to find single words only you must not allow every character (.) in between but only word-characters (like \w). So
/(?=\w*\d)(?=\w*[a-z])(?=.\w[A-Z])\w+/g
should do it. Of course you're free to allow more letters than only \w, maybe even \S.

Special way of forming regex?

I've come across this regex and I was wondering how this is used:
^.*(?=.{10,})(?=.*\d)(?=.*[a-z])(?=.*[A-Z]).*$
I want to know what the individual section of the regex mean, not only what the regex in its whole does.
With the knowledge of regex's I have, I think it matches for any input (at least 10 chars long) that matches a digit (0-9), lowercase and uppercase letters, but I need confirmation if this is correct?
Edit
I also don't know what it is meant to validate, but looking at what I think it does, is it right that the regex can be simplified to:
[\d|[a-zA-Z]]{10,}
Edit 2
I've noticed my replacement regex doesn't make sure I have at least one of every requirements (at least a digit, upcase and lowcase letter). Any way to adjust it so the regex does that as well, or is that only possible with the original regex?
I can explain what the parts of the regex do, but in general I find this quite odd:
^.*(?=.{10,})(?=.*\d)(?=.*[a-z])(?=.*[A-Z]).*$
Basically what you said is true - there is no other magic in the regex.
^.* - match the beginning of the line and 0+ characters then ensure that
The following just assert - none of them matches/captures anything. It's called the positive lookahead if you want to look it up. if all of them evaluate to true, the last part of the regex will do the rest:
(?=.{10,}) - from where the first matching stops (could be after the beginning of the line) there is a string of 10+ chars (any chars)
(?=.*\d) - and there is at least one digit in the whole string ahead
(?=.*[a-z]) - and a lower case letter
(?=.*[A-Z]) - and an upper case letter
If all that is true, then:
.*$ - match everything till the end of the line
Note: if any of the asserts fail, nothing will be matched.
To your edit
I don't think so - it's not the same to say that there is an upper and lower case letter and a digit somewhere in the string, and to say that the string consists of 10+ characters of which all are either digits or letters (upper or lower case) or both. Your regex would match a string that consists of only digits as well as only letters or a mix of both - the original regex ensures that each of these classes is represented at least once. It seems that someone might have used it to validate a user password or something like that.
This is probably used to validate candidate passwords - it
Checks that it is at least 10 characters long
Checks that it contains at least one digit
Checks that it contains at least one lower case letter
Checks that it contains at least one upper case letter
Your replacement regex is not identical because it just ORs the above conditions - the long nasty regex ANDs them. Also there is no order to the above conditions; the letters or digits can occur anywhere in the string.
I don't see a way of simplifying it much further actually - you might perhaps remove the .* at the beginning and .*$ at the end since they don't really serve any purpose. But otherwise, that long regex does a good job of conjunctively imposing those conditions without imposing an order.
I think this is used for ensuring password strength: it has to be at least 10 chars long, with at least 1 digit, at least 1 lowercase letter, and at least 1 uppercase letter.
The most important part of the whole regex is the (?=...) operator, which matches, but does NOT consume the part of the string it matches. Multiple (?=...) next to one another, therefore, acts as an AND operator.
(?=.{10,}) matches any sequence of at least 10 chars.
(?=.*\d) matches a single digit that follows anything.
(?=.*[a-z]) matches a lowercase char that follows anything.
(?=.*[A-Z]) matches an uppercase char that follows anything.
So this regex will match any string that has a substring that is at least 10-char long, has at least a digit, a lowercase char, and an uppercase char.
You can see that it sounds more complicated than it should, especially the substring part. Indeed, the .* part right after ^ is not necessary, and we can simplify this as
^(?=.{10,})(?=.*\d)(?=.*[a-z])(?=.*[A-Z]).*$
It's a password strength validation regex as others have said, but that .* at the beginning should not be there. As it is, the .* initially consumes the whole string, then backtracks until it reaches a position where all four lookaheads can match. It works, but why make the regex do so much work if it doesn't have to?
^(?=.{10,})(?=.*\d)(?=.*[a-z])(?=.*[A-Z]).*$
With the leading .* removed, the regex never has to backtrack (unless you count returning to the starting position after a successful lookahead backtracking). As for the .*$ at the end, it might not be necessary, but it does no harm either. I would leave it in, just in case someone tries to use the result of the match for something instead of the original string.
One more point: you could make the regex more concise by removing the first lookahead and putting the .{10,} in place of the .*:
^(?=.*\d)(?=.*[a-z])(?=.*[A-Z]).{10,}$
The reason it's written the way it is, is to work around a long standing bug in Internet Explorer (ref). The bug finally got fixed in IE8 or IE9, but I would leave it the way it is, just in case.