Negative Regex for 4 input - regex

I have positive expression:
^(?=.*[A-Z])(?=.*[!##$%^&*()-_=+])(?=.*[0-9])(?=.*[a-z]).*$
Here I choose expression in which there are at least one uppercase letter, one lowercase letter, digit and special character (wrong: 8$t or 4EE. correct: R7e+ or #6Jw).
I need to make a negative expression for this. That is, look for any phrase in which all four characters at the same time do not occur (wrong: A1a+ or %2Ht. correct: 1+a or w2B).
How I can make this a negation?

Apart from the fact that it is not pretty, what's wrong with
^(?!(?=.*[A-Z])(?=.*[-!##$%^&*()_=+])(?=.*[0-9])(?=.*[a-z])).*$
Also be careful, - is a special character inside of a character class. To treat it literally you need to put it at the beginning or the end or the class.
Performance
On a side note, if you want improved performances, you might want to take advantage of the possessive quantifier, in order to fail quicker when you need to.
The positive regex you posted could be changed into
^(?=[^A-Z]*+[A-Z])(?=[^-!##$%^&*()_=+]*+[-!##$%^&*()_=+])(?=[^0-9]*+[0-9])(?=[^a-z]*+[a-z]).*$
See the improved results here from regex101 (the debugger is launched from the bug-like icon on the left): with possessive quantifiers vs no possessive quantifier.
You can obviously negate that one as well in the same way we did the first one.

Related

is this regex vulnerable to REDOS attacks

Regex :
^\d+(\.\d+)*$
I tried to break it with :
1234567890.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1x]
that is 200x".1"
I have read about ReDos attacks from :
Preventing Regular Expression Denial of Service (ReDoS)
Runaway Regular Expressions: Catastrophic Backtracking
However, I am not too confident in my skills to prepare a ReDos attack on an expression. I tried to trigger catastrophic backtracking due to "Nested Quantifiers".
Is that expression breakable? What input should be used for that and, if yes, how did you come up with it?
"Nested quantifiers" isn't inherently a problem. It's just a simple way to refer to a problem which is actually quite a bit more complicated. The problem is "quantifying over a sub-expression which can, itself, match in many ways at the same position". It just turns out that you almost always need a quantifier in the inner sub-expression to provide a rich enough supply of matches, and so quantifiers inside quantifiers serve as a red flag that indicates the possibility of trouble.
(.*)* is problematic because .* has maximum symmetry — it can match anything between zero and all of the remaining characters at any point of the input. Repeating this leads to a combinatorial explosion.
([0-9a-f]+\d+)* is problematic because at any point in a string of digits, there will be many possible ways to allocate those digits between an initial substring of [0-9a-f]+ and a final substring of \d+, so it has the same exact issue as (.*)*.
(\.\d+)* is not problematic because \. and \d match completely different things. A digit isn't a dot and a dot isn't a digit. At any given point in the input there is only one possible way to match \., and only one possible way to match \d+ that leaves open the possibility of another repetition (consume all of the digits, because if we stop before a digit, the next character is certainly not a dot). Therefore (\.\d+)* is no worse, backtracking-wise, than a \d* would be in the same context, even though it contains nested quantifiers.
Your regex is safe, but only because of "\."
Testing on regex101.com shows that there are no combinations of inputs that create runaway checks - but your regex is VERY close to being vulnerable, so be careful when modifying it.
As you've read, catastrophic backtracking happens when two quantifiers are right next to each other. In your case, the regex expands to \d+\.\d+\.\d+\.\d+\. ... and so on. Because you make the dot required for every single match between \d+, your regex grows by only three steps for each period-number you add. (This translates to 4 steps per period-number if you put an invalid character at the end.) That's a linear growth rate, so your regex is fine. Demo
However, if you make the \. optional, accidentally forget the escape character to make it plain ol' ., or remove it altogether, then you're in trouble. Such a regex would allow catastrophic backtracking; an invalid character at the end approximately doubles the runtime with every additional number you add before it. That's an exponential growth rate, and it's enough to crash time out the regex101 engine's default settings with just 18 digits and 1 invalid character. Demo
As written, your regex is fine, and will remain so as long as you ensure sure there's something "solid" between the first \d+ and the second \d+, as well as something "solid" between the second \d+ and the * outside its capture group.

Why regular expression .* is slower at one place and faster at other

Lately I am using a lot of regular expressions in java/groovy. For testing I routinely use regex101.com. Obviously I am looking at the regular expressions performance too.
One thing I noticed that using .* properly can significantly improve the overall performance. Primarily, using .* in between, or better to say not at the end of the regular expression is performance kill.
For example, in this regular expression the required number of steps is 27:
If I change first .* to \s*, it will reduce the steps required significantly to 16:
However, if I change second .* to \s*, it does not reduce the steps any further:
I have few questions:
Why the above? I dont want to compare \s and .*. I know the difference. I want to know why \s and .* costs different based on their position in the complete regex. And then the characteristics of the regex which may cost different based on their position in the overall regex (or based on any other aspect other than position, if there is any).
Does the steps counter given in this site really gives any indication about regex performance?
what other simple or similar (position related) regex performance observations you have?
The following is output from the debugger.
The big reason for the difference in performance is that .* will consume everything until the end of the string (except the newline). The pattern will then continue, forcing the regex to backtrack (as seen in the first image).
The reason that \s and .* perform equally well at the end of the pattern is that the greedy pattern vs. consuming whitespace makes no difference if there's nothing else to match (besides WS).
If your test string didn't end in whitespace, there would be a difference in performance, much like you saw in the first pattern - the regex would be forced to backtrack.
EDIT
You can see the performance difference if you end with something besides whitespace:
Bad:
^myname.*mahesh.*hiworld
Better:
^myname.*mahesh\s*hiworld
Even better:
^myname\s*mahesh\s*hiworld
The way regex engines work with the * quantifier, aka greedy quantifier, is to consume everything in the input that matches, then:
try the next term in the regex. If it matches, proceed on
"unconsume" one character (move the pointer back one), aka backtrack and goto step 1.
Since . matches anything (almost), the first state after encountering .* is to move the pointer to the end of input, then start moving back through the input one char at a time trying the next term until there's a match.
With \s*, only whitespace is consumed, so the pointer is initially moved exactly where you want it to be - no backtracking required to match the next term.
Something you should try is using the reluctant quantifier .*?, which will consume one char at a time until the next term matches, which should have the same time complexity as \s*, but be slightly more efficient because no check of the current char is required.
\s* and .* at the end of the expression will perform similarly, because both will consume everything at the end f input that matches, which leaves the pointer is the same position for both expressions.

Regex Expressions

I am trying to learn to handle Regex Expressions and got some exercises but no solutions to it. One Question is: all lower-case words except 'if'.
Can I do this one like this:
[a-z][a-z]^[if] | [a-z][a-z][a-z]+
I'm expect that a word has at least two characters. So every word with three or more is okay.
Well... the full real solution would be something like that:
\b(?!if\b)\p{Ll}+\b
Demo
But I suppose it's, well, "higher level" regex that you didn't learn yet.
So, let's keep things simple. If you can accept to ignore words of less than 3 characters, you can write this:
\b[a-hj-z][a-eg-z][a-z]+|i[a-z]{2,}
Demo
The first two character classes are just [a-z] without i and f respectively.
If you want to include words of less than 3 characters, this will do:
\b(?:i|if[a-z]+|i[a-eg-z][a-z]*|[a-hj-z][a-z]*)\b
Demo
But it gets complicated at this point...
All sequences of two or more lower-case letters, except "if":
[a-hj-z][a-z]+|i(?:[a-eg-z][a-z]*|f[a-z]+)
With negative look-ahead, you can also do:
(?!if\b)[a-z]{2,}
A simple solution would be to place what you want to ignore on the left side of the alternation operator and place what you want to match in a capturing group on the right side of the alternation operator as you were attempting.
\bif\b|([a-z]{2,})
Note: The caret ^ outside of a character class does not mean negation, it asserts the position at start of the string. And unless you are using the x (free-spacing) modifier, you need to remove the spaces between the alternation.

What do we need Lookahead/Lookbehind Zero Width Assertions for?

I've just learned about these two concepts in more detail. I've always been good with RegEx, and it seems I've never seen the need for these 2 zero width assertions.
I'm pretty sure I'm wrong, but I do not see why these constructs are needed. Consider this example:
Match a 'q' which is not followed by a 'u'.
2 strings will be the input:
Iraq
quit
With negative lookahead, the regex looks like this:
q(?!u)
Without it, it looks like this:
q[^u]
For the given input, both of these regex give the same results (i.e. matching Iraq but not quit) (tested with perl). The same idea applies to lookbehinds.
Am I missing a crucial feature that makes these assertions more valuable than the classic syntax?
Why your test probably worked (and why it shouldn't)
The reason you were able to match Iraq in your test might be that your string contained a \n at the end (for instance, if you read it from the shell). If you have a string that ends in q, then q[^u] cannot match it as the others said, because [^u] matches a non-u character - but the point is there has to be a character.
What do we actually need lookarounds for?
Obviously in the above case, lookaheads are not vital. You could workaround this by using q(?:[^u]|$). So we match only if q is followed by a non-u character or the end of the string. There are much more sophisticated uses for lookaheads though, which become a pain if you do them without lookaheads.
This answer tries to give an overview of some important standard situations which are best solved with lookarounds.
Let's start with looking at quoted strings. The usual way to match them is with something like "[^"]*" (not with ".*?"). After the opening ", we simply repeat as many non-quote characters as possible and then match the closing quote. Again, a negated character class is perfectly fine. But there are cases, where a negated character class doesn't cut it:
Multi-character delimiters
Now what if we don't have double-quotes to delimit our substring of interest, but a multi-character delimiter. For instance, we are looking for ---sometext---, where single and double - are allowed within sometext. Now you can't just use [^-]*, because that would forbid single -. The standard technique is to use a negative lookahead at every position, and only consume the next character, if it is not the beginning of ---. Like so:
---(?:(?!---).)*---
This might look a bit complicated if you haven't seen it before, but it's certainly nicer (and usually more efficient) than the alternatives.
Different delimiters
You get a similar case, where your delimiter is only one character but could be one of two (or more) different characters. For instance, say in our initial example, we want to allow for both single- and double-quoted strings. Of course, you could use '[^']*'|"[^"]*", but it would be nice to treat both cases without an alternative. The surrounding quotes can easily be taken care of with a backreference: (['"])[^'"]*\1. This makes sure that the match ends with the same character it began with. But now we're too restrictive - we'd like to allow " in single-quoted and ' in double-quoted strings. Something like [^\1] doesn't work, because a backreference will in general contain more than one character. So we use the same technique as above:
(['"])(?:(?!\1).)*\1
That is after the opening quote, before consuming each character we make sure that it is not the same as the opening character. We do that as long as possible, and then match the opening character again.
Overlapping matches
This is a (completely different) problem that can usually not be solved at all without lookarounds. If you search for a match globally (or want to regex-replace something globally), you may have noticed that matches can never overlap. I.e. if you search for ... in abcdefghi you get abc, def, ghi and not bcd, cde and so on. This can be problem if you want to make sure that your match is preceded (or surrounded) by something else.
Say you have a CSV file like
aaa,111,bbb,222,333,ccc
and you want to extract only fields that are entirely numerical. For simplicity, I'll assume that there is no leading or trailing whitespace anywhere. Without lookarounds, we might go with capturing and try:
(?:^|,)(\d+)(?:,|$)
So we make sure that we have the start of a field (start of string or ,), then only digits, and then the end of a field (, or end of string). Between that we capture the digits into group 1. Unfortunately, this will not give us 333 in the above example, because the , that precedes it was already part of the match ,222, - and matches cannot overlap. Lookarounds solve the problem:
(?<=^|,)\d+(?=,|$)
Or if you prefer double negation over alternation, this is equivalent to
(?<![^,])\d+(?![^,])
In addition to being able to get all matches, we get rid of the capturing which can generally improve performance. (Thanks to Adrian Pronk for this example.)
Multiple independent conditions
Another very classic example of when to use lookarounds (in particular lookaheads) is when we want to check multiple conditions on an input at the same time. Say we want to write a single regex that makes sure our input contains a digit, a lower case letter, an upper case letter, a character that is none of those, and no whitespace (say, for password security). Without lookarounds you'd have to consider all permutations of digit, lower case/upper case letter, and symbol. Like:
\S*\d\S*[a-z]\S*[A-Z]\S*[^0-9a-zA_Z]\S*|\S*\d\S*[A-Z]\S*[a-z]\S*[^0-9a-zA_Z]\S*|...
Those are only two of the 24 necessary permutations. If you also want to ensure a minimum string length in the same regex, you'd have to distribute those in all possible combinations of the \S* - it simply becomes impossible to do in a single regex.
Lookahead to the rescue! We can simply use several lookaheads at the beginning of the string to check all of these conditions:
^(?=.*\d)(?=.*[a-z])(?=.*[A-Z])(?=.*[^0-9a-zA-Z])(?!.*\s)
Because the lookaheads don't actually consume anything, after checking each condition the engine resets to the beginning of the string and can start looking at the next one. If we wanted to add a minimum string length (say 8), we could simply append (?=.{8}). Much simpler, much more readable, much more maintainable.
Important note: This is not the best general approach to check these conditions in any real setting. If you are making the check programmatically, it's usually better to have one regex for each condition, and check them separately - this let's you return a much more useful error message. However, the above is sometimes necessary, if you have some fixed framework that lets you do validation only by supplying a single regex. In addition, it's worth knowing the general technique, if you ever have independent criteria for a string to match.
I hope these examples give you a better idea of why people would like to use lookarounds. There are a lot more applications (another classic is inserting commas into numbers), but it's important that you realise that there is a difference between (?!u) and [^u] and that there are cases where negated character classes are not powerful enough at all.
q[^u] will not match "Iraq" because it will look for another symbol.
q(?!u) however, will match "Iraq":
regex = /q[^u]/
/q[^u]/
regex.test("Iraq")
false
regex.test("Iraqf")
true
regex = /q(?!u)/
/q(?!u)/
regex.test("Iraq")
true
Well, another thing along with what others mentioned with the negative lookahead, you can match consecutive characters (e.g. you can negate ui while with [^...], you cannot negate ui but either u or i and if you try [^ui]{2}, you will also negate uu, ii and iu.
The whole point is to not "consume" the next character(s), so that it can be e.g. captured by another expression that comes afterwards.
If they're the last expression in the regex, then what you've shown are equivalent.
But e.g. q(?!u)([a-z]) would let the non-u character be part of the next group.

Regular Expression to validate the field: field must contain atleast 2 AlphaNumeric characters

I need to validate VARCHAR field.
conditions are: field must contain atleast 2 AlphaNumeric characters
so please any one give the Regular Expression for the above conditions
i wrote below Expression but it will check atleast 2 letters are alphanumeric. and if my input will have other than alphanumeric it is nt validating.
'^[a-zA-Z0-9]{2,}$'
please help.........
[a-zA-Z0-9].*[a-zA-Z0-9]
The easy way: At least two alnum's anywhere in the string.
Answer to comments
I never did (nor intended to do) any benchmarking. Therefore - and given that we know nothing of OP's environment - I am not one to judge whether a non-greedy version ([a-zA-Z0-9].*?[a-zA-Z0-9]) will be more efficient. I do believe, however, that the performance impact is totally negligible :)
I would probably use this regular expression:
[a-zA-Z0-9][^a-zA-Z0-9]*[a-zA-Z0-9]
How broad is your definition of alphanumeric? For US ASCII, see the answers above. For a more cosmopolitan view, use one of
[[:alnum:]].*[[:alnum:]]
or
[^\W_].*[^\W_]
The latter works because \w matches a "word character," alphanumerics and underscore. Use a double-negative to exclude the underscore: "not not-a-word-character and not underscore."
As simple as
'\w.*\w'
In response to a comment, here's a performance comparison for the greedy [a-zA-Z0-9].*[a-zA-Z0-9] and the non-greedy [a-zA-Z0-9].*?[a-zA-Z0-9].
The greedy version will find the first alphanumeric, match all the way to the end, and backtrack to the last alphanumeric, finding the longest possible match. For a long string, it is the slowest version. The non greedy version finds the first alphanumeric, and tries not to match the following symbols until another alphanumeric is found (that is, for every letter it matches the empty string, tries to match [a-zA-Z0-9], fails, and matches .).
Benchmarking (empirical results):
In case the alphanumeric are very far away, the greedy version is faster (even faster than Gumbo's version).
In case the alphanumerics are close to each other, the greedy version is significantly slower.
The test: http://jsbin.com/eletu/4
Compares 3 versions:
[a-zA-Z0-9].*?[a-zA-Z0-9]
[a-zA-Z0-9][^a-zA-Z0-9]*[a-zA-Z0-9]
[a-zA-Z0-9].*[a-zA-Z0-9]
Conclusion: none. As always, you should check against typical data.