Catastrophic backtracking issue with regular expression on long names - regex

I am currently trying to validate a regex pattern for a names list among other things.
It actually works so far except when I try to test the limits. If the name is quite long, a maximum of 128 characters is allowed and then at the end a character which is defined in an inner group, such as:. a separator e.g. Space or a puncture, catastrophic backtracking occurs. Somehow I don't quite understand that because I would assume that group one (?:[\p{L}\p{Nd}\p{Ps}])+ 1 x must be there, group (?:\p{Zs}\p{P}|\p{P}\p{Zs}|[\p{P}\p{Zs}])? is optional and if the group has to be valid at the end (?:[\p{L}\p{Nd}\p{Pe}.]). The rear 2 groups can occur more often.
Full pattern
^(?!.{129})(?!.["])(?:[\p{L}\p{Nd}\p{Ps}])+(?:(?:\p{Zs}\p{P}|\p{P}\p{Zs}|[\p{P}\p{Zs}])?(?:[\p{L}\p{Nd}\p{Pe}.]))*$
Tests & Samples
https://regex101.com/r/6E0Khd/1

You need to re-phrase the pattern in such a way so that the consequent regex parts could not match at the same location inside the string.
You can use
^(?!.{129})(?!.")[\p{L}\p{Nd}\p{Ps}][\p{L}\p{Nd}\p{Pe}.]*(?:(?:\p{Zs}\p{P}?|\p{P}\p{Zs}?)[\p{L}\p{Nd}\p{Pe}.]+)*$
See the regex demo.
Your regex was ^<Lookahead_1><Lookahead_2><P_I>+(?:<OPT_SEP>?<P_II>)*$. You need to make sure your string only starts with a char that matches <P_I> pattern, the rest of the chars can match <P_II> pattern. So, it should look like ^<Lookahead_1><Lookahead_2><P_I><P_II>*(?:<SEP><P_II>+)*$. Note the P_I pattern is used to match the first char only, P_II pattern is added right after P_I to match zero or more chars matching that pattern, SEP pattern is now obligatory and P_II pattern is quantified with +.
I also shrunk the (?:\p{Zs}\p{P}|\p{P}\p{Zs}|[\p{P}\p{Zs}]) pattern into (?:\p{Zs}\p{P}?|\p{P}\p{Zs}?) (it matches either a horizontal whitespace and an optional punctuation proper symbol, or an optional punctuation proper symbol followed with an optional horizontal whitespace.
Note that \p{Zs} does not match a TAB char, you may want to use [\p{Zs}\t] instead.

Related

Abort regex execution when pattern found in negative lookahead syntax

While struggling trying to validate SQL Server's connection string pattern using regex I've achieved the following result:
^(?!.*?(?<=^|\;)[a-zA-Z]+( [a-zA-Z]+)*(\=[^\;]+?\=[^\;]*)?(\;|$))+([a-zA-Z]+( [a-zA-Z]+)*\=[^\;]+\;?)+$
Sample string used was:
option=value;missingvalue;multiple assignment=123=456
* (hosted and tested in regex101)
And, as expected, the string didn't match. The issue is that I think this may not be standard, recommended nor optimal regex implementation — especially at the negative lookahead part, considering it's just going through the whole string even after a successful match.
I'll try to break down how it works below:
Negative Lookahead
1. ^(?!.*?(?<=^|;)
Negative lookahead pattern starting either at the beginning of the string or recursively throughout just after the semi colon character
2. [a-zA-Z]+( [a-zA-Z]+)*(=[^;]+?=[^;]*)?(;|$))+
Matching the simple or composite option names — that is, just [a-zA-Z]+ (mandatory) or, additionally, ( [a-zA-Z]+)* any number of times; afterwards there's an optional group that tries to match when there's more than one consecutive value assignment for any given option; finally it ends with either ; or $ (end of string) — in case of the first one, the lookahead pattern restarts from the beginning (recursion)
Regular Pattern Matching
([a-zA-Z]+( [a-zA-Z]+)*=[^;]+;?)+$
Not much new to say here other than that this is the pattern which should actually match the string after the initial Negative Lookahead thorough scan/validation.
I can't deny that it's kinda working for what I intended, but I can't hold back the feeling that I'm misunderstanding something about regex's workings.
Is there an easier way to do this while avoiding having to recursively look ahead using the pattern described above multiple times?
EDIT: As requested, some closer to real life examples would be the following — for both valid and invalid formatting:
VALID
Database=somedb;Username=admin;Password=P#ssword!23;Port=1433
INVALID
missing delimiter between Username and Password options
Database=somedb;Username=adminPassword=P#ssword!23;Port=1433
missing value for Port option
Database=somedb;Port;Username=admin;Password=P#ssword!23
The following string accepts only letters for the names. for the purposes of testing it accepts any character except equals and semi colon in the values. This would need to be defined as characters like line ending and tab would need to be excluded.
We have a negative lookahead to forbid a second equals sign in the values and a negative lookback to forbid a semi-colon before the end. Please note that your "correct" example is found to be wrong because there is no semi-colon at the end
If we try to block the otherway round it becomes impossible to match the regex.
I've added an optional single space in the name to match "Connection Timeout" and similar
/^(\s*[a-zA-Z]+ ?[a-zA-Z]+=[^=;]+;)+$/gm
I have also allowed spaces before the name.
Our string is made up of
^beginning of line
( start group
\s* optional whitespace before name
[a-zA-Z]+ ?[a-zA-Z]+name containing at least one letter before and after an optional space. This means at least two letters
=an equals sign
(start inner group
(?!\=) negative look ahead for equals sign
[^=;] any character except equals and semi-colon at least once
; a literal semi-colon.
){4,}close the outer group and repeat it at least 4 times
$ end of line
Thank you Casimir et Hippolyte for the improvement. I was using look-aheads and look-backs following the question but your syntax is much cleaner.

(PowerShell) Why is this regular expression so slow for the given input? [duplicate]

Using Java, i want to detect if a line starts with words and separator then "myword", but this regex takes too long. What is incorrect ?
^\s*(\w+(\s|/|&|-)*)*myword
The pattern ^\s*(\w+(\s|/|&|-)*)*myword is not efficient due to the nested quantifier. \w+ requires at least one word character and (\s|/|&|-)* can match zero or more of some characters. When the * is applied to the group and the input string has no separators in between word characters, the expression becomes similar to a (\w+)* pattern that is a classical catastrophical backtracking issue pattern.
Just a small illustration of \w+ and (\w+)* performance:
\w+: (\w+)*
You pattern is even more complicated and invloves more those backtracking steps. To avoid such issues, a pattern should not have optional subpatterns inside quantified groups. That is, create a group with obligatory subpatterns and apply the necessary quantifier to the group.
In this case, you can unroll the group you have as
String rx = "^\\s*(\\w+(?:[\\s/&-]+\\w+)*)[\\s/&-]+myword";
See IDEONE demo
Here, (\w+(\s|/|&|-)*)* is unrolled as (\w+(?:[\s/&-]+\w+)*) (I kept the outer parentheses to produce a capture group #1, you may remove these brackets if you are not interested in them). \w+ matches one or more word characters (so, it is an obligatory subpatter), and the (?:[\s/&-]+\w+)* subpattern matches zero or more (*, thus, this whole group is optional) sequences of one or more characters from the defined character class [\s/&-]+ (so, it is obligatory) followed with one or more word characters \w+.

How to only match a single instance of a character?

Not quite sure how to go about this, but basically what I want to do is match a character, say a for example. In this case all of the following would not contain matches (i.e. I don't want to match them):
aa
aaa
fooaaxyz
Whereas the following would:
a (obviously)
fooaxyz (this would only match the letter a part)
My knowledge of RegEx is not great, so I am not even sure if this is possible. Basically what I want to do is match any single a that has any other non a character around it (except for the start and end of the string).
Basically what I want to do is match any single a that has any other non a character around it (except for the start and end of the string).
^[^\sa]*\Ka(?=[^\sa]*$)
DEMO
\K discards the previously matched characters and lookahead assertes whether a match is possibel or not. So the above matches only the letter a which satifies the conditions.
OR
a{2,}(*SKIP)(*F)|a
DEMO
You may use a combination of a lookbehind and a lookahead:
(?<!a)a(?!a)
See the regex demo and the regex graph:
Details
(?<!a) - a negative lookbehind that fails the match if, immediately to the left of the current location, there is a a char
a - an a char
(?!a) - a negative lookahead that fails the match if, immediately to the right of the current location, there is a a char.
You need two things:
a negated character class: [^a] (all except "a")
anchors (^ and $) to ensure that the limits of the string are reached (in other words, that the pattern matches the whole string and not only a substring):
Result:
^[^a]*a[^a]*$
Once you know there is only one "a", you can use the way you want to extract/replace/remove it depending of the language you use.

Regular expression to match non-integer values in a string

I want to match the following rules:
One dash is allowed at the start of a number.
Only values between 0 and 9 should be allowed.
I currently have the following regex pattern, I'm matching the inverse so that I can thrown an exception upon finding a match that doesn't follow the rules:
[^-0-9]
The downside to this pattern is that it works for all cases except a hyphen in the middle of the String will still pass. For example:
"-2304923" is allowed correctly but "9234-342" is also allowed and shouldn't be.
Please let me know what I can do to specify the first character as [^-0-9] and the rest as [^0-9]. Thanks!
This regex will work for you:
^-?\d+$
Explanation: start the string ^, then - but optional (?), the digit \d repeated few times (+), and string must finish here $.
You can do this:
(?:^|\s)(-?\d+)(?:["'\s]|$)
^^^^^ non capturing group for start of line or space
^^^^^ capture number
^^^^^^^^^ non capturing group for end of line, space or quote
See it work
This will capture all strings of numbers in a line with an optional hyphen in front.
-2304923" "9234-342" 1234 -1234
++++++++ captured
^^^^^^^^ NOT captured
++++ captured
+++++ captured
I don't understand how your pattern - [^-0-9] is matching those strings you are talking about. That pattern is just the opposite of what you want. You have simply negated the character class by using caret(^) at the beginning. So, this pattern would match anything except the hyphen and the digits.
Anyways, for your requirement, first you need to match one hyphen at the beginning. So, just keep it outside the character class. And then to match any number of digits later on, you can use [0-9]+ or \d+.
So, your pattern to match the required format should be:
-[0-9]+ // or -\d+
The above regex is used to find the pattern in some large string. If you want the entire string to match this pattern, then you can add anchors at the ends of the regex: -
^-[0-9]+$
For a regular expression like this, it's sometimes helpful to think of it in terms of two cases.
Is the first character messed up somehow?
If not, are any of the other characters messed up somehow?
Combine these with |
(^[^-0-9]|^.+?[^0-9])

Regex to match [integer][colon][alphanum][colon][integer]

I am attempting to match a string formatted as [integer][colon][alphanum][colon][integer]. For example, 42100:ZBA01:20. I need to split these by colon...
I'd like to learn regex, so if you could, tell me what I'm doing wrong:
This is what I've been able to come up with...
^(\d):([A-Za-z0-9_]):(\d)+$
^(\d+)$
^[a-zA-Z0-9_](:)+$
^(:)(\d+)$
At first I tried matching parts of the string, these matching the entire string. As you can tell, I'm not very familiar with regular expressions.
EDIT: The regex is for input into a desktop application. I'm was not certain what 'language' or 'type' of regex to use, so I assumed .NET .
I need to be able to identify each of those grouped characters, split by colon. So Group #1 should be the first integer, Group #2 should be the alphanumeric group, Group #3 should be an integer (ranging 1-4).
Thank you in advance,
Darius
I assume the semicolons (;) are meant to be colons (:)? All right, a bit of the basics.
^ matches the beginning of the input. That is, the regular expression will only match if it finds a match at the start of the input.
Similarly, $ matches the end of the input.
^(\d+)$ will match a string consisting only of one or more numbers. This is because the match needs to start at the beginning of the input and stop at the end of the input. In other words, the whole input needs to match (not just a part of it). The + denotes one or more matches.
With this knowledge, you'll notice that ^(\d):([A-Za-z0-9_]):(\d)+$ was actually very close to being right. This expression indicates that the whole input needs to match:
one digit;
a colon;
one word character (or an alphanumeric character as you call it);
a colon;
one or more digits.
The problem is clearly in 1 and 3. You need to add a + quantifier there to match one or more times instead of just once. Also, you want to place these quantifiers inside the capturing groups in order to get the multiple matches inside one capturing group as opposed to receiving multiple capturing groups containing single matches.
^(\d+):([A-Za-z0-9_]+):(\d+)$
You need to use quantifiers
^(\d+):([A-Za-z0-9_]+):(\d+)$
^ ^ ^
+ is quantifier that matches preceeding pattern 1 to many times
Now you can access the values by accessing the particular groups