Why the [^A] does not work? - regex

Why the regular expression:
changes\s*=\s*[^A].*
matches
changes = AssignDictionary(out
What I want to find is no words starting with character "A" ([^A]) following the spaces (\s*), and it supposes not to match that line...what am I doing wrong?

The [^A] does not work because of backtracking. \s* matches zero or more whitespaces, and then the engine backtracks to accommodate for a non-A. Since there are two spaces after =, the second space is matched with [^A] -> there is a match.
See Step 12 & 13 (regex demo):
If you want to fail the match when there is an A after =, you need a negative lookahead:
changes\s*=(?!\s*A)\s*.*
^^^^^^^^
See another demo
Or another PCRE variation: changes\s*=\s*+(?!A).* (check if the character is not A after all whitespaces after =).
If your regex engine supports atomic groups or possessive quantifiers, you can make your regex work by preventing backtracking into the \s* construct:
changes\s*=\s*+[^A].*
^^ (possessive quantifier)
changes\s*=(?>\s*)[^A]\s*.*
^^ ^ - atomic group
And in case your engine does not support atomic groups, nor possessive quantifiers, you can disable backtracking with a capture group/backreference combination (to emulate an atomic group):
changes\s*=(?=(\s*))\1[^A].*
See this demo.
Still, the first solution with a lookahead is preferable since it seems the most universal one. The fastest looks to be the one with the possessive quantifier.

It is also possible to get that with plain regex. Just indicate what is not a valid character following the arbitrary number of spaces before the "not A". As you indicated this is: not A, but of course also "not a space".
Otherwise backtracking would allow a space preceeding an A in tat position to
be matched for the "not-A" and defeat your intentions.
Using changes\s*=\s*[^A\s].* will match anything that does not have an A or a white space after the spaces following the equals sign (and extend the match to end-of-line/end-of-input.

Related

(PowerShell) Why is this regular expression so slow for the given input? [duplicate]

Using Java, i want to detect if a line starts with words and separator then "myword", but this regex takes too long. What is incorrect ?
^\s*(\w+(\s|/|&|-)*)*myword
The pattern ^\s*(\w+(\s|/|&|-)*)*myword is not efficient due to the nested quantifier. \w+ requires at least one word character and (\s|/|&|-)* can match zero or more of some characters. When the * is applied to the group and the input string has no separators in between word characters, the expression becomes similar to a (\w+)* pattern that is a classical catastrophical backtracking issue pattern.
Just a small illustration of \w+ and (\w+)* performance:
\w+: (\w+)*
You pattern is even more complicated and invloves more those backtracking steps. To avoid such issues, a pattern should not have optional subpatterns inside quantified groups. That is, create a group with obligatory subpatterns and apply the necessary quantifier to the group.
In this case, you can unroll the group you have as
String rx = "^\\s*(\\w+(?:[\\s/&-]+\\w+)*)[\\s/&-]+myword";
See IDEONE demo
Here, (\w+(\s|/|&|-)*)* is unrolled as (\w+(?:[\s/&-]+\w+)*) (I kept the outer parentheses to produce a capture group #1, you may remove these brackets if you are not interested in them). \w+ matches one or more word characters (so, it is an obligatory subpatter), and the (?:[\s/&-]+\w+)* subpattern matches zero or more (*, thus, this whole group is optional) sequences of one or more characters from the defined character class [\s/&-]+ (so, it is obligatory) followed with one or more word characters \w+.

Regex - How to prevent any string that starts with "de" but cannot use lookahead or lookbehind?

I have a regex
[a-zA-Z][a-z]
I have to change this regex such that the regex should not accept string that starts with "de","DE","dE" and "De" .I cannot use look behind or look ahead because my system does not support it?
There's a solution without a lookahead or lookbehind, but you need to be able to use groups.
The idea there is to create a sort of "honeypot" that will match your negative results and keep only the results that do interest you.
In your case, that would write:
[dD][eE].*|(<your-regex>)
If the proposition is de<anything> (case insensitive here), it will match, but group(1) will be null.
On the other hand, matching diZ for instance would match not match what is before the or and would therefore fall into the group(1).
Finally, if the proposition doesn't start with de and doesn't match your regex, well, there will be no groups to get at all.
If you need to be sure that your proposition will match the whole provided string, you can update the regex thus:
^(?:[dD][eE].*|(<your-regex>))$
Note that ?: is not a lookahead of any kind, it serves to mark the group as non-capturing, so that <your-regex> will still be captured by group(1) (would become group(2) otherwise and the capture of a group is not always a transparent operation, performance-wise).
Simply ignore those characters:
[a-ce-z][a-df-z][a-gi-kwxyzWZXZ]
Make sure the flag is set to case insensitive. Also, [a-gi-kwxyzWZXZ] can then be modified to [a-gi-kwxyz].
EDIT:
As pointed out in this comment, the regex here won't support other words that start with d but are not followed by e. In this case, negative lookahead is a possible solution:
^(?!de)[a-z]+
This matches anything not starting with "DE" (case insensitive, without look arounds, allowing leading whitespace):
^ *+(?:[^Dd].|.[^Ee])<your regex for rest of input>
See live demo.
The possessive quantifier *+ used for whitespace prevents [^Dd] from being allowed to match a space via backtracking, making this regex hardened against leading spaces.
You can use an alternation excluding matching the d and D from the first character, or exclude matching the e as the second character.
Note that the pattern [a-zA-Z][a-z] matches at least 2 characters, so will the following pattern:
^(?:[abce-zABCE-Z][a-z]|[a-zA-Z][a-df-z]).*
^ Start of string
(?: Non capture group
[abce-zABCE-Z][a-z] Match a char a-zA-Z without d and D followed by a lowercase char a-z
| or
[a-zA-Z][a-df-z] Match a char a-zA-Z followed by a lowercase chars a-z without e
) Close non capture grou
.* Match 0+ times any char except a newline
Regex demo
Another option is to use word boundaries \b instead of an anchor ^
\b(?:[abce-zABCE-Z][a-z]|[a-zA-Z][a-df-z])[a-zA-Z]*\b
Regex demo

Which would be better non-greedy regex or negated character class?

I need to match #anything_here# from a string #anything_here#dhhhd#shdjhjs#. So I'd used following regex.
^#.*?#
or
^#[^#]*#
Both way it's work but I would like to know which one would be a better solution. Regex with non-greedy repetition or regex with negated character class?
Negated character classes should usually be prefered over lazy matching, if possible.
If the regex is successful, ^#[^#]*# can match the content between #s in a single step, while ^#.*?# needs to expand for each character between #s.
When failing (for the case of no ending #) most regex engines will apply a little magic and internally treat [^#]* as [^#]*+, as there is a clear cut border between # and non-#, thus it will match to the end of the string, recognize the missing # and not backtrack, but instantly fail. .*? will expand character for character as usual.
When used in larger contexts, [^#]* will also never expand over the borders of the ending # while this is very well possible for the lazy matching. E.g. ^#[^#]*a[^#]*# won't match #bbbb#a# while ^#.*?a.*?# will.
Note that [^#] will also match newlines, while . doesn't (in most regex engines and unless used in singleline mode). You can avoid this by adding the newline character to the negation - if it is not wanted.
It is clear the ^#[^#]*# option is much better.
The negated character class is quantified greedily which means the regex engine grabs 0 or more chars other than # right away, as many as possible. See this regex demo and matching:
When you use a lazy dot matching pattern, the engine matches #, then tries to match the trailing # (skipping the .*?). It does not find the # at Index 1, so the .*? matches the a char. This .*? pattern expands as many times as there are chars other than # up to the first #.
See the lazy dot matching based pattern demo here and here is the matching steps:

Correct match using RegEx but it should work without substitution

I have <autorpodpis>([^;,<\n\r]*?)\s*[;,<\n\r] to catch everything inside
<autorpodpis>_this_is_an_example_of_what_I'd_like_to_match< If there is a space, a colon (;) or a semicolon (;) or a space before a colon or a semicolon, my RegEx catches everything but including these characters – see my link. It works as it is expected to.
Overall, the RegEx works fine with substitution \1 (or in AutoHotKey I use – $1). But I'd like match without using substitution.
You seem to mix the terms substitution (regex based replacement operation) and capturing (storing a part of the matched value captured with a part of a pattern enclosed with a pair of unescaped parentheses inside a numbered or named stack).
If you want to just match a substring in specific context without capturing any subvalues, you might consider using lookarounds (lookbehind or lookahead).
In your case, since you need to match a string after some known string, you need a lookbehind. A lookbehind tells the regex engine to temporarily step backwards in the string, to check if the text inside the lookbehind can be matched there.
So, you could use
pos := RegExMatch(input, "(?<=<autorpodpis>)\p{L}+(?:\s+\p{L}+)*", Res)
So, the Res should have WOJCIECH ZAŁUSKA if you supply <autorpodpis>WOJCIECH ZAŁUSKA</autorpodpis> as input.
Explanation:
(?<=<autorpodpis>) - check if there is <autorpodpis> right before the currently tested location. If there is none, fail this match, go on to the next location in string
\p{L}+ - 1+ Unicode letters
(?:\s+\p{L}+)* - 0+ sequences of 1+ whitespaces followed with 1+ Unicode letters.
However, in most cases, and always in cases like this when the pattern in the lookbehind is known, the lookbehind is unanchored (say, when it is the first subpattern in the pattern) and you do not need overlapping matches, use capturing.
The version with capturing in place:
pos := RegExMatch(input, "<autorpodpis>(\p{L}+(?:\s+\p{L}+)*)", Res)
And then Res[1] will hold the WOJCIECH ZAŁUSKA value. Capturing is in most cases (96%) faster.
Now, your regex - <autorpodpis>([^;,<\n\r]*?)\s*[;,<\n\r] - is not efficient as the [^;,<\n\r] also matches \s and \s matches [;,<\n\r]. My regex is linear, each subsequent subpattern does not match the previous one.

how to get sub-string using regex if I specify start and end, without start characters?

I have string like this:
12abcc?p_auth=123ABC&ABC&s
Start of symbol is "p_auth=" and end of string first "&" symbol.
P.S symbol '&' and 'p_auth=' must not be included.
I have wrote that regex:
(p_auth).+?(?=&)
Ok, thats works well, it gets that sub-string:
p_auth=123ABC
bot how to get string without 'p_auth'?
Use look-arounds:
(?<=p_auth=).*?(?=&)
See regex demo
The look-behind (?<=p_auth=) and the look-ahead (?=&) do not consume characters as they are zero-width assertions. They just check for the substring presence either before or after a certain subpattern.
A couple more words about (?<=p_auth=). It is a positive look-behind. Positive because it require a pattern inside it to appear on the left, before the "main" subpattern. If the look-behind subpattern is found, the result is just "true" and the regex goes on checking the rest of subpatterns. If not, the match is failed, the engine goes on looking for another match at the next index.
Here is some description from regular-expressions.info:
It [the look-behind] tells the regex engine to temporarily step backwards in the string, to check if the text inside the lookbehind can be matched there. (?<!a)b matches a "b" that is not preceded by an "a", using negative lookbehind. It doesn't match cab, but matches the b (and only the b) in bed or debt. (?<=a)b (positive lookbehind) matches the b (and only the b) in cab, but does not match bed or debt.
In most cases, you do not really need look-arounds. In this case, you could just use a
p_auth(.*?)&
And get the first capturing group value.
The .*? pattern will look for any number of characters other than a newline, but as few as possible that are required to find a match. It is called lazy dot matching, because the ? symbol makes the * quantifier stop before the first symbol that is matched by the subsequent subpattern in the regular expression.
The .*& would match all the substring until the last & because * quantifier is greedy - it will consume as many characters it can match as possible.
See more at Repetition with Star and Plus regular-expressions.info page.
p_auth(.+?)(?=&)
Simply use this and grab the group 1 or capture 1.