Which would be better non-greedy regex or negated character class? - regex

I need to match #anything_here# from a string #anything_here#dhhhd#shdjhjs#. So I'd used following regex.
^#.*?#
or
^#[^#]*#
Both way it's work but I would like to know which one would be a better solution. Regex with non-greedy repetition or regex with negated character class?

Negated character classes should usually be prefered over lazy matching, if possible.
If the regex is successful, ^#[^#]*# can match the content between #s in a single step, while ^#.*?# needs to expand for each character between #s.
When failing (for the case of no ending #) most regex engines will apply a little magic and internally treat [^#]* as [^#]*+, as there is a clear cut border between # and non-#, thus it will match to the end of the string, recognize the missing # and not backtrack, but instantly fail. .*? will expand character for character as usual.
When used in larger contexts, [^#]* will also never expand over the borders of the ending # while this is very well possible for the lazy matching. E.g. ^#[^#]*a[^#]*# won't match #bbbb#a# while ^#.*?a.*?# will.
Note that [^#] will also match newlines, while . doesn't (in most regex engines and unless used in singleline mode). You can avoid this by adding the newline character to the negation - if it is not wanted.

It is clear the ^#[^#]*# option is much better.
The negated character class is quantified greedily which means the regex engine grabs 0 or more chars other than # right away, as many as possible. See this regex demo and matching:
When you use a lazy dot matching pattern, the engine matches #, then tries to match the trailing # (skipping the .*?). It does not find the # at Index 1, so the .*? matches the a char. This .*? pattern expands as many times as there are chars other than # up to the first #.
See the lazy dot matching based pattern demo here and here is the matching steps:

Related

Regex Pattern to Match except when the clause enclosed by the tilde (~) on both sides

I want to extract matches of the clauses match-this that is enclosed with anything other than the tilde (~) in the string.
For example, in this string:
match-this~match-this~ match-this ~match-this#match-this~match-this~match-this
There should be 5 matches from above. The matches are explained below (enclosed by []):
Either match-this~ or match-this is correct for first match.
match-this is correct for 2nd match.
Either ~match-this# or ~match-this is correct for 3rd match.
Either #match-this~ or #match-this or match-this~ is correct for 4th match.
Either ~match-this or match-this is correct for 5th match.
I can use the pattern ~match-this~ catch these ~match-this~, but when I tried the negation of it (?!(~match-this)), it literally catches all nulls.
When I tried the pattern [^~]match-this[^~], it catches only one match (the 2nd match from above). And when I tried to add asterisk wild card on any negation of tilde, either [^~]match-this[^~]* or [^~]*match-this[^~], I got only 2 matches. When I put the asterisk wild card on both, it catches all match-this including those which enclosed by tildes ~.
Is it possible to achieve this with only one regex test? Or Does it need more??
If you also want to match #match-this~ as a separate match, you would have to account for # while matching, as [^~] also matches #
You could match what you don't want, and capture in a group what you want to keep.
~[^~#]*~|((?:(?!match-this).)*match-this(?:(?!match-this)[^#~])*)
Explanation
~[^~#]*~ Match any char except ~ or # between ~
| Or
( Capture group 1
(?:(?!match-this).)* Match any char if not directly followed by *match-this~
match-this Match literally
(?:(?!match-this)[^#~])* Match any char except ~ or # if not directly followed by match this
) Close group 1
See a regex demo and a Python demo.
Example
import re
pattern = r"~[^~#]*~|((?:(?!match-this).)*match-this(?:(?!match-this)[^#~])*)"
s = "match-this~match-this~ match-this ~match-this#match-this~match-this~match-this"
res = [m for m in re.findall(pattern, s) if m]
print (res)
Output
['match-this', ' match-this ', '~match-this', '#match-this', 'match-this']
If all five matches can be "match-this" (contradicting the requirement for the 3rd match) you can match the regular expression
~match-this~|(\bmatch-this\b)
and keep only matches that are captured (to capture group 1). The idea is to discard matches that are not captured and keep matches that are captured. When the regex engine matches "~match-this~" its internal string pointer is moved just past the closing "~", thereby skipping an unwanted substring.
Demo
The regular expression can be broken down as follows.
~match-this~ # match literal
| # or
( # begin capture group 1
\b # match a word boundary
match-this # match literal
\b # match a word boundary
) # end capture group 1
Being so simple, this regular expression would be supported by most regex engines.
For this you need both kinds of lookarounds. This will match the 5 spots you want, and there's a reason why it only works this way and not another and why the prefix and/or suffix can't be included:
(?<=~)match-this(?!~)|(?<!~)match-this(?=~)|(?<!~)match-this(?!~)
Explaining lookarounds:
(?=...) is a positive lookahead: what comes next must match
(?!...) is a negative lookahead: what comes next must not match
(?<=...) is a positive lookbehind: what comes before must match
(?<!...) is a negative lookbehind: what comes before must not match
Why other ways won't work:
[^~] is a class with negation, but it always needs one character to be there and also consumes that character for the match itself. The former is a problem for a starting text. The latter is a problem for having advanced too far, so a "don't match" character is gone already.
(^|[^~]) would solve the first problem: either the text starts or it must be a character not matching this. We could do the same for ending texts, but this is a dead again anyway.
Only lookarounds remain, and even then we have to code all 3 variants, hence the two |.
As per the nature of lookarounds the character in front or behind cannot be captured. Additionally if you want to also match either a leading or a trailing character then this collides with recognizing the next potential match.
It's a difference between telling the engine to "not match" a character and to tell the engine to "look out" for something without actually consuming characters and advancing the current position in the text. Also not every regex engine supports all lookarounds, so it matters where you actually want to use it. For me it works fine in TextPad 8 and should also work fine in PCRE (f.e. in PHP). As per regex101.com/r/CjcaWQ/1 it also works as expected by me.
What irritates me: if the leading and/or trailing character of a found match is important to you, then just extract it from the input when processing all the matches, since they also come with starting positions and lengths: first match at position 0 for 10 characters means you look at input text position -1 and 10.

Is there a way to optimize this case of catastrophic regex backtracking?

So I have come up with the following regex:
([^\s\\]+(?:\\.[^\s\\]*)*)(?:.*?)(\S+\.php\b)
Test link: https://regex101.com/r/NV6Bk4/4
It matches the binary and the script name of a command line. Example:
php --strict myscript.php --arg=value
matches php and myscript.php in group(1) and group(2).
The problem is this part in the middle: (?:.*?), it leads to combinatorial explosion, slowing down the regex for large inputs. Is there a way to optimize this? Since there is no pattern I can't think of anything.
To clarify, the rule that I'm trying to match is:
Match any path to a command, possibly containing escaped whitespace. Ignore any arguments following it. Match a file ending in .php, ignore anything that follows it. The command should be in group(1), the filename should be in group(2).
You may use the following "fix" with Matcher#matches():
([^\s\\]*+(?:\\.[^\s\\]*)*).*?(\S+\.php\b).*
In Java
String regex = "([^\\s\\\\]*+(?:\\\\.[^\\s\\\\]*)*).*?(\\S+\\.php\\b).*";
See the regex demo. Note that a literal . outside of a character class must be escaped. Compile the pattern with Pattern.DOTALL if the string may have line breaks.
As you see, the .*? part matches any char, and (?:\\.[^\s\\]*)* before it can match any 0 or more chars (so, it is kind of optional) and the next adjoining pattern to .*? from the left is [^\s\\]+ that can match the same chars as .*?. That means, the regex engine may backtrack to the first subpattern, and that creates a lot of ways to match the string, commonly named as catastrophic backtracking.
If you disallow backtracking into the first negated character class with *+ possessive quantifier, it will already work much more reliably.
Add .* at the end to make it work with .matches() as this method requires a full string match.

Match asterisk followed by space in PCRE

I'm just having trouble figuring out how to regex properly. What I need is to match an asterisk followed by a space followed by any amount of characters that aren't \n. (Similar to reddit list formatting)
Example:
* Test
* Test2
* Test3
The closest I got was this, but it wasn't working.
/^[*][ ](.*?)/s
Can anyone familiar with PCRE help me.
You should not use a lazy dot pattern at the end of the regex because it will never match any single char (as it will be skipped when the regex engine comes up to it, and since there is nothing to match after it, the empty string will be matched by .*?).
Use the greedy dot pattern:
^\* (.*)
See the regex demo
Other notes: you may use \h to match any horizontal whitespace instead of the regular space in the pattern. To match start of lines with ^ use m modifier. Only use s modifier if you need . to match any chars including a newline (and carriage return depending on PCRE verbs that are active).

RegEx expression not allowing only spaces?

I have this regEx expression which allows only spaces, letters and dashes. I'd like to modify it so it wouldn't allow ONLY spaces too. Can someone help me ?
/^([A-zăâîșțĂÂÎȘȚ-\s])+$/
You can use a negative lookahead to restrict this generic pattern:
/^(?!\s+$)[A-Za-zăâîșțĂÂÎȘȚ\s-]+$/
^^^^^^^^
See the regex demo
The (?!\s+$) lookahead is executed once at the very beginning and returns false if there are 1 or more whitespaces until the end of the string.
Also, your regex contained a classical issue of [A-z] that matches more than just ASCII letters, you need to replace this with [A-Za-z] (or just [a-z] and use the /i case insensitive modifier).
Also, the - inside a character class is usually placed at the end so as not to escape it, and it will be parsed as a literal hyphen (however, you might want to escape it if another developer will have to update this pattern by adding more symbols to the character class).
And just in case this is a regex engine that does not support lookarounds:
^[A-Za-zăâîșțĂÂÎȘȚ\s-]*[A-Za-zăâîșțĂÂÎȘȚ-][A-Za-zăâîșțĂÂÎȘȚ\s-]*$
It requires at least 1 non-space character from the allowed set (also matching 1 obligatory symbol).
Another regex demo

Do not repeat placeholders in the same string regex

I made a regex to validate arrays that contain variable placeholders surrounded by { and }:
^(\/?(([a-zA-Z0-9\-\_]+)|(\{[a-zA-Z][a-zA-Z0-9]*\}))\/?)*$
It will validate strings like test/{a}/{b} and /some-text/{a}/{a}/ and its working fine. Here is the test: https://regex101.com/r/nP1tB2/2
Is it possible to block duplicated placeholders?
For example, in the 2nd string, {a} appears twice, but I would like to "block" (regex that doesn't match) it.
You may use a negative lookahead to restrict the matching process:
^(?!.*{([\w-]+)}.*{\1})(\/?(([\w-]+)|(\{[a-zA-Z][a-zA-Z0-9]*\}))\/?)*$
^^^^^^^^^^^^^^^^^^^^^^
It means that right after a beginning of string is detected, (?!.*{([\w-]+)}.*{\1}) will check if there are 0+ characters other than a newline followed with a {...} substring (with only letters, digits, underscores or hyphens) followed with the same pattern. If the pattern is found, the whole match is failed.
See the regex demo
Note that if you do not use a Unicode aware pattern (and it is not .NET without RegexOptions.ECMAScript), \w is equal to [A-Za-z0-9_]. So, I replaced that with \w in your pattern. Else, restore that subpattern in both lookahead and the main pattern.
Also, [a-zA-Z] can also be expressed as [^\W\d_] or \p{L} (or even [:alpha:]) and [a-zA-Z0-9] as [^\W_] (or [:alnum:], [\p{L}\p{N}]). These subpatterns are handy if you need to make the pattern Unicode aware. A lot depends on the regex flavor.