Do not repeat placeholders in the same string regex - regex

I made a regex to validate arrays that contain variable placeholders surrounded by { and }:
^(\/?(([a-zA-Z0-9\-\_]+)|(\{[a-zA-Z][a-zA-Z0-9]*\}))\/?)*$
It will validate strings like test/{a}/{b} and /some-text/{a}/{a}/ and its working fine. Here is the test: https://regex101.com/r/nP1tB2/2
Is it possible to block duplicated placeholders?
For example, in the 2nd string, {a} appears twice, but I would like to "block" (regex that doesn't match) it.

You may use a negative lookahead to restrict the matching process:
^(?!.*{([\w-]+)}.*{\1})(\/?(([\w-]+)|(\{[a-zA-Z][a-zA-Z0-9]*\}))\/?)*$
^^^^^^^^^^^^^^^^^^^^^^
It means that right after a beginning of string is detected, (?!.*{([\w-]+)}.*{\1}) will check if there are 0+ characters other than a newline followed with a {...} substring (with only letters, digits, underscores or hyphens) followed with the same pattern. If the pattern is found, the whole match is failed.
See the regex demo
Note that if you do not use a Unicode aware pattern (and it is not .NET without RegexOptions.ECMAScript), \w is equal to [A-Za-z0-9_]. So, I replaced that with \w in your pattern. Else, restore that subpattern in both lookahead and the main pattern.
Also, [a-zA-Z] can also be expressed as [^\W\d_] or \p{L} (or even [:alpha:]) and [a-zA-Z0-9] as [^\W_] (or [:alnum:], [\p{L}\p{N}]). These subpatterns are handy if you need to make the pattern Unicode aware. A lot depends on the regex flavor.

Related

Regex match between if present [duplicate]

I have a string. The end is different, such as index.php?test=1&list=UL or index.php?list=UL&more=1. The one thing I'm looking for is &list=.
How can I match it, whether it's in the middle of the string or it's at the end? So far I've got [&|\?]list=.*?([&|$]), but the ([&|$]) part doesn't actually work; I'm trying to use that to match either & or the end of the string, but the end of the string part doesn't work, so this pattern matches the second example but not the first.
Use:
/(&|\?)list=.*?(&|$)/
Note that when you use a bracket expression, every character within it (with some exceptions) is going to be interpreted literally. In other words, [&|$] matches the characters &, |, and $.
In short
Any zero-width assertions inside [...] lose their meaning of a zero-width assertion. [\b] does not match a word boundary (it matches a backspace, or, in POSIX, \ or b), [$] matches a literal $ char, [^] is either an error or, as in ECMAScript regex flavor, any char. Same with \z, \Z, \A anchors.
You may solve the problem using any of the below patterns:
[&?]list=([^&]*)
[&?]list=(.*?)(?=&|$)
[&?]list=(.*?)(?![^&])
If you need to check for the "absolute", unambiguous string end anchor, you need to remember that is various regex flavors, it is expressed with different constructs:
[&?]list=(.*?)(?=&|$) - OK for ECMA regex (JavaScript, default C++ `std::regex`)
[&?]list=(.*?)(?=&|\z) - OK for .NET, Go, Onigmo (Ruby), Perl, PCRE (PHP, base R), Boost, ICU (R `stringr`), Java/Andorid
[&?]list=(.*?)(?=&|\Z) - OK for Python
Matching between a char sequence and a single char or end of string (current scenario)
The .*?([YOUR_SINGLE_CHAR_DELIMITER(S)]|$) pattern (suggested by João Silva) is rather inefficient since the regex engine checks for the patterns that appear to the right of the lazy dot pattern first, and only if they do not match does it "expand" the lazy dot pattern.
In these cases it is recommended to use negated character class (or bracket expression in the POSIX talk):
[&?]list=([^&]*)
See demo. Details
[&?] - a positive character class matching either & or ? (note the relationships between chars/char ranges in a character class are OR relationships)
list= - a substring, char sequence
([^&]*) - Capturing group #1: zero or more (*) chars other than & ([^&]), as many as possible
Checking for the trailing single char delimiter presence without returning it or end of string
Most regex flavors (including JavaScript beginning with ECMAScript 2018) support lookarounds, constructs that only return true or false if there patterns match or not. They are crucial in case consecutive matches that may start and end with the same char are expected (see the original pattern, it may match a string starting and ending with &). Although it is not expected in a query string, it is a common scenario.
In that case, you can use two approaches:
A positive lookahead with an alternation containing positive character class: (?=[SINGLE_CHAR_DELIMITER(S)]|$)
A negative lookahead with just a negative character class: (?![^SINGLE_CHAR_DELIMITER(S)])
The negative lookahead solution is a bit more efficient because it does not contain an alternation group that adds complexity to matching procedure. The OP solution would look like
[&?]list=(.*?)(?=&|$)
or
[&?]list=(.*?)(?![^&])
See this regex demo and another one here.
Certainly, in case the trailing delimiters are multichar sequences, only a positive lookahead solution will work since [^yes] does not negate a sequence of chars, but the chars inside the class (i.e. [^yes] matches any char but y, e and s).

This regex to match a word surrounded by {} does not work

So here's my regex to match a word after "define" or "define:"
((?<=define |define: )\w+)
That part works well and all. But when I add the part where it also should match word between {} if it can, it matches everything.
((?<=define |define: )\w+)|([^{][A-Z]+[^}])
The regex with the examples
The thing that I noticed is that when I add ^ at first [{] then it ruins everything and I don't understand why.
Why does using [^{] not work?
By using [^{], your regex becomes:
[^{][A-Z]+[^}]
In words, this translates to:
character that's not a {
a bunch of letters
character that's not a }
Note how nothing in your regex enforces the idea that the "a bunch of letters" part has to be between {}s. It just says that it has to be after a character that is not {, and before a character that is not }. By this logic, even something like ABC would match because A is not {, B is the bunch of letters, and C is not }.
How to match a word between {}?
You can use this regex:
{([A-Z]+)}
And get group 1.
I don't think that you should combine this with the regex that matches a word after define. You should use 2 separate regexes because these are two completely different things.
So split it into two regexes:
(?<=define |define: )\w+
and
{([A-Z]+)}
You are using negated character classes the way we would use positive lookbehind (?<=) and positive lookahead (?=). They are fundamentally different and, as opposed to lookbehind or lookahead, character classes consume characters.
Hence:
[^{][A-Z] matches a capital letter that is preceded by a character other than {.
[A-Z][^}] matches a capital letter that is followed by a character other than }.
So if you try to match the letters in {OO} with the regex [^{][A-Z]+[^}], it is totally normal that your regex won't match anything because you have two letters, one preceded by a {, the other followed by a }.

regex not matching last word if there is no white space after it [duplicate]

I have a string. The end is different, such as index.php?test=1&list=UL or index.php?list=UL&more=1. The one thing I'm looking for is &list=.
How can I match it, whether it's in the middle of the string or it's at the end? So far I've got [&|\?]list=.*?([&|$]), but the ([&|$]) part doesn't actually work; I'm trying to use that to match either & or the end of the string, but the end of the string part doesn't work, so this pattern matches the second example but not the first.
Use:
/(&|\?)list=.*?(&|$)/
Note that when you use a bracket expression, every character within it (with some exceptions) is going to be interpreted literally. In other words, [&|$] matches the characters &, |, and $.
In short
Any zero-width assertions inside [...] lose their meaning of a zero-width assertion. [\b] does not match a word boundary (it matches a backspace, or, in POSIX, \ or b), [$] matches a literal $ char, [^] is either an error or, as in ECMAScript regex flavor, any char. Same with \z, \Z, \A anchors.
You may solve the problem using any of the below patterns:
[&?]list=([^&]*)
[&?]list=(.*?)(?=&|$)
[&?]list=(.*?)(?![^&])
If you need to check for the "absolute", unambiguous string end anchor, you need to remember that is various regex flavors, it is expressed with different constructs:
[&?]list=(.*?)(?=&|$) - OK for ECMA regex (JavaScript, default C++ `std::regex`)
[&?]list=(.*?)(?=&|\z) - OK for .NET, Go, Onigmo (Ruby), Perl, PCRE (PHP, base R), Boost, ICU (R `stringr`), Java/Andorid
[&?]list=(.*?)(?=&|\Z) - OK for Python
Matching between a char sequence and a single char or end of string (current scenario)
The .*?([YOUR_SINGLE_CHAR_DELIMITER(S)]|$) pattern (suggested by João Silva) is rather inefficient since the regex engine checks for the patterns that appear to the right of the lazy dot pattern first, and only if they do not match does it "expand" the lazy dot pattern.
In these cases it is recommended to use negated character class (or bracket expression in the POSIX talk):
[&?]list=([^&]*)
See demo. Details
[&?] - a positive character class matching either & or ? (note the relationships between chars/char ranges in a character class are OR relationships)
list= - a substring, char sequence
([^&]*) - Capturing group #1: zero or more (*) chars other than & ([^&]), as many as possible
Checking for the trailing single char delimiter presence without returning it or end of string
Most regex flavors (including JavaScript beginning with ECMAScript 2018) support lookarounds, constructs that only return true or false if there patterns match or not. They are crucial in case consecutive matches that may start and end with the same char are expected (see the original pattern, it may match a string starting and ending with &). Although it is not expected in a query string, it is a common scenario.
In that case, you can use two approaches:
A positive lookahead with an alternation containing positive character class: (?=[SINGLE_CHAR_DELIMITER(S)]|$)
A negative lookahead with just a negative character class: (?![^SINGLE_CHAR_DELIMITER(S)])
The negative lookahead solution is a bit more efficient because it does not contain an alternation group that adds complexity to matching procedure. The OP solution would look like
[&?]list=(.*?)(?=&|$)
or
[&?]list=(.*?)(?![^&])
See this regex demo and another one here.
Certainly, in case the trailing delimiters are multichar sequences, only a positive lookahead solution will work since [^yes] does not negate a sequence of chars, but the chars inside the class (i.e. [^yes] matches any char but y, e and s).

RegEx expression not allowing only spaces?

I have this regEx expression which allows only spaces, letters and dashes. I'd like to modify it so it wouldn't allow ONLY spaces too. Can someone help me ?
/^([A-zăâîșțĂÂÎȘȚ-\s])+$/
You can use a negative lookahead to restrict this generic pattern:
/^(?!\s+$)[A-Za-zăâîșțĂÂÎȘȚ\s-]+$/
^^^^^^^^
See the regex demo
The (?!\s+$) lookahead is executed once at the very beginning and returns false if there are 1 or more whitespaces until the end of the string.
Also, your regex contained a classical issue of [A-z] that matches more than just ASCII letters, you need to replace this with [A-Za-z] (or just [a-z] and use the /i case insensitive modifier).
Also, the - inside a character class is usually placed at the end so as not to escape it, and it will be parsed as a literal hyphen (however, you might want to escape it if another developer will have to update this pattern by adding more symbols to the character class).
And just in case this is a regex engine that does not support lookarounds:
^[A-Za-zăâîșțĂÂÎȘȚ\s-]*[A-Za-zăâîșțĂÂÎȘȚ-][A-Za-zăâîșțĂÂÎȘȚ\s-]*$
It requires at least 1 non-space character from the allowed set (also matching 1 obligatory symbol).
Another regex demo

Correct match using RegEx but it should work without substitution

I have <autorpodpis>([^;,<\n\r]*?)\s*[;,<\n\r] to catch everything inside
<autorpodpis>_this_is_an_example_of_what_I'd_like_to_match< If there is a space, a colon (;) or a semicolon (;) or a space before a colon or a semicolon, my RegEx catches everything but including these characters – see my link. It works as it is expected to.
Overall, the RegEx works fine with substitution \1 (or in AutoHotKey I use – $1). But I'd like match without using substitution.
You seem to mix the terms substitution (regex based replacement operation) and capturing (storing a part of the matched value captured with a part of a pattern enclosed with a pair of unescaped parentheses inside a numbered or named stack).
If you want to just match a substring in specific context without capturing any subvalues, you might consider using lookarounds (lookbehind or lookahead).
In your case, since you need to match a string after some known string, you need a lookbehind. A lookbehind tells the regex engine to temporarily step backwards in the string, to check if the text inside the lookbehind can be matched there.
So, you could use
pos := RegExMatch(input, "(?<=<autorpodpis>)\p{L}+(?:\s+\p{L}+)*", Res)
So, the Res should have WOJCIECH ZAŁUSKA if you supply <autorpodpis>WOJCIECH ZAŁUSKA</autorpodpis> as input.
Explanation:
(?<=<autorpodpis>) - check if there is <autorpodpis> right before the currently tested location. If there is none, fail this match, go on to the next location in string
\p{L}+ - 1+ Unicode letters
(?:\s+\p{L}+)* - 0+ sequences of 1+ whitespaces followed with 1+ Unicode letters.
However, in most cases, and always in cases like this when the pattern in the lookbehind is known, the lookbehind is unanchored (say, when it is the first subpattern in the pattern) and you do not need overlapping matches, use capturing.
The version with capturing in place:
pos := RegExMatch(input, "<autorpodpis>(\p{L}+(?:\s+\p{L}+)*)", Res)
And then Res[1] will hold the WOJCIECH ZAŁUSKA value. Capturing is in most cases (96%) faster.
Now, your regex - <autorpodpis>([^;,<\n\r]*?)\s*[;,<\n\r] - is not efficient as the [^;,<\n\r] also matches \s and \s matches [;,<\n\r]. My regex is linear, each subsequent subpattern does not match the previous one.