? character in a regular expression [duplicate] - regex

This question already has an answer here:
Reference - What does this regex mean?
(1 answer)
Closed 8 years ago.
I have the following regex :
.*(?:(?:(?<!a)cc|string).*number).*
And I am trying to understand what the ? in the beginning of the string between brackets mean. I know the a? means that the previous character 'a' can be repeated zero or one time. But what does it mean when it appears in the beginning of a string ?

The answer requires a little history lesson. When Larry Wall wanted to add new features to regexes in Perl, he couldn't just change the meaning of existing metacharacters, or assign special meanings to characters that didn't have them. That would have broken a lot of regexes that had been working. Instead, he had to look for character sequences that would never appear in a regex.
There was only the one kind of group originally: what we now call capturing groups. The opening parenthesis was a metacharacter, so it would make no sense to follow it with a quantifier. You could match a literal open-paren zero or one time with \(?, or you could match (and capture) a literal question mark with (\?), but if you tried to use (? in regex it would throw an exception.
Larry changed the rule so (? could appear in a regex, but it must form the beginning of a special-group construct, which requires at least one more character. So, to answer your question, the string doesn't start with ?. The sequence (?: forms a single token, representing the beginning of a non-capturing group. We also have (?= and (?! for positive and negative lookaheads, (?<= and (?<! for lookbehinds, and so on.

(?:) is a non-capturing group. It do a matching operation only. It won't capture anything.
(?<!) is a Negative lookbehind.

Related

How to build a regular expression which prohibits hyphens from appearing at the start and end of a string? [duplicate]

This question already has answers here:
RegEx for allowing alphanumeric at the starting and hyphen thereafter
(4 answers)
Closed 5 years ago.
I want to build a regular expression which only matches [A-Za-z0-9\-] with an additional rule that hyphens (-) are not allowed to appear at the start and at the end.
For example:
my-site is matched.
m is matched.
mysite- is not matched.
-mysite is not matched.
Currently, I've come up with ^[A-Za-z0-9][A-Za-z0-9\-]*[A-Za-z0-9]+$.
But this doesn't match m.
How can I change my regular expression so that it fits my needs?
Use look arounds:
^(?!-)[A-Za-z0-9-]*(?<!-)$
The reason this works is that look arounds don't consume input, so the look ahead and the look behind can both assert on the same character.
Note that you don't need to escape the dash within the character class if it's the first or last character.

The use of ".*" in regex for password validation [duplicate]

This question already has an answer here:
Reference - What does this regex mean?
(1 answer)
Closed 6 years ago.
I came across this regex used for password validation:
(?=.*[a-z])(?=.*[A-Z])(?=.*[\d])(?=.*[^a-zA-Z\d])(?=\S+$).{8,}
There are only two things that are unclear to me about this regex:
what are .* used for and why this regex doesn't work without them?
what is the difference/benefit or using [\d] instead of \d, because the regex works just fine in both cases
.* matches any sequence of characters; . matches any character (other than newline, which is not relevant here) and * matches zero or more of the preceding pattern. This is used in the lookaheads to search for matches anywhere in the password. If you didn't have it,then it would require that you have those types of characters in a specific order: a lowercase letter followed by an uppercase letter followed by a digit. With .*, it means the password must contain at least one of each of them, but they can be anywhere in the password.
There's no difference between \d and [\d]. Whoever write this might just use the brackets out of habit, or perhaps to make it easier to modify it to put other characters into the character class.

Difference between "(\S+)\.|" and "(\S+) |" in Perl [duplicate]

This question already has an answer here:
Reference - What does this regex mean?
(1 answer)
Closed 4 years ago.
I have no idea what the difference between (\S+) | and (\S+)\.| is.
(\S+)\.| will match and capture any number (one or more) of non-space characters, followed by a dot character.
(\S+) | will match and capture any number (one or more) of non-space characters, followed by a space character (assuming the regular expression isn't modified with a /x flag).
In both cases, these constructs appear to be one component of an alternation.
Breaking it down:
(....) : Group and capture.
\S : Non-space character.
+ : One or more.
\. : A dot character (without the backslash escape, the
dot has special meaning).
: Just an ordinary single space.
| : Alternation (similar to logical or).
See perlretut for a crash course in Perl's regular expressions. Also perlintro is a good starting point for learning Perl, and perlre is the canonical explanation of Perl's regular expressions. There are many other useful documents in Perl's documentation, but these would get you moving in the right direction.
If you want to learn everything there was to know in 2005 about common regular expression flavors, Mastering Regular Expressions, 3rd Edition is unparalleled. And despite being a few years old, it's still one of the best resources anywhere on regular expressions.

How do regex positive look-behinds work?

I have been solving old question from stack so that I can improve my regex knowledge. As I have a basic knowledge of regex, most of them were easy but this question regex problem is tough.
It asks for a regex that extracts from this kind of string ou=persons,ou=(.*),dc=company,dc=org the last string immediately preceded by a comma not followed by (.*). In the last case, this should give dc=company,dc=org.
The solution is (?<=,(?!.*\Q(.*)\E)).* but I cannot understand its flow. I understood (?!.*\Q(.*)\E) portion but other are still mystery to me. Specially ?<= which is a positive look-behind. Does it search from end of string? Can anyone explain it to me like I am a 7 year old kid — and please http://regex101.com/ is not helping.
The RegEx (?<=,(?!.*\Q(.*)\E)).* look-behind potion works like this:
Start at the beginning of the string at first character.
Can we match the the thing we are looking for? ,(?!.*\Q(.*)\E)
If we can't: Move forward one character, Go To 2. and check match again.
If a match is found: Capture all the remaining characters until we can't find any .* (or generally then try the matching the remaining RegEx).
For a more wordly explaination consider reading Lookahead and Lookbehind Zero-Length Assertions.
A lookbehind allows you to specify a context just before the actual match.
You can say ,(dc=) and only return the capture group, or ,\Kdc=, or (?<=,)dc= to return the match on dc= but require that the comma is present just before the match.
The facility also allows for multiple lookbehinds, so you could do (?<=a.*)(?<=b.*)c to match c only if it is preceded by both a and b somewhere in the input.
A lookbehind is basically syntactic sugar, in that you can usually rephrase your conditions using some other regex construct. It can be really handy when you have multiple unanchored constraints, like in the last example

What does (^?)* mean in this regex?

I have this regex:
^(^?)*\?(.*)$
If I understand correctly, this is the breakdown of what it does:
^ - start matching from the beginning of the string
(^?)* - I don't know know, but it stores it in $1
\? - matches a question mark
(.*)$ - matches anything until the end of the string
So what does (^?)* mean?
The (^?) is simply looking for the literal character ^. The ^ character in a regex pattern only has special meaning when used as the first character of the pattern or the first character in a grouping match []. When used outside those 2 positions the ^ is interpreted literally meaning in looks for the ^ character in the input string
Note: Whether or not ^ outside of the first and grouping position is interpreted literally is regex engine specific. I'm not familiar enough with LUA to state which it does
Lua does not have a conventional regexp language, it has Lua patterns in its place. While they look a lot like regexp, Lua patterns are a distinct language of their own that has a simpler set of rules and most importantly lacks grouping and alternation features.
Interpreted as a Lua pattern, the example will surprising a longtime regexp user since so many details are different.
Lua patterns are described in PiL, and at a first glance are similar enough to a conventional regexp to cause confusion. The biggest differences are probably the lack of an alternation operator |, parenthesis are only used to mark captures, quantifiers (?, -, +, and *) only apply to a character or character class, and % is the escape character not \. A big clue that this example was probably not written with Lua in mind is the lack of the Lua pattern quoting character % applied to any (or ideally, all) of the non-alphanumeric characters in the pattern string, and the suspicious use of \? which smells like a conventional regexp to match a single literal ?.
The simple answer to the question asked is: (^?)* is not a recommended form, and would match ^* or *, capturing the presence or absence of the caret. If that were the intended effect, then I would write it as (%^?)%* to make that clearer.
To see why this is the case, let's take the pattern given and analyze it as a Lua pattern. The entire pattern is:
^(^?)*\?(.*)$
Handed to string.match(), it would be interpreted as follows:
^ anchors the match to the beginning of the string.
( marks the beginning of the first capture.
^ is not at the beginning of the pattern or a character class, so it matches a literal ^ character. For clarity that should likely have been written as %^.
? matches exactly zero or one of the previous character.
) marks the end of the first capture.
* is not after something that can be quantified so it matches a literal * character. For clarity that should likely have been written as %*.
\ in a pattern matches itself, it is not an escape character in the pattern language. However, it is an escape character in a Lua short string literal, making the following character not special to the string literal parser which in this case is moot because the ? that follows was not special to it in any case. So if the pattern were enclosed in double or single quotes, then the \ would be absorbed by string parsing. If written in a long string (as [[^(^?)*\?(.*)$]], the backslash would survive the string parser, to appear in the pattern.
? matches exactly zero or one of the previous character.
( marks the beginning the second capture.
. matches any character at all, effectively a synonym for the class [\000-\255] (remember, in Lua numeric escapes are in decimal not octal as in C).
* matches zero or more of the previous character, greedily.
) marks the end of the second capture.
$ anchors the pattern to the end of the string.
So it matches and captures an optional ^ at the beginning of the string, followed by *, then an optional \ which is not captured, and captures the entire rest of the string. string.match would return two strings on success (either or both of which might be zero length), or nil on failure.
Edit: I've fixed some typos, and corrected an error in my answer, noticed by Egor in a comment. I forgot that in patterns, special symbols loose their specialness when in a spot where it can't apply. That makes the first asterisk match a literal asterisk rather than be an error. The cascade of that falls through most of the answer.
Note that if you really want a true regexp in Lua, there are libraries available that will provide it. That said, the built-in pattern language is quite powerful. If it is not sufficient, then you might be best off adopting a full parser, and use LPeg which can do everything a regexp can and more. It even comes with a module that provides a complete regexp syntax that is translated into an LPeg grammar for execution.
In this case, the (^?) refers to the previous string "^" meaning the literal character ^ as Jared has said. Check out regexlib for any further deciphering.
For all your Regex needs: http://regexlib.com/CheatSheet.aspx
It looks to me like the intent of the creator of the expression was to match any number of ^ before the question mark, but only wanted to capture the first instance of ^. However, it may not be a valid expression depending on the engine, as others have stated.