three or five repetition in PCRE - regex

in PCRE is there a better way rather than the following one to specify character repetition 3 or 5 times?
/(\d{3})|(\d{5})/
Is there a special repetition quantifier for it to have a shorter expression?

I guess you could also do:
Edit2: Corrected greedy specifier \d{2}? to be optional quantifier.
/(\d{3}(?:\d{2})?)/
Additional comment on curly braced quantifiers -
Consider this:
An exact amount specified of the form {n,n} or {n}
What is the impact of a greedy specifier?
Perl docs say
{n}? Match exactly n times, not greedily
{n}+ Match exactly n times and give nothing back (redundant)
The impact is nothing, it has no connotation other than the exact amount must match.
Obviously there is a HOLE here.
There is also a syntax hole in the form of {,n}, that is not parsed as a quantifier,
but as a literal.
Pretty confusing since no error is ever generated in these cases.
There is a need, and use, for a quantifier construct that fits perfectly in this form:
{n}? Match exactly n times, or match nothing
{,n,o,p, ..} Match the most either n,o,p, times (where n < o < p)
{,n,o,p, ..}? Match the least either n,o,p, times (where n < o < p)
If this syntax hole' were used/filled, the OP's question could be answered by a
simple /(\d{,3,5})/
In a broader context, this would add imense power to regular expressions.
And probably, dramatically reduce backtracking because this list-like syntax is atomic
in nature.
An edit for #Tim Pietzcker -
Sorry I'm late with a response. I don't think there is an "exact" construct in curyly quantifiers, the notation is shortcut. There is only min/max as you suggest (correct me if I'm wrong). I can only see with a Perl regex debugger. It suggests {n} notation is parsed as {n,n} as seen by the engine. No optimization's are changed. I think this is bad and confusing notation with regard to the lazy quantifier, that according to all docs (and all docs have this), remain 'least possible' on an exact shorhand notation. I think this shorhand {n}? should be eliminated, or actually used for something useful - shorthand for (?:x{n})?. That is my point, though kind of unrelated to what your comment was.

/\d{3,5}?/ maybe? Do you really need the capturing brackets?

Related

How to find changing date format from a string [duplicate]

I've seen regex patterns that use explicitly numbered repetition instead of ?, * and +, i.e.:
Explicit Shorthand
(something){0,1} (something)?
(something){1} (something)
(something){0,} (something)*
(something){1,} (something)+
The questions are:
Are these two forms identical? What if you add possessive/reluctant modifiers?
If they are identical, which one is more idiomatic? More readable? Simply "better"?
To my knowledge they are identical. I think there maybe a few engines out there that don't support the numbered syntax but I'm not sure which. I vaguely recall a question on SO a few days ago where explicit notation wouldn't work in Notepad++.
The only time I would use explicitly numbered repetition is when the repetition is greater than 1:
Exactly two: {2}
Two or more: {2,}
Two to four: {2,4}
I tend to prefer these especially when the repeated pattern is more than a few characters. If you have to match 3 numbers, some people like to write: \d\d\d but I would rather write \d{3} since it emphasizes the number of repetitions involved. Furthermore, down the road if that number ever needs to change, I only need to change {3} to {n} and not re-parse the regex in my head or worry about messing it up; it requires less mental effort.
If that criteria isn't met, I prefer the shorthand. Using the "explicit" notation quickly clutters up the pattern and makes it hard to read. I've worked on a project where some developers didn't know regex too well (it's not exactly everyone's favorite topic) and I saw a lot of {1} and {0,1} occurrences. A few people would ask me to code review their pattern and that's when I would suggest changing those occurrences to shorthand notation and save space and, IMO, improve readability.
I can see how, if you have a regex that does a lot of bounded repetition, you might want to use the {n,m} form consistently for readability's sake. For example:
/^
abc{2,5}
xyz{0,1}
foo{3,12}
bar{1,}
$/x
But I can't recall ever seeing such a case in real life. When I see {0,1}, {0,} or {1,} being used in a question, it's virtually always being done out of ignorance. And in the process of answering such a question, we should also suggest that they use the ?, * or + instead.
And of course, {1} is pure clutter. Some people seem to have a vague notion that it means "one and only one"--after all, it must mean something, right? Why would such a pathologically terse language support a construct that takes up a whole three characters and does nothing at all? Its only legitimate use that I know of is to isolate a backreference that's followed by a literal digit (e.g. \1{1}0), but there are other ways to do that.
They're all identical unless you're using an exceptional regex engine. However, not all regex engines support numbered repetition, ? or +.
If all of them are available, I'd use characters rather than numbers, simply because it's more intuitive for me.
They're equivalent (and you'll find out if they're available by testing your context.)
The problem I'd anticipate is when you may not be the only person ever needing to work with your code.
Regexes are difficult enough for most people. Anytime someone uses an unusual syntax, the question
arises: "Why didn't they do it the standard way? What were they thinking that I'm missing?"

Outside quantifier [duplicate]

I've seen regex patterns that use explicitly numbered repetition instead of ?, * and +, i.e.:
Explicit Shorthand
(something){0,1} (something)?
(something){1} (something)
(something){0,} (something)*
(something){1,} (something)+
The questions are:
Are these two forms identical? What if you add possessive/reluctant modifiers?
If they are identical, which one is more idiomatic? More readable? Simply "better"?
To my knowledge they are identical. I think there maybe a few engines out there that don't support the numbered syntax but I'm not sure which. I vaguely recall a question on SO a few days ago where explicit notation wouldn't work in Notepad++.
The only time I would use explicitly numbered repetition is when the repetition is greater than 1:
Exactly two: {2}
Two or more: {2,}
Two to four: {2,4}
I tend to prefer these especially when the repeated pattern is more than a few characters. If you have to match 3 numbers, some people like to write: \d\d\d but I would rather write \d{3} since it emphasizes the number of repetitions involved. Furthermore, down the road if that number ever needs to change, I only need to change {3} to {n} and not re-parse the regex in my head or worry about messing it up; it requires less mental effort.
If that criteria isn't met, I prefer the shorthand. Using the "explicit" notation quickly clutters up the pattern and makes it hard to read. I've worked on a project where some developers didn't know regex too well (it's not exactly everyone's favorite topic) and I saw a lot of {1} and {0,1} occurrences. A few people would ask me to code review their pattern and that's when I would suggest changing those occurrences to shorthand notation and save space and, IMO, improve readability.
I can see how, if you have a regex that does a lot of bounded repetition, you might want to use the {n,m} form consistently for readability's sake. For example:
/^
abc{2,5}
xyz{0,1}
foo{3,12}
bar{1,}
$/x
But I can't recall ever seeing such a case in real life. When I see {0,1}, {0,} or {1,} being used in a question, it's virtually always being done out of ignorance. And in the process of answering such a question, we should also suggest that they use the ?, * or + instead.
And of course, {1} is pure clutter. Some people seem to have a vague notion that it means "one and only one"--after all, it must mean something, right? Why would such a pathologically terse language support a construct that takes up a whole three characters and does nothing at all? Its only legitimate use that I know of is to isolate a backreference that's followed by a literal digit (e.g. \1{1}0), but there are other ways to do that.
They're all identical unless you're using an exceptional regex engine. However, not all regex engines support numbered repetition, ? or +.
If all of them are available, I'd use characters rather than numbers, simply because it's more intuitive for me.
They're equivalent (and you'll find out if they're available by testing your context.)
The problem I'd anticipate is when you may not be the only person ever needing to work with your code.
Regexes are difficult enough for most people. Anytime someone uses an unusual syntax, the question
arises: "Why didn't they do it the standard way? What were they thinking that I'm missing?"

How does the ? make a quantifier lazy in regex

I've been looking into regex lately and figured that the ? operator makes the *,+, or ? lazy. My question is how does it do that? Is it that *? for example is a special operator, or does the ? have an effect on the * ? In other words, does regex recognize *? as one operator in itself, or does regex recognize *? as the two separate operators * and ? ? If it is the case that *? is being recognized as two separate operators, how does the ? affect the * to make it lazy. If ? means that the * is optional, shouldn't this mean that the * doesn't have to exists at all. If so, then in a statement .*? wouldn't regex just match separate letters and the whole string instead of the shorter string? Please explain, I'm desperate to understand.Many thanks.
? can mean a lot of different things in different contexts.
Following a normal regex token (a character, a shorthand, a character class, a group...), it means "Match the previous item 0-1 times".
Following a quantifier like ?, *, +, {n,m}, it takes on a different meaning: "Make the previous quantifier lazy instead of greedy (if that's the default; that can be changed, though - for example in PHP, the /U modifier makes all quantifiers lazy by default, so the additional ? makes them greedy).
Right after an opening parenthesis, it marks the start of a special construct like for example
a) (?s): mode modifiers ("turn on dotall mode")
b) (?:...): make the group non-capturing
c) (?=...) or (?!...): lookahead assertion
d) (?<=...) or (?<!...): lookbehind assertion
e) (?>...): atomic group
f) (?<foo>...): named capturing group
g) (?#comment): inline comments, ignored by the regex engine
h) (?(?=if)then|else): conditionals
and others. Not all constructs are available in all regex flavors.
Within a character class ([?]), it simply matches a verbatim ?.
I think a little history will make it easier to understand. When the Larry Wall wanted to grow regex syntax to support new features, his options were severely limited. He couldn't just decree (for example) that % is now a metacharacter that supports new feature "XYZ". That would break the millions of existing regexes that happened to use % to match a literal percent sign.
What he could do is take an already-defined metacharacter and use it in such a way that its original function wouldn't make sense. For example, any regex that contained two quantifiers in a row would be invalid, so it was safe to say a ? after another quantifier now turns it into a reluctant quantifier (a much better name than "lazy" IMO; non-greedy good too). So the answer to your question is that ? doesn't modify the *, *? is a single entity: a reluctant quantifier. The same is true of the + in possessive quantifiers (*+, {0,2}+ etc.).
A similar process occurred with group syntax. It would never make sense to have a quantifier after an unescaped opening parenthesis, so it was safe to say (? now marks the beginning of a special group construct. But the question mark alone would only support one new feature, so the ? itself to be followed has to be followed by at least one more character to indicate which kind of group it is ((?:...), (?<!...), etc.). Again, the (?: is a single entity: the opening delimiter of a non-capturing group.
I don't know offhand why he used the question mark both times. I do know Perl 6 Rules (a bottom-up rewrite of Perl 5 regexes) has done away with all that crap and uses an infinitely more sensible syntax.
Imagine you have the following text:
BAAAAAAAAD
The following regexs will return:
/B(A+)/ => 'BAAAAAAAA'
/B(A+?)/ => 'BA'
/B(A*)/ => 'BAAAAAAAA'
/B(A*?)/ => 'B'
The addition of the "?" to the + and * operators make them "lazy" - i.e. they will match the absolute minimum required for the expression to be true. Whereas by default the * and + operators are "greedy" and try and match AS MUCH AS POSSIBLE for the expression to be true.
Remember + means "one or more" so the minimum will be "one if possible, more if absolutely necessary" whereas the maximum will be "all if possible, one if absolutely necessary".
And * means "zero or more" so the minimum will be "nothing if possible, more if absolutely necessary" whereas the maximum will be "all if possible, zero if absolutely necessary".
This very much depends on the implementation, I guess. But since every quantifier I am aware of can be modified with ? it might be reasonable to implement it that way.

Is there any difference between `{m}` and `{m}?` quantifiers?

Please give an example where the difference between greedy and lazy versions of "repeat-exactly-m-times" quantifier can be seen.
The question arose from here and here.
If there are no differences then what for the {m}? quantifier exists?
I don't believe there is any real difference between {m} and {m}? since each specifies exactly m times. However, there is a difference between {m,} and {m,}? (and {m,}+, while we're at it). It's appropriate and needed for quantifiers in general, even if it isn't needed for that particular case.
As said in comments, Oniguruma regexp engine treats it differently, as an exception: {m}? is not a non-greedy exact m (which is same as greedy exact m), but 0-or-m. All the other engines I tried did as other posters say: it makes no difference.
The reason for the non-greedy exact m to exist: if it didn't, it's an exception. Exceptions are harder to remember, and harder to implement - it's extra work, and in this case, as the semantics is equal, it doesn't hurt anyone.
I love Oniguruma, and appreciate they might have wanted to change the unneeded bit into something more usable and efficient, but this looks like a bug waiting to happen. Fortunately, no-one sane writes non-greedy exact m...
Doesn't make a difference in exact match {m}.
However, will make a difference with {m,} as greedy qualifiers match as many characters as possible, whereas lazy qualifiers match as few as possible.
Given the string "Baaaaaaaaaaaa"
The regex (B[a]{2,}?) would match "Baa"
The regex (B[a]{2,}) would match "Baaaaaaaaaaaa"
Whereas, with the exact match {m}:
The regex (B[a]{2}?) would match "Baa"
The regex (B[a]{2}) would also match "Baa"

Using explicitly numbered repetition instead of question mark, star and plus

I've seen regex patterns that use explicitly numbered repetition instead of ?, * and +, i.e.:
Explicit Shorthand
(something){0,1} (something)?
(something){1} (something)
(something){0,} (something)*
(something){1,} (something)+
The questions are:
Are these two forms identical? What if you add possessive/reluctant modifiers?
If they are identical, which one is more idiomatic? More readable? Simply "better"?
To my knowledge they are identical. I think there maybe a few engines out there that don't support the numbered syntax but I'm not sure which. I vaguely recall a question on SO a few days ago where explicit notation wouldn't work in Notepad++.
The only time I would use explicitly numbered repetition is when the repetition is greater than 1:
Exactly two: {2}
Two or more: {2,}
Two to four: {2,4}
I tend to prefer these especially when the repeated pattern is more than a few characters. If you have to match 3 numbers, some people like to write: \d\d\d but I would rather write \d{3} since it emphasizes the number of repetitions involved. Furthermore, down the road if that number ever needs to change, I only need to change {3} to {n} and not re-parse the regex in my head or worry about messing it up; it requires less mental effort.
If that criteria isn't met, I prefer the shorthand. Using the "explicit" notation quickly clutters up the pattern and makes it hard to read. I've worked on a project where some developers didn't know regex too well (it's not exactly everyone's favorite topic) and I saw a lot of {1} and {0,1} occurrences. A few people would ask me to code review their pattern and that's when I would suggest changing those occurrences to shorthand notation and save space and, IMO, improve readability.
I can see how, if you have a regex that does a lot of bounded repetition, you might want to use the {n,m} form consistently for readability's sake. For example:
/^
abc{2,5}
xyz{0,1}
foo{3,12}
bar{1,}
$/x
But I can't recall ever seeing such a case in real life. When I see {0,1}, {0,} or {1,} being used in a question, it's virtually always being done out of ignorance. And in the process of answering such a question, we should also suggest that they use the ?, * or + instead.
And of course, {1} is pure clutter. Some people seem to have a vague notion that it means "one and only one"--after all, it must mean something, right? Why would such a pathologically terse language support a construct that takes up a whole three characters and does nothing at all? Its only legitimate use that I know of is to isolate a backreference that's followed by a literal digit (e.g. \1{1}0), but there are other ways to do that.
They're all identical unless you're using an exceptional regex engine. However, not all regex engines support numbered repetition, ? or +.
If all of them are available, I'd use characters rather than numbers, simply because it's more intuitive for me.
They're equivalent (and you'll find out if they're available by testing your context.)
The problem I'd anticipate is when you may not be the only person ever needing to work with your code.
Regexes are difficult enough for most people. Anytime someone uses an unusual syntax, the question
arises: "Why didn't they do it the standard way? What were they thinking that I'm missing?"