Capture a dot with postgres regexp - regex

I have these strings :
3 FD160497. 2016 abcd
3 FD160497 2016 abcd
I want to capture "FD", the digits, then the dot if it is present.
I tried this :
SELECT
sqn[1] AS letters,
sqn[2] AS digits,
sqn[3] AS dot
FROM (
SELECT
regexp_matches(string, '.*?(FD)([0-9]{6})(\.)?.*') as sqn
FROM
mytable
) t;
(PostgreSQL 9.5.3)
"dot" column is NULL in both cases, and I really don't know why.
It works well on regex101.

The first lazy pattern made all quantifiers in the current branch lazy, so your pattern became equivalent to
.*?(FD)([0-9]{6})(\.)??.*?
^^ ^
See its demo at regex101.com
See the 9.7.3.1. Regular Expression Details excerpt:
...matching is done in such a way that the branch, or whole RE, matches the longest or shortest possible substring as a whole. Once the length of the entire match is determined, the part of it that matches any particular subexpression is determined on the basis of the greediness attribute of that subexpression, with subexpressions starting earlier in the RE taking priority over ones starting later.
You need to use quantifiers consistently within one branch:
regexp_matches(string, '.*(FD)([0-9]{6})(\.)?.*') as sqn
or
regexp_matches(string, '.*[[:blank:]](FD)([0-9]{6})(\.)?.*') as sqn
See the regex demo

Related

Regex Pattern to Match except when the clause enclosed by the tilde (~) on both sides

I want to extract matches of the clauses match-this that is enclosed with anything other than the tilde (~) in the string.
For example, in this string:
match-this~match-this~ match-this ~match-this#match-this~match-this~match-this
There should be 5 matches from above. The matches are explained below (enclosed by []):
Either match-this~ or match-this is correct for first match.
match-this is correct for 2nd match.
Either ~match-this# or ~match-this is correct for 3rd match.
Either #match-this~ or #match-this or match-this~ is correct for 4th match.
Either ~match-this or match-this is correct for 5th match.
I can use the pattern ~match-this~ catch these ~match-this~, but when I tried the negation of it (?!(~match-this)), it literally catches all nulls.
When I tried the pattern [^~]match-this[^~], it catches only one match (the 2nd match from above). And when I tried to add asterisk wild card on any negation of tilde, either [^~]match-this[^~]* or [^~]*match-this[^~], I got only 2 matches. When I put the asterisk wild card on both, it catches all match-this including those which enclosed by tildes ~.
Is it possible to achieve this with only one regex test? Or Does it need more??
If you also want to match #match-this~ as a separate match, you would have to account for # while matching, as [^~] also matches #
You could match what you don't want, and capture in a group what you want to keep.
~[^~#]*~|((?:(?!match-this).)*match-this(?:(?!match-this)[^#~])*)
Explanation
~[^~#]*~ Match any char except ~ or # between ~
| Or
( Capture group 1
(?:(?!match-this).)* Match any char if not directly followed by *match-this~
match-this Match literally
(?:(?!match-this)[^#~])* Match any char except ~ or # if not directly followed by match this
) Close group 1
See a regex demo and a Python demo.
Example
import re
pattern = r"~[^~#]*~|((?:(?!match-this).)*match-this(?:(?!match-this)[^#~])*)"
s = "match-this~match-this~ match-this ~match-this#match-this~match-this~match-this"
res = [m for m in re.findall(pattern, s) if m]
print (res)
Output
['match-this', ' match-this ', '~match-this', '#match-this', 'match-this']
If all five matches can be "match-this" (contradicting the requirement for the 3rd match) you can match the regular expression
~match-this~|(\bmatch-this\b)
and keep only matches that are captured (to capture group 1). The idea is to discard matches that are not captured and keep matches that are captured. When the regex engine matches "~match-this~" its internal string pointer is moved just past the closing "~", thereby skipping an unwanted substring.
Demo
The regular expression can be broken down as follows.
~match-this~ # match literal
| # or
( # begin capture group 1
\b # match a word boundary
match-this # match literal
\b # match a word boundary
) # end capture group 1
Being so simple, this regular expression would be supported by most regex engines.
For this you need both kinds of lookarounds. This will match the 5 spots you want, and there's a reason why it only works this way and not another and why the prefix and/or suffix can't be included:
(?<=~)match-this(?!~)|(?<!~)match-this(?=~)|(?<!~)match-this(?!~)
Explaining lookarounds:
(?=...) is a positive lookahead: what comes next must match
(?!...) is a negative lookahead: what comes next must not match
(?<=...) is a positive lookbehind: what comes before must match
(?<!...) is a negative lookbehind: what comes before must not match
Why other ways won't work:
[^~] is a class with negation, but it always needs one character to be there and also consumes that character for the match itself. The former is a problem for a starting text. The latter is a problem for having advanced too far, so a "don't match" character is gone already.
(^|[^~]) would solve the first problem: either the text starts or it must be a character not matching this. We could do the same for ending texts, but this is a dead again anyway.
Only lookarounds remain, and even then we have to code all 3 variants, hence the two |.
As per the nature of lookarounds the character in front or behind cannot be captured. Additionally if you want to also match either a leading or a trailing character then this collides with recognizing the next potential match.
It's a difference between telling the engine to "not match" a character and to tell the engine to "look out" for something without actually consuming characters and advancing the current position in the text. Also not every regex engine supports all lookarounds, so it matters where you actually want to use it. For me it works fine in TextPad 8 and should also work fine in PCRE (f.e. in PHP). As per regex101.com/r/CjcaWQ/1 it also works as expected by me.
What irritates me: if the leading and/or trailing character of a found match is important to you, then just extract it from the input when processing all the matches, since they also come with starting positions and lengths: first match at position 0 for 10 characters means you look at input text position -1 and 10.

Regular expression for all abbreviated keyword variations

I need to search for a keyword, such as "abcdef", which can also be in an abbreviated version with a dot at the end. All valid variants are:
abcdef
abcde.
abcd.
abc.
ab.
a.
I have a regular expression for this, which is clear:
abcdef|abcde\.|abcd\.|abc\.|ab\.|a\.
Another regular expression where the keyword characters are not repeated:
a(b(c(d(e(f|\.)|\.)|\.)|\.)|\.)
I'm looking for a more compact expression where not even a dot will be repeated.
I use .NET syntax.
You can use a conditional construct:
a(b(c(d(e(?<f>f)?)?)?)?)?(?(f)|\.?)
See the regex demo. Here, (?<f>f)? is an optional named group matching f one or zero times. If the group matches, the f group is not empty, and (?(f)|\.?) matches an empty string then. If it is empty, \.? matches an optional ..
In PCRE falvor, could use
a(b(c(d(e(f(*ACCEPT))?)?)?)?)?\.?
where (*ACCEPT) verb inside an optional group would stop analyzing the current regex and return the value matched so far (so the last \.? would not be tried if f is found). See this regex flavor.
As a variant:
a(bcdef|(bcde|bcd|bc|b|)\.)
With shorting 2 letters (a bit shorter):
a(bcdef|(b(cde|cd|c|)|)\.)
With shorting 3 letters (the same length):
a(bcdef|(b(c(de|d|)|)|)\.)
With shorting 4 letters the shortest - 25 symbols:
a(bcdef|(b(c(de?|)|)|)\.)

Regex101 vs Oracle Regex

My regex:
^\+?(-?)0*([[:digit:]]+,[[:digit:]]+?)0*$
It is removing leading + and leading and tailing 0s in decimal number.
I have tested it in regex101
For input: +000099,8420000 and substitution \1\2 it returns 99,842
I want the same result in Oracle database 11g:
select REGEXP_REPLACE('+000099,8420000','^\+?(-?)0*([[:digit:]]+,[[:digit:]]+?)0*$','\1\2') from dual;
But it returns 99,8420000 (tailing 0s are still present...)
What I'm missing?
EDIT
It works like greedy quantifier * at the end of regex, not lazy *? but I definitely set lazy one.
The problem is well-known for all those who worked with Henry Spencer's regex library implementations: lazy quantifiers should not be mixed up with greedy quantifiers in one and the same branch since that leads to undefined behavior. The TRE regex engine used in R shows the same behavior. While you may mix the lazy and greedy quantifiers to some extent, you must always make sure you get a consistent result.
The solution is to only use lazy quantifiers inside the capturing group:
select REGEXP_REPLACE('+000099,8420000', '^\+?(-?)0*([0-9]+?,[0-9]+?)0*$','\1\2') as Result from dual
See the online demo
The [0-9]+?,[0-9]+? part matches 1 or more digits but as few times as possible followed with a comma and then 1 or more digits, as few as possible.
Some more tests (select REGEXP_REPLACE('+00009,010020','[0-9]+,[0-9]+?([1-9])','\1') from dual yields +20) prove that the first quantifier in a group sets the quantifier greediness type. In the case above, Group 0 quantifier greediness is set to greedy by the first ? quantifier, and Group 1 (i.e. ([0-9]+?,[0-9]+?)) greediness type is set with the first +? (which is lazy).

python regex non-capture group handling

(1[0-9]{2})\s+(\w+(?:-\w+)+)\s+(\w+)\s+(\w+(?:-\w+)+)\s+(\w+)
used to match string
123 FEX-1-80 Online N2K-C2248TP-1GE SSDFDFWFw23r23
How come this works in regexr.com but Python 3.5.1 can't find a match
r'(1[0-9]{2})\s+(\w+(?:-\w+)+)\s+(\w+)\s+(\w+(?:-\w+))'
can match up to
123 FEX-1-80 Online N2K-C2248TP
but the second hyphen - in group(4) is not matched
From what I understand, non-capture group character can appear more than once in the group, what went wrong here?
Just a comment, not really an answer but for the sake of clarity I have put it as an answer.
Being relatively new to regular expressions, one should use the verbose mode. With this, your expression becomes much much more readable:
(1[0-9]{2})\s+ # three digits, the first one needs to be 1
(\w+(?:-\w+)+)\s+ # a word character (wc), followed by - and wcs
(\w+)\s+ # another word
(\w+(?:-\w+)+)\s+ # same expression as above
(\w+) # another word
Also, check if your (second and fourth) expression could be rewritten as [\w-]+ - it is not the same as yours and will match other substrings but try to avoid nested parenthesis in general.
Concerning your question, the second string cannot be matched as you made all of your expressions mandatory (and group 5 is missing in the second example, so it will fail).
See a demo on regex101.com.
This regular expression matches the full input string:
(1[0-9]{2})\s+(\w+(?:-\w+)+)\s+(\w+)\s+(\w+(?:-\w+)+)\s+(\w+)
This one doesn't:
(1[0-9]{2})\s+(\w+(?:-\w+)+)\s+(\w+)\s+(\w+(?:-\w+))
The latter is missing a + after the last non-capturing group, and it's missing the \s+(\w+) at the end that matches the SSDFDFWFw23r23 at the end of the input string.
From what I understand, non-capture group character can appear more than once in the group, what went wrong here?
I'm not sure I follow. A non-capturing group is really just there to group a part of a regular expression.
(?:-\w+) or just -\w+ will both match a hyphen (-) followed by one or more "word" characters (\w+). It doesn't matter whether that regular expression is in a non-capturing group or not. If you want to match repetitions of that pattern, you can use the + modifier after the non-capturing group, e.g. (?:-\w+)+. That pattern will match a string like -foo-bar-baz.
So the reason your second regular expression doesn't match the repeated pattern is because it's lacking the + modifier.

Does greediness of first quantifier override greediness of all next quantifiers?

I'm working with pattern matching in Postgresql 9.4. I run this query:
select regexp_matches('aaabbb', 'a+b+?')
and I expect it to return 'aaab' but instead it returns 'aaabbb'. Shouldn't the b+? atom match only one 'b' since it is not greedy? Is the greediness of the first quantifier setting the greediness for the whole regular expression?
Here is what I've found in postgresql 9.4's documentation:
Once the length of the entire match is determined, the part of it that matches any particular subexpression is determined on the basis of the greediness attribute of that subexpression, with subexpressions starting earlier in the RE taking priority over ones starting later.
and
If the RE could match more than one substring starting at that point, either the longest possible match or the shortest possible match will be taken, depending on whether the RE is greedy or non-greedy.
An example of what this means:
SELECT SUBSTRING('XY1234Z', 'Y*([0-9]{1,3})');
Result: 123
SELECT SUBSTRING('XY1234Z', 'Y*?([0-9]{1,3})');
Result: 1
In the first case, the RE as a whole is greedy because Y* is greedy. It can match beginning at the Y, and it matches the longest possible string starting there, i.e., Y123. The output is the parenthesized part of that, or 123. In the second case, the RE as a whole is non-greedy because Y*? is non-greedy. It can match beginning at the Y, and it matches the shortest possible string starting there, i.e., Y1. The sub-expression [0-9]{1,3} is greedy but it cannot change the decision as to the overall match length; so it is forced to match just 1.
Meaning that the greediness of an operator is determined by the the ones defined prior to it.
I guess you have to use a+?b+? for achieving what you want.