Regex101 vs Oracle Regex - regex

My regex:
^\+?(-?)0*([[:digit:]]+,[[:digit:]]+?)0*$
It is removing leading + and leading and tailing 0s in decimal number.
I have tested it in regex101
For input: +000099,8420000 and substitution \1\2 it returns 99,842
I want the same result in Oracle database 11g:
select REGEXP_REPLACE('+000099,8420000','^\+?(-?)0*([[:digit:]]+,[[:digit:]]+?)0*$','\1\2') from dual;
But it returns 99,8420000 (tailing 0s are still present...)
What I'm missing?
EDIT
It works like greedy quantifier * at the end of regex, not lazy *? but I definitely set lazy one.

The problem is well-known for all those who worked with Henry Spencer's regex library implementations: lazy quantifiers should not be mixed up with greedy quantifiers in one and the same branch since that leads to undefined behavior. The TRE regex engine used in R shows the same behavior. While you may mix the lazy and greedy quantifiers to some extent, you must always make sure you get a consistent result.
The solution is to only use lazy quantifiers inside the capturing group:
select REGEXP_REPLACE('+000099,8420000', '^\+?(-?)0*([0-9]+?,[0-9]+?)0*$','\1\2') as Result from dual
See the online demo
The [0-9]+?,[0-9]+? part matches 1 or more digits but as few times as possible followed with a comma and then 1 or more digits, as few as possible.
Some more tests (select REGEXP_REPLACE('+00009,010020','[0-9]+,[0-9]+?([1-9])','\1') from dual yields +20) prove that the first quantifier in a group sets the quantifier greediness type. In the case above, Group 0 quantifier greediness is set to greedy by the first ? quantifier, and Group 1 (i.e. ([0-9]+?,[0-9]+?)) greediness type is set with the first +? (which is lazy).

Related

RegEx: How to make greedy quantifier really really greddy (and never give back)?

For example I have this RegEx:
([0-9]{1,4})([0-9])
Which gives me these matching groups when testing with string "3041":
As you can see, group2 is filled before group1 even if the quantifier is greedy.
How can I instead make sure to fill group1 before group2?
EDIT1: I want to have the same regEx, but have "3041" in group1 and group2 empty.
EDIT2: I want to have "3041" in group1 and group2 empty. And, yes, I want the regEx to not match!,
For an input "1234", the pattern: ([0-9]{1,4})([0-9]) is being as greedy as possible.
The first capture group cannot contain four characters, otherwise the last part of the pattern would not match.
Perhaps what you're looking for is:
([0-9]{1,4})([0-9]?)
By making the second group optionally empty, the first group can contain all four characters.
Edit:
I want the regEx to not match!, I want only 5 digits strings to match the whole RegEx.
In this case, your pattern should not really be "1-4 characters" in the first group, since you only want to match a group of 4:
([0-9]{4})([0-9])
In some regex flavours (i.e. not all languages support this), it is also possible to make quantifiers possessive (although this is unnecessary in your case, as shown above). For example:
([0-9]{1,4}+)([0-9])
This will force the first group to match as far as it can (i.e. 4 characters), so a 3-character match does not get attempted and the overall pattern fails to match.
Edit2:
Is "possessiveness" available in Javascript? If not, any workarounds?
Unfortunately, possessive quantifiers are not available in JavaScript.
However, you can emulate the behaviour (in a slightly ugly way) with a lookahead:
(?=([0-9]{1,4}))\1([0-9])
In general, a possessive quantifier a++ can be emulated as: (?=(a+))\1.
As it stands you only need anchors:
^([0-9]{4})([0-9])$
This will only match five digits strings and will fail on any other string.

Regex: Greedy matching some patterns ~-11 or ~6 etc?

I have some text with following pattern
~, and optionally a - then some digits
So I can have (can be part of some larger text)
~7
~-6
~-11534
~-0
for example my text can be:
New Zealand~1 expenditure~-900
Right now I am using this pattern:
[~-]*[0-9]*[0-9]
It seems to does the work but I know [0-9]*[0-9] is greedy matching (0 to unlimited times)
I am wondering if there is a better pattern?
EDIT: Actually, to match your requirements better, I suggest
~-?[0-9]+
This way we specify that ~ required, and it can be followed by - and digits.
The question mark after a quantifier says to take as little as possible, making it non-greedy, but in current example it is unnecessary.
EDIT 2: I noticed rather lately that the digits are not entirely optional, and changed 'zero or more' quantifier * to 'one or more' quantifier +.
EDIT 3: To talk about 'greedy' and 'non-greedy'. A non-greedy algorithm would return as little as possible, and in case of multiple digits at the end of string it would include only the first one into result, which is not what you're looking for.
A bit more about greedy and non-greedy algorithms, thanks to Wiktor Stribiżew for excellently formulated explanation:
The lazy quantifier at the end of the pattern will match 0 (if *? is used) or 1 (if +? is used) chars. That happens because the lazily quantified patterns only grab what they must match first (the minimum number of occurrences) and are skipped so that the subsequent patterns could be checked. Only if there is no match the engine goes back to the lazily quantified subpattern to expand it one more char and retry.

Capture a dot with postgres regexp

I have these strings :
3 FD160497. 2016 abcd
3 FD160497 2016 abcd
I want to capture "FD", the digits, then the dot if it is present.
I tried this :
SELECT
sqn[1] AS letters,
sqn[2] AS digits,
sqn[3] AS dot
FROM (
SELECT
regexp_matches(string, '.*?(FD)([0-9]{6})(\.)?.*') as sqn
FROM
mytable
) t;
(PostgreSQL 9.5.3)
"dot" column is NULL in both cases, and I really don't know why.
It works well on regex101.
The first lazy pattern made all quantifiers in the current branch lazy, so your pattern became equivalent to
.*?(FD)([0-9]{6})(\.)??.*?
^^ ^
See its demo at regex101.com
See the 9.7.3.1. Regular Expression Details excerpt:
...matching is done in such a way that the branch, or whole RE, matches the longest or shortest possible substring as a whole. Once the length of the entire match is determined, the part of it that matches any particular subexpression is determined on the basis of the greediness attribute of that subexpression, with subexpressions starting earlier in the RE taking priority over ones starting later.
You need to use quantifiers consistently within one branch:
regexp_matches(string, '.*(FD)([0-9]{6})(\.)?.*') as sqn
or
regexp_matches(string, '.*[[:blank:]](FD)([0-9]{6})(\.)?.*') as sqn
See the regex demo

What is the difference between .*? and .* regular expressions?

I'm trying to split up a string into two parts using regex. The string is formatted as follows:
text to extract<number>
I've been using (.*?)< and <(.*?)> which work fine but after reading into regex a little, I've just started to wonder why I need the ? in the expressions. I've only done it like that after finding them through this site so I'm not exactly sure what the difference is.
On greedy vs non-greedy
Repetition in regex by default is greedy: they try to match as many reps as possible, and when this doesn't work and they have to backtrack, they try to match one fewer rep at a time, until a match of the whole pattern is found. As a result, when a match finally happens, a greedy repetition would match as many reps as possible.
The ? as a repetition quantifier changes this behavior into non-greedy, also called reluctant (in e.g. Java) (and sometimes "lazy"). In contrast, this repetition will first try to match as few reps as possible, and when this doesn't work and they have to backtrack, they start matching one more rept a time. As a result, when a match finally happens, a reluctant repetition would match as few reps as possible.
References
regular-expressions.info/Repetition - Laziness instead of Greediness
Example 1: From A to Z
Let's compare these two patterns: A.*Z and A.*?Z.
Given the following input:
eeeAiiZuuuuAoooZeeee
The patterns yield the following matches:
A.*Z yields 1 match: AiiZuuuuAoooZ (see on rubular.com)
A.*?Z yields 2 matches: AiiZ and AoooZ (see on rubular.com)
Let's first focus on what A.*Z does. When it matched the first A, the .*, being greedy, first tries to match as many . as possible.
eeeAiiZuuuuAoooZeeee
\_______________/
A.* matched, Z can't match
Since the Z doesn't match, the engine backtracks, and .* must then match one fewer .:
eeeAiiZuuuuAoooZeeee
\______________/
A.* matched, Z still can't match
This happens a few more times, until finally we come to this:
eeeAiiZuuuuAoooZeeee
\__________/
A.* matched, Z can now match
Now Z can match, so the overall pattern matches:
eeeAiiZuuuuAoooZeeee
\___________/
A.*Z matched
By contrast, the reluctant repetition in A.*?Z first matches as few . as possible, and then taking more . as necessary. This explains why it finds two matches in the input.
Here's a visual representation of what the two patterns matched:
eeeAiiZuuuuAoooZeeee
\__/r \___/r r = reluctant
\____g____/ g = greedy
Example: An alternative
In many applications, the two matches in the above input is what is desired, thus a reluctant .*? is used instead of the greedy .* to prevent overmatching. For this particular pattern, however, there is a better alternative, using negated character class.
The pattern A[^Z]*Z also finds the same two matches as the A.*?Z pattern for the above input (as seen on ideone.com). [^Z] is what is called a negated character class: it matches anything but Z.
The main difference between the two patterns is in performance: being more strict, the negated character class can only match one way for a given input. It doesn't matter if you use greedy or reluctant modifier for this pattern. In fact, in some flavors, you can do even better and use what is called possessive quantifier, which doesn't backtrack at all.
References
regular-expressions.info/Repetition - An Alternative to Laziness, Negated Character Classes and Possessive Quantifiers
Example 2: From A to ZZ
This example should be illustrative: it shows how the greedy, reluctant, and negated character class patterns match differently given the same input.
eeAiiZooAuuZZeeeZZfff
These are the matches for the above input:
A[^Z]*ZZ yields 1 match: AuuZZ (as seen on ideone.com)
A.*?ZZ yields 1 match: AiiZooAuuZZ (as seen on ideone.com)
A.*ZZ yields 1 match: AiiZooAuuZZeeeZZ (as seen on ideone.com)
Here's a visual representation of what they matched:
___n
/ \ n = negated character class
eeAiiZooAuuZZeeeZZfff r = reluctant
\_________/r / g = greedy
\____________/g
Related topics
These are links to questions and answers on stackoverflow that cover some topics that may be of interest.
One greedy repetition can outgreed another
Regex not being greedy enough
Regular expression: who's greedier
It is the difference between greedy and non-greedy quantifiers.
Consider the input 101000000000100.
Using 1.*1, * is greedy - it will match all the way to the end, and then backtrack until it can match 1, leaving you with 1010000000001.
.*? is non-greedy. * will match nothing, but then will try to match extra characters until it matches 1, eventually matching 101.
All quantifiers have a non-greedy mode: .*?, .+?, .{2,6}?, and even .??.
In your case, a similar pattern could be <([^>]*)> - matching anything but a greater-than sign (strictly speaking, it matches zero or more characters other than > in-between < and >).
See Quantifier Cheat Sheet.
Let's say you have:
<a></a>
<(.*)> would match a></a where as <(.*?)> would match a.
The latter stops after the first match of >. It checks for one
or 0 matches of .* followed by the next expression.
The first expression <(.*)> doesn't stop when matching the first >. It will continue until the last match of >.

What do 'lazy' and 'greedy' mean in the context of regular expressions?

What are these two terms in an understandable way?
Greedy will consume as much as possible. From http://www.regular-expressions.info/repeat.html we see the example of trying to match HTML tags with <.+>. Suppose you have the following:
<em>Hello World</em>
You may think that <.+> (. means any non newline character and + means one or more) would only match the <em> and the </em>, when in reality it will be very greedy, and go from the first < to the last >. This means it will match <em>Hello World</em> instead of what you wanted.
Making it lazy (<.+?>) will prevent this. By adding the ? after the +, we tell it to repeat as few times as possible, so the first > it comes across, is where we want to stop the matching.
I'd encourage you to download RegExr, a great tool that will help you explore Regular Expressions - I use it all the time.
'Greedy' means match longest possible string.
'Lazy' means match shortest possible string.
For example, the greedy h.+l matches 'hell' in 'hello' but the lazy h.+?l matches 'hel'.
Greedy quantifier
Lazy quantifier
Description
*
*?
Star Quantifier: 0 or more
+
+?
Plus Quantifier: 1 or more
?
??
Optional Quantifier: 0 or 1
{n}
{n}?
Quantifier: exactly n
{n,}
{n,}?
Quantifier: n or more
{n,m}
{n,m}?
Quantifier: between n and m
Add a ? to a quantifier to make it ungreedy i.e lazy.
Example:
test string : stackoverflow
greedy reg expression : s.*o output: stackoverflow
lazy reg expression : s.*?o output: stackoverflow
Greedy means your expression will match as large a group as possible, lazy means it will match the smallest group possible. For this string:
abcdefghijklmc
and this expression:
a.*c
A greedy match will match the whole string, and a lazy match will match just the first abc.
As far as I know, most regex engine is greedy by default. Add a question mark at the end of quantifier will enable lazy match.
As #Andre S mentioned in comment.
Greedy: Keep searching until condition is not satisfied.
Lazy: Stop searching once condition is satisfied.
Refer to the example below for what is greedy and what is lazy.
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class Test {
public static void main(String args[]){
String money = "100000000999";
String greedyRegex = "100(0*)";
Pattern pattern = Pattern.compile(greedyRegex);
Matcher matcher = pattern.matcher(money);
while(matcher.find()){
System.out.println("I'm greedy and I want " + matcher.group() + " dollars. This is the most I can get.");
}
String lazyRegex = "100(0*?)";
pattern = Pattern.compile(lazyRegex);
matcher = pattern.matcher(money);
while(matcher.find()){
System.out.println("I'm too lazy to get so much money, only " + matcher.group() + " dollars is enough for me");
}
}
}
The result is:
I'm greedy and I want 100000000 dollars. This is the most I can get.
I'm too lazy to get so much money, only 100 dollars is enough for me
Taken From www.regular-expressions.info
Greediness: Greedy quantifiers first tries to repeat the token as many times
as possible, and gradually gives up matches as the engine backtracks to find
an overall match.
Laziness: Lazy quantifier first repeats the token as few times as required, and
gradually expands the match as the engine backtracks through the regex to
find an overall match.
From Regular expression
The standard quantifiers in regular
expressions are greedy, meaning they
match as much as they can, only giving
back as necessary to match the
remainder of the regex.
By using a lazy quantifier, the
expression tries the minimal match
first.
Greedy matching. The default behavior of regular expressions is to be greedy. That means it tries to extract as much as possible until it conforms to a pattern even when a smaller part would have been syntactically sufficient.
Example:
import re
text = "<body>Regex Greedy Matching Example </body>"
re.findall('<.*>', text)
#> ['<body>Regex Greedy Matching Example </body>']
Instead of matching till the first occurrence of ‘>’, it extracted the whole string. This is the default greedy or ‘take it all’ behavior of regex.
Lazy matching, on the other hand, ‘takes as little as possible’. This can be effected by adding a ? at the end of the pattern.
Example:
re.findall('<.*?>', text)
#> ['<body>', '</body>']
If you want only the first match to be retrieved, use the search method instead.
re.search('<.*?>', text).group()
#> '<body>'
Source: Python Regex Examples
Greedy Quantifiers are like the IRS
They’ll take as much as they can. e.g. matches with this regex: .*
$50,000
Bye-bye bank balance.
See here for an example: Greedy-example
Non-greedy quantifiers - they take as little as they can
Ask for a tax refund: the IRS sudden becomes non-greedy - and return as little as possible: i.e. they use this quantifier:
(.{2,5}?)([0-9]*) against this input: $50,000
The first group is non-needy and only matches $5 – so I get a $5 refund against the $50,000 input.
See here: Non-greedy-example.
Why do we need greedy vs non-greedy?
It becomes important if you are trying to match certain parts of an expression. Sometimes you don't want to match everything - as little as possible. Sometimes you want to match as much as possible. Nothing more to it.
You can play around with the examples in the links posted above.
(Analogy used to help you remember).
Greedy means it will consume your pattern until there are none of them left and it can look no further.
Lazy will stop as soon as it will encounter the first pattern you requested.
One common example that I often encounter is \s*-\s*? of a regex ([0-9]{2}\s*-\s*?[0-9]{7})
The first \s* is classified as greedy because of * and will look as many white spaces as possible after the digits are encountered and then look for a dash character "-". Where as the second \s*? is lazy because of the present of *? which means that it will look the first white space character and stop right there.
Best shown by example. String. 192.168.1.1 and a greedy regex \b.+\b
You might think this would give you the 1st octet but is actually matches against the whole string. Why? Because the.+ is greedy and a greedy match matches every character in 192.168.1.1 until it reaches the end of the string. This is the important bit! Now it starts to backtrack one character at a time until it finds a match for the 3rd token (\b).
If the string a 4GB text file and 192.168.1.1 was at the start you could easily see how this backtracking would cause an issue.
To make a regex non greedy (lazy) put a question mark after your greedy search e.g
*?
??
+?
What happens now is token 2 (+?) finds a match, regex moves along a character and then tries the next token (\b) rather than token 2 (+?). So it creeps along gingerly.
To give extra clarification on Laziness, here is one example which is maybe not intuitive on first look but explains idea of "gradually expands the match" from Suganthan Madhavan Pillai answer.
input -> some.email#domain.com#
regex -> ^.*?#$
Regex for this input will have a match. At first glance somebody could say LAZY match(".*?#") will stop at first # after which it will check that input string ends("$"). Following this logic someone would conclude there is no match because input string doesn't end after first #.
But as you can see this is not the case, regex will go forward even though we are using non-greedy(lazy mode) search until it hits second # and have a MINIMAL match.
try to understand the following behavior:
var input = "0014.2";
Regex r1 = new Regex("\\d+.{0,1}\\d+");
Regex r2 = new Regex("\\d*.{0,1}\\d*");
Console.WriteLine(r1.Match(input).Value); // "0014.2"
Console.WriteLine(r2.Match(input).Value); // "0014.2"
input = " 0014.2";
Console.WriteLine(r1.Match(input).Value); // "0014.2"
Console.WriteLine(r2.Match(input).Value); // " 0014"
input = " 0014.2";
Console.WriteLine(r1.Match(input).Value); // "0014.2"
Console.WriteLine(r2.Match(input).Value); // ""