RegEx - Extracting similar groups - regex

I have to interpret a bunch of files, where each line stands for some maximum float value.
{...}
SomeMaximumVal: 630.0 (AB300: 420.0) (AB301: 220.0)
SomeOtherMaximumVal: 610.0 (AB300: 410.0) (AB301: 210.0)
{...}
A single line hereby can either contain just a common value, e.g.
SomeMaximumVal: 630.0
or a common value and one application specific value, e.g.
SomeMaximumVal: 630.0 (AB300: 420.0)
or a common value and more than one application specific value, e.g.
SomeMaximumVal: 630.0 (AB300: 420.0) (AB301: 220.0)
or no common value and just one or more application specific value, e.g.
SomeMaximumVal: (AB300: 420.0) (AB301: 220.0)
Now I'd like to extract those values via the regular expression
\s*(?:(\S*)\s*:\s*([0-9\.-]*)(?:\s*\(\s*(\S*)\s*:\s*([0-9\.-]+)\)))
but e.g. the results for the file
SomeMaximumVal: 630.0 (AB300: 420.0) (AB301: 220.0)
SomeOtherMaximumVal: 610.0 (AB300: 410.0) (AB301: 210.0)
are:
Match 1
Full match 0-36 SomeMaximumVal: 630.0 (AB300: 420.0)
Group 1. 0-14 SomeMaximumVal
Group 2. 16-21 630.0
Group 3. 23-28 AB300
Group 4. 30-35 420.0
Match 2
Full match 52-94 SomeOtherMaximumVal: 610.0 (AB300: 410.0)
Group 1. 53-72 SomeOtherMaximumVal
Group 2. 74-79 610.0
Group 3. 81-86 AB300
Group 4. 88-93 410.0
which include only the first of each application specific value.
The question is: How can I extend the RegEx to include the further values, too?

You may use
(\w+)\s*:\s*(-?[0-9]+(?:\.[0-9]+)?)
See the regex demo.
Details
(\w+) - Group 1: one or more word chars
\s*:\s* - a colon enclosed with 0+ whitespaces
(-?[0-9]+(?:\.[0-9]+)?) - Group 2:
-? - 1 or 0 hyphens
[0-9]+ - 1 or more digits
(?:\.[0-9]+)? - 1 or 0 occurrences of:
\. - a dot
[0-9]+ - 1 or more digits.

Related

Substitute one group with another group

<P1 x="-0,36935" y="0,26315"/><P2 x="4,29731" y="0,26315"/><P3 x="5,29731" y="-0,40351"/><P4 x="-0,36935" y="-0,40351"/>
<P1 x="4,64065" y="0,26315"/><P2 x="5,97398" y="0,26315"/><P3 x="5,30731" y="-0,40351"/><P4 x="4,64065" y="-0,40351"/>
I want to put a value of P3(x) into P2(x).
So far I have a somewhat working solution, that is not the prettiest;
((?:P2 x=\")(.*?[^\"]+)(?=.+((?:P3 x=\")(.*?[^\"]+))))
It forces me to use P2 x="\4 substitution instead of simply \4
https://regex101.com/r/iua3p0/1
What I am trying is to separate Group 2 from Group 1, and Group 4 from Group 3,
that at the same time would allow me to use value of Group 4 in Group 2
You may try this regex:
(?<=<P2 x=")[\d,-]+(?=.*<P3 x="([\d,-]+)")
Substitution:
\1
(?<=<P2 x=") // positive lookbehind, the following rule must be preceded by <P2 x="
[\d,-]+ // a set of characters formed by digits, comma and -
(?=.*<P3 x="([\d,-]+)) // positive lookahead, find the x value of P3 and store it in group 1
See the proof

How to create a matching regex pattern for "greater than 10-000-000 and lower than 150-000-000"?

I'm trying to make
09-546-943
fail in the below regex pattern.
​^[0-9]{2,3}[- ]{0,1}[0-9]{3}[- ]{0,1}[0-9]{3}$
Passing criteria is
greater than 10-000-000 or 010-000-000 and
less than 150-000-000
The tried example "09-546-943" passes. This should be a fail.
Any idea how to create a regex that makes this example a fail instead of a pass?
You may use
^(?:(?:0?[1-9][0-9]|1[0-4][0-9])-[0-9]{3}-[0-9]{3}|150-000-000)$
See the regex demo.
The pattern is partially generated with this online number range regex generator, I set the min number to 10 and max to 150, then merged the branches that match 1-8 and 9 (the tool does a bad job here), added 0? to the two digit numbers to match an optional leading 0 and -[0-9]{3}-[0-9]{3} for 10-149 part and -000-000 for 150.
See the regex graph:
Details
^ - start of string
(?: - start of a container non-capturing group making the anchors apply to both alternatives:
(?:0?[1-9][0-9]|1[0-4][0-9]) - an optional 0 and then a number from 10 to 99 or 1 followed with a digit from 0 to 4 and then any digit (100 to 149)
-[0-9]{3}-[0-9]{3} - a hyphen and three digits repeated twice (=(?:-[0-9]{3}){2})
| - or
150-000-000 - a 150-000-000 value
) - end of the non-capturing group
$ - end of string.
This expression or maybe a slightly modified version of which might work:
^[1][0-4][0-9]-[0-9]{3}-[0-9]{3}$|^[1][0]-[0-9]{3}-[0-9]{2}[1-9]$
It would also fail 10-000-000 and 150-000-000.
In this demo, the expression is explained, if you might be interested.
This pattern:
((0?[1-9])|(1[0-4]))[0-9]-[0-9]{3}-[0-9]{3}
matches the range from (0)10-000-000 to 149-999-999 inclusive. To keep the regex simple, you may need to handle the extremes ((0)10-000-000 and 150-000-000) separately - depending on your need of them to be included or excluded.
Test here.
This regex:
((0?[1-9])|(1[0-4]))[0-9][- ]?[0-9]{3}[- ]?[0-9]{3}
accepts (space) or nothing instead of -.
Test here.

Replace nested double brace pair with single [duplicate]

This question already has answers here:
Can regular expressions be used to match nested patterns? [duplicate]
(11 answers)
Closed 4 years ago.
Any ideas how to replace:
..((....))..
With:
..(...)..
Be aware, it is not a straight up replace of "((" with "(". The expression must determine that the child brace pair being removed is contained directly with the parent pair, with no other content.
Bonus points if anyone can figure out how to function recursively, e.g. "(((...)))" to "(...)"
You can use this:
([(]*)(?:\([^)]*\))([)]*)
You just need to replace groups with empty string if even first group size is equal to second group or else use the minimum one.
Test:
(ABC)
((ABC))
(((ABC)))
((ABC)a)
Match Information:
Match 1
Full match 0-5 `(ABC)`
Group 1. 0-0 ``
Group 2. 5-5 ``
--> Hence, no update required
Match 2
Full match 6-13 `((ABC))`
Group 1. 6-7 `(`
Group 2. 12-13 `)`
--> As Group 1 and Group 2 size is same, replace those values with '' resulting to '(ABC)
Match 3
Full match 14-23 `(((ABC)))`
Group 1. 14-16 `((`
Group 2. 21-23 `))`
--> Same in this case as well
Match 4
Full match 24-30 `((ABC)`
Group 1. 24-25 `(`
Group 2. 30-30 ``
--> As group 1 and group 2 are not of same size, reduce to the min one which is group 2 (size 0) and hence no update required leaving it to '((ABC)A)'
Demo

Why is this regex performing partial matches?

I have the following raw data:
1.1.2.2.4.4.4.5.5.9.11.15.16.16.19 ...
I'm using this regex to remove duplicates:
([^.]+)(.[ ]*\1)+
which results in the following:
1.2.4.5.9.115.16.19 ...
The problem is how the regex handles 1.1 in the substring .11.15. What should be 9.11.15.16 becomes 9.115.16. How do I fix this?
The raw values are sorted in numeric order to accommodate the regex used for processing the duplicate values.
The regex is being used within Oracle's REGEXP_REPLACE
The decimal is a delimiter. I've tried commas and pipes but that doesn't fix the problem.
Oracle's REGEX does not work the way you intended. You could split the string and find distinct rows using the general method Splitting string into multiple rows in Oracle. Another option is to use XMLTABLE , which works for numbers and also strings with proper quoting.
SELECT LISTAGG(n, '.') WITHIN
GROUP (
ORDER BY n
) AS n
FROM (
SELECT DISTINCT TO_NUMBER(column_value) AS n
FROM XMLTABLE(replace('1.1.2.2.4.4.4.5.5.9.11.15.16.16.19', '.', ','))
);
Demo
Unfortunately Oracle doesn't provide a token to match a word boundary position. Neither familiar \b token nor ancient [[:<:]] or [[:>:]].
But on this specific set you can use:
(\d+\.)(\1)+
Note: You forgot to escape dot.
Your regex caught:
a 1 - the second digit in 11,
then a dot,
and finally 1 - the first digit in 15.
So your regex failed to catch the whole sequence of digits.
The most natural way to write a regex catching the whole sequence
of digits would be to use:
a loobehind for either the start of the string or a dot,
then catch a sequence of digits,
and finally a lookahead for a dot.
But as I am not sure whether Oracle supports lookarounds, I wrote
the regex another way:
(^|\.)(\d+)(\.(\2))+
Details:
(^|\.) - Either start of the string or a dot (group 1), instead of
the loobehind.
(\d+) - A sequence of digits (group 2).
( - Start of group 3, containing:
\.(\2) - A dot and the same sequence of digits which caught group 2.
)+ - End of group 3, it may occur multiple times.
Group the repeating pattern and remove it
As revo has indicated, a big source of your difficulties came with not escaping the period. In addition, the resulting string having a 115 included can be explained as follows (Valdi_Bo made a similar observation earlier):
([^.]+)(.[ ]*\1)+ will match 11.15 as follow:
SCOTT#DB>SELECT
2 '11.15' val,
3 regexp_replace('11.15','([^.]+)(\.[ ]*\1)+','\1') deduplicated
4 FROM
5 dual;
VAL DEDUPLICATED
11.15 115
Here is a similar approach to address those problems:
matching pattern composition
-Look for a non-period matching list of length 0 to N (subexpression is referenced by \1).
'19' which matches ([^.]*)
-Look for the repeats which form our second matching list associated with subexression 2, referenced by \2.
'19.19.19' which matches ([^.]*)([.]\1)+
-Look for either a period or end of string. This is matching list referenced by \3. This fixes the match of '11.15' by '115'.
([.]|$)
replacement string
I replace the match pattern with a replacement string composed of the first instance of the non-period matching list.
\1\3
Solution
regexp_replace(val,'([^.]*)([.]\1)+([.]|$)','\1\3')
Here is an example using some permutations of your examples:
SCOTT#db>WITH tst AS (
2 SELECT
3 '1.1.2.2.4.4.4.5.5.9.11.15.16.16.19' val
4 FROM
5 dual
6 UNION ALL
7 SELECT
8 '1.1.1.1.2.2.4.4.4.4.4.5.5.9.11.11.11.15.16.16.19' val
9 FROM
10 dual
11 UNION ALL
12 SELECT
13 '1.1.2.2.4.4.4.5.5.9.11.15.16.16.19.19.19' val
14 FROM
15 dual
16 ) SELECT
17 val,
18 regexp_replace(val,'([^.]*)([.]\1)+([.]|$)','\1\3') deduplicate
19 FROM
20 tst;
VAL DEDUPLICATE
------------------------------------------------------------------------
1.1.2.2.4.4.4.5.5.9.11.15.16.16.19 1.2.4.5.9.11.15.16.19
1.1.1.1.2.2.4.4.4.4.4.5.5.9.11.11.11.15.16.16.19 1.2.4.5.9.11.15.16.19
1.1.2.2.4.4.4.5.5.9.11.15.16.16.19.19.19 1.2.4.5.9.11.15.16.19
My approach does not address possible spaces in the string. One could just remove them separately (e.g. through a separate replace statement).

Extract filename and id from its name

I have a file with text
# co2a0000123.rd
# co2c0000124.rd
I need to use regex and extract co2a0000123 in group 1 and a or c as highlighted in group 2 of regex expression
I have tried
(\B[a|c])([a-z0-9]+).(?:[a-z]+)
What happens is ([a-z0-9]+).(?:[a-z]+) this part of regex gives co2a0000123 in group 1 as desired but as soon as I add (\B[a|c]) in the beginning or end co2a0000123 changes to co2a in group 1 and gives 'a' in Group 2.
Try for example \s(\w+?([ac])\w*)\.
Group 1 will be the part between a space and a dot.
Group 2 will be the first a or c anywhere except the first letter within Group 1.