Substitute one group with another group - regex

<P1 x="-0,36935" y="0,26315"/><P2 x="4,29731" y="0,26315"/><P3 x="5,29731" y="-0,40351"/><P4 x="-0,36935" y="-0,40351"/>
<P1 x="4,64065" y="0,26315"/><P2 x="5,97398" y="0,26315"/><P3 x="5,30731" y="-0,40351"/><P4 x="4,64065" y="-0,40351"/>
I want to put a value of P3(x) into P2(x).
So far I have a somewhat working solution, that is not the prettiest;
((?:P2 x=\")(.*?[^\"]+)(?=.+((?:P3 x=\")(.*?[^\"]+))))
It forces me to use P2 x="\4 substitution instead of simply \4
https://regex101.com/r/iua3p0/1
What I am trying is to separate Group 2 from Group 1, and Group 4 from Group 3,
that at the same time would allow me to use value of Group 4 in Group 2

You may try this regex:
(?<=<P2 x=")[\d,-]+(?=.*<P3 x="([\d,-]+)")
Substitution:
\1
(?<=<P2 x=") // positive lookbehind, the following rule must be preceded by <P2 x="
[\d,-]+ // a set of characters formed by digits, comma and -
(?=.*<P3 x="([\d,-]+)) // positive lookahead, find the x value of P3 and store it in group 1
See the proof

Related

Replace nested double brace pair with single [duplicate]

This question already has answers here:
Can regular expressions be used to match nested patterns? [duplicate]
(11 answers)
Closed 4 years ago.
Any ideas how to replace:
..((....))..
With:
..(...)..
Be aware, it is not a straight up replace of "((" with "(". The expression must determine that the child brace pair being removed is contained directly with the parent pair, with no other content.
Bonus points if anyone can figure out how to function recursively, e.g. "(((...)))" to "(...)"
You can use this:
([(]*)(?:\([^)]*\))([)]*)
You just need to replace groups with empty string if even first group size is equal to second group or else use the minimum one.
Test:
(ABC)
((ABC))
(((ABC)))
((ABC)a)
Match Information:
Match 1
Full match 0-5 `(ABC)`
Group 1. 0-0 ``
Group 2. 5-5 ``
--> Hence, no update required
Match 2
Full match 6-13 `((ABC))`
Group 1. 6-7 `(`
Group 2. 12-13 `)`
--> As Group 1 and Group 2 size is same, replace those values with '' resulting to '(ABC)
Match 3
Full match 14-23 `(((ABC)))`
Group 1. 14-16 `((`
Group 2. 21-23 `))`
--> Same in this case as well
Match 4
Full match 24-30 `((ABC)`
Group 1. 24-25 `(`
Group 2. 30-30 ``
--> As group 1 and group 2 are not of same size, reduce to the min one which is group 2 (size 0) and hence no update required leaving it to '((ABC)A)'
Demo

Why is this regex performing partial matches?

I have the following raw data:
1.1.2.2.4.4.4.5.5.9.11.15.16.16.19 ...
I'm using this regex to remove duplicates:
([^.]+)(.[ ]*\1)+
which results in the following:
1.2.4.5.9.115.16.19 ...
The problem is how the regex handles 1.1 in the substring .11.15. What should be 9.11.15.16 becomes 9.115.16. How do I fix this?
The raw values are sorted in numeric order to accommodate the regex used for processing the duplicate values.
The regex is being used within Oracle's REGEXP_REPLACE
The decimal is a delimiter. I've tried commas and pipes but that doesn't fix the problem.
Oracle's REGEX does not work the way you intended. You could split the string and find distinct rows using the general method Splitting string into multiple rows in Oracle. Another option is to use XMLTABLE , which works for numbers and also strings with proper quoting.
SELECT LISTAGG(n, '.') WITHIN
GROUP (
ORDER BY n
) AS n
FROM (
SELECT DISTINCT TO_NUMBER(column_value) AS n
FROM XMLTABLE(replace('1.1.2.2.4.4.4.5.5.9.11.15.16.16.19', '.', ','))
);
Demo
Unfortunately Oracle doesn't provide a token to match a word boundary position. Neither familiar \b token nor ancient [[:<:]] or [[:>:]].
But on this specific set you can use:
(\d+\.)(\1)+
Note: You forgot to escape dot.
Your regex caught:
a 1 - the second digit in 11,
then a dot,
and finally 1 - the first digit in 15.
So your regex failed to catch the whole sequence of digits.
The most natural way to write a regex catching the whole sequence
of digits would be to use:
a loobehind for either the start of the string or a dot,
then catch a sequence of digits,
and finally a lookahead for a dot.
But as I am not sure whether Oracle supports lookarounds, I wrote
the regex another way:
(^|\.)(\d+)(\.(\2))+
Details:
(^|\.) - Either start of the string or a dot (group 1), instead of
the loobehind.
(\d+) - A sequence of digits (group 2).
( - Start of group 3, containing:
\.(\2) - A dot and the same sequence of digits which caught group 2.
)+ - End of group 3, it may occur multiple times.
Group the repeating pattern and remove it
As revo has indicated, a big source of your difficulties came with not escaping the period. In addition, the resulting string having a 115 included can be explained as follows (Valdi_Bo made a similar observation earlier):
([^.]+)(.[ ]*\1)+ will match 11.15 as follow:
SCOTT#DB>SELECT
2 '11.15' val,
3 regexp_replace('11.15','([^.]+)(\.[ ]*\1)+','\1') deduplicated
4 FROM
5 dual;
VAL DEDUPLICATED
11.15 115
Here is a similar approach to address those problems:
matching pattern composition
-Look for a non-period matching list of length 0 to N (subexpression is referenced by \1).
'19' which matches ([^.]*)
-Look for the repeats which form our second matching list associated with subexression 2, referenced by \2.
'19.19.19' which matches ([^.]*)([.]\1)+
-Look for either a period or end of string. This is matching list referenced by \3. This fixes the match of '11.15' by '115'.
([.]|$)
replacement string
I replace the match pattern with a replacement string composed of the first instance of the non-period matching list.
\1\3
Solution
regexp_replace(val,'([^.]*)([.]\1)+([.]|$)','\1\3')
Here is an example using some permutations of your examples:
SCOTT#db>WITH tst AS (
2 SELECT
3 '1.1.2.2.4.4.4.5.5.9.11.15.16.16.19' val
4 FROM
5 dual
6 UNION ALL
7 SELECT
8 '1.1.1.1.2.2.4.4.4.4.4.5.5.9.11.11.11.15.16.16.19' val
9 FROM
10 dual
11 UNION ALL
12 SELECT
13 '1.1.2.2.4.4.4.5.5.9.11.15.16.16.19.19.19' val
14 FROM
15 dual
16 ) SELECT
17 val,
18 regexp_replace(val,'([^.]*)([.]\1)+([.]|$)','\1\3') deduplicate
19 FROM
20 tst;
VAL DEDUPLICATE
------------------------------------------------------------------------
1.1.2.2.4.4.4.5.5.9.11.15.16.16.19 1.2.4.5.9.11.15.16.19
1.1.1.1.2.2.4.4.4.4.4.5.5.9.11.11.11.15.16.16.19 1.2.4.5.9.11.15.16.19
1.1.2.2.4.4.4.5.5.9.11.15.16.16.19.19.19 1.2.4.5.9.11.15.16.19
My approach does not address possible spaces in the string. One could just remove them separately (e.g. through a separate replace statement).

Extract filename and id from its name

I have a file with text
# co2a0000123.rd
# co2c0000124.rd
I need to use regex and extract co2a0000123 in group 1 and a or c as highlighted in group 2 of regex expression
I have tried
(\B[a|c])([a-z0-9]+).(?:[a-z]+)
What happens is ([a-z0-9]+).(?:[a-z]+) this part of regex gives co2a0000123 in group 1 as desired but as soon as I add (\B[a|c]) in the beginning or end co2a0000123 changes to co2a in group 1 and gives 'a' in Group 2.
Try for example \s(\w+?([ac])\w*)\.
Group 1 will be the part between a space and a dot.
Group 2 will be the first a or c anywhere except the first letter within Group 1.

RegEx - Extracting similar groups

I have to interpret a bunch of files, where each line stands for some maximum float value.
{...}
SomeMaximumVal: 630.0 (AB300: 420.0) (AB301: 220.0)
SomeOtherMaximumVal: 610.0 (AB300: 410.0) (AB301: 210.0)
{...}
A single line hereby can either contain just a common value, e.g.
SomeMaximumVal: 630.0
or a common value and one application specific value, e.g.
SomeMaximumVal: 630.0 (AB300: 420.0)
or a common value and more than one application specific value, e.g.
SomeMaximumVal: 630.0 (AB300: 420.0) (AB301: 220.0)
or no common value and just one or more application specific value, e.g.
SomeMaximumVal: (AB300: 420.0) (AB301: 220.0)
Now I'd like to extract those values via the regular expression
\s*(?:(\S*)\s*:\s*([0-9\.-]*)(?:\s*\(\s*(\S*)\s*:\s*([0-9\.-]+)\)))
but e.g. the results for the file
SomeMaximumVal: 630.0 (AB300: 420.0) (AB301: 220.0)
SomeOtherMaximumVal: 610.0 (AB300: 410.0) (AB301: 210.0)
are:
Match 1
Full match 0-36 SomeMaximumVal: 630.0 (AB300: 420.0)
Group 1. 0-14 SomeMaximumVal
Group 2. 16-21 630.0
Group 3. 23-28 AB300
Group 4. 30-35 420.0
Match 2
Full match 52-94 SomeOtherMaximumVal: 610.0 (AB300: 410.0)
Group 1. 53-72 SomeOtherMaximumVal
Group 2. 74-79 610.0
Group 3. 81-86 AB300
Group 4. 88-93 410.0
which include only the first of each application specific value.
The question is: How can I extend the RegEx to include the further values, too?
You may use
(\w+)\s*:\s*(-?[0-9]+(?:\.[0-9]+)?)
See the regex demo.
Details
(\w+) - Group 1: one or more word chars
\s*:\s* - a colon enclosed with 0+ whitespaces
(-?[0-9]+(?:\.[0-9]+)?) - Group 2:
-? - 1 or 0 hyphens
[0-9]+ - 1 or more digits
(?:\.[0-9]+)? - 1 or 0 occurrences of:
\. - a dot
[0-9]+ - 1 or more digits.

Matching a group that may or may not exist

My regex needs to parse an address which looks like this:
BLOOKKOKATU 20 A 773 00810 HELSINKI SUOMI
-------------------- ----- -------- -----
1 2 3 4*
Groups one, two and three will always exist in an address. Group 4 may not exist. I've written a regex that helps me get the first, second and third part but I would also need the fourth part. Part 4 is the country name and can either be FINLAND or SUOMI. If the fourth part didn't exist in an address the fourth group would be empty. This is my regex so far but the third group captures the country too. Any help?
(.*?)\s(\d{5})\s(.*)$
(I'm going to be using this Oracles REGEXP function)
Change the regex to:
(.*?)\s(\d{5})\s(.+?)\s?(FINLAND|SUOMI)?$
Making group three none greedy will let you match the optional space + country choices. If group 4 doesn't match I think it will be uninitialized rather than blank, that depends on language.
To match a character (or in your case group) that may or may not exist, you need to use ? after the character/subpattern/class in question. I'm answering now because RegEx is complicated and should be explained: only posting the fix without the answer isn't enough!
A question mark matches zero or one of the preceding character, class, or subpattern. Think of this as "the preceding item is optional". For example, colou?r matches both color and colour because the "u" is optional.
Above quote from http://www.autohotkey.com/docs/misc/RegEx-QuickRef.htm
Try this:
(.*?)\s(\d{5})\s(.*?)\s?([^\s]*)?$
This will match your input more tightly and each of your groups is in its own regex group:
(\w+\s\d+\s\w\s\d+)\s(\d+)\s(\w+)\s(\w*)
or if space is OK instead of "whitespace":
(\w+ \d+ \w \d+) (\d+) (\w+) (\w*)
Group 1: BLOOKKOKATU 20 A 773
Group 2: 00810
Group 3: HELSINKI
Group 4: SUOMI (optional - doesn't have to match)
(.*?)\s(\d{5})\s(\w+)\s(\w*)
An example:
SQL> with t as
2 ( select 'BLOOKKOKATU 20 A 773 00810 HELSINKI SUOMI' text from dual
3 )
4 select text
5 , regexp_replace(text,'(.*?)\s(\d{5})\s(\w+)\s(\w*)','\1**\2**\3**\4') new_text
6 from t
7 /
TEXT
-----------------------------------------
NEW_TEXT
-----------------------------------------------------------------------------------------
BLOOKKOKATU 20 A 773 00810 HELSINKI SUOMI
BLOOKKOKATU 20 A 773**00810**HELSINKI**SUOMI
1 row selected.
Regards,
Rob.