Extract filename and id from its name - regex

I have a file with text
# co2a0000123.rd
# co2c0000124.rd
I need to use regex and extract co2a0000123 in group 1 and a or c as highlighted in group 2 of regex expression
I have tried
(\B[a|c])([a-z0-9]+).(?:[a-z]+)
What happens is ([a-z0-9]+).(?:[a-z]+) this part of regex gives co2a0000123 in group 1 as desired but as soon as I add (\B[a|c]) in the beginning or end co2a0000123 changes to co2a in group 1 and gives 'a' in Group 2.

Try for example \s(\w+?([ac])\w*)\.
Group 1 will be the part between a space and a dot.
Group 2 will be the first a or c anywhere except the first letter within Group 1.

Related

Substitute one group with another group

<P1 x="-0,36935" y="0,26315"/><P2 x="4,29731" y="0,26315"/><P3 x="5,29731" y="-0,40351"/><P4 x="-0,36935" y="-0,40351"/>
<P1 x="4,64065" y="0,26315"/><P2 x="5,97398" y="0,26315"/><P3 x="5,30731" y="-0,40351"/><P4 x="4,64065" y="-0,40351"/>
I want to put a value of P3(x) into P2(x).
So far I have a somewhat working solution, that is not the prettiest;
((?:P2 x=\")(.*?[^\"]+)(?=.+((?:P3 x=\")(.*?[^\"]+))))
It forces me to use P2 x="\4 substitution instead of simply \4
https://regex101.com/r/iua3p0/1
What I am trying is to separate Group 2 from Group 1, and Group 4 from Group 3,
that at the same time would allow me to use value of Group 4 in Group 2
You may try this regex:
(?<=<P2 x=")[\d,-]+(?=.*<P3 x="([\d,-]+)")
Substitution:
\1
(?<=<P2 x=") // positive lookbehind, the following rule must be preceded by <P2 x="
[\d,-]+ // a set of characters formed by digits, comma and -
(?=.*<P3 x="([\d,-]+)) // positive lookahead, find the x value of P3 and store it in group 1
See the proof

Regex - Match n occurences of substring within any m-lettered window

I am facing some issues forming a regex that matches at least n times a given pattern within m characters of the input string.
For example imagine that my input string is:
00000001100000001110111100000000000000000000000000000000000000000000000000110000000111000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000001100
I want to detect all cases where an 1 appears at least 7 times (not necessarily consecutively) in the input string, but within a window of up to 20 characters.
So far I have built this expression:
(1[^1]*?){7,}
which detects all cases where an 1 appears at least 7 times in the input string, but this now matches both the:
11000000011101111
and the
1100000001110000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000011
parts whereas I want only the first one to be kept, as it is within a substring composed of less than 20 characters.
It tried to combine the aforementioned regex with:
(?=(^[01]{0,20}))
to also match only parts of the string containing either an '1' or a '0' of length up to 20 characters but when I do that it stops working.
Does anyone have an idea gow to accomplish this?
I have put this example in regex101 as a quick reference.
Thank you very much!
This is not something that can be done with regex without listing out every possible string. You would need to iterate over the string instead.
You could also iterate over the matches. Example in Python:
import re
matches = re.finditer(r'(?=((1[^1]*?){7}))', string)
matches = [match.group(1) for match in matches if len(match.group(1)) <= 20]
The next Python snippet is an attempt to get the desired sequences using only the regular expression.
import re
r = r'''
(?mx)
( # the 1st capturing group will contain the desired sequence
1 # this sequence should begin with 1
(?=(?:[01]{6,19}) # let's see that there are enough 0s and 1s in a line
(.*$)) # the 2nd capturing group will contain all characters to the end of a line
(?:0*1){6}) # there must be six more 1s in the sequence
(?=.{0,13} # complement the 1st capturing group to 20 characters
\2) # the rest of a line should be 2nd capturing group
'''
s = '''
0000000
101010101010111111100000000000001
00000001100000001110111100000000000000000000000000000000000000000000000000110000000111000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000001100
1111111
111111
'''
print([m.group(1) for m in re.finditer(r, s)])
Output:
['1010101010101', '11111100000000000001', '110000000111011', '1111111']
You can find an exhaustive explanation of this regular expression on RegEx101.

Replace nested double brace pair with single [duplicate]

This question already has answers here:
Can regular expressions be used to match nested patterns? [duplicate]
(11 answers)
Closed 4 years ago.
Any ideas how to replace:
..((....))..
With:
..(...)..
Be aware, it is not a straight up replace of "((" with "(". The expression must determine that the child brace pair being removed is contained directly with the parent pair, with no other content.
Bonus points if anyone can figure out how to function recursively, e.g. "(((...)))" to "(...)"
You can use this:
([(]*)(?:\([^)]*\))([)]*)
You just need to replace groups with empty string if even first group size is equal to second group or else use the minimum one.
Test:
(ABC)
((ABC))
(((ABC)))
((ABC)a)
Match Information:
Match 1
Full match 0-5 `(ABC)`
Group 1. 0-0 ``
Group 2. 5-5 ``
--> Hence, no update required
Match 2
Full match 6-13 `((ABC))`
Group 1. 6-7 `(`
Group 2. 12-13 `)`
--> As Group 1 and Group 2 size is same, replace those values with '' resulting to '(ABC)
Match 3
Full match 14-23 `(((ABC)))`
Group 1. 14-16 `((`
Group 2. 21-23 `))`
--> Same in this case as well
Match 4
Full match 24-30 `((ABC)`
Group 1. 24-25 `(`
Group 2. 30-30 ``
--> As group 1 and group 2 are not of same size, reduce to the min one which is group 2 (size 0) and hence no update required leaving it to '((ABC)A)'
Demo

RegEx multiple capture groups replaced in a string

I have a string of data...
"123456712J456","D","TEST1~TEST2~TEST3~TEST4~TEST5"
I want to take the following string and make 5 strings.
"123456712J456","D","TEST1"
"123456712J456","D","TEST2"
"123456712J456","D","TEST3"
"123456712J456","D","TEST4"
"123456712J456","D","TEST5"
I currently have the following regex...
//In a program like Textpad
<FIND> "\(.\{13\}\)","D","\([^~]*\)~\(.*\)
<REPLACE> "\1","D","\2"\n"\1","D","\3
//On the regex101 site
"(.{13})","D","([^~]*)~(.*)
Now if I run this 5 times it would work fine. The problem is there is an unknown number of lines to be made. For example...
"123456712J456","D","TEST1~TEST2~TEST3~TEST4~TEST5"
"123456712J457","D","TEST1~TEST2~TEST3"
"123456712J458","D","TEST1~TEST2"
"123456712J459","D","TEST1~TEST2~TEST3~TEST4"
I was hoping to be able to use a MULTI capture group to make this work. I found this PAGE talking about the common mistake between repeating a capturing group and capturing a repeated group. I need to capture a repeated group. For some reason I just could not make mine work right though. Anyone else have an idea?
RESOURCES:
http://www.regular-expressions.info/captureall.html
http://regex101.com/
Try this.See demo.Just club match1 and rest of the matches.
http://regex101.com/r/yR3mM3/17
RegEx:
(.*,)|([^"~]+)
Example:
"1234567123456","T","TEST1~TEST2~TEST3~TEST4~TEST5"
Results:
MATCH 1
1. [0-20] `"1234567123456","T",`
MATCH 2
2. [21-26] `TEST1`
MATCH 3
2. [27-32] `TEST2`
MATCH 4
2. [33-38] `TEST3`
MATCH 5
2. [39-44] `TEST4`
MATCH 6
2. [45-50] `TEST5`

Matching a group that may or may not exist

My regex needs to parse an address which looks like this:
BLOOKKOKATU 20 A 773 00810 HELSINKI SUOMI
-------------------- ----- -------- -----
1 2 3 4*
Groups one, two and three will always exist in an address. Group 4 may not exist. I've written a regex that helps me get the first, second and third part but I would also need the fourth part. Part 4 is the country name and can either be FINLAND or SUOMI. If the fourth part didn't exist in an address the fourth group would be empty. This is my regex so far but the third group captures the country too. Any help?
(.*?)\s(\d{5})\s(.*)$
(I'm going to be using this Oracles REGEXP function)
Change the regex to:
(.*?)\s(\d{5})\s(.+?)\s?(FINLAND|SUOMI)?$
Making group three none greedy will let you match the optional space + country choices. If group 4 doesn't match I think it will be uninitialized rather than blank, that depends on language.
To match a character (or in your case group) that may or may not exist, you need to use ? after the character/subpattern/class in question. I'm answering now because RegEx is complicated and should be explained: only posting the fix without the answer isn't enough!
A question mark matches zero or one of the preceding character, class, or subpattern. Think of this as "the preceding item is optional". For example, colou?r matches both color and colour because the "u" is optional.
Above quote from http://www.autohotkey.com/docs/misc/RegEx-QuickRef.htm
Try this:
(.*?)\s(\d{5})\s(.*?)\s?([^\s]*)?$
This will match your input more tightly and each of your groups is in its own regex group:
(\w+\s\d+\s\w\s\d+)\s(\d+)\s(\w+)\s(\w*)
or if space is OK instead of "whitespace":
(\w+ \d+ \w \d+) (\d+) (\w+) (\w*)
Group 1: BLOOKKOKATU 20 A 773
Group 2: 00810
Group 3: HELSINKI
Group 4: SUOMI (optional - doesn't have to match)
(.*?)\s(\d{5})\s(\w+)\s(\w*)
An example:
SQL> with t as
2 ( select 'BLOOKKOKATU 20 A 773 00810 HELSINKI SUOMI' text from dual
3 )
4 select text
5 , regexp_replace(text,'(.*?)\s(\d{5})\s(\w+)\s(\w*)','\1**\2**\3**\4') new_text
6 from t
7 /
TEXT
-----------------------------------------
NEW_TEXT
-----------------------------------------------------------------------------------------
BLOOKKOKATU 20 A 773 00810 HELSINKI SUOMI
BLOOKKOKATU 20 A 773**00810**HELSINKI**SUOMI
1 row selected.
Regards,
Rob.