This question already has an answer here:
Capture a dot with postgres regexp
(1 answer)
Closed last month.
I composed almost whole question and then found answer so I will put it here in Q&A style anyway because the described behaviour seems surprising to me.
This regex works correctly and splits string to three parts - numerical part surrounded with letter parts:
select regexp_replace('abc12345def', '^(.*?)([0-9]+)(.*)$', '{first="\1" second="\2" third="\3"}');
{first="abc" second="12345" third="def"}
However after removal of ^ and $ anchors I get
select regexp_replace('abc12345def', '(.*?)([0-9]+)(.*)', '{first="\1" second="\2" third="\3"}');
{first="abc" second="1" third=""}2345def
Because the groups 2 and 3 have greedy quantifier I expect them to match 12345 and def, respectively, and hence return the same string. Equivalent Java code behaves this way:
System.out.println("abc12345def".replaceFirst("(.*?)([0-9]+)(.*)", "{first='$1' second='$2' third='$3'}"));
System.out.println("abc12345def".replaceFirst("^(.*?)([0-9]+)(.*)$", "{first='$1' second='$2' third='$3'}"));
{first='abc' second='12345' third='def'}
{first='abc' second='12345' third='def'}
Why does not it work?
fiddle
Greediness in regular expressions in Postgres is set as a whole. Essentially same example as the provided one can be found in documentation with explanation:
SELECT regexp_match('abc01234xyz', '(.*?)(\d+)(.*)');
Result: {abc,0,""}
That didn't work either, because now the RE as a whole is non-greedy and so it ends the overall match as soon as possible. We can get what we want by forcing the RE as a whole to be greedy:...
Solution is to force whole regex to be greedy:
select regexp_replace('abc12345def', '(?:(.*?)([0-9]+)(.*)){1,1}', '{first="\1" second="\2" third="\3"}');
{first="abc" second="12345" third="def"}
Slightly counterintuitive for me but works.
I'm editing some data, and my end goal is to conditionally substitute , (comma) chars with .(dot). I have a crude solution working now, so this question is strictly for suggestions on better methods in practice, and determining what is possible with a regex engine outside of an enhanced programming environment.
I gave it a good college try, but 6 hours is enough mental grind for a Saturday, and I'm throwing in the towel. :)
I've been through about 40 SO posts on regex recursion, substitution, etc, the wiki.org on the definitions and history of regex and regular language, and a few other tutorial sites. The majority is centered around Python and PHP.
The working, crude regex (facilitating loops / search and replace by hand):
(^.*)(?<=\()(.*?)(,)(.*)(?=\))(.*$)
A snip of the input:
room_ass=01:macro_id=01: name=Left, pgm_audio=0, usb=0, list=(1*,3,5,7,),
room_ass=01:macro_id=02: name=Right, pgm_audio=1, usb=1, list=(2*,4,6,8,),
room_ass=01:macro_id=03: name=All, pgm_audio=1, list=(1,2*,3,4,5,6,7,8,),
And the desired output:
room_ass=01: macro_id=01: name=Left, pgm_audio=0, usb=0, list=(1*.3.5.7.),
room_ass=01: macro_id=02: name=Right, pgm_audio=1, usb=1, list=(2*.4.6.8.),
room_ass=01: macro_id=03: name=All, pgm_audio=1, list=(1.2*.3.4.5.6.7.8.),
That's all. Just replace the , with ., but only inside ( ).
This is one conceptual (not working) method I'd like to see, where the middle group<3> would loop recursively:
(^.*)(?<=\()([^,]*)([,|\d|\*]\3.*)(?=\))(.*$)
( ^ )
..where each recursive iteration would shift across the data, either 1 char or 1 comma at a time:
room_ass=01:macro_id=01: name=Left, pgm_audio=0, usb=0, list=(1*,3,5,7,),
iter 1-| ^ |
2-| ^ |
3-| ^ |
4-| ^|
or
A much simpler approach would be to just tell it to mask/select all , between the (), but I struck out on figuring that one out.
I use text editors a lot for little data editing tasks like this, so I'd like to verify that SublimeText can't do it before I dig into Python.
All suggestions and criticisms welcome. Be gentle. <--#n00b
Thanks in advance!
-B
Not much magic needed. Just check, if there's a closing ) ahead, without any ( in between.
,(?=[^)(]*\))
See this demo at regex101
However it does not check for an opening (. It's a common approach and probably a dulicate.
This is a complete guess because I don't use SublimeText, the assumption here is that SublimeText uses PCRE regular expressions.
Note that you mention "recursive", I don't believe you mean Regular Expression Recursion that doesn't fit the problem here.
Something like this might work...
You'll need to test to make sure this isn't matching other things in your document and to see if SublimeText even supports this...
This is based on using the /K operator to "keep" what comes before it - you can find other uses of it as an PCRE alternative (workaround) to variable look-behinds not being supported by PCRE.
Regular Expression
\((?:(?:[^,\)]+),)*?(?:[^,\)]+)\K,
Visualisation
Regex Description
Match the opening parenthesis character \(
Match the regular expression below (?:(?:[^,\)]+),)*?
Between zero and unlimited times, as few times as possible, expanding as needed (lazy) *?
Match the regular expression below (?:[^,\)]+)
Match any single character NOT present in the list below [^,\)]+
Between one and unlimited times, as many times as possible, giving back as needed (greedy) +
The literal character “,” ,
The closing parenthesis character \)
Match the character “,” literally ,
Match the regular expression below (?:[^,\)]+)
Match any single character NOT present in the list below [^,\)]+
Between one and unlimited times, as many times as possible, giving back as needed (greedy) +
The literal character “,” ,
The closing parenthesis character \)
Keep the text matched so far out of the overall regex match \K
Match the character “,” literally ,
I'm having an issue with Regex.
I'm trying to match T0000001 (2, 3 and so on).
However, some of the lines it searches has what I can describe as positioners. These are shown as a question mark, followed by 2 digits, such as ?21.
These positioners describe a new position if the document were to be printed off the website.
Example:
T123?214567
T?211234567
I need to disregard ?21 and match T1234567.
From what I can see, this is not possible.
I have looked everywhere and tried numerous attempts.
All we have to work off is the linked image. The creators cant even confirm the flavour of Regex it is - they believe its Python but I'm unsure.
Regex Image
Update
Unfortunately none of the codes below have worked so far. I thought to test each code in live (Rather than via regex thinking may work different but unfortunately still didn't work)
There is no replace feature, and as mentioned before I'm not sure if it is Python. Appreciate your help.
Do two regex operations
First do the regex replace to replace the positioners with an empty string.
(\?[0-9]{2})
Then do the regex match
T[0-9]{7}
If there's only one occurrence of the 'positioners' in each match, something like this should work: (T.*?)\?\d{2}(.*)
This can be tested here: https://regex101.com/r/XhQXkh/2
Basically, match two capture groups before and after the '?21' sequence. You'll need to concatenate these two matches.
At first, match the ?21 and repace it with a distinctive character, #, etc
\?21
Demo
and you may try this regex to find what you want
(T(?:\d{7}|[\#\d]{8}))\s
Demo,,, in which target string is captured to group 1 (or \1).
Finally, replace # with ?21 or something you like.
Python script may be like this
ss="""T123?214567
T?211234567
T1234567
T1234434?21
T5435433"""
rexpre= re.compile(r'\?21')
regx= re.compile(r'(T(?:\d{7}|[\#\d]{8}))\s')
for m in regx.findall(rexpre.sub('#',ss)):
print(m)
print()
for m in regx.findall(rexpre.sub('#',ss)):
print(re.sub('#',r'?21', m))
Output is
T123#4567
T#1234567
T1234567
T1234434#
T123?214567
T?211234567
T1234567
T1234434?21
If using a replace functionality is an option for you then this might be an approach to match T0000001 or T123?214567:
Capture a T followed by zero or more digits before the optional part in group 1 (T\d*)
Make the question mark followed by 2 digits part optional (?:\?\d{2})?
Capture one or more digits after in group 2 (\d+).
Then in the replacement you could use group1group2 \1\2.
Using word boundaries \b (Or use assertions for the start and the end of the line ^ $) this could look like:
\b(T\d*)(?:\?\d{2})?(\d+)\b
Example Python
Is the below what you want?
Use RegExReplace with multiline tag (m) and enable replace all occurrences!
Pattern = (T\d*)\?\d{2}(\d*)
replace = $1$2
Usage Example:
is it possible with regex to match a particular sequence repeating it self rather than number of letters? I would like to be able to match cn.cn. or ti.ti. or xft.xft. but not vv.pp. or aa.bb. and I do not seam to be able to do that with (\w\w.)+ opposed to \w+.\w+. in the first case I want in fact to use only one occurrence, like cn. or ti. in the second I want to keep v.p. or a.b.
thanks for any help.
Depending on your flavor of regex, you can use backreferences in your regex to match an earlier group. Your question title and question body disagree, however, on what exactly is supposed to be matched. I'll answer in Python as that's the flavor I'm most familiar with.
# match vv.pp., no match cn.cn.
re.match(r"(\w)\1\.(\w)\2\.", some_text)
# match cn.cn., no match vv.pp.
re.match(r"(\w{2})\.\1\.", some_text)
I have a string:
readiness/dir ABTrickToTrade
I want to match everything after AB. So I want the output to be TrickToTrade.
So far the regex I have come up with matches the whole of ABTrickToTrade:
/(AB(.*))/g
How do I get it to match everything after and not the whole thing?
Depends on the language/tool you are using. For the above, most regex engines will create backreferences for ABTrickToTrade and TrickToTrade as 1 and 2 respectively. In fact, you don't need the outer parentheses. In JavaScript, for example:
matches = str.match(/AB(.*)/);
matches[1]; // TrickToTrade
It seems that regexr.com doesn't support capturing parentheses out of the box (at least not from what I see), but other sites do: http://rubular.com/r/gbZ7NAoNeA