Regex practicing groups - regex

Hello again Stack Overflow. As i mentioned in my last post i am trying to get better at regular expressions. I am going through my books chapters tonight and decided to see if i could, if even possible create multiple groups. I am fully aware Regex is not the answer to everything this is purely for me to learn. I am Using VB.net
Example input(s):
MyTokenName{%#example1%, %#example2%}
MyTokenName{example1, example2}
Now this is a completed made up by myself output to test against. The consistent factors of this expression are Name{ } There will always be a name consisting of only a-z first. Inside of curly brackets. The MAIN delimiter that separates the Two groups is , Before the groups start there will be an OPTIONAL %# that ends with a OPTIONAL %
So to summarize i only want to match groups defined between the curly brackets of only a-z unlimited times.
MyTokenName{%#example%, %#example%} ----- Would match Two groups example1 and example2
MyTokenName{example, example} --- Would match Two groups example1 and example2
My attempt that's not working.
(?<=[a-zA-Z]+\{[^a-zA-Z#]+?)[a-zA-Z, ]+(?=%?})
Any advice would be amazing. Thanks guys for such a great forum. Please remember i am only trying to practice regex. I can do this with other .Net methods.

An interesting way could maybe this one:
/(?i)(?<=\{|\G|\{%#|\G%#)([a-z0-9]+)(?:%?\s*(?:,\s*|\}))/g
http://regex101.com/r/bU0zY5
Here's also a structural view of it:
Debuggex Demo
with interesting I mean the usage of lookbehind with \G ;) and it should match all your examples

This variable length lookbehind is expensive performance wise and of no real value in this case, when all you want to do is capture what your interrested in.
This might work.
[a-zA-z]+ { \s*(?:%#)? ([a-z]+) %?\s* , \s*(?:%#)? ([a-z]+) %?\s* }

Does the pattern (\w+) serve your purpose here?
It'll match MyTokenName, example1, and example2 in both sample cases.
If you always wanted to ignore MyTokenName you could just refer to any matches other than the first match in the list.
Like:
dim txt = "MyTokenName{%#example1%, %#example2%}"
dim matches = regex.matches(txt,"(\w+)")
for i as integer = 1 to (matches.count - 1)
DoSomethingWith(matches(i).value) 'start at 1 so we skip over MyTokenName
next
Something like that.

Related

RegEx Replace - Remove Non-Matched Values

Firstly, apologies; I'm fairly new to the world of RegEx.
Secondly (more of an FYI), I'm using an application that only has RegEx Replace functionality, therefore I'm potentially going to be limited on what can/can't be achieved.
The Challange
I have a free text field (labelled Description) that primarily contains "useless" text. However, some records will contain either one or multiple IDs that are useful and I would like to extract said IDs.
Every ID will have the same three-letter prefix (APP) followed by a five digit numeric value (e.g. 12911).
For example, I have the following string in my Description Field;
APP00001Was APP00002TEST APP00003Blah blah APP00004 Apple APP11112OrANGE APP
THE JOURNEY
I've managed to very crudely put together an expression that is close to what I need (although, I actually need the reverse);
/!?APP\d{1,5}/g
Result;
THE STRUGGLE
However, on the Replace, I'm only able to retain the non-matched values;
Was TEST Blah blah Apple OrANGE APP
THE ENDGAME
I would like the output to be;
APP00001 APP00002 APP00003 APP00004 APP11112
Apologies once again if this is somewhat of a 'noddy' question; but any help would be much appreciated and all ideas welcome.
Many thanks in advance.
You could use an alternation | to capture either the pattern starting with a word boundary in group 1 or match 1+ word chars followed by optional whitespace chars.
What you capture in group 1 can be used as the replacement. The matches will not be in the replacement.
Using !? matches an optional exclamation mark. You could prepend that to the pattern, but it is not part of the example data.
\b(APP\d{1,5})\w*|\w+\s*
See a regex demo
In the replacement use capture group 1, mostly using $1 or \1

Regex Conditional Matching in One Capture Group

I have a string that may come in the form:
"filename.first_order.png"
"filename.second_order.png"
"filename.png"
"filename.(jpg|tif|etc)"
I need to match the first part of the string containing the name, and the extension - however, if the string is a first/second order type, I need to match "first_order"/"second_order" as the second group, and not "png", and I can't get those two conditions to co-exist in one capture group. Example matches:
imageondisk.first_order.png -> [imageondisk, first_order]
anotherfile.png -> [anotherfile, png]
meetingminutes.jpeg -> [meetingminutes, jpeg]
I feel like I've used all sorts of combinations of lookaheads, lookbehinds, ?s which must look like a desparate uneducated mess, but whatever I do, I can never get a result where they don't conflict when I join them together - which would look something like
(.+)\.(tif|jpg|<png when not preceded by first/second_order>|<first/second_order, ignoring the .png on the end>)
Except I just went down a frustrating rabbit hole of non-capture groups and lookarounds that seemed to end in the same place, and I feel like I knew less regex than before.
Help would hugely appreciated.
You could use this regex, which captures the filename in group 1, any first/second order string in group 2 and the extension in group 3:
^([^.]+)(?:\.(.+))?\.(png|jpg|tif)$
Demo on regex101

Regex - Skip characters to match

I'm having an issue with Regex.
I'm trying to match T0000001 (2, 3 and so on).
However, some of the lines it searches has what I can describe as positioners. These are shown as a question mark, followed by 2 digits, such as ?21.
These positioners describe a new position if the document were to be printed off the website.
Example:
T123?214567
T?211234567
I need to disregard ?21 and match T1234567.
From what I can see, this is not possible.
I have looked everywhere and tried numerous attempts.
All we have to work off is the linked image. The creators cant even confirm the flavour of Regex it is - they believe its Python but I'm unsure.
Regex Image
Update
Unfortunately none of the codes below have worked so far. I thought to test each code in live (Rather than via regex thinking may work different but unfortunately still didn't work)
There is no replace feature, and as mentioned before I'm not sure if it is Python. Appreciate your help.
Do two regex operations
First do the regex replace to replace the positioners with an empty string.
(\?[0-9]{2})
Then do the regex match
T[0-9]{7}
If there's only one occurrence of the 'positioners' in each match, something like this should work: (T.*?)\?\d{2}(.*)
This can be tested here: https://regex101.com/r/XhQXkh/2
Basically, match two capture groups before and after the '?21' sequence. You'll need to concatenate these two matches.
At first, match the ?21 and repace it with a distinctive character, #, etc
\?21
Demo
and you may try this regex to find what you want
(T(?:\d{7}|[\#\d]{8}))\s
Demo,,, in which target string is captured to group 1 (or \1).
Finally, replace # with ?21 or something you like.
Python script may be like this
ss="""T123?214567
T?211234567
T1234567
T1234434?21
T5435433"""
rexpre= re.compile(r'\?21')
regx= re.compile(r'(T(?:\d{7}|[\#\d]{8}))\s')
for m in regx.findall(rexpre.sub('#',ss)):
print(m)
print()
for m in regx.findall(rexpre.sub('#',ss)):
print(re.sub('#',r'?21', m))
Output is
T123#4567
T#1234567
T1234567
T1234434#
T123?214567
T?211234567
T1234567
T1234434?21
If using a replace functionality is an option for you then this might be an approach to match T0000001 or T123?214567:
Capture a T followed by zero or more digits before the optional part in group 1 (T\d*)
Make the question mark followed by 2 digits part optional (?:\?\d{2})?
Capture one or more digits after in group 2 (\d+).
Then in the replacement you could use group1group2 \1\2.
Using word boundaries \b (Or use assertions for the start and the end of the line ^ $) this could look like:
\b(T\d*)(?:\?\d{2})?(\d+)\b
Example Python
Is the below what you want?
Use RegExReplace with multiline tag (m) and enable replace all occurrences!
Pattern = (T\d*)\?\d{2}(\d*)
replace = $1$2
Usage Example:

Regex capture into group everything from string except part of string

I'm trying to create a regex, which will capture everything from a string, except for specific parts of the string. The he best place to start seems to be using groups.
For example, I want to capture everything except for "production" and "public" from a string.
Sample input:
california-public-local-card-production
production-nevada-public
Would give output
california-local-card
nevada
On https://regex101.com/ I can extract the strings I don't want with
(production|public)\g
But how to capture the things I want instead?
The following will kind of get me the word from between production and public, but not anything before or after https://regex101.com/r/f5xLLr/2 :
(production|public)-?(\w*)\g
Flipping it and going for \s\S actually gives me what I need in two separate subgroups (group2 in both matches) https://regex101.com/r/ItlXk5/1 :
(([\s\S]*?)(production|public))\g
But how to combine the results? Ideally I would like to extract them as a separate named group , this is where I've gotten to https://regex101.com/r/scWxh5/1 :
(([\s\S]*?)(production|public))(?P<app>\2)\g
But this breaks the group2 matchings and gets me empty strings. What else should I try?
Edit: This question boils down to this: How to merge regex group matches?
Which seems to be impossible to solve in regex.
A regexp match is always a continuous range of the sample string. Thus, the anwswer is "No, you cannot write a regexp which matches a series of concatenated substrings as described in the question".
But, this popular kind of task is being solved very easily by replacing unnecessary words by empty strings. Like
s/-production|production-|-public|public-//g
(Or an equivalent in a language you're using)
Note. Provided that \b is supported, it would be more correct to spell it as
s/-production\b|\bproduction-|-public\b|\bpublic-//g
(to avoid matching words like 'subproduction' or 'publication')
Your regex is nearly there:
([\s\S]*?)(?>production|public)
But this results in multiple matches
Match 1
Full match 0-17 `california-public`
Group 1. 0-11 `california-`
Match 2
Full match 17-39 `-local-card-production`
Group 1. 17-29 `-local-card-`
So You have to match multiple times to retrieve the result.

Regex, continue matching after lookaround

I'm having trouble with lookaround in regex.
Here the problem : I have a big file I want to edit, I want to change a function by another keeping the first parameter but removing the second one.
Let say we have :
func1(paramIWantToKeep, paramIDontWant)
or
func1(func3(paramIWantToKeep), paramIDontWant)
I want to change with :
func2(paramIWantToKeep) in both case.
so I try using positive lookahead
func1\((?=.+), paramIDontWant\)
Now, I just try not to select the first parameter (then I'll manage to do the same with the parenthesis).
But it doesn't work, it appears that my regex, after ignoring the positive look ahead (.+) look for (, paramIDontWant\)) at the same position it was before the look ahead (so the opening parenthesis)
So my question is, how to continue a regex after a matching group, here after (.+).
Thanks.
PS: Sorry for the english and/or the bad construction of my question.
Edit : I use Sublime Text
The first thing you need to understand is that a regex will always match a consecutive string. There will never be gaps.
Therefore, if you want to replace 123abc456 with abc, you can't simply match 123456 and remove it.
Instead, you can use a capturing group. This will allow you to remember a section of the regex for later.
For example, to replace 123abc456 with abc, you could replace this regex:
\d+([a-z]+)\d+
with this string:
$1
What that does is actually replaces the match with the contents of the first capturing group. In this case, the capturing group was ([a-z]+), which matches abc. Thus, the entire match is replaced with just abc.
An example you may find more useful:
Given:
func1(foo, bar)
replacing this regex:
\w+\((\w+),\s*\w+\)
with this string:
func2($1)
results in:
func2(foo)
import re
t = "func1(paramKeep,paramLose)"
t1 = "func1(paramKeep,((paramLose(dog,cat))))"
t2 = "func1(func3(paramKeep),paramDont)"
t3 = "func1(func3(paramKeep),paramDont,((i)),don't,want,these)"
reg = r'(\w+\(.*?(?=,))(,.*)(\))'
keep,lose,end = re.match(reg,t).groups()
print(keep+end)
keep,lose,end = re.match(reg,t1).groups()
print(keep+end)
keep,lose,end = re.match(reg,t2).groups()
print(keep+end)
keep,lose,end = re.match(reg,t3).groups()
print(keep+end)
Produces
>>>
func1(paramKeep)
func1(paramKeep)
func1(func3(paramKeep))
func1(func3(paramKeep))
Apply these two regexp in this order
s/(func1)([^,]*)(, )?(paramIDontWant)(.)/func2$2$5/;
s/(func2\()(func3\()(paramIWantToKeep).*/$1$3)/;
These cope with the two examples you gave. I guess that the real world code you are editing is slightly more complicated but the general idea of applying a series of regexps might be helpful