Regex: a number vs. a backreference to a capture group - regex

I've been studying regular expressions, and I'm scratching my head on this one. On this page (https://www.regular-expressions.info/conditional.html) I see that, in a conditional regex, a reference to a numbered backreference is just a number. For example,
(a)?b(?(1)c|d)
How does regex know that we aren't supposed to match the number "1" instead of the backreference to the 1st capture group? Previously in the lessons I had learned that a backreference would be escaped, such as \1, \2, etc.

As per the regex tutorial you're following:
A special construct (?ifthen|else) allows you to create conditional regular expressions. If the if part evaluates to true, then the regex engine will attempt to match the then part. Otherwise, the else part is attempted instead. The syntax consists of a pair of parentheses. The opening bracket must be followed by a question mark, immediately followed by the if part, immediately followed by the then part. This part can be followed by a vertical bar and the else part. You may omit the else part, and the vertical bar with it.
Alternatively, you can check in the if part whether a capturing group has taken part in the match thus far. Place the number of the capturing group inside parentheses, and use that as the if part.
Your second question is this:
RegEx Demo of \b(a)?b(?(1)c|d)\b
Note that I have added word boundary to avoid matching string like abd partially.
What if someone actually wanted to match the literal 1 this way?
valid input: 1c or d invalid input: 1d
That would be:
\b(1)?(?(1)c|d)\b

Related

Regex: how do I match a character before other capture characters?

I'm trying to match on a list of strings where I want to make sure the first character is not the equals sign, don't capture that match. So, for a list (excerpted from pip freeze) like:
ply==3.10
powerline-status===2.6.dev9999-git.b-e52754d5c5c6a82238b43a5687a5c4c647c9ebc1-
psutil==4.0.0
ptyprocess==0.5.1
I want the captured output to look like this:
==3.10
==4.0.0
==0.5.1
I first thought using a negative lookahead (?![^=]) would work, but with a regular expression of (?![^=])==[0-9]+.* it ends up capturing the line I don't want:
==3.10
==2.6.dev9999-git.b-e52754d5c5c6a82238b43a5687a5c4c647c9ebc1-
==4.0.0
==0.5.1
I also tried using a non-capturing group (?:[^=]) with a regex of (?:[^=])==[0-9]+.* but that ends up capturing the first character which I also don't want:
y==3.10
l==4.0.0
s==0.5.1
So the question is this: How can one match but not capture a string before the rest of the regex?
Negative look behind would be the go:
(?<!=)==[0-9.]+
Also, here is the site I like to use:
http://www.rubular.com/
Of course it does some times help if you advise which engine/software you are using so we know what limitations there might be.
If you want to remove the version numbers from the text you could capture not an equals sign ([^=]) in the first capturing group followed by matching == and the version numbers\d+(?:\.\d+)+. Then in the replacement you would use your capturing group.
Regex
([^=])==\d+(?:\.\d+)+
Replacement
Group 1 $1
Note
You could also use ==[0-9]+.* or ==[0-9.]+ to match the double equals signs and version numbers but that would be a very broad match. The first would also match ====1test and the latter would also match ==..
There's another regex operator called a 'lookbehind assertion' (also called positive lookbehind) ?<= - and in my above example using it in the expression (?<=[^=])==[0-9]+.* results in the expected output:
==3.10
==4.0.0
==0.5.1
At the time of this writing, it took me a while to discover this - notably the lookbehind assertion currently isn't supported in the popular regex tool regexr.
If there's alternatives to using lookbehind to solve I'd love to hear it.

python regex non-capture group handling

(1[0-9]{2})\s+(\w+(?:-\w+)+)\s+(\w+)\s+(\w+(?:-\w+)+)\s+(\w+)
used to match string
123 FEX-1-80 Online N2K-C2248TP-1GE SSDFDFWFw23r23
How come this works in regexr.com but Python 3.5.1 can't find a match
r'(1[0-9]{2})\s+(\w+(?:-\w+)+)\s+(\w+)\s+(\w+(?:-\w+))'
can match up to
123 FEX-1-80 Online N2K-C2248TP
but the second hyphen - in group(4) is not matched
From what I understand, non-capture group character can appear more than once in the group, what went wrong here?
Just a comment, not really an answer but for the sake of clarity I have put it as an answer.
Being relatively new to regular expressions, one should use the verbose mode. With this, your expression becomes much much more readable:
(1[0-9]{2})\s+ # three digits, the first one needs to be 1
(\w+(?:-\w+)+)\s+ # a word character (wc), followed by - and wcs
(\w+)\s+ # another word
(\w+(?:-\w+)+)\s+ # same expression as above
(\w+) # another word
Also, check if your (second and fourth) expression could be rewritten as [\w-]+ - it is not the same as yours and will match other substrings but try to avoid nested parenthesis in general.
Concerning your question, the second string cannot be matched as you made all of your expressions mandatory (and group 5 is missing in the second example, so it will fail).
See a demo on regex101.com.
This regular expression matches the full input string:
(1[0-9]{2})\s+(\w+(?:-\w+)+)\s+(\w+)\s+(\w+(?:-\w+)+)\s+(\w+)
This one doesn't:
(1[0-9]{2})\s+(\w+(?:-\w+)+)\s+(\w+)\s+(\w+(?:-\w+))
The latter is missing a + after the last non-capturing group, and it's missing the \s+(\w+) at the end that matches the SSDFDFWFw23r23 at the end of the input string.
From what I understand, non-capture group character can appear more than once in the group, what went wrong here?
I'm not sure I follow. A non-capturing group is really just there to group a part of a regular expression.
(?:-\w+) or just -\w+ will both match a hyphen (-) followed by one or more "word" characters (\w+). It doesn't matter whether that regular expression is in a non-capturing group or not. If you want to match repetitions of that pattern, you can use the + modifier after the non-capturing group, e.g. (?:-\w+)+. That pattern will match a string like -foo-bar-baz.
So the reason your second regular expression doesn't match the repeated pattern is because it's lacking the + modifier.

Regular expression to find specific string and add characters when the're not already there in notepad++

Okay, I have zero knowledge of regular expressions so if someone can direct me to a better way to figure this out then by all means please do.
I figured out that a series of files are missing a particular naming convention for the database they will write to. So some might be dbname1, dbname2, dbname3, abcdbname4, abcdbname5 and they all need to have that abc in the beginning. I want to write a regular expression that will find all tags in the file that do not follow immediately by abc and add in abc. Any ideas how I can do this?
Again, forgive me if this is poorly worded/expressed. I really have absolutely zero knowledge of regular expressions. I can't find any questions that are asking this. I know that there are questions asking how to add strings to lines but not how to add only to lines that are missing the string when some already have it.
I thought I had written this in but I'm looking at lines that look like this
<Name>dbname</Name>
or
<Name>abcdbname</Name>
and I need to get them all to have that abc at the beginning
Cameron's answer will work, but so will this. It's called a negative lookbehind.
(?<!abc)(dbname\d+)
This regex looks for dbname followed by 1 or more digits, and not prefixed by abc. So it will capture dbname113.
This looks for any occurrence of dbname not immediately prefixed by the string "abc". THe original name is in the capture group \1 so you can replace this regex with abc\1 and all your files will be properly prefixed.
Not every program/language that implements regex (famously, javascript) supports lookbehinds, but most do and Notepad++ certainly does. Lookarounds (lookbehind / lookaheads) are exceedingly handy once you get the hang of them.
?<! negative lookbehind, ?<= positive lookbehind / lookbehind, ?! negative lookhead, and ?= lookahead all must be used within parantheses as I did above, but they're not used in capturing so they do not create capture groups, hence why the second set of parentheses is able to be referenced as \1 (or $1 depending on the language)
Edit: Given some better example criteria, this is possibly more what you're looking for.
Find: (<Name>)(.*?(?<!abc)dbname\d+)(</Name>)
Replace: \1abc\2\3
Alternatively, something a bit easier to understand, you can do this or something like this:
Find: (<Name>)(abc)?(dbname\d+)(</Name>)
Replace: \1abc\3\4
What this is does is:
Matches <Name>, captures as backreference 1.
Looks for abc and captures it, if it's there as backreference 2, otherwise 2 contains nothing. The ? after (abc) means match 0 or 1 times.
Looks for the dbname and captures it. and captures as backreference 3.
Matches </Name>, captures as backreference 4.
By replacing with \1abc\3\4, you kind of drop abc off dbname if it exists and replace dbname with abcdbname in all instances.
You can take this a step further and
Find: (<Name>)(?:abc)?(dbname\d+)(</Name>)
Replace: \1abc\2\3
prefix the abc with ?: to create a noncapturing group, so the backreferences for replacing are sequential.
Replace \bdbname(\d+) with abcdbname\1.
The \b means "word boundary", so it won't match the abc versions, but will match the others. The (...) parentheses represent a capturing group, which capture everything that's matched in-between into a numbered variable that can be later referenced (there's only one here so it goes in \1). The \d+ matches one or more digit characters.

Capturing two groups out of a string with a regex

I don't know anything about regular expressions and I don't really have the time to study them at the moment.
I have a string like this:
test (22/22/22)
I need to capture the test and the date 22/22/22 in an array.
the test string could also be a multiple words string:
test test(1) tes-t (22/22/22)
should capture test test(1) tes-t and 22/22/22
I have no idea how to get started on this. I managed to capture the date string with the parentheses by doing:
(\(.*)
but that really doesn't get me anywhere.
Could someone help me out here and provide an explanation of how I should go about capturing this? I'm kinda lost.
Thanks
To explain the given regular expression : (.*)\(([^)]+)\)
(.*) will match anything, and capture it (the parenthesis capture what their inner expression matches)
\( is an escaped parenthesis. That's what you'll write when you wnat to capture a parenthesis.
[^)]+ means anything but a parenthesis (special characters must not be escaped within square brackets) one or more times.
([^)]+) captures what's explained above
\) matches a closing parenthesis
So this regex will fail and capture the wrong strings if you have, say, a parenthesis in your first words like in :
test test(1) tes-t (22/22/22)
I'd recommend to think about what is the information you want to capture, and how do you spearate it from the rest of your string. This done,it will be much more easier to build an effective regular expression.
Try this
^(.*)\(([^)]*)\)
See it here online on Regexr
While hovering with the mouse over the blue colored matches, you can see the content of the capturing groups.
Explanation
^ BeginOfLine
(.*) CapturingGroup 1 AnyCharacterExcept\n, zero or more times
\(([^)]*)\) ( CapturingGroup 2, AnyCharNotIn[ )] zero or more times
This needle works on your example input:
(.*)\(([^)]+)\)

Why is the rightmost character captured in backreference when using a character class with quantifiers?

If I have pattern ([a-z]){2,4} and string "ab", what would I expect to see in backreference \1 ?
I'm getting "b", but why "b" rather than "a"?
I'm sure there is a valid explanation, but reading around various sites explaining regexes, I haven't found one. Anybody?
I'm not sure why nobody put this as an answer, but just for anyone hitting this page with a similar question, the answer is essentially that this regex:
([a-z]){2-4}
will match a single character between a and z at least 2 and as many as 4 times. It will match each character separately, overwriting anything previously matched and stored into the backreference (that is, whatever is between the () characters in the expression).
A similar expression (suggested in the comments on the question):
([a-z]{2,4})
moves the back-reference to surround the entire match (2-4 characters a-z) instead of a single character.
The parentheses represent a capture into a back-reference. When the repetition is inside the capture (the second example), it will capture all characters that make up that repetition. When the repetition is outside the capture (the first example), it will capture one letter, then repeat the process, capturing the next letter into the same back-reference, thus overwriting it. In this case, it will then repeat that process up to 2 more times, overwriting it each time.
So, matching against the target abc will result in \1 equaling c. Matching the target against abcd will result in \1 equaling d. With more letters, and depending upon the function (and language) used to run the regular expression, the target abcde might fail to match, or might result in the back-reference \1 equaling d (because the e is not part of the match).
The first example expression can be used to get abc or abcd if you use the whole match back-reference (often times $& or $0, but also \& or \0 and in Tcl, just an & character) - this returns the entire string matched by the entire regular expression.