Regex substring matching on capture group - regex

I have an advanced regex question (unless I am overthinking this).
With my basic knowledge of Regex, it is trivial to match static capture group further down in the string.
P(.): D:\1
Correctly matches
Pb: Db
Pa: Da
and (correctly) does not match
Pa: D:b
So far so good. However, what I need to capture is a set of [a-z]+ after the P and match the one character. So that these should also match:
Pabc: D:c
Pabc: D:a
Pba: D:b
Pba: D:a
but not
Pabc: D:x
Pba: D:g
I started going down the path of writing separate patterns like so (spaces added around the alternation for clarity):
P(.): D:\1 | P(.)(.): D:(\1|\2) | P(.)(.)(.): D:(\1|\2|\3)
But I cannot make even this clumsy solution work in Javascript Regex.
Is there an elegant, correct way to do this? Can it be done with Javascript's limited engine?

The following regex will do it:
P.*(.).*: D:\1
.*(.).* will match one or more characters, capturing one of them.
If the captured character matches the character after D:, then the regex matches.
If the captured character doesn't match, backtracking will ensure that it tries again with a different captured character, until all combinations have been tried.
See regex101.com for running example.

Related

Match same string twice within certain characters

I need to write a regex that matches patterns like this:
[[string|string]]
It's the same string twice within that specific syntax (I don't want to match the brackets themselves). I managed to come up with this:
(?<=\[\[)(.*)(?=\|)\|\1\]\]
However, it's not matching for some reason and I don't understand where's my mistake.
UPDATE: Turns out it wasn't working because my code was dirty and there were some ● characters in the first string, so both strings weren't equal: https://regexr.com/3n7ni
Removing those extraneous characters made the regex match, although it still needed tweaks (like not matching the closure brackets): https://regexr.com/3n7o7
See regex in use here
\[{2}([^|\]]+)\|\1]{2}
\[{2} Matches [ literally, twice
([^|\]]+) Captures one or more of any character except | or ] into capture group 1
\| Matches | literally
\1 Matches the text most recently captured into capture group 1
]{2} Matches ] literally, twice
To match the full pattern you can update your regex to include the first 2 brackets:
\[\[(.*)\|\1\]\]
I think you could also do without this positive lookahead (?=\|).
Your problem is the use of a greedy match (.*) (consume as much as possible). You should be using a reluctant match (.*?) (consume as little as possible):
\[\[(.*?)\|\1\]\]
See live demo.
Note that your look ahead (?=\|) is useless.

Mixing Lookahead and Lookbehind in 1 Regexp

I'm trying to match first occurrence of window.location.replace("http://stackoverflow.com") in some HTML string.
Especially I want to capture the URL of the first window.location.replace entry in whole HTML string.
So for capturing URL I formulated this 2 rules:
it should be after this string: window.location.redirect("
it should be before this string ")
To achieve it I think I need to use lookbehind (for 1st rule) and lookahead (for 2nd rule).
I end up with this Regex:
.+(?<=window\.location\.redirect\(\"?=\"\))
It doesn't work. I'm not even sure that it legal to mix both rules like I did.
Can you please help me with translating my rules to Regex? Other ways of doing this (without lookahead(behind)) also appreciated.
The pattern you wrote is really not the one you need as it matches something very different from what you expect: text window.location.redirect("=") in text window.location.redirect("=") something. And it will only work in PCRE/Python if you remove the ? from before \" (as lookbehinds should be fixed-width in PCRE). It will work with ? in .NET regex.
If it is JS, you just cannot use a lookbehind as its regex engine does not support them.
Instead, use a capturing group around the unknown part you want to get:
/window\.location\.redirect\("([^"]*)"\)/
or
/window\.location\.redirect\("(.*?)"\)/
See the regex demo
No /g modifier will allow matching just one, first occurrence. Access the value you need inside Group 1.
The ([^"]*) captures 0+ characters other than a double quote (URLs you need should not have it). If these URLs you have contain a ", you should use the second approach as (.*?) will match any 0+ characters other than a newline up to the first ").

Regular expression to find specific string and add characters when the're not already there in notepad++

Okay, I have zero knowledge of regular expressions so if someone can direct me to a better way to figure this out then by all means please do.
I figured out that a series of files are missing a particular naming convention for the database they will write to. So some might be dbname1, dbname2, dbname3, abcdbname4, abcdbname5 and they all need to have that abc in the beginning. I want to write a regular expression that will find all tags in the file that do not follow immediately by abc and add in abc. Any ideas how I can do this?
Again, forgive me if this is poorly worded/expressed. I really have absolutely zero knowledge of regular expressions. I can't find any questions that are asking this. I know that there are questions asking how to add strings to lines but not how to add only to lines that are missing the string when some already have it.
I thought I had written this in but I'm looking at lines that look like this
<Name>dbname</Name>
or
<Name>abcdbname</Name>
and I need to get them all to have that abc at the beginning
Cameron's answer will work, but so will this. It's called a negative lookbehind.
(?<!abc)(dbname\d+)
This regex looks for dbname followed by 1 or more digits, and not prefixed by abc. So it will capture dbname113.
This looks for any occurrence of dbname not immediately prefixed by the string "abc". THe original name is in the capture group \1 so you can replace this regex with abc\1 and all your files will be properly prefixed.
Not every program/language that implements regex (famously, javascript) supports lookbehinds, but most do and Notepad++ certainly does. Lookarounds (lookbehind / lookaheads) are exceedingly handy once you get the hang of them.
?<! negative lookbehind, ?<= positive lookbehind / lookbehind, ?! negative lookhead, and ?= lookahead all must be used within parantheses as I did above, but they're not used in capturing so they do not create capture groups, hence why the second set of parentheses is able to be referenced as \1 (or $1 depending on the language)
Edit: Given some better example criteria, this is possibly more what you're looking for.
Find: (<Name>)(.*?(?<!abc)dbname\d+)(</Name>)
Replace: \1abc\2\3
Alternatively, something a bit easier to understand, you can do this or something like this:
Find: (<Name>)(abc)?(dbname\d+)(</Name>)
Replace: \1abc\3\4
What this is does is:
Matches <Name>, captures as backreference 1.
Looks for abc and captures it, if it's there as backreference 2, otherwise 2 contains nothing. The ? after (abc) means match 0 or 1 times.
Looks for the dbname and captures it. and captures as backreference 3.
Matches </Name>, captures as backreference 4.
By replacing with \1abc\3\4, you kind of drop abc off dbname if it exists and replace dbname with abcdbname in all instances.
You can take this a step further and
Find: (<Name>)(?:abc)?(dbname\d+)(</Name>)
Replace: \1abc\2\3
prefix the abc with ?: to create a noncapturing group, so the backreferences for replacing are sequential.
Replace \bdbname(\d+) with abcdbname\1.
The \b means "word boundary", so it won't match the abc versions, but will match the others. The (...) parentheses represent a capturing group, which capture everything that's matched in-between into a numbered variable that can be later referenced (there's only one here so it goes in \1). The \d+ matches one or more digit characters.

Regex Matching with Space

I had a very simple question about regex matching, I want have "string" (ignore case) matched
in this case: "thisisastring", nothing should be returned
in this case: "this is a string" a single match on "string" should be returned
Now I had #"([S|s][T|t][R|r][I|i][N|n][G|g])" as the regex, However it doesn't work correctly in the first case.
How should I write this regex?
Thanks in advance!
[S|s] does not match what you seem to think
Please note that [S|s] does not mean "match a S or a s". It means "match one character that is either a S, a | or a s". That's how things work inside a [character class]. To express an OR, you can use a non-capturing group: (?:S|s). But [Ss] is all you need, and case-insensitivity is even better.
Case-Insensitivity
I'm going to assume we're using case-insensitive mode so we end up with a simpler regex. I assume you're in C# as it looks like you're using a verbatim string: (?i) will work. Another way to set case-insensitivity in C# would be RegexOptions.IgnoreCase
Option 1: boundary (close but no cigar)
(?i)\bstring
This no longer matches string in astring. However, it matches string in ##string, which you do not want.
Option 2: lookbehind
(?i)(?<=[ ])string
The lookbehind ensures that string is preceded by a space character. The brackets are optional, they help see the space.
Option 3: \K (but not in C#)
For engines that support it (Perl, PCRE, Ruby 2+):
(?i)[ ]\Kstring
The \K tells the engine to drop what was matched so far from the final match it returns

Regex negation?

I'm playing Regex Golf (http://regex.alf.nu/) and I'm doing the Abba hole. I have the following regex that matches the wrong side entirely (which is what I was trying to do):
(([\w])([\w])\3\2)
However, I'm trying to negate it now so it matches the other side. I can't seem to figure that part out. I tried:
(?!([\w])([\w])\3\2)
But that didn't work. Any tips from the regex masters?
You can make it much shorter (and get more points) by simply using . and removing unnecessary parens:
^(?!.*(.)(.)\2\1)
It just makes sure that there's no "abba" ("abba" here means 4 letters in that particular order we don't want to match) in any part of the string without having to match the whole word.
Using the explanation here: https://stackoverflow.com/a/406408/584663
I came up with: ^((?!((\w)(\w)\4\3)).)*$
The key here turns out to be the leading caret, ^, and the .*
(?! ...) is a look-ahead construct, and so does not advance the regex processing engine.
/(?! ...)/ on its own will correctly return a negative result for items matching the expression within; but for items which do not match (...) the regex engine continues processing. However if your regex only contains the (?! ) there is nothing left to process, and the regex processing position never advances. (See this great answer).
Apparently since the remaining regex is empty, it matches any zero-width segment of a string, i.e. it matches any string.
[begin SWAG]
With the caret ^ present, the regex engine is able to recognize that you are looking for a real answer and that you do not want it to tell you the string contains zero-width components.
[end SWAG]
Thus it is able to correctly fail to match when the (?! ) succeeds.