Notepad++ regex group capture - regex

I have such txt file:
ххх.prontube.ru
salo.ru
bbb.antichat.ru
yyy.ru
xx.bb.prontube.ru
zzz.com
srfsf.jwbefw.com.ua
Trying to delete all subdomains with such regex:
Find: .+\.((.*?)\.(ru|ua|com\.ua|com|net|info))$
Replace with: \1
Receive:
prontube.ru
salo.ru
antichat.ru
yyy.ru
prontube.ru
zzz.com
com.ua
Why last line becomes com.ua instead of jwbefw.com.ua ?

This works without look around:
Find: [a-zA-Z0-9-.]+\.([a-zA-Z0-9-]+)\.([a-zA-Z0-9-]+)$
Replace: \1\.\2
It finds something with at least 2 periods and only letters, numbers, and dashes following the last two periods; then it replaces it with the last 2 parts. More intuitive, in my opinion.
There's something funny going on with that leading xxx. It doesn't appear to be plain ASCII. For the sake of this question, I'm going to assume that's just something funny with this site and not representative of your real data.
Incorrect
Interestingly, I previously had an incorrect answer here that accumulated a lot of upvotes. So I think I should preserve it:
Find: [a-zA-Z0-9-]+\.([a-zA-Z0-9-]+)\.(.+)$
Replace: \1\.\2
It just finds a host name with at least 2 periods in it, then replaces it with everything after the first dot.

The .+ part is matching as much as possible. Try using .+? instead, and it will capture the least possible, allowing the com.ua option to match.

.+?\.([\w-]*?\.(?:ru|ua|com\.ua|com|net|info))$
This answer still uses the specific domain names that the original question was looking at. As some TLD (top level domains) have a period in them, and you could theoretically have a list including multiple subdomains, whitelisting the TLD in the regex is a good idea if it works with your data set. Both current answers (from 2013) will not handle the difference between "xx.bb.prontube.ru" and "srfsf.jwbefw.com.ua" correctly.
Here is a quick explanation of why this psnig's original regex isn't working as intended:
The + is greedy.
.+ will zip all the way to the right at the end of the line capturing everything,
then work its way backwards (to the left) looking for a match from here:
(ru|ua|com\.ua|com|net|info)
With srfsf.jwbefw.com.ua the regex engine will first fail to match a,
then it will move the token one place to the left to look at "ua"
At that point, ua from the regex (the second option) is a match.
The engine will not keep looking to find "com.ua" because ".ua" met that requirement.
Niet the Dark Absol's answer tells the regex to be "lazy"
.+? will match any character (at least one) and then try to find the next part of the regex. If that fails, it will advance the token, .+ matching one more character and then evaluating the rest of the regex again.
The .+? will eventually consume: srfsf.jwbefw before matching the period, and then matching com.ua.
But the implimentation of ? also creates issues.
Adding in the question mark makes that first .+ lazy, but then causes group1 to match bb.prontube.ru instead of prontube.ru
This is because that first period after the bb will match, then inside group 1 (.*?) will match bb.prontube. before \.(ru|ua|com\.ua|com|net|info))$ matches .ru
To avoid this, change that third group from (.*?) to ([\w-]*?) so it won't capture . only letters and numbers, or a dash.
resulting regex:
.+?\.(([\w-])*?\.(ru|ua|com\.ua|com|net|info))$
Note that you don't need to capture any groups other than the first. Adding ?: makes the TLD options non-capturing.
last change:
.+?\.([\w-]*?\.(?:ru|ua|com\.ua|com|net|info))$

Search what: .+?\.(\w+\.(?:ru|com|com\.au))
Replace with: $1
Look in the picture above, what regex capture referring
It's color the way you will not need a regex explaination anymore ....

Related

Not understanding a RegEx with Non-Capturing Group

I need help in understanding why a specific regular expression works and another one doesn't .. here's the background:
I want to use the ansible lineinfile module to add/modify the -u ntpd:ntpd option to the OPTIONS="" field in /etc/sysconfig/ntpd.
I see three cases:
No line with OPTIONS at all --> add OPTIONS="-u ntpd:ntpd"
OPTIONS line already there, but not with that specific option --> add the option to the existing options
OPTIONS line existing, -u option existing but with wrong parameters --> change the parameters
My first shot was this:
- name: Make sure ntpd runs as user:group ntpd:ntpd
lineinfile:
path: /etc/sysconfig/ntpd
regexp: 'OPTIONS=\"(.*)(-u\s[A-Za-z0-9\-_]+:[A-Za-z0-9\-_]+)?(.*)\"'
line: 'OPTIONS="\1 \3 -u ntpd:ntpd'
backrefs: yes
But the first group contains everything between "" Example at regex101
Changing the 1st and 3rd group to lazy makes everything between "" belong to the third group Example at regex101
After trying around and looking at it with a colleague we came up with this regular expression which does what we want:
OPTIONS=\"(?:(.*)(-u [a-z]+:[a-z]+))?(.*?)\"
Example at regex101
But honestly, we do not understand why. Anyone out there who can shed some light on this?
Your culprits are the wildcards in your first and third matching groups! Let's have a look at each one. We'll use OPTIONS="-x -u wrong:wrong -c blah" as the string to test.
OPTIONS=\"(.*)(-u\s[A-Za-z0-9\-_]+:[A-Za-z0-9\-_]+)?(.*)\"
This regex starts with a greedy wildcard, and in a way, that's where it ends, too. Between your two quote marks must appear:
First capture group: as much of anything as it wants. This group can, will, and does eat your entire string, leaving nothing else behind. It'll give back if it has to, but...
Second capture group: the interesting part of the regex. Normally, this would capture what you want to see. However, since it's marked with the ? quantifier, it's allowed to exist zero or one times - and the greedy quantifier is happy to keep its catch if the second capture group is not required to exist.
Third capture group: same issue here. Since .* means "zero or more" of any character, and the first greedy wildcard ate them all, it's happy to match nothing, which is why your first regex example has the dotted red line just before the closing quote - it signifies group 3's empty match.
Result: the first greedy wildcard ate everything, and none of the other capture groups made it a requirement to give something back.
OPTIONS=\"(.*?)(-u\s[A-Za-z0-9\-_]+:[A-Za-z0-9\-_]+)?(.*?)\"
In contrast, your second regex contains lazy wildcards. These won't eat anything if they don't have to.
First CG: doesn't capture anything if it doesn't specifically have to. It won't capture anything unless we get to the very end of the string and find no matches, at which point it wakes up and starts eating things until a match can be found.
Second CG: again, it can exist, but because of the ? quantifier, it doesn't have to. Since it can't instantly match the -x at the start of the string, it decides to try the final CG.
Third CG: doesn't capture anything if it doesn't specifically have to.
Except... now there's no match. At this point, the regex engine starts working backwards.
Third CG: starts capturing things, increasing until it hits a match. In this case, since your second group doesn't immediately match, it assumes it won't match at all (instead of checking to see if it can make something work with the first wildcard CG) and instead eats all the text. This counts as a match, and the engine is satisfied.
Second CG - tries to match, but doesn't if it can't immediately do so. Note that on the second line, where the first text after OPTIONS=" is a match, this group does activate.
First CG: never reached, the third one had it handled.
Result: you're getting closer, but because the lazy wildcard doesn't capture unless it has to, and the engine is more willing to let your third CG eat everything than it is to try and force a match between the second and first groups, the third group gets it all.
OPTIONS=\"(?:(.*)(-u [a-z]+:[a-z]+))?(.*?)\"
Now you got it. The non-capture group at the start treats the first two capture groups as the same entity. Let's look at what happens now:
First NCG:
First nested CG: matches the entire string.
Second nested CG: forces the first nested CG to surrender some text so it can match. There's no ? quantifier this time, so it's not an option: the second CG must match. Because the entire NCG must try to find a match before the rest of the regex can continue matching, if a match exists, it's guaranteed to be properly found.
Second CG: cleans up whatever the first NCG left behind.
Result: you get the string. Woo!
Let's take a closer look at your non-capturing group and see why it works.
(?:(.*)(-u [a-z]+:[a-z]+))?
(.*)(-u [a-z]+:[a-z]+)
Notice something? Your ? quantifier is outside the non capture group. This means that inside the NCG, your second capture group is no longer optional. The regex is forced to try to match your second CG, and if it can't, the ? quantifier outside the NCG cancels the entire thing, including the first wildcard CG. This means that the first wildcard CG can't be used to gobble text if the engine feels like ignoring your second CG - it can now only be used to assist the second CG in matching, as was likely your original intent.
The ? quantifier on the second CG was necessary, because there was no guarantee the command would exist. However, this gave the engine the option of being lazy and just ignoring it altogether - which it will gladly do, especially if you place it right next to a wildcard CG.
As an aside, if you change your regex like so, and put the ? quantifier back where it was, next to your second CG...
OPTIONS=\"(?:(.*)(-u [a-z]+:[a-z]+))?(.*?)\" works
OPTIONS=\"(?:(.*)(-u [a-z]+:[a-z]+)?)(.*?)\" uh oh
^^
...It ignores your second group again, and you're right back where you started.
Good luck ;)
First regex : the first capturing group's .* is greedy, it will match as much as possible, and it is possible to match everything up to the double-quote since everything else that follows is optional.
Second regex : the first capturing group's .*? is lazy, it will match as little as possible. It first tries matching nothing, which is possible since the last capturing group will be able to match the rest of the string.
Third regex : the non-capturing group is greedy and will try to match if possible, and its -u group is mandatory. The regex engine will match everything with .*, then backtrack until it's possible to match the -u group. If there's no -u, it leaves the whole thing to match to the third group.
Note that the laziness of the third group isn't necessary in your third regex, and that making the first .* lazy will improve performances : https://regex101.com/r/xvYOoF/7
As an alternative the following regex that makes sure the first capturing group can't match a -u option might be more understandable and will be more performant especially on long options strings (not that it matters I guess) :
OPTIONS="((?:-[^u]|[^-])*)(-u \w+:\w+)?(.*)"

How to extract characters from a string with optional string afterwards using Regex?

I am in the process of learning Regex and have been stuck on this case. I have a url that can be in two states EXAMPLE 1:
spotify.com/track/1HYcYZCOpaLjg51qUg8ilA?si=Nf5w1q9MTKu3zG_CJ83RWA
OR EXAMPLE 2:
spotify.com/track/1HYcYZCOpaLjg51qUg8ilA
I need to extract the 1HYcYZCOpaLjg51qUg8ilA ID
So far I am using this: (?<=track\/)(.*)(?=\?)? which works well for Example 2 but it includes the ?si=Nf5w1q9MTKu3zG_CJ83RWA when matching with Example 1.
BUT if I remove the ? at the end of the expression then it works for Example 1 but not Example 2! Doesn't that mean that last group (?=\?) is optional and should match?
Where am I going wrong?
Thanks!
I searched a handful of "Questions that may already have your answer" suggestions from SO, and didn't find this case, so I hope asking this is okay!
The capturing group in your regular expression is trying to match anything (.) as much as possible due to the greediness of the quantifier (*).
When you use:
(?<=track\/)(.*)(?=\?)
only 1HYcYZCOpaLjg51qUg8ilA from the first example is captured, as there is no question mark in your second example.
When using:
(?<=track\/)(.*)(?=\??)
You are effectively making the positive lookahead optional, so the capturing group will try to match as much as possible (including the question mark), so that 1HYcYZCOpaLjg51qUg8ilA?si=Nf5w1q9MTKu3zG_CJ83RWA and 1HYcYZCOpaLjg51qUg8ilA are matched, which is not the desired output.
Rather than matching anything, it is perhaps more appropriate for you to match alphanumerical characters \w only.
(?<=track\/)(\w*)(?=\??)
Alternatively, if you are expecting other characters , let's say a hyphen - or a underscore _, you may use a character class.
(?<=track\/)([a-zA-Z0-9_-]*)(?=\??)
Or you might want to capture everything except a question mark ? with a negated character class.
(?<=track\/)([^?]*)(?=\??)
As pointed out by gaganso, a look-behind is not necessary in this situation (or indeed the lookahead), however it is indeed a good idea to start playing around with them. The look-around assertions do not actually consume the characters in the string. As you can see here, the full match for both matches only consists of what is captured by the capture group. You may find more information here.
This should work:
track\/(\w+)
Please see here.
Since track is part of both the strings, and the ID is formed from alphanumeric characters, the above regex which matches the string "track/" and captures the alphanumeric characters after that string, should provide the required ID.
Regex : (\w+(?=\?))|(\w+&)
See the demo for the regex, https://regexr.com/3s4gv .
This will first try to search for word which has '?' just after it and if thats unsuccessful it will fetch the last word.

Regular expression to exclude tag groups or match only (.*) in between tags

I am struggling with this regex for a while now.
I need to match the text which is in between the <ns3:OutputData> data</ns3:OutputData>.
Note: after nscould be 1 or 2 digits
Note: the data is in one line just as in the example
Note: the ... preceding and ending is just to mention there are more tags nested
My regex so far: (ns\d\d?:OutputData>)\b(.*)(\/\1)
Sample text:
...<ns3:OutputData>foo bar</ns3:OutputData>...
I have tried (?:(ns\d\d?:OutputData>)\b)(.*)(?:(\/\1)) in an attempt to exclude group 1 and 3.
I wan't to exclude the tags which are matched, as in the images:
start
end
Any help is much appreciated.
EDIT
There might be some regex interpretation issue with Grep Console for IntelliJ which I intend to use the regex.
Here is is the latest image with the best match so far...
Your regex is almost there. All you need to do is to make the inside-matcher non-greedy. I.e. instead of (.*) you can write (.*?).
Another, xml-specific alternative is the negated character-class: ([^<]*).
So, this is the regex: (ns\d\d?:OutputData>)\b(.*?)(\/\1) You can experiment with it here.
Update
To make sure that the only group is the one that matches the text, then you have to make it work without backreferences: (?:ns\d\d?:OutputData>)\b(.*?)<
Update 2
It's possible to match only the required parts, using lookbehind. Check the regex here.:
(?<=ns\d:OutputData>)\b([^<]*)|(?<=ns\d\d:OutputData>)\b([^<]*)
Explanation:
The two alternatives are almost identical. The only difference is the number of digits. This is important because some flavors support only fixed-length lookbehinds.
Checking alternative one, we put the starting tag into one lookbehind (?<=...) so it won't be included into the full match.
Then we match every non- lt symbol greedily: [^<]*. This will stop atching at the first closing tag.
Essentially, you need a look behind and a look ahead with a back reference to match just the content, but variable length look behinds are not allowed. Fortunately, you have only 2 variations, so an alternation deals with that:
(?<=<(ns\d:OutputData>)).*?(?=<\/\1)|(?<=<(ns\d\d:OutputData>)).*?(?=<\/\2)
The entire match is the target content between the tags, which may contain anything (including left angle brackets etc).
Note also the reluctant quantifier .*?, so the match stops at the next matching end tag, rather than greedy .* that would match all the way to the last matching end tag.
See live demo.
This was the answer in my case:
(?<=(ns\d:OutputData)>)(.*?)(?=<\/\1)
The answer is based on #WiktorStribiżew 3 given solutions (in comments).
The last one worked and I have made a slight modification of it.
Thanks all for the effort and especially #WiktorStribiżew!
EDIT
Ok, yes #Bohemian it does not match 2-digits, I forgot to update:
(?<=(ns\d{0,2}:OutputData)>)(.*?)(?=<\/\1)

regex negative lookbehind - pcre

I'm trying to write a rule to match on a top level domain followed by five digits. My problem arises because my existing pcre is matching on what I have described but much later in the URL then when I want it to. I want it to match on the first occurence of a TLD, not anywhere else. The easy way to check for this is to match on the TLD when it has not bee preceeded at some point by the "/" character. I tried using negative-lookbehind but that doesn't work because that only looks back one single character.
e.g.: How it is currently working
domain.net/stuff/stuff=www.google.com/12345
matches .com/12345 even though I do not want this match because it is not the first TLD in the URL
e.g.: How I want it to work
domain.net/12345/stuff=www.google.com/12345
matches on .net/12345 and ignores the later match on .com/12345
My current expression
(\.[a-z]{2,4})/\d{5}
EDIT: rewrote it so perhaps the problem is clearer in case anyone in the future has this same issue.
You're pretty close :)
You just need to be sure that before matching what you're looking for (i.e: (\.[a-z]{2,4})/\d{5}), you haven't met any / since the beginning of the line.
I would suggest you to simply preppend ^[^\/]*\. before your current regex.
Thus, the resulting regex would be:
^[^\/]*\.([a-z]{2,4})/\d{5}
How does it work?
^ asserts that this is the beginning of the tested String
[^\/]* accepts any sequence of characters that doesn't contain /
\.([a-z]{2,4})/\d{5} is the pattern you want to match (a . followed by 2 to 4 lowercase characters, then a / and at least 5 digits).
Here is a permalink to a working example on regex101.
Cheers!
You can use this regex:
'|^(\w+://)?([\w-]+\.)+\w+/\d{5}|'
Online Demo: http://regex101.com/

Regex PCRE: Validate string to match first set of string instead of last

I tried quite a few things but Im stuck with my regex whenever meets the criteria 2 consecutive times. In this case it just considers it as one expressions instead of 2.
\[ame\=[^\.]+(.+)youtube\.(.+)v\=([^\]\&\"]+)[\]\'\"\&](.+)\[\/ame\]
E.g.
[ame="http://www.youtube.com/watch?v=brfr5CD2qqY"][B][COLOR=yellow]http://www.youtube.com/watch?v=brfrx5D2qqY[/COLOR][/B][/ame][/U]
[B][COLOR=yellow]or[/COLOR][/B] [B][COLOR=yellow]B[/COLOR][/B]
[ame="http://www.youtube.com/watch?v=M9ak3rKIBAU"][B][COLOR=yellow]http://www.youtube.com/watch?v=M9a3arKIBAU[/COLOR][/B][/ame]
[B][COLOR=yellow]or[/COLOR][/B] [B][COLOR=yellow]C[/COLOR][/B]
[ame="http://www.youtube.com/watch?v=7vh--3pyq5U"][COLOR=yellow]http://www.youtube.com/watch?v=7vh--3pyq5U[/COLOR][/ame]
In that case, this regex would instead of matching all 3 options, it takes it as one.
Any ideas how to make an expression that would say match the first "[/ame]"?
The problem is the use of .+ - they are "greedy", meaning they will consume as much input as possible and still match.
Change them to reluctant quantifiers: .+?, which won't skip forward over the end of the first match to match the end if the last match.
I'm not sure what your objective is (you haven't made that clear yet)
But this will match and capture out the youtube URL for you, ensuring you only match each single instance between [ame= and [/ame]
/\[ame=["'](.*?)["'](.*?)\/ame\]/i
Here's a working example, and a great sandbox to play around in: http://regex101.com/r/jR4lK2