Repeating groups regex url path, node.js - regex

I am trying to extract express route named parameters with regex.
So, for example:
www.test.com/something/:var/else/:var2
I am trying with this regex:
.*\/?([:]+\w+)+
but I am getting only last matched group.
Does anyone knows how to match both :var and :var2.

The first problem is that .* is greedy, and will therefore bypass all matches until the final one is found. This means that the first :var is bypassed.
However, as you are searching for a variable number of capture groups (with thanks to #MichaelTang), I recommend using two regexes in sequence. First, use
^(?:.*?\/?\:\w+)+$
to detect which lines contain colon-elements...
Debuggex Demo
...and then search that line repeatedly for, simply
\/:(\w+)
This places the text post-colon into capture group one.
Debuggex Demo

Here is how you can match both of them:
www.test.com/something/:var/else/:var2'.match(/\:(\w+)/g)
[":var", ":var2"]

Related

Excluding 3dots additional to other characters with regex in a string

I have such an http-url detector regex:
(?:http|https)(?::\/{2}[\w]+)(?:[\/|\.]?)(?:[^\s<"]*)
It works pretty well for the following url representation:
http://www.acer.com/clearfi/download/
What kind of modification I can do to extract
http://schemas.microsoft.com/office/word/2003/wordml2450
from
Huanghhttp://schemas.microsoft.com/office/word/2003/wordml2450...)()()()()()
?
You can modify it to capture:
group of http stuff
followed by (group of) subdomain stuff
followed by as many as possible groups of:
one point or slash
followed by a group of characters (non-point, non-space, non-", non-<)
(?:http|https)(?:\/{2}[\w]+)([\/|\.][^\s<"\.]+)*
I made capturing groups to visualize the results
I've changed your expression here and there: (.*)(https?:\/{2}[\w]+[\/|\.]?[^\s<"]*)(\.{3}.*) and get only second capturing group from it. See example here: https://regex101.com/r/0viPC5/2
This expression probably can be simplified further but I don't know your exact input and search criteria so let's stick with what you already wrote.

Regular expression to exclude tag groups or match only (.*) in between tags

I am struggling with this regex for a while now.
I need to match the text which is in between the <ns3:OutputData> data</ns3:OutputData>.
Note: after nscould be 1 or 2 digits
Note: the data is in one line just as in the example
Note: the ... preceding and ending is just to mention there are more tags nested
My regex so far: (ns\d\d?:OutputData>)\b(.*)(\/\1)
Sample text:
...<ns3:OutputData>foo bar</ns3:OutputData>...
I have tried (?:(ns\d\d?:OutputData>)\b)(.*)(?:(\/\1)) in an attempt to exclude group 1 and 3.
I wan't to exclude the tags which are matched, as in the images:
start
end
Any help is much appreciated.
EDIT
There might be some regex interpretation issue with Grep Console for IntelliJ which I intend to use the regex.
Here is is the latest image with the best match so far...
Your regex is almost there. All you need to do is to make the inside-matcher non-greedy. I.e. instead of (.*) you can write (.*?).
Another, xml-specific alternative is the negated character-class: ([^<]*).
So, this is the regex: (ns\d\d?:OutputData>)\b(.*?)(\/\1) You can experiment with it here.
Update
To make sure that the only group is the one that matches the text, then you have to make it work without backreferences: (?:ns\d\d?:OutputData>)\b(.*?)<
Update 2
It's possible to match only the required parts, using lookbehind. Check the regex here.:
(?<=ns\d:OutputData>)\b([^<]*)|(?<=ns\d\d:OutputData>)\b([^<]*)
Explanation:
The two alternatives are almost identical. The only difference is the number of digits. This is important because some flavors support only fixed-length lookbehinds.
Checking alternative one, we put the starting tag into one lookbehind (?<=...) so it won't be included into the full match.
Then we match every non- lt symbol greedily: [^<]*. This will stop atching at the first closing tag.
Essentially, you need a look behind and a look ahead with a back reference to match just the content, but variable length look behinds are not allowed. Fortunately, you have only 2 variations, so an alternation deals with that:
(?<=<(ns\d:OutputData>)).*?(?=<\/\1)|(?<=<(ns\d\d:OutputData>)).*?(?=<\/\2)
The entire match is the target content between the tags, which may contain anything (including left angle brackets etc).
Note also the reluctant quantifier .*?, so the match stops at the next matching end tag, rather than greedy .* that would match all the way to the last matching end tag.
See live demo.
This was the answer in my case:
(?<=(ns\d:OutputData)>)(.*?)(?=<\/\1)
The answer is based on #WiktorStribiżew 3 given solutions (in comments).
The last one worked and I have made a slight modification of it.
Thanks all for the effort and especially #WiktorStribiżew!
EDIT
Ok, yes #Bohemian it does not match 2-digits, I forgot to update:
(?<=(ns\d{0,2}:OutputData)>)(.*?)(?=<\/\1)

Is there any upper limit for number of groups used or the length of the regex in Notepad++?

I am new to using regex. I am trying to use the regex find and replace option in Notepad++.
I have used the following regex:
((?:)|(\+)|(-))(\d)((?:)|(\+)|(-))(/)((?:)|(\+)|(-))(\d)((?:)|(\+)|(-))
For the following text:
2/2
+2/+2
-2/-2
2+/2+
2-/2-
But I am able to get matches only for the first three. The last two, it only gives partial matches, excluding the last "+" and the "-". I am wondering if there is any upper limit for the number of groups (which i doubt is unlikely) that can be used or any upper limit for the maximum length of the regex. I am not sure why my regex is failing. Or if there is anything wrong with my regex, please correct it.
This is not an issue with Notepad++'s regex engine. The problem is that when you have alternations like (?:)|(\+)|(-), the regex engine will attempt to match the different options in the order they are specified. Since you specified an empty group first, it will attempt to match an empty string first, only matching the + or - if it needs to backtrack. This essentially makes the alternation lazy—it will never match any character unless it has to.
vks's answer works perfectly well, but just in case you actually needed those capturing groups separated out, you can do the same thing just by rewriting your alternations like this:
((\+)|(-)|(?:))(\d)((\+)|(-)|(?:))(/)((\+)|(-)|(?:))(\d)((\+)|(-)|(?:))
or even more simply, like this:
((\+)|(-)|)(\d)((\+)|(-)|)(/)((\+)|(-)|)(\d)((\+)|(-)|)
([-+]?)(\d)([-+]?)(/)([-+]?)(\d)([-+]?)
You can use this simple regex to match all cases.See here.
https://www.regex101.com/r/fG5pZ8/19

Is it possible to say in Regex "if the next word does not match this expression"?

I'm trying to detect occurrences of words italicized with *asterisks* around it. However I want to ensure it's not within a link. So it should find "text" in here is some *text* but not within http://google.com/hereissome*text*intheurl.
My first instinct was to use look aheads, but it doesn't seem to work if I use a URL regex such as John Gruber's:
(?i)\b((?:[a-z][\w-]+:(?:/{1,3}|[a-z0-9%])|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:'".,<>?«»“”‘’]))
And put it in a look ahead at the beginning of the pattern, followed by the rest of the pattern.
(?=URLPATTERN)\*[a-zA-Z\s]\*
So how would I do this?
You can use this alternation technique to match everything first on LHS that you want to discard. Then on RHS use captured group to match desired text.
https?:\/\/\S*|(\*\S+\*)
You can then use captured group #1 for your emphasized text.
RegEx Demo
The following regexp:
^(?!http://google.com/hereissome.*text.*intheurl).*
Matches everything but http://google.com/hereissome*text*intheurl. This is called negative lookahead. Some regexp libraries may not support it, python's does.
Here is a link to Mastering Lookahead and Lookbehind.

Notepad++ regex group capture

I have such txt file:
ххх.prontube.ru
salo.ru
bbb.antichat.ru
yyy.ru
xx.bb.prontube.ru
zzz.com
srfsf.jwbefw.com.ua
Trying to delete all subdomains with such regex:
Find: .+\.((.*?)\.(ru|ua|com\.ua|com|net|info))$
Replace with: \1
Receive:
prontube.ru
salo.ru
antichat.ru
yyy.ru
prontube.ru
zzz.com
com.ua
Why last line becomes com.ua instead of jwbefw.com.ua ?
This works without look around:
Find: [a-zA-Z0-9-.]+\.([a-zA-Z0-9-]+)\.([a-zA-Z0-9-]+)$
Replace: \1\.\2
It finds something with at least 2 periods and only letters, numbers, and dashes following the last two periods; then it replaces it with the last 2 parts. More intuitive, in my opinion.
There's something funny going on with that leading xxx. It doesn't appear to be plain ASCII. For the sake of this question, I'm going to assume that's just something funny with this site and not representative of your real data.
Incorrect
Interestingly, I previously had an incorrect answer here that accumulated a lot of upvotes. So I think I should preserve it:
Find: [a-zA-Z0-9-]+\.([a-zA-Z0-9-]+)\.(.+)$
Replace: \1\.\2
It just finds a host name with at least 2 periods in it, then replaces it with everything after the first dot.
The .+ part is matching as much as possible. Try using .+? instead, and it will capture the least possible, allowing the com.ua option to match.
.+?\.([\w-]*?\.(?:ru|ua|com\.ua|com|net|info))$
This answer still uses the specific domain names that the original question was looking at. As some TLD (top level domains) have a period in them, and you could theoretically have a list including multiple subdomains, whitelisting the TLD in the regex is a good idea if it works with your data set. Both current answers (from 2013) will not handle the difference between "xx.bb.prontube.ru" and "srfsf.jwbefw.com.ua" correctly.
Here is a quick explanation of why this psnig's original regex isn't working as intended:
The + is greedy.
.+ will zip all the way to the right at the end of the line capturing everything,
then work its way backwards (to the left) looking for a match from here:
(ru|ua|com\.ua|com|net|info)
With srfsf.jwbefw.com.ua the regex engine will first fail to match a,
then it will move the token one place to the left to look at "ua"
At that point, ua from the regex (the second option) is a match.
The engine will not keep looking to find "com.ua" because ".ua" met that requirement.
Niet the Dark Absol's answer tells the regex to be "lazy"
.+? will match any character (at least one) and then try to find the next part of the regex. If that fails, it will advance the token, .+ matching one more character and then evaluating the rest of the regex again.
The .+? will eventually consume: srfsf.jwbefw before matching the period, and then matching com.ua.
But the implimentation of ? also creates issues.
Adding in the question mark makes that first .+ lazy, but then causes group1 to match bb.prontube.ru instead of prontube.ru
This is because that first period after the bb will match, then inside group 1 (.*?) will match bb.prontube. before \.(ru|ua|com\.ua|com|net|info))$ matches .ru
To avoid this, change that third group from (.*?) to ([\w-]*?) so it won't capture . only letters and numbers, or a dash.
resulting regex:
.+?\.(([\w-])*?\.(ru|ua|com\.ua|com|net|info))$
Note that you don't need to capture any groups other than the first. Adding ?: makes the TLD options non-capturing.
last change:
.+?\.([\w-]*?\.(?:ru|ua|com\.ua|com|net|info))$
Search what: .+?\.(\w+\.(?:ru|com|com\.au))
Replace with: $1
Look in the picture above, what regex capture referring
It's color the way you will not need a regex explaination anymore ....