Regex and angle brackets - regex

I have the following regular expression on nginx:
^(?<subdomain>.+)\.test\.com$
If parenthesis are for group, then how does it matches 'something.test.com', or 'foobar.test.com' ?
I was expecting to match something that only the word 'subdomain'. I think I am not understanding the ?, and the <> symbols. Also I can't see the use for the .+ at the end.

(?<name>.+) is a named capture group. The only pattern part of this group is the .+
The benefit to using named capture groups is that you can reference them by name rather than number, so in this case "something" or "foobar" can be referenced using the subdomain capture group.
The .+ at the end just means to match one or more of any character except newline characters.
This should help you visualize it better

Related

Regex - find the param in a url in any position in the string

I am trying to match a url param and this param's position is not fixed in the uri. It can show up sometime right after the ? or after the &. I need to match vr=359821 param in the below uri's. How can I do this.
Example urls:
/br/col/aon/11631?vr=359821&cId=9113
/br/col/aon/11631?cId=9113&vr=359821
/br/col/aon/11631?cId=9113&vr=359821&grid=2&page=something
Somethings I tried:
I tried to use backreferencing (not sure if this is right approach) but was not successful.
I was trying to group them and may be backreference to find the string within that group.
(\/br\/col\/aon\/11631)(\?cId=9113&(vr=359821)) # this matches second url above but not others.
(\/br\/col\/aon\/11631)(\?cId=9113&(vr=359821)).+?\3 # this is wrong I know.
(\/br\/col\/aon\/11631)(\?cId=9113&(vr=359821)).*?\2[vr=359821] # this is wrong
Above regex are wrong but my idea was to make it a group and match vr=359821 in that group. I dont know if this is even possible in regex.
why I am doing this:
The final goal is to redirect this url to a different url with all the params from original request in ngnix.
In the last 2 patterns that you tried, you are using a backreference like \2 and \3. But a backreference will match the same data that was already captured in the corresponding group.
In this case, that is not the desired behaviour. Instead, you want to match a key value pair in the uri, which does not have to exist in the content before.
Therefore you can match the start of the pattern followed by a non greedy quantifier (as it can also occur right after the question mark) to match the first occurrence of vr= followed by 1 or more digits.
In the comments I suggested this pattern \/br\/col\/aon\/11631\b.*?[?&](vr=\d+), but (depending on the regex delimiters) you don't have to escape the forward slash.
The pattern could be
/br/col/aon/11631\b.*?[?&](vr=\d+)
The pattern matches
/br/col/aon/11631\b Match the start of the pattern followed by a word boundary
.*? Match any char as least as possible
[?&] Match either ? or &
(vr=\d+) Capture group 1, match vr= followed by 1+ digits
Regex demo
From what I read is that nginx uses PCRE. To get a more specific pattern, one option could be:
/br/col/aon/11631\?.*?(?<=[?&])(vr=\d+)(?=\&|$)
This pattern matches
/br/col/aon/11631\? Match the start of the pattern followed by the question mark
.*? Match any char as least as possible
(?<=[?&]) Positive lookbehind, assert what is directy to the left is either ? or &
(vr=\d+) Capture group 1, match vr= followed by 1+ digits
(?=\&|$) Positive lookahead, assert what is directly to the right is & or the end of the string to prevent a partial match
Regex demo

python regex non-capture group handling

(1[0-9]{2})\s+(\w+(?:-\w+)+)\s+(\w+)\s+(\w+(?:-\w+)+)\s+(\w+)
used to match string
123 FEX-1-80 Online N2K-C2248TP-1GE SSDFDFWFw23r23
How come this works in regexr.com but Python 3.5.1 can't find a match
r'(1[0-9]{2})\s+(\w+(?:-\w+)+)\s+(\w+)\s+(\w+(?:-\w+))'
can match up to
123 FEX-1-80 Online N2K-C2248TP
but the second hyphen - in group(4) is not matched
From what I understand, non-capture group character can appear more than once in the group, what went wrong here?
Just a comment, not really an answer but for the sake of clarity I have put it as an answer.
Being relatively new to regular expressions, one should use the verbose mode. With this, your expression becomes much much more readable:
(1[0-9]{2})\s+ # three digits, the first one needs to be 1
(\w+(?:-\w+)+)\s+ # a word character (wc), followed by - and wcs
(\w+)\s+ # another word
(\w+(?:-\w+)+)\s+ # same expression as above
(\w+) # another word
Also, check if your (second and fourth) expression could be rewritten as [\w-]+ - it is not the same as yours and will match other substrings but try to avoid nested parenthesis in general.
Concerning your question, the second string cannot be matched as you made all of your expressions mandatory (and group 5 is missing in the second example, so it will fail).
See a demo on regex101.com.
This regular expression matches the full input string:
(1[0-9]{2})\s+(\w+(?:-\w+)+)\s+(\w+)\s+(\w+(?:-\w+)+)\s+(\w+)
This one doesn't:
(1[0-9]{2})\s+(\w+(?:-\w+)+)\s+(\w+)\s+(\w+(?:-\w+))
The latter is missing a + after the last non-capturing group, and it's missing the \s+(\w+) at the end that matches the SSDFDFWFw23r23 at the end of the input string.
From what I understand, non-capture group character can appear more than once in the group, what went wrong here?
I'm not sure I follow. A non-capturing group is really just there to group a part of a regular expression.
(?:-\w+) or just -\w+ will both match a hyphen (-) followed by one or more "word" characters (\w+). It doesn't matter whether that regular expression is in a non-capturing group or not. If you want to match repetitions of that pattern, you can use the + modifier after the non-capturing group, e.g. (?:-\w+)+. That pattern will match a string like -foo-bar-baz.
So the reason your second regular expression doesn't match the repeated pattern is because it's lacking the + modifier.

Non capturing group included in capture?

This text
"dhdhd89(dd)"
Matched against this regex
.+?(?:\()
..returns "dhdhd89(".
Why is the start parenthesis included in the capture?
Two different tools, as well as the .NET Regex class, returns the same result. So I gather there is something I don't understand about this.
The way I read my regex is.
Match any character, at least one occurrence. As few as possible.
The matched string should be followed by a start parenthesis, but not to be included in the capture.
I can find workaround, but I still want to know what is going on.
Just turn the non-capturing group to positive lookahead assertion.
.+?(?=\()
.+? non-greedy match of one or more characters followed by an opening parenthesis. Assertions won't match any characters but asserts whether a match is possible or not. But the non-capturing group will do the matching operation.
DEMO
You can just use this negation based regex to capture only text before a literal (:
^([^(]+)
When you use:
.+?(?:\()
Regex engine does match ( after initial text but it just doesn't return that in a captured group to you.
You havn't defined capture groups then I guess you display the whole match (group 0), you can do:
(.+?)(?:\()
and the string you want is in group 1
or use lookahead as #AvinashRaj said.

Keep only the strings in between quotes in Notepad++

In Notepad++, I use the expression (?<=").*(?=") to find all strings in between quotes. It would the seem rather trivial to be able to only keep those results. However, I cannot find an easy solution for this.
I think the problem is that Notepad++ is not able to make multiple selections. But there must be some kind of workaround, right? Perhaps I must invert the regex and then find/replace those results to end up with the strings I want.
For example:
blablabla "Important" blabla
blabla "Again important" blablabla
I want to keep:
Important
Again important
There is no great solution for this and depending on your use case I would recommend writing a quick script that actually uses your first expression and creates a new file with all of the matches (or something like this). However, if you just want something quick and dirty, this expression should get you started:
[^"]*(?:"([^"]*)")?
\1\n
Explanation:
[^"]* # 0+ non-" characters
(?: # Start non-capturing group
" # " literally
( # Start capturing group
[^"]* # 0+ non-" characters
) # End capturing group
" # " literally
)? # End non-capturing group AND make it optional
The reason the optional non-capturing group is used is because the end of your file may very well not have a string in quotes, so this isn't a necessary match (we're more interested in the first [^"]* that we want to remove).
Try something like this:
[^"\r\n]+"([^"]+)"[^"\r\n]+
And replace with $1. The above regex assumes there will be only 2 double quotes in each line.
[^"]+ matches non-quote characters.
[^"\r\n]+ matches non-quote, non newline characters.
regex101 demo
Hard to be certain from your post, but I think you may want : SEE BELOW
<(?<=")(.*)(?=")
The part you keep will be captured as \2.
(?<=")(.*)(?=")
\1 \2 \3
Your original regex string uses parentheses to group characters for evaluation. Parentheses ALSO group characters for capturing. That is what I added.
Update:
The regex pattern you provided doesn't seem to work correctly. Won't this work?
\"(.*)\"
\1 now captures the content.

regular expression match word or absence of word

I am trying to match the word group or match the absence of the word group
http://rubular.com/r/TKJPFvnzZ0
I can match a space but I would like it to actually match nothing. I am struggling with finding the correct syntax.
Match group 3 should contain either group or empty string.
Thanks!
Not sure if I understood you correctly, but would this solve your problem:
^I post a "(.*?)" to the "(.*?)"(?: (group))? which the entire world can see$
?
Basically says that group is optional.
The ?: inside the parenthesis marks that group as a "non-capturing group", which means that we're only enclosing that part of the expression in parenthesis to group it, but we don't want to capture the content to use after. group is simply enclosed in parenthesis because we want to capture that match as a group.