What is the purpose of this non-capturing group? - regex

I am trying to understand the inner pipings of express.js, but I'm having a little trouble on one thing.
If you add a new route, like such:
app.get("/hello/darkness/myold/:name", ...)
The string I provided internally becomes a regular expression. Now, I worked out what I thought the regex should be internally, and I came up with:
^\/hello\/darkness\/myold\/([^\/]+?)\/?$
The ([^\/]+?) will capture the name parameter, \/? is present if strict routing is disabled, and the whole thing is encapsulated in ^...$. However, when I went and looked what is actually stored inside express, it's actually this:
^\/hello\/darkness\/myold\/(?:([^\/]+?))\/?$
As you can see, there is a non-capturing group around the capturing group. My question is: what is the purpose of this non-capturing group?
The method I used to see what regex express.js was using internally was simply to make an invalid regex and view the error console:
app.get('/hello/darkness/myold/:friend/[', function(req, res){});
yields
SyntaxError: Invalid regular expression: ^\/hello\/darkness\/myold\/(?:([^\/]+?))\/[\/?$

The answer to this question is that the non-capturing group is a relic of the case where a parameter is optional. Consider the difference between the following two routes:
/hello/:world/goodbye
/hello/:world?/goodbye
They will generate, respectively:
^\/hello\/(?:([^\/]+?))\/goodbye\/?$
^\/hello(?:\/([^\/]+?))?\/goodbye\/?$
Note the important but subtle change that happens to the non-capturing group when an optional parameter is present.

Related

NGINX location block regex and proxy pass

I hope all of you are well.
I am a beginner with NGINX and I am trying to understand the following NGINX config file block. I would be really grateful if someone could help me understand this block.
location ~ ^/search/google(/.*)?$ {
set $proxy_uri $1$is_args$args;
proxy_pass http://google.com$proxy_uri;
}
From the following SO article (https://stackoverflow.com/a/59846239), I understand that:
For the location ~ ^/search/google(/.*)?$
~ means that it will perform regex search (case sensitive)
^/search/google means that the route should start with /search/google (e.g. http://<ip or domain>/search/google. Is there any difference if we have trailing / at the end (e.g. http://<ip or domain>/search/google/ instead of http://<ip or domain>/search/google
(/.*)?$ this is the part that I'm a bit confused.
why use () group in this case? What's the common use case of using group?
why use ? in this case? Isn't .* already includes any char zero or more, why do we still need ?
Can we simply remove () and ? such as /search/google/.*$ to get the same behavior as the original one?
set $proxy_uri $1$is_args$args;
I understand that we are setting a user-defined var called proxy_uri
what will $1 be replaced with, sometimes someone also include $2 and so on?
I think $is_args$args means that if there's a query string (i.e. http://<ip or domain>/search/google?fruit=apple, $is_args$args will be replaced with ?fruit=apple
proxy_pass http://google.com$proxy_uri
I would assume it just redirects the user to http://google.com$proxy_uri??? same as http redirect 301???
Thank you very much in advance!
Being a non-native English speaker, I thought someone will answer your question with a more perfect English than mine, but since no one did it for the last five days, I would try to do it by myself.
~ means that it will perform regex search (case sensitive)
I think the more correct term is "perform matching against a regex pattern".
^/search/google means that the route should start with /search/google (e.g. http://<ip or domain>/search/google. Is there any difference if we have trailing / at the end (e.g. http://<ip or domain>/search/google/ instead of http://<ip or domain>/search/google
Will be answered below.
why use () group in this case? What's the common use case of using group?
This is a numbered capturing group. Content of the string matched this group can be referenced later as $1. Second numbered capture group, being present in the regex pattern, can be referenced as $2 and so on. There is also the named capture groups exists, when you can use your own variable name instead of $1, $2, etc. A good example of using named capture groups is given at this ServerFault thread.
BTW the answer you are referencing mentions numbered capture groups (but not the named capture groups).
why use ? in this case? Isn't .* already includes any char zero or more, why do we still need ?
Did you notice our capture group is (/.*), not the (.*)? This way it will match /search/google/<any suffix> but not the /search/googles etc. A question sign made this capturing group optional (/search/google will match our regex pattern too).
Can we simply remove () and ? such as /search/google/.*$ to get the same behavior as the original one?
No, as we need that $1 value later. If you understand all the above information correctly, you should see it can be /<any suffix> or an empty string.
what will $1 be replaced with, sometimes someone also include $2 and so on?
Already answered.
I think $is_args$args means that if there's a query string (i.e. http://<ip or domain>/search/google?fruit=apple, $is_args$args will be replaced with ?fruit=apple
Yes, exactly.
I would assume it just redirects the user to http://google.com$proxy_uri??? same as http redirect 301???
Totally wrong. The difference is briefly described here although that answer doesn't mention you can additionally modify the response before sending it to the client (for example, using the sub_filter module).

what would be regex pattern for chained functions?

I tried various combinations but unsuccesfull at figuring out correct regex pattern.
Basically I want to capture patterns like examples below:
{{variable}}
{{variable.function1{param1}}}
{{variable.function1{param1}.function2{param2}}}
and so on..
I wanted to capture variable,function1,param1,function2,param2 from this
So far I have below regex which does not work completely
\{\{([^{}.]+)(\.([^{}]+)\{([^{}]+)\})*\}\}
If I try to apply above pattern on example 3, I get below groups
Group#1 - variable
Group#2 - .function2{param2}
Group#3 - function2
Group#4 - param2
I was expecting somthing as below,
Group#1 - variable
Group#2 - .function1{param1}
Group#3 - function1
Group#4 - param1
Group#5 - .function2{param2}
Group#6 - function2
Group#7 - param2
PS: you can check without writing code at http://regexr.com/3e4st
Okay, so the reason why your thing doesn't work, is because you're basically only capturing one instance of the thing in general, which means each capture group can only return one instance of what you want. So what's needed is the global variable or your equivalent in whatever language you're using.
Example: https://regex101.com/r/pO8xN2/3
The number of groups in a regex match is fixed. See an older post of mine with more explanation. In your case that number is 4.
When a group matches repeatedly, you will usually only be able to access the value of the last occurrence in the string. That's what you see with your issue: function2, param2.
Some regex engines allow accessing previous group captures (for example the one in .NET). The majority don't. Whether you can solve your issue easily or not strictly depends on your regex engine.

Find replace named groups regexp in Geany

I am trying to replace public methods to protected methods for methods that have a comment.
This because I am using phpunit to test some of those methods, but they really don't need to be public, so I'd like to switch them on the production server and switch back when testing.
Here is the method declaration:
public function extractFile($fileName){ //TODO: change to protected
This is the regexp:
(?<ws>^\s+)(?<pb>public)(?<fn>[^/\n]+)(?<cm>//TODO: change to protected)
If I replace it with:
\1protected\3\//TODO: change back to public for testing
It seems to be working, but what I cannot get to work is naming the replace with. I have to use \1 to get the first group. Why name the groups if you can't access them in the replacing texts? I tried things like <ws>, $ws, $ws, but that doesn't work.
What is the replacing text if I want to replace \1 with the <ws> named group?
The ?<ws> named group syntax is the same as that used by .NET/Perl. For those regex engines the replacement string reference for the named group is ${ws}. This means your replacement string would be:
${ws}protected${fn}\//TODO: change back to public for testing
The \k<ws> reference mentioned by m.buettner is only used for backreferences in the actual regex.
Extra Information:
It seems like Geany also allows use of Python style named groups:
?P<ws> is the capturing syntax
\g<ws> is the replacement string syntax
(?P=ws) is the regex backreference syntax
EDIT:
It looks my hope for a solution didn't pan out. From the manual,
A subpattern can be named in one of three ways: (?...) or (?'name'...) as in Perl, or (?P...) as in Python. References to capturing parentheses from other parts of the pattern, such as backreferences, recursion, and conditions, can be made by name as well as by number.
And further down:
Back references to named subpatterns use the Perl syntax \k or \k'name' or the Python syntax (?P=name).
and
A subpattern that is referenced by name may appear in the pattern before or after the reference.
So, my inference of the syntax for using named groups was correct. Unfortunately, they can only be used in the matching pattern. That answers your question "Why name groups...?".
How stupid is this? If you go to all the trouble to implement named groups and their usage in the matching pattern, why not also implement usage in the replacement string?

I want a regular expression that only matches domain names with one period in them

I want it to catch things like somedomain.com/folder/path, but not something like domain.sub.other.com. The regex I have so far is almost complete, it just doesn't sift out the multi-domain urls:
^(.*)://(?!(.{2,3})\.(.*)(.{2,3})(.*)
Is there any way to sift out on multiple periods?
Instead of .{2,3}, you want something like this: [^.]{2,3} - this excludes the period (no need to escape as it has no special meaning in this context in a regular expression) from that particular match. Overall you'd have something like:
://[^.]+\.[^.]{2,3}(/.*)?
Except obviously you're missing things like *.info by doing that....
Found a solution that is working given a variety of test scenarios:
^(.*)://([^.]+)\.([^(\?|/|\r|\n|\.)]+)((/|\?|$)+)(.*)$
Here, the 2nd to the last group is matching against a potential forward slash, question mark or end of string, working together with the group before it which does not allow matches which include '.'
So the final effect is that it only matches URLs with a two-part domain such as 'domain.com' and there aren't any limits placed on string length.

Regex capturing named groups in a language that doesn't support them using a meta regex?

I am using Haskell and I don't seem to find a REGEX package that supports Named Groups so I have to implement it somehow myself.
basically a user of my api would use some regex with named groups to get back captured groups in a map
so
/(?P<name>[a-z]*)/hhhh/(?P<surname>[a-z]*)/jjj on /foo/hhhh/bar/jjj
would give
[("name","foo"),("surname","bar")]
I am doing a specification trivial implementation with relatively small strings so for now performance is not a main issue.
To solve this, I thought I'd write a meta regex that will apply on the user's regex
/(?P<name>[a-z]*)/hhhh/(?P<surname>[a-z]*)/jjj
to extract the names of groups and replace them with nothing to get
0 -> name
1 -> surname
and the regex becomes
/([a-z]*)/hhhh/([a-z]*)/jjj
then apply it to the string and use the index to group names with matched.
Two questions:
does it seem like a good idea?
what is the meta regex that I need to capture and replace the named groups syntax
for those unfamiliar with named groups http://www.regular-expressions.info/named.html
note: all what I need from named groups is that the user give names to matches, so a subset of named groups that only gives me this is ok.
The more generally you want to apply your solution, the more complex your problem becomes. For instance, in your approach, you want to remove the named groups and use the indexes (indices?) to match. This seems like a good start, but you have consider a few things:
If you replace the (?<name>blah) with (blah) then you also have to replace the /name with /1 or /2 or whatever.
What happens if the user includes non named groups as well? for eg: ([a-z]{3})/(?P<name>[a-z]*)/hhhh/(?P<surname>[a-z]*)/jjj on /foo/hhhh/bar/jjj. In this case, your numbering will not work b/c group 1 is the user defined non named group.
See this post for some insipration, as it seems other have successfully tried the same (albeit in Java)
Regex Named Groups in Java
Perhaps you should use parser combinators. This looks sufficiently complicated that it would be cleaner and more maintainable to step out and use Parsec or Attoparsec, instead of trying to push regexes further towards parsing.