IIS Web.config Regex forward slash restraint - regex

I'm using web.config to rewrite URLs in IIS 8.5
This is my regex:
match url="^((?:[a-z]{2}\/{1}){1,2})?listen$"
This will successfully match the following:
en/gb/listen
en/listen
listen
However the part that I can't get to work is restraining the forward slashes in each optional group to a single character:
\/{1}
Interestingly this example does work on https://regex101.com/r/VNwejt/1
Any help would be appreciated.

You may restrict the whole pattern using a negative lookahead at the start:
^(?!.*//)<PATTERN_GOES_HERE>
See the regex demo.
The (?!.*//) lookahead fails the match if there is a // substring anywhere on a line of text.
However, in this case, the lookahead is redundant as your consuming pattern does not allow 2 consecutive // anywhere in the string, ^(?!.*//)((?:[a-z]{2}/){1,2})?listen$. Check the other options in your configuration file.

Related

Match specific pattern that does not contain other pattern in one expression

I'm looking for a regex to use in nginx location matching, that would match a specified end pattern not being anywhere preceded by a specified other pattern.
Like, I have files:
webgl-0.4.0-alpha.1-gzip-dev/streaming-wasm-gzip-dev.wasm.framework.unityweb
webgl-0.4.0-alpha.1-gzip-dev/streaming-wasm-gzip-dev.data.unityweb
webgl-0.4.0-alpha.1-gzip/streaming-wasm-gzip.wasm.framework.unityweb
webgl-0.4.0-alpha.1-gzip/streaming-wasm-gzip.data.unityweb
I want to match all \.unityweb except those that are anywhere preceded by dev. Basically, I need to match last two lines. I cannot hardcode it, as the files/directories might be named arbitrary.
The usual ((?!dev\/).)*$ doesn't suffice, because it still gets the ends. (?<!dev) also cannot be added anwyhere as it will only match directly before.
I am out of clues and also out of regex fu!
The solution does not have to be strictly regex, might be nginx based too.
It might have been asked before, but I cannot seem to know the correct keywords to find it.
Try
^(?!.*?dev\/.*).+\.unityweb$
See the demo here
Description:
^ From the start of the line
(?! _______ ) Negative Lookahead
.*?dev\/ Match any character any amount of times, until you reach dev followed by a slash
.* Match any characters any amount of times
Negative lookahead closes
.+ Match any character, more than once
\.unityweb - until you reach .unityweb
$ End of the line
Use the full match for what you need
EDIT
Just realised that you also state a contradiction in your question, as you say you don't want to match anything preceded by dev/ but you also want to match the first two examples you gave.
That can be done by changing the negative lookahead to a positive lookahead:
^(?=.*?dev\/.*).+\.unityweb$
See the demo here
You can use this
^(?!.*dev.*\.unityweb)(?=.*\.unityweb).*$
Demo

Match a url that does not contain certain word

I need some help for a regular expressions for not matching urls like these one:
/Common/Download.php?file=/path/to/file.pdf
and instead to matching these static urls:
/path/to/file.pdf
I have read many post (also in this site) but nothing seems to works as expected.
Thanks for your helps.
Lorenzo.
UPDATE
Sorry if this post is not so complete. I post more information to obtain a better help.
The regular expression that I need must work with Apache module mod_rewrite (and also with the module mod_rewrite of IIS (maybe this is not the right name) that is compatible with the module of Apache (as from my knowledge), if possible ) and must redirect the matching static urls (only of the second type, as from my post) to a specific page.
Thanks again.
Lorenzo.
Without knowing more about your programming language and regex parser, I'm keeping my regex really generic, but something like this should get you close:
^/([A-Za-z0-9]+/)+[A-Za-z0-9]+\.[A-Za-z0-9]{3,4}$
This matches a string starting with a slash, one or more directories separated by slashes, and ending with a filename with a three or four character file extension.
This means /path/to/some/really/buried/file.html would match too.
Using an interactive regular expression evaluator is a great way to rapidly write and debug regular expressions, especially if you are new to them. I really like The Regex Coach for this.
Another option could be to repeat the forward slash lowercase characters pattern in a non capturing group and repeat that. Then match the file extension .pdf
^(?:/[a-z]+){3}\.pdf$
Explanation
From the beginning of the string ^
Non capturing group (?:
Match one or more lowercase characters [a-z]+
Close the non capturing group and match 3 times ){3}
Match a dot \. and pdf
The end of the string $
Or repeat the group 2 times and for the filename use \w+
^(?:/[a-z]+){2}/\w+\.pdf$
If you want to match your example static url and maybe longer or shorter paths like /path/file.pdf or /dir/path/to/file.pdf you could for example use:
^(?:/\w+)+\.\w+$

Clean and extract Subdomains & Domains from URLs using Regex Notepad++

This is simple text file.
The URL:
Can have https:// or http://
Eliminate both as well as trailing url/ file paths
Extract only domains and/or subdomains
I have Notepad++ and EditPlus
open to other Suggestions?
Examples:
https://appspace.com
http://appspace.com/
http://ayurfit.ning.com/main/authorization/signIn
http://bangalore.olx.in/login.php
http://birthdayshoes.com/forum/index.php
http://birthdayshoes.com/forum/register/
http://forums.virtualbox.org/ucp.php
Tries:
/(?!.{253})((?!-)[A-Za-z0-9-]{1,63}(?<!-)\.){1,126}+[A-Za-z]{2,6}/
^(?:https?://)?([^/.]+(?=\.)|)(\.?[^/.]+\.[^/]+)/?(.+|)$
https://regex101.com/r/hZ4cL4/4
Tried many on other machine as examples from Regex101
Found this little nugget as well. I'll post how its different once I understand it.
Regular Expression - Extract subdomain & domain
For the links that start with protocol, you can use the following regex:
(?<=://)[\w-]+(?:\.[\w-]+)+\b
See demo
The (?<=://) look-behind makes sure there is :// before the value we want to match, and the whole matched text consists of sequences of 1 or more word characters or hyphens ([\w-]+) that are eventually separated with periods.
You could simply extract anything that is between two . Additionally
you could use lookbehinds for http(s) and lookahead for the filepath
to fine tune your results.

regex negative lookbehind - pcre

I'm trying to write a rule to match on a top level domain followed by five digits. My problem arises because my existing pcre is matching on what I have described but much later in the URL then when I want it to. I want it to match on the first occurence of a TLD, not anywhere else. The easy way to check for this is to match on the TLD when it has not bee preceeded at some point by the "/" character. I tried using negative-lookbehind but that doesn't work because that only looks back one single character.
e.g.: How it is currently working
domain.net/stuff/stuff=www.google.com/12345
matches .com/12345 even though I do not want this match because it is not the first TLD in the URL
e.g.: How I want it to work
domain.net/12345/stuff=www.google.com/12345
matches on .net/12345 and ignores the later match on .com/12345
My current expression
(\.[a-z]{2,4})/\d{5}
EDIT: rewrote it so perhaps the problem is clearer in case anyone in the future has this same issue.
You're pretty close :)
You just need to be sure that before matching what you're looking for (i.e: (\.[a-z]{2,4})/\d{5}), you haven't met any / since the beginning of the line.
I would suggest you to simply preppend ^[^\/]*\. before your current regex.
Thus, the resulting regex would be:
^[^\/]*\.([a-z]{2,4})/\d{5}
How does it work?
^ asserts that this is the beginning of the tested String
[^\/]* accepts any sequence of characters that doesn't contain /
\.([a-z]{2,4})/\d{5} is the pattern you want to match (a . followed by 2 to 4 lowercase characters, then a / and at least 5 digits).
Here is a permalink to a working example on regex101.
Cheers!
You can use this regex:
'|^(\w+://)?([\w-]+\.)+\w+/\d{5}|'
Online Demo: http://regex101.com/

Regex - not in the list but matched anyway

This is a bit hard to sum up in a title, but here is my problem:
(?:(?:http|https):\\/\\/)?(?:\\/\\/www\\.)?youtube.com\\/watch\\?(?:.*)v=(\\w{11}).*
Given the expression given below, I really really don't understand why ftp://www.youtube.com/watch?v=F5eScJmYZZ8 matches. I unsuccessfully tried to add ^ to the expression beginning, but then, my expression does not match anything anymore (this is done in Java, that explains the doubled backslashes).
How can ftp be accepted as it is clearly not listed in (http|ftp)?
EDIT
To be accurate, here is what is allowed:
http(s)://www.[...]
http(s)://[...]
www.[...]
[...]
and nothing else.
Because ? after the http part the means that it is optional. Use + instead of ?.
Also, you are checking for // after http twice.
\s* allows whitespace at the beginning. If you don't want to allow whitespace (i.e., the input text will contain only 1 match), use ^ instead.
Here is the working regex that meets all of your added requirements:
\s*(?:(http|https)\:\/\/)?(?:www\.)?youtube.com\/watch\?(?:.*)v=(\w{11}).*
Because the leading (?:(?:http|https):\\/\\/)? is optional. That's what the question mark at the end of the group signifies (match at most one, i.e. match only if it exists).
A leading ^ should prevent the match with ftp though. Can you post the failing regex you tried (with the ^)?
UPDATE:
Aha! It matches without the ^ since the http group is optional, and anything can come before the match (e.g. cheeseyoutube.com/... would match). Adding a ^ to the beginning of the regex fixes this, but there's another problem with your regex: the www group is trying to match two slashes (as first pointed out in Justin's answer), which it can't once the http group has already matched those slashes. So the www group fails to match (fine, since it's optional), but then the youtube part can't match since there's an unmatched www in the way!
This should fix your problem:
^(?:(?:http|https):\\/\\/)?(?:www\\.)?youtube.com\\/watch\\?(?:.*)v=(\\w{11}).*