Clean and extract Subdomains & Domains from URLs using Regex Notepad++ - regex

This is simple text file.
The URL:
Can have https:// or http://
Eliminate both as well as trailing url/ file paths
Extract only domains and/or subdomains
I have Notepad++ and EditPlus
open to other Suggestions?
Examples:
https://appspace.com
http://appspace.com/
http://ayurfit.ning.com/main/authorization/signIn
http://bangalore.olx.in/login.php
http://birthdayshoes.com/forum/index.php
http://birthdayshoes.com/forum/register/
http://forums.virtualbox.org/ucp.php
Tries:
/(?!.{253})((?!-)[A-Za-z0-9-]{1,63}(?<!-)\.){1,126}+[A-Za-z]{2,6}/
^(?:https?://)?([^/.]+(?=\.)|)(\.?[^/.]+\.[^/]+)/?(.+|)$
https://regex101.com/r/hZ4cL4/4
Tried many on other machine as examples from Regex101
Found this little nugget as well. I'll post how its different once I understand it.
Regular Expression - Extract subdomain & domain

For the links that start with protocol, you can use the following regex:
(?<=://)[\w-]+(?:\.[\w-]+)+\b
See demo
The (?<=://) look-behind makes sure there is :// before the value we want to match, and the whole matched text consists of sequences of 1 or more word characters or hyphens ([\w-]+) that are eventually separated with periods.

You could simply extract anything that is between two . Additionally
you could use lookbehinds for http(s) and lookahead for the filepath
to fine tune your results.

Related

Regex for finding domains in a sentence but not IP addresses

I am trying to write a regular expression that will match domains in a sentence.
I found this post which was very useful and helped me create the following to match domains, but it also unfortunately matches IP addresses too which I do not want:
((?!-))(xn--)?[a-z0-9][a-z0-9-_]{0,61}[a-z0-9]{0,1}\.(xn--)?([a-z0-9\._-]{1,61}|[a-z0-9-]{1,30})
I want to update my expression so that the following can still be found: in a sentence, between brackets, etc.:
www.example.com
subdomain.example.com
subdomain.example.co.uk
But not:
192.168.0.0
127.0.0.1
Is there a way to do this?
We could use a simple lookahead that excludes combinations of numbers and dots only: (?![\d.]+)
(?![\d.]+)((?!-))(xn--)?[a-z0-9][a-z0-9-_]{0,61}[a-z0-9]{0,1}\.(xn--)?([a-z0-9\._-]{1,61}|[a-z0-9-]{1,30})
Demo
Answer from #wp78de is correct, however it would not detect the domains starting with Numerical digits i.e. 123reg.com
So remove the first group in the regex like this
((?!-))(xn--)?[a-z0-9][a-z0-9-_]{0,61}[a-z0-9]{0,1}\.(xn--)?([a-z0-9\._-]{1,61}|[a-z0-9-]{1,30})

Match a url that does not contain certain word

I need some help for a regular expressions for not matching urls like these one:
/Common/Download.php?file=/path/to/file.pdf
and instead to matching these static urls:
/path/to/file.pdf
I have read many post (also in this site) but nothing seems to works as expected.
Thanks for your helps.
Lorenzo.
UPDATE
Sorry if this post is not so complete. I post more information to obtain a better help.
The regular expression that I need must work with Apache module mod_rewrite (and also with the module mod_rewrite of IIS (maybe this is not the right name) that is compatible with the module of Apache (as from my knowledge), if possible ) and must redirect the matching static urls (only of the second type, as from my post) to a specific page.
Thanks again.
Lorenzo.
Without knowing more about your programming language and regex parser, I'm keeping my regex really generic, but something like this should get you close:
^/([A-Za-z0-9]+/)+[A-Za-z0-9]+\.[A-Za-z0-9]{3,4}$
This matches a string starting with a slash, one or more directories separated by slashes, and ending with a filename with a three or four character file extension.
This means /path/to/some/really/buried/file.html would match too.
Using an interactive regular expression evaluator is a great way to rapidly write and debug regular expressions, especially if you are new to them. I really like The Regex Coach for this.
Another option could be to repeat the forward slash lowercase characters pattern in a non capturing group and repeat that. Then match the file extension .pdf
^(?:/[a-z]+){3}\.pdf$
Explanation
From the beginning of the string ^
Non capturing group (?:
Match one or more lowercase characters [a-z]+
Close the non capturing group and match 3 times ){3}
Match a dot \. and pdf
The end of the string $
Or repeat the group 2 times and for the filename use \w+
^(?:/[a-z]+){2}/\w+\.pdf$
If you want to match your example static url and maybe longer or shorter paths like /path/file.pdf or /dir/path/to/file.pdf you could for example use:
^(?:/\w+)+\.\w+$

IIS Web.config Regex forward slash restraint

I'm using web.config to rewrite URLs in IIS 8.5
This is my regex:
match url="^((?:[a-z]{2}\/{1}){1,2})?listen$"
This will successfully match the following:
en/gb/listen
en/listen
listen
However the part that I can't get to work is restraining the forward slashes in each optional group to a single character:
\/{1}
Interestingly this example does work on https://regex101.com/r/VNwejt/1
Any help would be appreciated.
You may restrict the whole pattern using a negative lookahead at the start:
^(?!.*//)<PATTERN_GOES_HERE>
See the regex demo.
The (?!.*//) lookahead fails the match if there is a // substring anywhere on a line of text.
However, in this case, the lookahead is redundant as your consuming pattern does not allow 2 consecutive // anywhere in the string, ^(?!.*//)((?:[a-z]{2}/){1,2})?listen$. Check the other options in your configuration file.

Regex on picking S3 bucket name from S3 URI

I have been trying to regex pattern to obtain S3 bucket name from S3 URI but have no luck.
example: s3://example-bucket/file-name.filetype
Closest I could get with this: \/\/([^\/].*[^\/])\/ but i'm not sure how to negate the slash from the result
The part ([^/]+) is looking for a sequence of of characters that are not slash.
keeping close with what you had you could write //([^/]+)/ but this is the same as
//([^/]+)
optional you could use lookbehind (?<=//) and/or lookahead (?=/)
(?<=//)([^/]+)(?=/)
(depending on your use cases a couple of different lookahead expressions are possible)
'lookaround' especially 'lookbehind' is not supported in all regexp dialects. e.g. not in JavaScript
Debuggex Demo
If you don't have the lookbehind option (javascript regex), and your URL structure is consistent enough, then perhaps this pattern will be useful:
[^\/]+(?=\/[^\/])
In English: "match one or more non-forward slashes, followed (but not actually matched, via the lookahead) by one forward slash and a non-forward slash".
(your other option would be to access your match groups in order to get rid of the slashes that you matched)
https://regex101.com/r/XM05Sw/2/

preg_replace words not inside a url

I am using preg_replace to replace a list of words in a text that may contain some urls.
The problem is that I don't want to replace these words if they're part of a url.
These examples should be ignored:
foo.com
foo.com/foo
foo.com/foo/foo
For a basic example (written in php), I tried to ignore strings containing .com and optional slashes and chars, using a negative look ahead assertion, but with no success:
preg_replace("/(\b)foo(\b)/", "$1bar$2(?!(\w+\.\w+)*(\.com)([\.\/]\w+)*)", $text);
This call works just ignores the word before .com.
Any help would be really appreciated.
In cases like these, its much easier to think of the problem inverted. You want to match words not in an url. Instead think, you want to match the url and the words. So, your expression would look like this: url_match_here|(?:my|words|here). This will allow the regex engine to consume the URL first and then try to match those words. Thus, you never have to worry about matching the words inside an URL. If you want to maintain the text structure, you can use preg_replace, with the following expression (url_match_here)|(?:my|words|here) and replace by \1 to preserve the URL and the text.
I hope this helps.
Good luck.