Regex for finding domains in a sentence but not IP addresses - regex

I am trying to write a regular expression that will match domains in a sentence.
I found this post which was very useful and helped me create the following to match domains, but it also unfortunately matches IP addresses too which I do not want:
((?!-))(xn--)?[a-z0-9][a-z0-9-_]{0,61}[a-z0-9]{0,1}\.(xn--)?([a-z0-9\._-]{1,61}|[a-z0-9-]{1,30})
I want to update my expression so that the following can still be found: in a sentence, between brackets, etc.:
www.example.com
subdomain.example.com
subdomain.example.co.uk
But not:
192.168.0.0
127.0.0.1
Is there a way to do this?

We could use a simple lookahead that excludes combinations of numbers and dots only: (?![\d.]+)
(?![\d.]+)((?!-))(xn--)?[a-z0-9][a-z0-9-_]{0,61}[a-z0-9]{0,1}\.(xn--)?([a-z0-9\._-]{1,61}|[a-z0-9-]{1,30})
Demo

Answer from #wp78de is correct, however it would not detect the domains starting with Numerical digits i.e. 123reg.com
So remove the first group in the regex like this
((?!-))(xn--)?[a-z0-9][a-z0-9-_]{0,61}[a-z0-9]{0,1}\.(xn--)?([a-z0-9\._-]{1,61}|[a-z0-9-]{1,30})

Related

Regular expression which matches a domain with only two ending characters?

I'm trying to stop spammers who are using short domains bit.ly etc. The domains they post seem to all be only two characters (not .com, etc).
I've used this:
\.[a-z][a-z]$
But, it has two problems:
it matches .co.uk
If anything is after the domain, it doesn't match (a space or slash, example: bit.ly/2231)
Could someone assist me with a regex that would accomplish this, please?
Whole URL matching. Depends on domain being before the first forward slash past protocol. First one uses if it only has one dot in the url and ends with two character primary TLD. Second one uses negative lookbehind to make sure it's not something like .co.uk.
https://regex101.com/r/5acu56/2
^(https?:\/\/)?[^\/.]+\.[a-z][a-z](\/|\s*$)
https://regex101.com/r/p8Ajw9/2
^(https?:\/\/)?[^\/]+(?<!\.[a-z][a-z])\.[a-z][a-z](\/\s*|\s*$)

Regex validation works without domain

So I have the following Regex URL validator:
[a-zA-Z0-9][a-zA-Z0-9-]{1,61}[a-zA-Z0-9](?:\.[a-zA-Z]{2,})+
It works perfectly well for my needs, except that it accepts urls without a domain for example www.test works.
How can I modify it to validate for a domain? (Any domain should be accepted not just .com
Demo
Just make the last group in your regex mandatory as appearing two or more times:
[a-zA-Z0-9][a-zA-Z0-9-]{1,61}[a-zA-Z0-9](?:\.[a-zA-Z]{2,}){2,}
As a disclaimer, and as #Wiktor will probably comment, you might want to use a regex pattern for validating URLs which already has been tested thoroughly. While this answer may fix your immediate problem, there are most likely other edge cases which exist.
You could do it like this to account for unicode:
^\p{L}+\.\p{L}+(\.\p{L}{2,})+
\p{L} or \p{Letter}: any kind of letter from any language.
So with this we match for a group of one or more letters (sudomain) followed by a . followed by a group of one or more letters (main domain) followed by any number of groups of . with two or more letters (domain suffix).

Why does this regexp for IPv4 doesn't work?

So this is the regex I've made:
^(([01]?\d{1,2})|(2(([0-4]\d)|(5[0-5])))\.){3}(([01]?\d{1,2})|(2(([0-4]\d)|(5[0-5]))))$
I have used several sites to break it down and it seems that it should work, but it doesn't. The desired result is to match any IPv4 - four numbers between 0 and 255 delimited by dots.
As an example, 1.1.1.1 won't give you a match.
The purpose of this question is not to find out a regex for IPv4 address, but to find out why this one, which seems correct, is not.
The literal . is only part of the 200-255 section of the capture group: railroad diagram.
Here's (([01]?\d{1,2})|(2([0-4]\d)|(5[0-5]))\.) formatted differently to help you spot the reason:
(
([01]?\d{1,2})
|
(2([0-4]\d)|(5[0-5])) \.
)
You're matching 0-199 or 200-255 with a dot. The dot is conditional on matching 200-255.
Additionally, as #SebastianProske pointed out, 2([0-4]\d)|(5[0-5]) matches 200-249 or 50-55, not 200-255.
You can fix your regex by adding capturing groups, but ultimately I would recommend not reinventing the wheel and using A) a pre-existing regex solution or B) parse the IPv4 address by splitting on dots. The latter method being easier to read and understand.
to fix yours up, just account for the "decimal" after each of the first three groups:
((2[0-4]\d|25[0-5]|[01]?\d{1,2})\.){3}(2[0-4]\d|25[0-5]|[01]?\d{1,2})
(*note that I reversed the order of the 2xx vs 1xx tests as well - prefer SPECIAL|...|NORMAL, or more restrictive first, when using alternations like this)
see it in action

Regex: How can I match third IPv4 address?

I'm a regex noob and for the life of me I can't figure out how to match the third IPv4 address on line that contains three IPv4 addresses.
The line in question:
ip route 214.25.48.547 255.255.255.255 16.48.75.46 name Chicago-VPN
The regex I have so far that matches one IP:
([0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3})
If I put a {3} at the end of it, it breaks. I think it has something to do with the spaces between the addresses but I can't figure out how to handle that. I need to capture the third address.
https://regex101.com/r/mN3cR6/1
You just need to add a multiline modifier to the code.
Your new code should be like this
/([0-9]{1,3}.[0-9]{1,3}.[0-9]{1,3}.[0-9]{1,3})/g
See this demo https://regex101.com/r/mN3cR6/2
Try
([0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\s?)+
This should match one, two, or three, or even more "IPs".
Or
([0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3})\s([0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3})\s([0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3})
for exactly 3.
Or
([0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\s?){3}
for a shorter formula with some possible errors.
Note that the basic idea is problematic too, as it matches "999.999.999.999" when it is definitely not a valid IP address.
The following should match the third ip
(?:[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\s){2}([0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3})
It's possible to be more compact depending what language you're using - for instance in ruby
string.scan(/([0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3})/)[2]
would give you what you want. You could also collapse the multiple [0-9]{1,3}. instances using non matching groups and counts
The problem is, that the regex needs to not only contain the IPs but also the spaces between the IPs.
So adding a space into the repeated group should do the trick:
([0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3} ){3}
If you don't want tat space in the final match, you make it non-greedy, using ?? (or *?):
([0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3} ??){3}
Also note, that your regex matches more than just valid IPs. e.g. 999.999.999.999 would match nicely.
You are already matching all three IPs with that regex.
([0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3})
Match 1
214.25.48.547
Match 2
255.255.255.255
Match 3
16.48.75.46
You can test it here:
http://rubular.com/
The problem may be with how you are trying to access them.
In Ruby, your regex works perfectly:
regex = /([0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3})/
"ip route 214.25.48.547 255.255.255.255 16.48.75.46 name Chicago-VPN".scan(regex)
=> [["214.25.48.547"], ["255.255.255.255"], ["16.48.75.46"]]

Clean and extract Subdomains & Domains from URLs using Regex Notepad++

This is simple text file.
The URL:
Can have https:// or http://
Eliminate both as well as trailing url/ file paths
Extract only domains and/or subdomains
I have Notepad++ and EditPlus
open to other Suggestions?
Examples:
https://appspace.com
http://appspace.com/
http://ayurfit.ning.com/main/authorization/signIn
http://bangalore.olx.in/login.php
http://birthdayshoes.com/forum/index.php
http://birthdayshoes.com/forum/register/
http://forums.virtualbox.org/ucp.php
Tries:
/(?!.{253})((?!-)[A-Za-z0-9-]{1,63}(?<!-)\.){1,126}+[A-Za-z]{2,6}/
^(?:https?://)?([^/.]+(?=\.)|)(\.?[^/.]+\.[^/]+)/?(.+|)$
https://regex101.com/r/hZ4cL4/4
Tried many on other machine as examples from Regex101
Found this little nugget as well. I'll post how its different once I understand it.
Regular Expression - Extract subdomain & domain
For the links that start with protocol, you can use the following regex:
(?<=://)[\w-]+(?:\.[\w-]+)+\b
See demo
The (?<=://) look-behind makes sure there is :// before the value we want to match, and the whole matched text consists of sequences of 1 or more word characters or hyphens ([\w-]+) that are eventually separated with periods.
You could simply extract anything that is between two . Additionally
you could use lookbehinds for http(s) and lookahead for the filepath
to fine tune your results.