Validate URL domain using regex - regex

I'm trying to validate a URL in multiple formats.
EG:
https://www.google.com OK
http://www.google.com OK
www.google.com OK
htt://www.google.com
://www.google.com
https://google.com OK
http\\www.google.com
http:\\www.google.com
http:\\www.google.com
http://computerName/abc/MenuItems.aspx OK
https://computerName/abc/MenuItems.aspx OK
http://www.a.com/abc/ms.aspx?Id=13&(Not.Licensed.For.Production)= OK
http://www.a.com/abc/ms.aspx?Id=13 OK
I'm using this regex
^(?:https?:\/\/(?:www\.)?|www\.)[a-z0-9]+(?:[a-zA-Z0-9_\-/\.]+)*(?::[0-9]{1,5})?(?:\/[a-z0-9]+)*(?:\.[a-z]{2,5})?$
But the last two items aren't ok. How can I also validate for urls without www and for eg.(.com)
I tried to removed the part (that I think validates the .com but without success.
It's almost working but found a new cases last two that don't work
Here's my sample https://regex101.com/r/w3dQSl/5

For your example links, you could use
^https?:\/\/[a-z0-9]+(?:[-.][a-z0-9]+)*(?::[0-9]{1,5})?(?:\/[^\/\r\n]+)*\.[a-z]{2,5}(?:[?#]\S*)?$
The pattern will match:
^ Start of string
https?:\/\/ Match the protocol with optional s and ://
[a-z0-9]+ Match 1+ times any of the listed
(?:[-.][a-z0-9]+)* Repeat 0+ times any of the listed preceded by a - or .
(?::[0-9]{1,5})? Optionally match : and 1-5 digits
(?:\/[^\/\r\n]+)* Repeat 0+ times / and any char except /
\.[a-z]{2,5} Match a . and 2-5 times a char a-z
(?:[?#]\S*)? Optionally match either ? or # and 0+ times any non whitespace char
$ End of string
Regex demo

You ought to think about which URLs you try to match, but for your examples, this does the job and is also simpler than the inital regex you provided:
^(https?://)?(www\.)?[a-zA-Z0-9-]+[a-zA-Z0-9_\-/\.\?=\&\(\)]+$
Optionally you can make some of the groups non-capturing by adding ?: after the opening parentheses.
Debuggex Demo

Related

Regex to properly match urls with a particular domain and also if there is a subdomain added

I have the following regex:
(^|^[^:]+:\/\/|[^\.]+\.)hello\.net
Which seems to work fors most cases such as these:
http://hello.net
https://hello.net
http://www.hello.net
https://www.hello.net
http://domain.hello.net
https://solutions.hello.net
hello.net
www.hello.net
However it still matches this which it should not:
hello.net.domain.com
You can see it here:
https://regex101.com/r/fBH112/1
I am basically trying to check if a url is part of hello.net. so hello.net and any subdomains such as sub.hello.net should all match.
it should also match hello.net/bye. So anything after hello.net is irrelevant.
You may fix your pattern by adding (?:\/.*)?$ at the end:
(^|^[^:]+:\/\/|[^.]+\.)hello\.net(?:\/.*)?$
See the regex demo. The (?:\/.*)?$ matches an optional sequence of / and any 0 or more chars and then the end of string.
You might consider a "cleaner" pattern like
^(?:\w+:\/\/)?(?:[^\/.]+\.)?hello\.net(?:\/.*)?$
See the regex demo. Details:
^ - start of string
(?:\w+:\/\/)? - an optional occurrence of 1+ word chars, and then :// char sqequence
(?:[^\/.]+\.)? - an optional occurrence of any 1 or more chars other than / and . and then .
hello\.net - hello.net
(?:\/.*)?$ - an optional occurrence of / and then any 0+ chars and then end of string

problems with regex url validator

I'm trying to create a regex to test if a url is valid or not. I had a good example to work off of, but I had to tweak it a bit to make it fit my purpose:
^(https?:\/\/)(www\.)?(\w*\.)+([\w\-_~:/?#[\]#!$&'()*+,;=.])*$
It works fine for the most part, but it matches the following, which drives me nuts:
http://www..example..com
I tried forever and I just can't get the magical combination of characters to get it to ignore the above use case. What am I doing wrong?
Here's a list of things I want the regex to match (all of them are matched):
http://www.example.com
https://www.example.com
https://www.example.com/
https://example.com/
https://blog.example.com/
https://my.blog.example.com/
https://my.blog.example.co.uk/
https://www.example.com/#test
https://www.example.com#test
https://www.example.com/test.php
https://www.example.com/test.php?test=yes&testmore=yesevenmore
https://www.example.com/test.php#test
https://www.example.com/test.php?test=yes&testmore2=yesevenmore&whatnumber=42#test
https://www.example.com/test
https://www.example.com/test/
https://www.example.com/test/?test=yes&testmore2=yesevenmore&whatnumber=42
https://www.example.com/test/#test
https://www.example.com/test/?test=yes&testmore=yesevenmore&whatnumber=42#test
https://www.example.com/test/?test=yes&testmore=yesevenmore&whatnumber=42#test
https://www.blog.example.com/test/?test=yes&testmore=yesevenmore&whatnumber=42#test
https://www.my.blog.example.com/test/?test=yes&testmore=yesevenmore&whatnumber=42#test
https://my.blog.example.co.uk/?test=yes&testmore=yesevenmore&whatnumber=42#test
http://255.255.255.255
http://www.example.com:8008
http://www.example.com:8008/test/?test=yes&testmore=yesevenmore&whatnumber=42#test
Here's a list of things I DON'T want it to match:
www.example.com
example.com
*http://www.blog..example..com
*http://www..example.com
*http://www...example.com
*http://www..example..com
http://www.example.com | not valid
http://www.example.com|
255.255.255.255
* still matched
How can I prevent regex from matching the multidots?
Your pattern matches the dot literally \. as well as in the character class which is repeated 1+ times as a group and (\w*\.)+ also matches consecutive dots.
You could shorten the character class as some parts do not have to be escaped and \w also matches _
Using the characters from your character class that you accept to be valid you could repeat in a group matching what you want to allow excluding the dot and match a single dot at the end:
^https?:\/\/(?:[-\w~:/?#[\]#!$&'()*+,;=]+\.)*[-\w~:/?#[\]#!$&'()*+,;=]+$
That will match
^ Start of string
https?:\/\/ Match http:// or https://
(?: Non capturing group
[-\w~:/?#[\]#!$&'()*+,;=]+\. Match 1+ times any of listed, then match a .
)* Close group and repeat 0+ times
[-\w~:/?#[\]#!$&'()*+,;=]+ Match any of the listed 1+ times (note that there is no .)
$ End of string
Regex demo
A more specific variant:
^https?:\/\/\w+(?:\.\w+)*(?:[/#:][-\w~:/?#[\]#!$&'()*+,;=.]*)?$
Regex demo

Regular expression for relative URLs (segment after the domain name) [& no querystring]

The relative URLs that I need to match are as follows:
/
/asdf
/asdf.php
/sdfsf/
/asdfsdf/asdf
/asdfsdf/s-_df.jpg
/
/asdf#
/asdf.php#
/sdfsf/
/asdfsdf/asdf#
/asdfsdf/s-_df.jpg#
I have tried a number of patterns but seem to be hitting the wall -
https://regex101.com/r/GnK43b/4
https://regex101.com/r/GnK43b/1
In the outcome, I need the the regex groups such that I get the segments:
For e.g.
/sdfsf/ => Group 1: sdfsf
/asdfsdf/asdf => Group 1: asdfsdf; Group 2: asdf
It's not very clear how exactly you expect the results, but let's imagine you want to get these matches:
URL GROUP1 GROUP2
----------------------------------------------------------
/
/asdf asdf
/asdf.php asdf.php
/sdfsf/ sdfsf
/asdfsdf/asdf asdfsdf asdf
/asdfsdf/s-_df.jpg asdfsdf s-_df.jpg
/asdfsdf/s-_df.jpg# asdfsdf s-_df.jpg
then you could use this regex:
/^/(?:([0-9a-zA-Z._]+)#?)(?:/([0-9a-zA-Z._]+#?)?)?$/gmU
https://regex101.com/r/GUTGn4/2
To match all those values, you could match the forward slash followed by repeating 0+ times all the characters that you would allow in the character class to match followed by a forward slash.
At the end repeat 0+ times any chars listed in the character list and an optional #
^/(?:[0-9a-zA-Z_]+/)*[0-9a-zA-Z_.-]*#?$
^/ Start of string and forward slash
(?:[0-9a-zA-Z_]+/)* Match 0+ times any of the listed and forward slash
[0-9a-zA-Z_.-]* Match 0+ times any of the listed
#? Match optional #
$ End of string
Regex demo

Issue matching exact word

I am building a website validator regex that can match a url.
Thing is, it 90% works! It goes in and out of my string match which is where the issue is.
My regex: (http(s?)://www.|www.|http(s?)://)+[a-z0-9]+([-.]{1}[a-z0-9]+).[a-z]{2,5}(:[0-9]{1,5})?(/.)?
My string to test with:
1)(This should fail, but it passes) https://www.xy
2)(This should pass, which it does) https://www.xy.com
It keeps going into my group (http(s?)://) instead of the group ((http(s?)://www.)
Any idea on how to solve this?
URL i want to pass:
http://www.test.com
http://test.com
https://test.com
https://www.test.com
URL i want to fail:
http://www.bla
https://www.ggg
So, if it matches https://www. or http://www. it should use the correct group and then apply the rest of the regex where it checks that it contains.. test.com or etc.
You may use
^(?:https?:\/\/)?(?!www\.[^.]+$)(?:www\.)?[a-z0-9]+(?:[-.][a-z0-9]+)*\.[a-z]{2,5}(?::[0-9]{1,5})?(\/.*)?$
See the regex demo
Details
^ - start of string
(?:https?:\/\/)? - an optional http:// or https://
(?!www\.[^.]+$) - a negative lookahead that fails the match if immediately to the right of the current position there is www. and then any 1+ chars other than dot to the end of the string
(?:www\.)? - an optional www.
[a-z0-9]+ - 1+ lowercase letters and digits
(?:[-.][a-z0-9]+)* - 0 or more repetitions of - or . and then 1+ lowercase letters and digits
\. - a .
[a-z]{2,5} - two to five lowercase letters
(?::[0-9]{1,5})? - an optional sequence of : and 1 to 5 digits
(\/.*)? - an optional sequence of / and the rest of the line
$ - end of the string.

Regex match if certain string is contained after last occurrence of specific character

For example, I want to check if the web url contains 'foo' after last slash, and match the entire url. So the following url should be a match:
https://www.facebook.com/messages/new/foobar
https://www.facebook.com/messages/t/barfoo
https://www.facebook.com/bfooar
https://foobar.com
https://foobar.com/foo
But the following shouldn't:
https://random.com/random
https://foobar.com/something
https://foobar.com/foo/bar
My approach is ((\\.*)*\\.*foo.*), but it seems doesn't work for any url that contains foo before the last slash. Is this pattern even doable in regex? Or I have to use something like split('\') in the code to achieve the desired pattern I want?
Thanks
You can use this regex:
^.*/[^/]*foo[^/]*$
RegEx Demo
Breakup:
^ - Start
.* - Match 0 or more characters (greedy)
/ - Match a /
[^/]* - Match 0 or more non-/ characters
foo - match foo
[^/]* - Match 0 or more non-/ characters
$ - End