problems with regex url validator - regex

I'm trying to create a regex to test if a url is valid or not. I had a good example to work off of, but I had to tweak it a bit to make it fit my purpose:
^(https?:\/\/)(www\.)?(\w*\.)+([\w\-_~:/?#[\]#!$&'()*+,;=.])*$
It works fine for the most part, but it matches the following, which drives me nuts:
http://www..example..com
I tried forever and I just can't get the magical combination of characters to get it to ignore the above use case. What am I doing wrong?
Here's a list of things I want the regex to match (all of them are matched):
http://www.example.com
https://www.example.com
https://www.example.com/
https://example.com/
https://blog.example.com/
https://my.blog.example.com/
https://my.blog.example.co.uk/
https://www.example.com/#test
https://www.example.com#test
https://www.example.com/test.php
https://www.example.com/test.php?test=yes&testmore=yesevenmore
https://www.example.com/test.php#test
https://www.example.com/test.php?test=yes&testmore2=yesevenmore&whatnumber=42#test
https://www.example.com/test
https://www.example.com/test/
https://www.example.com/test/?test=yes&testmore2=yesevenmore&whatnumber=42
https://www.example.com/test/#test
https://www.example.com/test/?test=yes&testmore=yesevenmore&whatnumber=42#test
https://www.example.com/test/?test=yes&testmore=yesevenmore&whatnumber=42#test
https://www.blog.example.com/test/?test=yes&testmore=yesevenmore&whatnumber=42#test
https://www.my.blog.example.com/test/?test=yes&testmore=yesevenmore&whatnumber=42#test
https://my.blog.example.co.uk/?test=yes&testmore=yesevenmore&whatnumber=42#test
http://255.255.255.255
http://www.example.com:8008
http://www.example.com:8008/test/?test=yes&testmore=yesevenmore&whatnumber=42#test
Here's a list of things I DON'T want it to match:
www.example.com
example.com
*http://www.blog..example..com
*http://www..example.com
*http://www...example.com
*http://www..example..com
http://www.example.com | not valid
http://www.example.com|
255.255.255.255
* still matched
How can I prevent regex from matching the multidots?

Your pattern matches the dot literally \. as well as in the character class which is repeated 1+ times as a group and (\w*\.)+ also matches consecutive dots.
You could shorten the character class as some parts do not have to be escaped and \w also matches _
Using the characters from your character class that you accept to be valid you could repeat in a group matching what you want to allow excluding the dot and match a single dot at the end:
^https?:\/\/(?:[-\w~:/?#[\]#!$&'()*+,;=]+\.)*[-\w~:/?#[\]#!$&'()*+,;=]+$
That will match
^ Start of string
https?:\/\/ Match http:// or https://
(?: Non capturing group
[-\w~:/?#[\]#!$&'()*+,;=]+\. Match 1+ times any of listed, then match a .
)* Close group and repeat 0+ times
[-\w~:/?#[\]#!$&'()*+,;=]+ Match any of the listed 1+ times (note that there is no .)
$ End of string
Regex demo
A more specific variant:
^https?:\/\/\w+(?:\.\w+)*(?:[/#:][-\w~:/?#[\]#!$&'()*+,;=.]*)?$
Regex demo

Related

Regex to match after double backslash slash until .com

I've been trying to regex the following message:
Netlogon has failed an additional 130 authentication requests in the
last 30 minutes. The requests timed out before they could be sent to
domain controller \\AA-SRV85.xx.acme.com in domain XX. Please see
http://support.microsoft.com/kb/2654097 for more information.
So far, I've managed to understand that if use the following, I will find a match for \\AA-SVR85.xx.acme.com
\\\\AA\-SRV85\.xx\.acme\.com
But the thing is, I have multiple servers in my environment and the server name will certainly vary.
Can someone please explain how this should be done?
My goal is to match everything after the double backslash until the end of the domain (.com).
This will match "everything after the double backslash until the end of the domain (.com)." As requested in your question.
\\\\.*?\.com
You may want to modify it a bit to match upper or lower case COM:
\\\\.*?\.[Cc][Oo][Mm]
Here is how it works:
\\ matches \
. matches any character (except for line terminators)
*? matches the previous token between zero and unlimited times, as few times as possible, expanding as needed (lazy)
\. matches a period
[Cc] matches C or c
[Oo] matches O or o
[Mm] matches M or m
Replacing the . with [^ ] so that it won't match spaces between
the \\ and the .com is probably an improvement also...
^\\\\[a-zA-Z0-9-.]+.com$
As per your question, a
^ means regex match starts here,
\\ - matches a \, so two pairs of \\ matches the two \\ in your URL.
[a-zA-Z0-9-.] matches the characters from a-z, A-Z, 0-9, dash, and a period.
+ means match the above condition infinite times.
a .com matches literal .com
$ signifies regex ends here.
/\\\\\w+-\w+\.xx\.\w+.com/ig
here we have
\\ for matching \
\w+ for matching one or more character
- for matching -
\. for matching .
com for exact string match
g for globally search in string.
i for case in-sensitive.

Regex to properly match urls with a particular domain and also if there is a subdomain added

I have the following regex:
(^|^[^:]+:\/\/|[^\.]+\.)hello\.net
Which seems to work fors most cases such as these:
http://hello.net
https://hello.net
http://www.hello.net
https://www.hello.net
http://domain.hello.net
https://solutions.hello.net
hello.net
www.hello.net
However it still matches this which it should not:
hello.net.domain.com
You can see it here:
https://regex101.com/r/fBH112/1
I am basically trying to check if a url is part of hello.net. so hello.net and any subdomains such as sub.hello.net should all match.
it should also match hello.net/bye. So anything after hello.net is irrelevant.
You may fix your pattern by adding (?:\/.*)?$ at the end:
(^|^[^:]+:\/\/|[^.]+\.)hello\.net(?:\/.*)?$
See the regex demo. The (?:\/.*)?$ matches an optional sequence of / and any 0 or more chars and then the end of string.
You might consider a "cleaner" pattern like
^(?:\w+:\/\/)?(?:[^\/.]+\.)?hello\.net(?:\/.*)?$
See the regex demo. Details:
^ - start of string
(?:\w+:\/\/)? - an optional occurrence of 1+ word chars, and then :// char sqequence
(?:[^\/.]+\.)? - an optional occurrence of any 1 or more chars other than / and . and then .
hello\.net - hello.net
(?:\/.*)?$ - an optional occurrence of / and then any 0+ chars and then end of string

Validate URL domain using regex

I'm trying to validate a URL in multiple formats.
EG:
https://www.google.com OK
http://www.google.com OK
www.google.com OK
htt://www.google.com
://www.google.com
https://google.com OK
http\\www.google.com
http:\\www.google.com
http:\\www.google.com
http://computerName/abc/MenuItems.aspx OK
https://computerName/abc/MenuItems.aspx OK
http://www.a.com/abc/ms.aspx?Id=13&(Not.Licensed.For.Production)= OK
http://www.a.com/abc/ms.aspx?Id=13 OK
I'm using this regex
^(?:https?:\/\/(?:www\.)?|www\.)[a-z0-9]+(?:[a-zA-Z0-9_\-/\.]+)*(?::[0-9]{1,5})?(?:\/[a-z0-9]+)*(?:\.[a-z]{2,5})?$
But the last two items aren't ok. How can I also validate for urls without www and for eg.(.com)
I tried to removed the part (that I think validates the .com but without success.
It's almost working but found a new cases last two that don't work
Here's my sample https://regex101.com/r/w3dQSl/5
For your example links, you could use
^https?:\/\/[a-z0-9]+(?:[-.][a-z0-9]+)*(?::[0-9]{1,5})?(?:\/[^\/\r\n]+)*\.[a-z]{2,5}(?:[?#]\S*)?$
The pattern will match:
^ Start of string
https?:\/\/ Match the protocol with optional s and ://
[a-z0-9]+ Match 1+ times any of the listed
(?:[-.][a-z0-9]+)* Repeat 0+ times any of the listed preceded by a - or .
(?::[0-9]{1,5})? Optionally match : and 1-5 digits
(?:\/[^\/\r\n]+)* Repeat 0+ times / and any char except /
\.[a-z]{2,5} Match a . and 2-5 times a char a-z
(?:[?#]\S*)? Optionally match either ? or # and 0+ times any non whitespace char
$ End of string
Regex demo
You ought to think about which URLs you try to match, but for your examples, this does the job and is also simpler than the inital regex you provided:
^(https?://)?(www\.)?[a-zA-Z0-9-]+[a-zA-Z0-9_\-/\.\?=\&\(\)]+$
Optionally you can make some of the groups non-capturing by adding ?: after the opening parentheses.
Debuggex Demo

Regular expression for relative URLs (segment after the domain name) [& no querystring]

The relative URLs that I need to match are as follows:
/
/asdf
/asdf.php
/sdfsf/
/asdfsdf/asdf
/asdfsdf/s-_df.jpg
/
/asdf#
/asdf.php#
/sdfsf/
/asdfsdf/asdf#
/asdfsdf/s-_df.jpg#
I have tried a number of patterns but seem to be hitting the wall -
https://regex101.com/r/GnK43b/4
https://regex101.com/r/GnK43b/1
In the outcome, I need the the regex groups such that I get the segments:
For e.g.
/sdfsf/ => Group 1: sdfsf
/asdfsdf/asdf => Group 1: asdfsdf; Group 2: asdf
It's not very clear how exactly you expect the results, but let's imagine you want to get these matches:
URL GROUP1 GROUP2
----------------------------------------------------------
/
/asdf asdf
/asdf.php asdf.php
/sdfsf/ sdfsf
/asdfsdf/asdf asdfsdf asdf
/asdfsdf/s-_df.jpg asdfsdf s-_df.jpg
/asdfsdf/s-_df.jpg# asdfsdf s-_df.jpg
then you could use this regex:
/^/(?:([0-9a-zA-Z._]+)#?)(?:/([0-9a-zA-Z._]+#?)?)?$/gmU
https://regex101.com/r/GUTGn4/2
To match all those values, you could match the forward slash followed by repeating 0+ times all the characters that you would allow in the character class to match followed by a forward slash.
At the end repeat 0+ times any chars listed in the character list and an optional #
^/(?:[0-9a-zA-Z_]+/)*[0-9a-zA-Z_.-]*#?$
^/ Start of string and forward slash
(?:[0-9a-zA-Z_]+/)* Match 0+ times any of the listed and forward slash
[0-9a-zA-Z_.-]* Match 0+ times any of the listed
#? Match optional #
$ End of string
Regex demo

Regex match if certain string is contained after last occurrence of specific character

For example, I want to check if the web url contains 'foo' after last slash, and match the entire url. So the following url should be a match:
https://www.facebook.com/messages/new/foobar
https://www.facebook.com/messages/t/barfoo
https://www.facebook.com/bfooar
https://foobar.com
https://foobar.com/foo
But the following shouldn't:
https://random.com/random
https://foobar.com/something
https://foobar.com/foo/bar
My approach is ((\\.*)*\\.*foo.*), but it seems doesn't work for any url that contains foo before the last slash. Is this pattern even doable in regex? Or I have to use something like split('\') in the code to achieve the desired pattern I want?
Thanks
You can use this regex:
^.*/[^/]*foo[^/]*$
RegEx Demo
Breakup:
^ - Start
.* - Match 0 or more characters (greedy)
/ - Match a /
[^/]* - Match 0 or more non-/ characters
foo - match foo
[^/]* - Match 0 or more non-/ characters
$ - End