MFC: How do I construct a good regular expression that validates URLs?

MFC: How do I construct a good regular expression that validates URLs? - regex

Here's the regular expression I use, and I parse it using CAtlRegExp of MFC :
(((h|H?)(t|T?)(t|T?)(p|P?)(s|S?))://)?([a-zA-Z0-9]+[\.]+[a-zA-Z0-9]+[\.]+[a-zA-Z0-9])
It works fine except with one flaw. When URL is preceded by characters, it still accepts it as a URL.
ex inputs:
this is a link www.google.com (where I can just tokenize the spaces and validate each word)
is...www.google.com (this string still matches the RegEx above :( )
Please help...
Thanks...

Use the IgnoreCase flag instead of catering for each case.
Stick a ^ at the beginning if you want the start of the string to be the start of the URL
You're missing a lot of characters from possible, valid URLs.

You need to tell the regex to only match at the start and end of the string. I'm not sure how you do that in VC++ - in most regexs you enclose the pattern with ^ and $. The ^ says "the start of the string" and the $ says "the end of the string."
^(((h|H?)(t|T?)(t|T?)(p|P?)(s|S?))\://)?([a-zA-Z0-9]+[\\.]+[a-zA-Z0-9]+[\\.]+[a-zA-Z0-9])$
The second is matching because the string still contains a valid URL.

How about using CUrl (that is, 'C-Url', in ATL, not curl as in libcurl) which can 'parse' urls with CUrl::CrackUrl . If that function returns FALSE you assume it's not a valid URL.
That said, decomposing URL is sufficiently complex to warrant a proper parser, not a regex based decomposition. Cfr. rfc 2396 etc. for an overview on the complexities.

Start the regex with ^ to and end it with $ to have the regex match only if the entire sting matches (if that's what you want):
^(((h|H?)(t|T?)(t|T?)(p|P?)(s|S?))\://)?([a-zA-Z0-9]+[\.]+[a-zA-Z0-9]+[\.]+[a-zA-Z0-9])$

What about this one: (((f|ht)tp://)[-a-zA-Z0-9#:%_\+.~#?&//=]+) ?

This Regular Expression has been tested to work for the following
http|https://host[:port]/[?][parameter=value]*
public static final String URL_PATTERN = "(https?|ftp)://(www\\.)?(((([a-zA-Z0-9.-]+\\.){1,}[a-zA-Z]{2,4}|localhost))|((\\d{1,3}\\.){3}(\\d{1,3})))(:(\\d+))?(/([a-zA-Z0-9-._~!$&'()*+,;=:#/]|%[0-9A-F]{2})*)?(\\?([a-zA-Z0-9-._~!$&'()*+,;=:/?#]|%[0-9A-F]{2})*)?(#([a-zA-Z0-9._-]|%[0-9A-F]{2})*)?";
PS. It also validates on localhost link.
(Thoroughly written by me :-))

Related

Fluentvalidation 6.4.1.0 support me with Incorrect regex

In my case, i want to validate for url image, some url is valid but result is wrong.
Eg: link image is "https://fuvitech.online/wpcontent/uploads/2021/02/bta16600brg.jpg" or "https://fuvitech.online/wp-content/uploads/2021/02/bta16-600brg.jpg" reponse "The image link is not in the correct format".
My code here:
RuleFor(product => product.Images)
.Length(1, 3000).WithMessage(Labels.importProduct_ExceedDescription, p => ImportHelpers.GetColumnName(typeof(ProductEntity).GetProperty(nameof(p.Images))))
.Matches(#"^(http:\/\/|https:\/\/){1}?[a-z0-9]+([\-\.]{1}[a-z0-9]+)*\.[a-z]{2,5}(:[0-9]{1,5})?(\/.*)?$").WithMessage(Labels.importProduct_UrlNotCorrect, p => ImportHelpers.GetColumnName(typeof(ProductEntity).GetProperty(nameof(p.Images))));
Please help me where the above regex is wrong. Thank you.

Try this:
NOTE the following regex pattern may trigger false positives and also may ignore valid image URLs, because it is very difficult to validate whether a given URL is valid.
^https?:\/\/(?:(?:[A-Za-z0-9]+(?:-[A-Za-z0-9]+)+|[A-Za-z0-9]{2,})\.)+[A-Za-z]{2,}(?::\d+)?\/(?:(?:[A-Za-z0-9]+(?:(?:-[A-Za-z0-9]+)+)?\/)+|)[\w-]+\.(?:jpg|jpeg|png)$
Explanation
^ the start of a line/string.
https?:\/\/ match http with an optional letter s, followed by ://.
(?:(?:[A-Za-z0-9]+(?:-[A-Za-z0-9]+)+|[A-Za-z0-9]{2,})\.)+ This will match things like foo-foo.bar-bar., foo.bar-bar. and foo.
[A-Za-z]{2,} this will match the TLD part, e.g., com, org, this part with the previous part will match things like foo-foo.bar-bar.com, foo.bar-bar.com or foo.com.
(?::\d+)? optional group of (a colon : followed by one or more digits) for port part.
\/(?:(?:[A-Za-z0-9]+(?:(?:-[A-Za-z0-9]+)+)?\/)+|) this check for two things, the first one is /uploads/public-images/, /uploads/images/, the second one is a single /.
[\w-]+ this part for the file name, e.g., bta16-600brg.
\.(?:jpg|jpeg|png) you can add here multiple extensions, you can allow uppercase letters by using for example, [Jj][Pp][Gg] for jpg.
$ the end of the line/string.
See regex demo

Thanks #SaSkY answer my question.
I found my mistake.
This source [.[a-z]{2,5}] only allows domain extensions from 2-5 characters. Example [.com] is valid. But in my case [.online] was not valid.
I changed to [.[a-z]{1,10}].

How can I write a opposite regex to this regex?

this is a regex of a proxy, if I add this to my proxy:
(.*\.|)(abc|google)\.(org|net)
my proxy will not transmit the abc.org, abc.net, google.org, google.net's traffic.
how can I write a regex opposite to this regex? I mean only transmit the abc.org, abc.net, google.org, google.net's traffic.
EDIT-01
My thought is just want to transmit abc.org or www.abc.org, how can I do with that?

Try this:
^(?!(www\.)?(?:abc|google)\.(?:net|org)).*
Demo: https://regex101.com/r/WOnFx8/3/
I used ?! to reverse the matching of your regex. This way, it will match any domain except these specific 4 domains.
Another way to do it is by using this code to include anything before the desired domains:
^(?!(.*\.|)(?:abc|google)\.(?:net|org)).*
demo: https://regex101.com/r/WOnFx8/4/

Your regex you write
(.*\.|)(abc|google)\.(org|net)
mean any string is one of abc.org, gooogle.org, abc.net, google.net, with optional prefix string ends with dot (.)
Like: test.google.org, sub.abc.net,...
I think you want to match string like test.yahoo.com, but not test.google.org. If you can use negative look ahead, this is the answer:
^(.*\.|)(?!(abc|google)\.(org|net))\w+\.\w+$
Explain:
^ and $ to be sure your match is entire url string
Negative look ahead is to check the url is not something like abc.org, abc.net, google.org, google.net
And \w+\.\w+ to check the remain string is kind of URL type (something likes yahoo.com, etc...)

Im going to assume you have lookaheads, if so then you can simply use -
(^.*?\.(?!(abc|google))\w+\.(?:org|net)$)
Demo - https://regex101.com/r/5eC41R/3
What this does is -
Looks for the start of the url (till the first .)
Checks that next part is not abc or google
looks for the next section (till the next .)
Looks for a closing org or net
Note that since it is a lookahead it will be slow compared to other regex matches

Regular expression to match only domain from URL

I'm struggling with forming a regex that would match:
Just domain in case of URL
Whole string in case of no URL
Acceptance test (regex should match bold text):
http://mozart.co.uk
https://avocado.si/hmm
http://www.qwe123qwe.com
Starbucks
Benchmark 123
So far I've come up with this:
([^\/\/]+)(?:,|$)
It works fine, but not for URLs with trailing slash on the end. How can I modify the expression to include full path (everything on the right side of http(s)://) as well? Thank you.

This regex will match them if it starts with http:// or https:// until the next slash. If it doesn't start with http:// nor https:// then it will match the whole string. Close enough?
(?:^https?:\/\/([^\/]+)(?:[\/,]|$)|^(.*)$)
I should note that most languages have functions built in to properly parse URLs and these are preferable.
You should note that I've got 2 sets of capturing parentheses, so depending on your language that may be significant.

Maybe that ^(http[s]?:\/\/)?(.*)$. Play here: https://regex101.com/r/iZ2vL4/1

This will have Matching groups, the domain you want will be in the 4th matching group.
/^((http[s]?|ftp):\/\/)?\/?([^\/\.]+\.)*?([^\/\.]+\.[^:\/\s\.]{1,3}(\.[^:\/\s\.]{1,2})?(:\d+)?)($|\/)([^#?\s]+)?(.*?)?(#[\w\-]+)?$/mg
Regex101.com workbench to check out your URLs just paste them in the "TEST STRING" Textbox to test it out.
Don't recall where I got this... so I don't know who to credit. But it's pretty slick!

Exclude part of the string with regex

I'm quite bad with regex, and I'm looking to match a criteria.
This is a regex expression that should go emmbed into the url for a firewall, so It will block any url that is not like the list at the end.
This is what Im currently using but its not working:
http://www.youtube.com/(*.*)list=UUFwtOm4N5djdcuTAlNIWJaQ
This is the example url (to be blocked):
http://www.youtube.com/watch?NR=1&feature=fvwp&v=P1b5VY_Bp_o&list=UUFwtOm4N5djdcuTAlNIWJaQ
I'm trying to make a regex that will Success fully match when NR=1 or feature=fvwp
are NOT present, I asume I can do it like this: (?!^feature=fvwp$) but the v= and list=UUFwtOm4N5djdcuTAlNIWJaQ are allowed.
Also the v= should be limited to any character (uppercase and lowercase) and 11 length, I assume its: /^[a-z0-9]{11}$/
How can I build all that together and make it work so it would allow and match only on this urls excluding from allowing the previous criterias that I explained:
http://www.youtube.com/watch?v=4eK_RWpTgcc&feature=BFa&list=UUFwtOm4N5djdcuTAlNIWJaQ
http://www.youtube.com/watch?v=TLRl85TJwZM&feature=BFa&list=UUFwtOm4N5djdcuTAlNIWJaQ
http://www.youtube.com/watch?v=QEV9yqrpxkc&feature=BFa&list=UUFwtOm4N5djdcuTAlNIWJaQ

Can you block based on matching by regex? If so, just use
(.*)www\.youtube\.com/watch\?NR=1&feature=fvwp and block whatever matches that.

Regular Expression for some email rules

I was using a regular expression for email formats which I thought was ok but the customer is complaining that the expression is too strict. So they have come back with the following requirement:
The email must contain an "#" symbol and end with either .xx or .xxx ie.(.nl or .com). They are happy with this to pass validation. I have started the expression to see if the string contains an "#" symbol as below
^(?=.*[#])
this seems to work but how do I add the last requirement (must end with .xx or .xxx)?

A regex simply enforcing your two requirements is:
^.+#.+\.[a-zA-Z]{2,3}$
However, there are email validation libraries for most languages that will generally work better than a regex.

I always use this for emails
^([a-zA-Z0-9_\-\.]+)#((\[[0-9]{1,3}" +
#"\.[0-9]{1,3}\.[0-9]{1,3}\.)|(([a-zA-Z0-9\-]+\" +
#".)+))([a-zA-Z]{2,4}|[0-9]{1,3})(\]?)$
Try http://www.ultrapico.com/Expresso.htm as well!

It is not possible to validate every E-Mail Adress with RegEx but for your requirements this simple regex works. It is neither complete nor does it in any way check for errors but it exactly meets the specs:
[^#]+#.+\.\w{2,3}$
Explanation:
[^#]+: Match one or more characters that are not #
#: Match the #
.+: Match one or more of any character
\.: Match a .
\w{2,3}: Match 2 or 3 word-characters (a-zA-Z)
$: End of string

Try this :
([\w-\.]+)#((?:[\w]+\.)+)([a-zA-Z]{2,4})\be(\w*)s\b
A good tool to test our regular expression :
http://gskinner.com/RegExr/

You could use
[#].+\.[a-z0-9]{2,3}$

This should work:
^[^#\r\n\s]+[^.#]#[^.#][^#\r\n\s]+\.(\w){2,}$
I tested it against these invalid emails:
#exampleexample#domaincom.com
example#domaincom
exampledomain.com
exampledomain#.com
exampledomain.#com
example.domain#.#com
e.x+a.1m.5e#em.a.i.l.c.o
some-user#internal-email.company.c
some-user#internal-ema#il.company.co
some-user##internal-email.company.co
#test.com
test#asdaf
test#.com
test.#com.co
And these valid emails:
example#domain.com
e.x+a.1m.5e#em.a.i.l.c.om
some-user#internal-email.company.co
edit
This one appears to validate all of the addresses from that wikipedia page, though it probably allows some invalid emails as well. The parenthesis will split it into everything before and after the #:
^([^\r\n]+)#([^\r\n]+\.?\w{2,})$
niceandsimple#example.com
very.common#example.com
a.little.lengthy.but.fine#dept.example.com
disposable.style.email.with+symbol#example.com
other.email-with-dash#example.com
user#[IPv6:2001:db8:1ff::a0b:dbd0]
"much.more unusual"#example.com
"very.unusual.#.unusual.com"#example.com
"very.(),:;<>[]\".VERY.\"very#\\ \"very\".unusual"#strange.example.com
postbox#com
admin#mailserver1
!#$%&'*+-/=?^_`{}|~#example.org
"()<>[]:,;#\\\"!#$%&'*+-/=?^_`{}| ~.a"#example.org
" "#example.org
üñîçøðé#example.com
üñîçøðé#üñîçøðé.com

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

MFC: How do I construct a good regular expression that validates URLs? - regex

Use the IgnoreCase flag instead of catering for each case. Stick a ^ at the beginning if you want the start of the string to be the start of the URL You're missing a lot of characters from possible, valid URLs.

Start the regex with ^ to and end it with $ to have the regex match only if the entire sting matches (if that's what you want): ^(((h|H?)(t|T?)(t|T?)(p|P?)(s|S?))\://)?([a-zA-Z0-9]+[\.]+[a-zA-Z0-9]+[\.]+[a-zA-Z0-9])$

What about this one: (((f|ht)tp://)[-a-zA-Z0-9#:%_\+.~#?&//=]+) ?

Related

Fluentvalidation 6.4.1.0 support me with Incorrect regex

How can I write a opposite regex to this regex?

Regular expression to match only domain from URL

Exclude part of the string with regex

Regular Expression for some email rules

Categories

Resources