Gruber URL Regex tweak to capture "domain.com" - regex

I found an updated version of John Gruber's regex for url matching in this post by user GianPac, which states it's been adapted to recognize url without protocol or the www part:
(?i)\b((?:[a-z][\w-]+:(?:/{1,3}|[a-z0-9%])|www\d{0,3}[.]|[a-z0-9.-]+[.][a-z]{2,4}/?)(?:[^\s()<>]+|(([^\s()<>]+|(([^\s()<>]+)))))(?:(([^\s()<>]+|(([^\s()<>]+))))|[^\s`!()[]{};:'\".,<>?«»“”‘’]))
Whilst this works in most cases, I found it does not match "google.com". It does match "google.comm" and "google.co.uk", so this must be a small oversight.
The trouble is, I literally hate regex. It's the bane of my life. I just want to try and tweak this one more time to allow for "google.com" - can anyone throw me a pointer? I (think) it's something to do with this part of the code:
+[.][a-z]{2,4}/?)
?

Change it from {2,4} to {1,4} and it will match.
(?i)\b((?:[a-z][\w-]+:(?:/{1,3}|[a-z0-9%])|www\d{0,3}[.]|[a-z0-9.-]+[.][a-z]{1,4}/?)(?:[^\s()<>]+|(([^\s()<>]+|(([^\s()<>]+)))))(?:(([^\s()<>]+|(([^\s()<>]+))))|[^\s`!()[]{};:'\".,<>?«»“”‘’]))
It's still completely incomprehensible though, and I'm not sure I'd trust a regex url checker that doesn't match google.com to begin with! Most languages have something built in for parsing URLs, that's a better option if possible anyway.

Related

Regex for youtube URL

I am using the following regex for validating youtube video share url's.
var valid = /^(http\:\/\/)?(youtube\.com|youtu\.be)+$/;
alert(valid.test(url));
return false;
I want the regex to support the following URL formats:
http://youtu.be/cCnrX1w5luM
http://youtube/cCnrX1w5luM
www.youtube.com/cCnrX1w5luM
youtube/cCnrX1w5luM
youtu.be/cCnrX1w5luM
I tried different regex but I am not getting a suitable one for share links. Can anyone help me to solve this.
Here's a regex I use to match and capture the important bits of YouTube URLs with video codes:
^((?:https?:)?\/\/)?((?:www|m)\.)?((?:youtube(-nocookie)?\.com|youtu.be))(\/(?:[\w\-]+\?v=|embed\/|v\/)?)([\w\-]+)(\S+)?$
Works with the following URLs:
https://www.youtube.com/watch?v=DFYRQ_zQ-gk&feature=featured
https://www.youtube.com/watch?v=DFYRQ_zQ-gk
http://www.youtube.com/watch?v=DFYRQ_zQ-gk
//www.youtube.com/watch?v=DFYRQ_zQ-gk
www.youtube.com/watch?v=DFYRQ_zQ-gk
https://youtube.com/watch?v=DFYRQ_zQ-gk
http://youtube.com/watch?v=DFYRQ_zQ-gk
//youtube.com/watch?v=DFYRQ_zQ-gk
youtube.com/watch?v=DFYRQ_zQ-gk
https://m.youtube.com/watch?v=DFYRQ_zQ-gk
http://m.youtube.com/watch?v=DFYRQ_zQ-gk
//m.youtube.com/watch?v=DFYRQ_zQ-gk
m.youtube.com/watch?v=DFYRQ_zQ-gk
https://www.youtube.com/v/DFYRQ_zQ-gk?fs=1&hl=en_US
http://www.youtube.com/v/DFYRQ_zQ-gk?fs=1&hl=en_US
//www.youtube.com/v/DFYRQ_zQ-gk?fs=1&hl=en_US
www.youtube.com/v/DFYRQ_zQ-gk?fs=1&hl=en_US
youtube.com/v/DFYRQ_zQ-gk?fs=1&hl=en_US
https://www.youtube.com/embed/DFYRQ_zQ-gk?autoplay=1
https://www.youtube.com/embed/DFYRQ_zQ-gk
http://www.youtube.com/embed/DFYRQ_zQ-gk
//www.youtube.com/embed/DFYRQ_zQ-gk
www.youtube.com/embed/DFYRQ_zQ-gk
https://youtube.com/embed/DFYRQ_zQ-gk
http://youtube.com/embed/DFYRQ_zQ-gk
//youtube.com/embed/DFYRQ_zQ-gk
youtube.com/embed/DFYRQ_zQ-gk
https://www.youtube-nocookie.com/embed/DFYRQ_zQ-gk?autoplay=1
https://www.youtube-nocookie.com/embed/DFYRQ_zQ-gk
http://www.youtube-nocookie.com/embed/DFYRQ_zQ-gk
//www.youtube-nocookie.com/embed/DFYRQ_zQ-gk
www.youtube-nocookie.com/embed/DFYRQ_zQ-gk
https://youtube-nocookie.com/embed/DFYRQ_zQ-gk
http://youtube-nocookie.com/embed/DFYRQ_zQ-gk
//youtube-nocookie.com/embed/DFYRQ_zQ-gk
youtube-nocookie.com/embed/DFYRQ_zQ-gk
https://youtu.be/DFYRQ_zQ-gk?t=120
https://youtu.be/DFYRQ_zQ-gk
http://youtu.be/DFYRQ_zQ-gk
//youtu.be/DFYRQ_zQ-gk
youtu.be/DFYRQ_zQ-gk
https://www.youtube.com/HamdiKickProduction?v=DFYRQ_zQ-gk
The captured groups are:
protocol
subdomain
domain
path
video code
query string
https://regex101.com/r/vHEc61/1
You're missing www in your regex
The second \. should optional if you want to match both youtu.be and youtube (but I didn't change this since just youtube isn't actually a valid domain - see note below)
+ in your regex allows for one or more of (youtube\.com|youtu\.be), not one or more wild-cards.
You need to use a . to indicate a wild-card, and + to indicate you want one or more of them.
Try:
^(https?\:\/\/)?(www\.youtube\.com|youtu\.be)\/.+$
Live demo.
If you want it to match URLs with or without the www., just make it optional:
^(https?\:\/\/)?((www\.)?youtube\.com|youtu\.be)\/.+$
Live demo.
Invalid alternatives:
If you want www.youtu.be/... to also match (at the time of writing, this doesn't appear to be a valid URL format), put the optional www. outside the brackets:
^(https?\:\/\/)?(www\.)?(youtube\.com|youtu\.be)\/.+$
youtube/cCnrX1w5luM (with or without http://) isn't a valid URL, but the question explicitly mentions that the regex should support that. To include this, replace youtu\.be with youtu\.?be in any regex above. Live demo.
I know I'm like 2 years late to the party, but I was needing to write something up anyway, and seems to fit every test case that I can throw at it. Should be able to reference the first match ($1) to get the ID. Matches the http, https, www and non-www, youtube.com, youtu.be, /watch? and /watch.php? on youtube.com (youtu.be does not use these), and it supports matching even when there are other variables in the URL string (?t= for time, ?list= for playlists, etc).
(?:https?:\/\/)?(?:youtu\.be\/|(?:www\.|m\.)?youtube\.com\/(?:watch|v|embed)(?:\.php)?(?:\?.*v=|\/))([a-zA-Z0-9\_-]+)
Format for YouTube videos has changed. This regex works for all cases:
^(http(s)??\:\/\/)?(www\.)?((youtube\.com\/watch\?v=)|(youtu.be\/))([a-zA-Z0-9\-_])+
Tests here.
Based on so many other regex; this is the best I have got:
((http(s)?:\/\/)?)(www\.)?((youtube\.com\/)|(youtu.be\/))[\S]+
Test:
http://regexr.com/3bga2
Try this:
((http://)?)(www\.)?((youtube\.com/)|(youtu\.be)|(youtube)).+
http://regexr.com?36o7a
I took one of the answers from here and added support for a few edge cases that I noticed in my dataset. This should work for pretty much any valid url.
^(?:https?:)?(?:\/\/)?(?:youtu\.be\/|(?:www\.|m\.)?youtube\.com\/(?:watch|v|embed)(?:\.php)?(?:\?.*v=|\/))([a-zA-Z0-9\_-]{7,15})(?:[\?&][a-zA-Z0-9\_-]+=[a-zA-Z0-9\_-]+)*(?:[&\/\#].*)?$
I tried this one and it works fine for me.
(?:http(?:s)?:\/\/)?(?:www\.)?(?:youtu\.be\/|youtube\.com\/(?:(?:watch)?\?(?:.*&)?v(?:i)?=|(?:embed|v|vi|user)\/))([^\?&\"'<> #]+)
You can check here https://regex101.com/r/Kvk0nB/1
https://regexr.com/62kgd
^((http|https)\:\/\/)?(www\.youtube\.com|youtu\.?be)\/((watch\?v=)?([a-zA-Z0-9]{11}))(&.*)*$
https://www.youtube.com/watch?v=YPz9zqakRbk
https://www.youtube.com/watch?v=YPz9zqakRbk&t=11
http://youtu.be/cCnrX1w5luM&y=12
http://youtu.be/cCnrX1w5luM
http://youtube/cCnrXswsluM
www.youtube.com/cCnrX1w5luM
youtube/cCnrX1w5luM
Check this pattern instead:
r'(?i)(http.//|https.//)*[A-Za-z0-9._%+-]+\.\w+'

RegEx match all website links except those containing admin

I'm setting up URL Rewrite on an IIS and i need to match the following URLs using regex.
http://sub.mysite.com
sub.mysite.com
sub.mysite.com/
sub.mysite.com/Site1
sub.mysite.com/Site1/admin
but not:
sub.mysite.com/admin
sub.mysite.com/admin/somethingelse
sub.mysite.com/admin/admin
The site it self (sub.mysite.com) should not be "hardcoded" in the expression. Instead, it should be matched by something like .*.
I'm really blank on this one. I did find solutions to match the different URLs but once i try to combine them either none of them match or all of them do.
I hope someone can help me.
For your specific case, assuming you are matching the part after the domain (REQUEST_URI):
(?!/admin).*
(?!...) is a negative lookahead. I am not sure if it is supported in the IIS URL Rewrite engine. If not, a better approach would be to check for a complementary approach:
Or as #kirilloid said, just match /admin/? and discard (pay attention to slashes).
BTW. if you want to quickly test RegExps with a "visual" feedback, I highly recommend http://gskinner.com/RegExr/
([A-Za-z0-9]+.)+.com(?!/admin)/?([A-Za-z0-9]+/?)*
this should do the trick

Using RegEx to match domain.com and www.domain.com but NO OTHER subdomains?

Sorry if this has been asked elsewhere, I've been looking and can't find it for the life of me. I am attempting at tackling regular expressions, I've ALWAYS had problems with the more advanced scenarios... well, others find them quite easy, so maybe there's something wrong with me.
Anyway, I am attempting to write a RegEx that matches www.domain.com OR domain.com but NO OTHER SUBDOMAINS or anything. The only two strings I want to pass the regex are "domain.com" and "www.domain.com" and I haven't been able to find exactly what I am looking for other than including all subdomain matching, which I find easy.
The closest I have come is this: regex for matching something if it is not preceded by something else but in that case its failing only for one preceding string, I want it to succeed for only one preceding string/subdomain. Note, "domain.com" will always be static, meaning it will always be that exact string "domain.com" not various domains.
Thanks so much for shedding light on this!
Tyler
Just put the optional part in a non-capturing group, and make it optional.
/^(?:www\.)?example.com$/

Regex for checking a body of text for a URL?

I have a regex pattern for URL's that I use to check for links in a body of text. The only problem is that the pattern will match this link
stackoverflow.com
And this sentence
I'm a sentence.Next Sentence.
Obviously this would make sense because my pattern doesn't strong check .com, .co.uk, .com.au etc
I want it to match stackoverflow.com and not the latter.
As I'm no Regex expert, does anyone know of any good Regex patterns for checking for all types of URL's in a body text, while not matching the sentences like above?
If I have to strong check the domain extension, I suppose I'll have to settle.
Here's my pattern, but i don't think it help.
(([\w]+:)?\/\/)?(([\d\w]|%[a-fA-f\d]{2,2})+(:([\d\w]|%[a-fA-f\d]{2,2})+)?#)?([\d\w][-\d\w]{0,253}[\d\w]\.)+[\w]{2,4}(:[\d]+)?(\/([-+_~.\d\w]|%[a-fA-f\d]{2,2})*)*(\?(&?([-+_~.\d\w]|%[a-fA-f\d]{2,2})=?)*)?(#([-+_~.\d\w]|%[a-fA-f\d]{2,2})*)?
I would definitely suggest finding a working regex that someone else has made (which would probably include a strong check on the domain extension), but here is one possible way to just modify your existing regex.
It requires that you make the assumption that usually links will not mix case in the domain extension, for example you might see .COM or .com but probably not .Com, if you only match domain extensions that don't mix case then you would avoid matching most sentences.
In the middle of your regex you have [\w]{2,4}, try changing this to ([A-Z]{2,4}|[a-z]{2,4}) (or (?:[A-Z]{2,4}|[a-z]{2,4}) if you don't want a new captured group).

Regex Problem (newbie)

i'm writing a little app for spam-checking and i'm having problems with a regex.
let's say i'm having this spam-url:
http://hosting.tyumen.ru/tip.html
so i want to check its url for having 2 full stops (subdomain+ending), a slash, a word, full stop and "html".
here's what i got so far:
(http://.*?\..*?..*?/.*?.html)
might look like rubbish but it works - the problem: it's really slow and freezing my app.
any hints on how to optimize it?
thx.re
The reason it's slow is that the non-greedy operators ? being used this way is prone to catastrophic backtracking
Instead of saying "any amount of anything, but only to an extent where it doesn't conflict with later requirements", which is effectively what .*? is saying, try asking for "as much as possible, that isn't a double quote, which would terminate the href ":
\1
I also added a back-reference (\1) to your first capturing group, inside the <a>...</a>, so that you don't have to do the exact same matching all over again.
Note that this regex will be broken if, say, the a has a class name, an id, or anything else in its body. I left it like this because I wanted to give you what you asked for with as few changes as possible, and as to-the-point as possible.
(http://[\w.-]+/.+?\.html) - may be will work for your case only.
or may be faster one
(http://[\w.-]+/[^.]+\.html)
Since you claim to be a regexp newbie, I will offer a more general advice on creating and debugging regular expressions. When they get pretty complicated, I find using Regexp Coach a must.
It's a freeware and really saves a lot of headache. Not to mention you don't have to build / run your application every minute just to see if the regexp works the way you wanted.
In Python, a simple way to match URLs ending in .html or .htm is to use
url_re = re.compile(
r'https?://' # http:// or https://
r'(?:(?:[A-Z0-9](?:[A-Z0-9-]{0,61}[A-Z0-9])?\.)+[A-Z]{2,6}\.?|' #domain...
r'localhost|' #localhost...
r'\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})' # ...or ip
r'(?::\d+)?' # optional port
r'(?:\S+.html?)+' # ending in .html
, re.IGNORECASE)
which is a modified version of Django's UrlField regex.
This will match any site ending with .html or .htm. (either localhost, ip, domain).
#http://[-a-zA-Z0-9]+\.[-a-zA-Z0-9]+\.[-a-zA-Z]+/\w+\.html#