How do I write a regular expression for a URL without the scheme? - regex

How can I write a RE which validates the URLs without the scheme:
Pass:
www.example.com
example.com
Fail:
http://www.example.com

^[A-Za-z0-9][A-Za-z0-9.-]+(:\d+)?(/.*)?$
string must start with an ASCII letter or number
ASCII letters, numbers, dots and dashes follow (no slashes or colons allowed)
optional: a port is allowed (":8080")
optional: anything after a slash may follow (since you said "URL")
then the end of the string
Thoughts:
no line breaks allowed
no validity or sanity checking
no support for "internationalized domain names" (IDNs)
leave off the "optional:" parts if you like, but be sure to include the final "$"
If your regex flavor supports it, you can shorten the above to:
^[A-Za-z\d][\w.-]+(:\d+)?(/.*)?$
Be aware that \w may include Unicode characters in some regex flavors. Also, \w includes the underscore, which is invalid in host names. An explicit approach like the first one would be safer.

If you're trying to do this for some real code, find the URL parsing library for your language and use that. If you don't want to use it, look inside to see what it does.
The thing that you are calling "resource" is known as a "scheme". It's documented in RFC 1738 which says:
[2.1] ... In general, URLs are written as follows:
<scheme>:<scheme-specific-part>
A URL contains the name of the scheme being used (<scheme>) followed
by a colon and then a string (the <scheme-specific-part>) whose
interpretation depends on the scheme.
And, later in the BNF,
scheme = 1*[ lowalpha | digit | "+" | "-" | "." ]
So, if a scheme is there, you can match it with:
/^[a-z0-9+.-]+:/i
If that matches, you have what the URL syntax considers a scheme and your validation fails. If you have strings with port numbers, like www.example.com:80, then things get messy. In practice, I haven't dealt with schemes with - or ., so you might add a real world fudge to get around that until you decide to use a proper library.
Anything beyond that, like checking for existing and reachable domains and so on, is better left to a library that's already figured it all out.

URL syntax is quite complex, you need to narrow it down a bit. You can match anything.ext, if that is enough:
^[a-zA-Z0-9.]+\.[a-zA-Z]{2,4}$

My guess is
/^[\p{Alnum}-]+(\.[\p{Alnum}-]+)+$/
In more primitive RE syntax that would be
/^[0-9A-Za-z-]+(\.[0-9A-Za-z-]+)+$/
Or even more primitive still:
/^[0-9A-Za-z-][0-9A-Za-z-]*\.[0-9A-Za-z-][0-9A-Za-z-]*(\.[0-9A-Za-z-][0-9A-Za-z-]*)*$/

Thanks guys, I think I have a Python and a PHP solution. Here they are:
Python Solution:
import re
url = 'http://www.foo.com'
p = re.compile(r'^(?!http(s)?://$)[A-Za-z][A-Za-z0-9.-]+(:\d+)?(/.*)?$')
m = p.search(url)
print m # m returns _sre.SRE_Match if url is valid, otherwise None
PHP Solution:
$url = 'http://www.foo.com';
preg_match('/^(?!http(s)?:\/\/$)[A-Za-z][A-Za-z0-9\.\-]+(:\d+)?(\/\.*)?$/', $url);

Related

How to only match before the first dot?

I have the following regex.
^((?!example).)*$#Subdomain is reserved (example).
I would like to validate <subdomain>.example.org. However, since the domain name contains example, a match is occurring.
The validation should not match when the address is www.example.org
The validation should match when the address is example.example.org
Looks like you're missing the escape character from the period
^(example)\..*$
should work
It seems that a simple
^example\.
is enough. Or use string methods, depending on your language:
url.indexOf('example.') === 0
If input such as example.org is also possible, you can use
^example\..+\.
to force the appearance of two dots. But this would still fail for example.co.uk. It depends on your input.
A simple way might be to break it up into two:
^.+\.example\.org$
^(www)?\.example\.org$
If 1) matches and 2) does not, it's a subdomain of example.org; otherwise, it's not. (Although www technically is a subdomain, but you understand.)

Adding http:// to all links without a protocol

I use VB.NET and would like to add http:// to all links that doesn't already start with http://, https://, ftp:// and so on.
"I want to add http here Google,
but not here Google."
It was easy when I just had the links, but I can't find a good solution for an entire string containing multiple links. I guess RegEx is the way to go, but I wouldn't even know where to start.
I can find the RegEx myself, it's the parsing and prepending I'm having problems with. Could anyone give me an example with Regex.Replace() in C# or VB.NET?
Any help appreciated!
Quote RFC 1738:
"Scheme names consist of a sequence of characters. The lower case letters "a"--"z", digits, and the characters plus ("+"), period ("."), and hyphen ("-") are allowed. For resiliency, programs interpreting URLs should treat upper case letters as equivalent to lower case in scheme names (e.g., allow "HTTP" as well as "http")."
Excellent! A regex to match:
/^[a-zA-Z0-9+.-]+:\/\//
If that matches your href string, continue on. If not, prepend "http://". Remaining sanity checks are yours unless you ask for specific details. Do note the other commenters' thoughts about relative links.
EDIT: I'm starting to suspect that you've asked the wrong question... that you perhaps don't have anything that splits the text up into the individual tokens you need to handle it. See Looking for C# HTML parser
EDIT: As a blind try at ignoring all and just attacking the text, using case insensitive matching,
/(<a +href *= *")(.*?)(" *>)/
If the second back-reference matches /^[a-zA-Z0-9+.-]+:\/\//, do nothing. If it does not match, replace it with
$1 + "http://" + $2 + $3
This isn't C# syntax, but it should translate across without too much effort.
In PHP (should translate somewhat easily)
$text = preg_replace('/href="(?:(http|ftp|https)\:\/\/)?([^"]*)"/', 'href="http://$1"', $text);
C#
result = new Regex("(href=\")([^(http|https|ftp)])", RegexOptions.IgnoreCase).Replace(input, "href=\"//$2");
If you aren't concerned with potentially messing up local links, and you can always guarantee that the strings will be fully qualified domain names, then you can simply use the contains method:
Dim myUrl as string = "someUrlString".ToLower()
If Not myUrl.Contains("http://") AndAlso Not myUrl.Contains("https://") AndAlso Not myUrl.Contains("ftp://") Then
'Execute your logic to prepend the proper protocol
myUrl = "http://" & myUrl
End If
Keep in mind this omits a lot of holes regarding the checking of which protocol should be used in the addition and if the url is relative or not.
Edit: I chose specifically not to offer a RegEx solution since this is a simple check and RegEx is a little heavy for it (IMO).

Regular expression to add base domain to directory

10 websites need to be cached. When caching: photos, css, js, etc are not displayed properly because the base domain isn't attached to the directory. I need a regex to add the base domain to the directory. examples below
base domain: http://www.example.com
the problem occurs when reading cached pages with img src="thumb/123.jpg" or src="/inc/123.js".
they would display correctly if it was img src="http://www.example.com/thumb/123.jpg" or src="http://www.example.com/inc/123.js".
regex something like: if (src=") isn't followed by the base domain then add the base domain
without knowing the language, you can use the (maybe most portable) substitute modifier:
s/^(src=")([^"]+")$/$1www\.example\.com\/$2/
This should do the following:
1. the string 'src="' (and capture it in variable $1)
2. one or more non-double-quote (") character followed by " (and capture it in variable $2)
3. Substitutes 'www.example.com/' in between the two capture groups.
Depending on the language, you can wrap this in a conditional that checks for the existence of the domain and substitutes if it isn't found.
to check for domain: /www\.example\.com/i should do.
EDIT: See comments:
For PHP, I would do this a bit differently. I would probably use simplexml. I don't think that will translate well, though, so here's a regex one...
$html = file_get_contents('/path/to/file.html');
$regex_match = '/(src="|href=")[^(?:www.example.com\/)]([^"]+")/gi';
$regex_substitute = '$1www.example.com/$2';
preg_replace($regex_match, $regex_substitute, $html);
Note: I haven't actually run this to debug it, it's just off the cuff. I would be concerned about 3 things. first, I am unsure how preg_replace will handle the / character. I don't think you're concerned with this, though, unless VB has a similar problem. Second, If there's a chance that line breaks would get in the way, I might change the regex. Third, I added the [^(?:www\.example\.com)] bit. This should change the match to any src or href that doesn't have www.example.com/ there, but this depends on the type of regex being used (POSIX/PCRE).
The rest of the changes should be fine (I added href=" and also made it case-insensitive (\i) and there's a requirement to make it global (\g) otherwise, it will just match once).
I hope that helps.
Matching regular expression:
(?:src|href)="(http://www\.example\.com/)?.+

The Hostname Regex

I'm looking for the regex to validate hostnames. It must completely conform to the standard. Right now, I have
^[0-9a-z]([0-9a-z\-]{0,61}[0-9a-z])?(\.[0-9a-z](0-9a-z\-]{0,61}[0-9a-z])?)*$
but it allows successive hypens and hostnames longer than 255 characters. If the perfect regex is impossible, say so.
Edit/Clarification: a Google search didn't reveal that this is a solved (or proven unsolvable) problem. I want to to create the definitive regex so that nobody has to write his own ever. If dialects matter, I want a a version for each one in which this can be done.
^(?=.{1,255}$)[0-9A-Za-z](?:(?:[0-9A-Za-z]|-){0,61}[0-9A-Za-z])?(?:\.[0-9A-Za-z](?:(?:[0-9A-Za-z]|-){0,61}[0-9A-Za-z])?)*\.?$
The approved answer validates invalid hostnames containing multiple dots (example..com). Here is a regex I came up with that I think exactly matches what is allowable under RFC requirements (minus an ending "." supported by some resolvers to short-circuit relative naming and force FQDN resolution).
Spec:
<hname> ::= <name>*["."<name>]
<name> ::= <letter-or-digit>[*[<letter-or-digit-or-hyphen>]<letter-or-digit>]
Regex:
^([a-zA-Z0-9](?:(?:[a-zA-Z0-9-]*|(?<!-)\.(?![-.]))*[a-zA-Z0-9]+)?)$
I've tested quite a few permutations myself, I think it is accurate.
This regex also does not do length validation. Length constraints on labels betweens dots and on names are required by RFC, but lengths can easily be checked as second and third passes after validating against this regex, by checking full string length, and by splitting on "." and validating all substrings lengths. E.g., in JavaScript, label length validation might look like: "example.com".split(".").reduce(function (prev, curr) { return prev && curr.length <= 63; }, true).
Alternative Regex (without negative lookbehind, courtesy of the HTML Living Standard):
^[a-zA-Z0-9](?:[a-zA-Z0-9-]{0,61}[a-zA-Z0-9])?(?:\.[a-zA-Z0-9](?:[a-zA-Z0-9-]{0,61}[a-zA-Z0-9])?)*$
Your answer was relatively close.
But see
RFC 2396 Section 3.2.2
JaredPar's reference to this answer is referring to Regexp/Common/URI/RFC2396.pm source.
For a hostname RE, that perl module produces
(?:(?:(?:(?:[a-zA-Z0-9][-a-zA-Z0-9]*)?[a-zA-Z0-9])[.])*(?:[a-zA-Z][-a-zA-Z0-9]*[a-zA-Z0-9]|[a-zA-Z])[.]?)
I would modify to be more accurate as:
(?:(?:(?:(?:[a-zA-Z0-9][-a-zA-Z0-9]{0,61})?[a-zA-Z0-9])[.])*(?:[a-zA-Z][-a-zA-Z0-9]{0,61}[a-zA-Z0-9]|[a-zA-Z])[.]?)
Optionally anchoring the ends with ^$ to ONLY match hostnames.
I don't think a single RE can accomplish an full validation because, according to Wikipedia, there is a 255 character length restriction which i don't think can be included within that same RE, at least not without a ton of changes, but it's easy enough to just check the length <= 255 before running the RE.
Take a look at the following question. A few of the answers have regex expressions for host names
Regular expression to match DNS hostname or IP Address?
Could you specify what language you want to use this regex in? Most languages / systems have slightly different regex implementations that will affect people's answers.
I tried all answers with these examples below and unfortunately no one has passed the test.
ec2-11-111-222-333.cd-blahblah-1.compute.amazonaws.com
domaine.com
subdomain.domain.com
12533d5.dkkkd.com
2dotsextension.co
1dotextension.c
ekkej_dhh.com
12552.2225
112.25.25
12345.com
12345.123.com
domaine.123
whatever
9999-ee.99
email#domain.com
.jjdj.kkd
-subdomain.domain.com
#subdomain.domain.com
112.25.25
Here is a better solution.
^[A-Za-z0-9][A-Za-z0-9-.]*\.\D{2,4}$
Just please post any other not considered case if exists # https://regex101.com/r/89zZkW/1
What about:
^(?=.{1,255})([0-9A-Za-z]|_{1}|\*{1}$)(?:(?:[0-9A-Za-z]|\b-){0,61}[0-9A-Za-z])?(?:\.[0-9A-Za-z](?:(?:[0-9A-Za-z]|\b-){0,61}[0-9A-Za-z])?)*\.?$
for matching only one '_' (for some SRV) at the beginning and only one * (in case of a label for a DNs wildcard)
According to the relevant internet RFCs and assuming you have lookahead and lookbehind positive and negative assertions:
If you want to validate a local/leaf hostname for use in an internet hostname (e.g. - FQDN), then:
^(?!-)[-a-zA-Z0-9]{1,63}(?<!-)$
That ^^^ is also the general check that a label component inside an internet hostname is valid.
If you want to validate an internet hostname (e.g. - FQDN), then:
^(?=.{1,253}\.?$)(?:(?!-)[-a-zA-Z0-9]{1,63}(?<!-)\.)*(?!-)[-a-zA-Z0-9]{1,63}(?<!-)\.?$

Question about URL Validation with Regex [closed]

Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 14 years ago.
Improve this question
I have the following regex that does a great job matching urls:
((https?|ftp|gopher|telnet|file|notes|ms-help):((//)|(\\\\))+[\w\d:##%/;$()~_?\+-=\\\.&]*)`
However, it does not handle urls without a prefix, ie. stackoverflow.com or www.google.com do not match. Anyone know how I can modify this regex to not care if there is a prefix or not?
EDIT: Does my question too vague? Does it need more details?
(((https?|ftp|gopher|telnet|file|notes|ms-help):((//)|(\\\\)))?[\w\d:##%/;$()~_?\+-=\\\.&]*)
I added a ()? around the protocols like Vinko Vrsalovic suggested, but now the regex will match nearly any string, as long as it has valid URL characters.
My implementation of this is I have a database that I manage the contents, and it has a field that either has plain text, a phone number, a URL or an email address. I was looking for an easy way to validate the input so I can have it properly formatted, ie. creating anchor tags for the url/email, and formatting the phone number how I have the other numbers formatted throughout the site. Any suggestions?
The below regex is from the wonderful Mastering Regular Expressions book. If you are not familiar with the free spacing/comments mode, I suggest you get familiar with it.
\b
# Match the leading part (proto://hostname, or just hostname)
(
# ftp://, http://, or https:// leading part
(ftp|https?)://[-\w]+(\.\w[-\w]*)+
|
# or, try to find a hostname with our more specific sub-expression
(?i: [a-z0-9] (?:[-a-z0-9]*[a-z0-9])? \. )+ # sub domains
# Now ending .com, etc. For these, require lowercase
(?-i: com\b
| edu\b
| biz\b
| gov\b
| in(?:t|fo)\b # .int or .info
| mil\b
| net\b
| org\b
| name\b
| coop\b
| aero\b
| museum\b
| [a-z][a-z]\b # two-letter country codes
)
)
# Allow an optional port number
( : \d+ )?
# The rest of the URL is optional, and begins with / . . .
(
/
# The rest are heuristics for what seems to work well
[^.!,?;"'<>()\[\]{}\s\x7F-\xFF]*
(?:
[.!,?]+ [^.!,?;"'<>()\[\]{}\s\x7F-\xFF]+
)*
)?
To explain this regex briefly (for a full explanation get the book) - URLs have one or more dot separated parts ending with either a limited list of final bits, or a two letter country code (.uk .fr ...). In addition the parts may have any alphanumeric characters or hyphens '-', but hyphens may not be the first or last character of the parts. Then there may be a port number, and then the rest of it.
To extract this from the website, go to http://regex.info/listing.cgi?ed=3&p=207 It is from page 207 of the 3rd edition.
And the page says "Copyright © 2008 Jeffrey Friedl" so I'm not sure what the conditions for use are exactly, but I would expect that if you own the book you could use it so ... I'm hoping I'm not breaking the rules putting it here.
If you read section 5 of the URL specification (http://www.isi.edu/in-notes/rfc1738.txt) you'll see that the syntax of a URL is at a minimum:
scheme ':' schemepart
where scheme is 1 or more characters and schemepart is 0 or more characters. Therefore if you don't have a colon, you don't have a URL.
That said, /users/ don't care if they've given you a url, to them it looks like one. So here's what I do:
BEFORE validation, if there isn't a colon in it, prepend http://, then run it through whatever validator you want. This turns any legitimate hostname (which may not include domain info, after all) into something that looks like a URL.
frob -> http://frob
(Nearly) the only rule for the host part is that it can't begin with a digit if it contains no dots. Now, there are specific validations that should be performed for specific schemes, which none of the regexes given thus far accomplish. But, spec compliance is probably not what you want to 'validate'. Therefore a dns query on the hostname portion may be useful, but unless you're using the same resolver in the same context as your user, it isn't going to work in all cases.
Your regexp matches everything starting with one of those protocols, including a lot of things that cannot possibly be existent URLs, if you relax the protocol part (making it optional with ?) then you'll just be matching almost everything, including the empty string.
In other words, it does a great job matching URLs because it matches almost anything starting with http://,https://,ftp:// and so on. Well, it also matches ftp:\\ and ms-help://, but let's ignore that.
It may make sense, depending on actual usage, because the other regexp approach of whitelisting valid domains becomes non maintainable quickly enough, but making the protocol part optional does not make sense.
An example (with the relaxed protocol part in place):
>>> r = re.compile('(((https?|ftp|gopher|telnet|file|notes|ms-help):((//)|(\\\\))+)?[\w\d:##%/;$()~_?\+-=\\\.&]*)')
>>> r.search('oompaloompa_is_not_an_ur%&%%l').groups()[0]
'oompaloompa_is_not_an_ur%&%%l' #Matches!
>>> r.search('oompaloompa_isdfjakojfsdi.sdnioknfsdjknfsdjk.fsdnjkfnsdjknfsdjk').groups()[0]
'oompaloompa_isdfjakojfsdi.sdnioknfsdjknfsdjk.fsdnjkfnsdjknfsdjk' #Matches!
>>>
Given your edit I suggest you either make the user select what is he adding, adding an enum column, or create a simpler regex that'll check for at least a dot, besides the valid characters and maybe some common domains.
A third alternative which will be VERY SLOW and only to be used when URL validation is REALLY REALLY IMPORTANT is actually accessing the URL and do a HEAD request on it, if you get a host not found or an error you know it's not valid. For emails you could try and see if the MX host exists and has port 25 open. If both fails, it'll be plain text. (I'm not suggesting this either)
You can surround the prefix part in brackets and match 0 or 1 occurrences
(((https?|ftp|gopher|telnet|file|notes|ms-help):((//)|(\\\\))+)?
So the whole regex will become
(((https?|ftp|gopher|telnet|file|notes|ms-help):((//)|(\\\\))+)?[\w\d:##%/;$()~_?\+-=\\\.&]*)
The problem with that is it's going to match more or less any word. For example "test" would also be a match.
Where are you going to use that regex? Are you trying to validate a hostname or are you trying to find hostnames inside a paragraph?
Just use:
.*
i.e. match everything.
The things you want to match are just hostnames, not URL (technically).
There's no structure you can use to definitively identify hostnames.
Perhaps you could look for things that end in ".com" but then you'll miss any .co.uk, net, .org, etc.
Edit:
In other words: If you remove the requirement that the URL-like things start with a protocol you won't have any thing to match on.
Depending on what you are using the regular expression on:
Treat everything as a URL
Keep the requirement for a protocol
Hack checks for common endings for hostnames (e.g. .com .net .org) and accept you'll miss some.