Question about URL Validation with Regex [closed] - regex

Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 14 years ago.
Improve this question
I have the following regex that does a great job matching urls:
((https?|ftp|gopher|telnet|file|notes|ms-help):((//)|(\\\\))+[\w\d:##%/;$()~_?\+-=\\\.&]*)`
However, it does not handle urls without a prefix, ie. stackoverflow.com or www.google.com do not match. Anyone know how I can modify this regex to not care if there is a prefix or not?
EDIT: Does my question too vague? Does it need more details?
(((https?|ftp|gopher|telnet|file|notes|ms-help):((//)|(\\\\)))?[\w\d:##%/;$()~_?\+-=\\\.&]*)
I added a ()? around the protocols like Vinko Vrsalovic suggested, but now the regex will match nearly any string, as long as it has valid URL characters.
My implementation of this is I have a database that I manage the contents, and it has a field that either has plain text, a phone number, a URL or an email address. I was looking for an easy way to validate the input so I can have it properly formatted, ie. creating anchor tags for the url/email, and formatting the phone number how I have the other numbers formatted throughout the site. Any suggestions?

The below regex is from the wonderful Mastering Regular Expressions book. If you are not familiar with the free spacing/comments mode, I suggest you get familiar with it.
\b
# Match the leading part (proto://hostname, or just hostname)
(
# ftp://, http://, or https:// leading part
(ftp|https?)://[-\w]+(\.\w[-\w]*)+
|
# or, try to find a hostname with our more specific sub-expression
(?i: [a-z0-9] (?:[-a-z0-9]*[a-z0-9])? \. )+ # sub domains
# Now ending .com, etc. For these, require lowercase
(?-i: com\b
| edu\b
| biz\b
| gov\b
| in(?:t|fo)\b # .int or .info
| mil\b
| net\b
| org\b
| name\b
| coop\b
| aero\b
| museum\b
| [a-z][a-z]\b # two-letter country codes
)
)
# Allow an optional port number
( : \d+ )?
# The rest of the URL is optional, and begins with / . . .
(
/
# The rest are heuristics for what seems to work well
[^.!,?;"'<>()\[\]{}\s\x7F-\xFF]*
(?:
[.!,?]+ [^.!,?;"'<>()\[\]{}\s\x7F-\xFF]+
)*
)?
To explain this regex briefly (for a full explanation get the book) - URLs have one or more dot separated parts ending with either a limited list of final bits, or a two letter country code (.uk .fr ...). In addition the parts may have any alphanumeric characters or hyphens '-', but hyphens may not be the first or last character of the parts. Then there may be a port number, and then the rest of it.
To extract this from the website, go to http://regex.info/listing.cgi?ed=3&p=207 It is from page 207 of the 3rd edition.
And the page says "Copyright © 2008 Jeffrey Friedl" so I'm not sure what the conditions for use are exactly, but I would expect that if you own the book you could use it so ... I'm hoping I'm not breaking the rules putting it here.

If you read section 5 of the URL specification (http://www.isi.edu/in-notes/rfc1738.txt) you'll see that the syntax of a URL is at a minimum:
scheme ':' schemepart
where scheme is 1 or more characters and schemepart is 0 or more characters. Therefore if you don't have a colon, you don't have a URL.
That said, /users/ don't care if they've given you a url, to them it looks like one. So here's what I do:
BEFORE validation, if there isn't a colon in it, prepend http://, then run it through whatever validator you want. This turns any legitimate hostname (which may not include domain info, after all) into something that looks like a URL.
frob -> http://frob
(Nearly) the only rule for the host part is that it can't begin with a digit if it contains no dots. Now, there are specific validations that should be performed for specific schemes, which none of the regexes given thus far accomplish. But, spec compliance is probably not what you want to 'validate'. Therefore a dns query on the hostname portion may be useful, but unless you're using the same resolver in the same context as your user, it isn't going to work in all cases.

Your regexp matches everything starting with one of those protocols, including a lot of things that cannot possibly be existent URLs, if you relax the protocol part (making it optional with ?) then you'll just be matching almost everything, including the empty string.
In other words, it does a great job matching URLs because it matches almost anything starting with http://,https://,ftp:// and so on. Well, it also matches ftp:\\ and ms-help://, but let's ignore that.
It may make sense, depending on actual usage, because the other regexp approach of whitelisting valid domains becomes non maintainable quickly enough, but making the protocol part optional does not make sense.
An example (with the relaxed protocol part in place):
>>> r = re.compile('(((https?|ftp|gopher|telnet|file|notes|ms-help):((//)|(\\\\))+)?[\w\d:##%/;$()~_?\+-=\\\.&]*)')
>>> r.search('oompaloompa_is_not_an_ur%&%%l').groups()[0]
'oompaloompa_is_not_an_ur%&%%l' #Matches!
>>> r.search('oompaloompa_isdfjakojfsdi.sdnioknfsdjknfsdjk.fsdnjkfnsdjknfsdjk').groups()[0]
'oompaloompa_isdfjakojfsdi.sdnioknfsdjknfsdjk.fsdnjkfnsdjknfsdjk' #Matches!
>>>
Given your edit I suggest you either make the user select what is he adding, adding an enum column, or create a simpler regex that'll check for at least a dot, besides the valid characters and maybe some common domains.
A third alternative which will be VERY SLOW and only to be used when URL validation is REALLY REALLY IMPORTANT is actually accessing the URL and do a HEAD request on it, if you get a host not found or an error you know it's not valid. For emails you could try and see if the MX host exists and has port 25 open. If both fails, it'll be plain text. (I'm not suggesting this either)

You can surround the prefix part in brackets and match 0 or 1 occurrences
(((https?|ftp|gopher|telnet|file|notes|ms-help):((//)|(\\\\))+)?
So the whole regex will become
(((https?|ftp|gopher|telnet|file|notes|ms-help):((//)|(\\\\))+)?[\w\d:##%/;$()~_?\+-=\\\.&]*)
The problem with that is it's going to match more or less any word. For example "test" would also be a match.
Where are you going to use that regex? Are you trying to validate a hostname or are you trying to find hostnames inside a paragraph?

Just use:
.*
i.e. match everything.
The things you want to match are just hostnames, not URL (technically).
There's no structure you can use to definitively identify hostnames.
Perhaps you could look for things that end in ".com" but then you'll miss any .co.uk, net, .org, etc.
Edit:
In other words: If you remove the requirement that the URL-like things start with a protocol you won't have any thing to match on.
Depending on what you are using the regular expression on:
Treat everything as a URL
Keep the requirement for a protocol
Hack checks for common endings for hostnames (e.g. .com .net .org) and accept you'll miss some.

Related

Find last occurrence of period with regex

I'm trying to create a regex for validating URLs. I know there are many advanced ones out there, but I want to create my own for learning purposes.
So far I have a regex that works quite well, however I want to improve the validation for the TLD part of the URI because I feel it's not quite there yet.
Here's my regex (or find it on regexr):
/^[(http(s)?):\/\/(www\.)?a-zA-Z0-9#:._\+~#=]{2,256}\.[a-zA-Z]{2,6}\b([/#?]{0,1}([A-Za-z0-9-._~:?#[\]#!$&''()*+,;=]|(%[A-Fa-f0-9]{2}))*)$/
It works well for links such as foo.com or http://foo.com or foo.co.uk
The problem appears when you introduce subdomains or second-level domains such as co.uk because the regex will accept foo.co.u or foo.co..
I did try using the following to select the substring after the last .:
/[(http(s)?):\/\/(www\.)?a-zA-Z0-9#:._\+~#=]{2,256}[^.]{2,}$/
but this prevents me from defining the path rules of the URI.
How can I ensure that the substring after the last . but before the first /, ? or # is at least 2 characters long?
From what I can see, you're almost there. Made some modification and it seems to work.
^(http(s)?:\/\/)?(www\.)?[a-zA-Z0-9#:._\+~#=]{2,256}\.[a-zA-Z]{2,6}([/#?;]([A-Za-z0-9-._~:?#[\]#!$&''()*+,;=]|(%[A-Fa-f0-9]{2}))*)?$
Can be somewhat shortened by doing
^(http(s)?:\/\/)?(www\.)?[\w#:.\+~#=]{2,256}\.[a-zA-Z]{2,6}([/#?;]([-\w.~:?#[\]#!$&''()*+,;=]|(%[A-Fa-f0-9]{2}))*)?$
(basically just tweaked your regex)
The main difference is that the parameter part is optional, but if it is there it has to start with one of /#?;. That part could probably be simplified as well.
Check it out here.
Edit:
After some experimenting I think this one is about as simple it'll get:
^(http(?:s)?:\/\/)?([-.~\w]+\.[a-zA-Z]{2,6})(:\d+)?(\/[-.~\w]*)?([#/#?;].*)?$
It also captures the separate parts - scheme, host, port, path and query/params.
Example here.

Define regular expression that matches urls that end with digits unless anything else comes after

I'm using Scrapy to scrape a web site. I'm stuck at defining properly the rule for extracting links.
Specifically, I need help to write a regular expression that allows urls like:
https://discuss.dwolla.com/t/the-dwolla-reflector-is-now-open-source/1352
https://discuss.dwolla.com/t/enhancement-dwolla-php-updated-to-2-1-3/1180
https://discuss.dwolla.com/t/updated-java-android-helper-library-for-dwollas-api/108
while forbidding urls like this one
https://discuss.dwolla.com/t/the-dwolla-reflector-is-now-open-source/1352/12
In other words, I want urls that end with digits (i.e., /1352 in the example abpve), unless after these digits there is anything after (i.e., /12 in the example above)
I am by no means an expert of regular expressions, and I could only come up with something like \/(\d+)$, or even this one ^https:\/\/discuss.dwolla.com\/t\/\S*\/(\d+)$, but both fail at excluding the unwanted urls since they all capture the last digits in the address.
--- UPDATE ---
Sorry for not being clear in the first place. This addition is to clarify that the digits at the of URLS can change, so the /1352 is not fixed. As such, another example of urls to be accepted is also:
https://discuss.dwolla.com/t/updated-java-android-helper-library-for-dwollas-api/108
This is probably the simplest way:
[^\/\d][^\/]*\/\d+$
or to restrict to a particular domain:
^https?:\/\/discuss.dwolla.com\/.*[^\/\d][^\/]*\/\d+$
See live demo.
This regex requires the last part to be all digits, and the 2nd last part to have at least 1 non-digit.
Here is a java regex may fit your requirements in java style. You can specify number of digits N you are excepting in {N}
^https://discuss.dwolla.com/t/[\\w|-]+/[\\d]+$

RegEx filter links from a document

I am currently learning regex and I am trying to filter all links (eg: http://www.link.com/folder/file.html) from a document with notepad++. Actually I want to delete everything else so that in the end only the http links are listed.
So far I tried this : http\:\/\/www\.[a-zA-Z0-9\.\/\-]+
This gives me all links which is find, but how do I delete the remaining stuff so that in the end I have a neat list of all links?
If I try to replace it with nothing followed by \1, obviously the link will be deleted, but I want the exact opposite to have everything else deleted.
So it should be something like:
- find a string of numbers, letters and special signs until "http"
- delete what you found
- and keep searching for more numbers, letters ans special signs after "html"
- and delete that again
Any ideas? Thanks so much.
In Notepad++, in the Replace menu (CTRL+H) you can do the following:
Find: .*?(http\:\/\/www\.[a-zA-Z0-9\.\/\-]+)
Replace: $1\n
Options: check the Regular expression and the . matches newline
This will return you with a list of all your links. There are two issues though:
The regex you provided for matching URLs is far from being generic enough to match any URL. If it is working in your case, that's fine, else check this question.
It will leave the text after the last matched URL intact. You have to delete it manually.
The answer made previously by #psxls was a great help for me when I have wanted to perform a similar process.
However, this regex rule was written six years ago now: accordingly, I had to adjust / complete / update it in order it can properly work with the some recent links, because:
a lot of URL are now using HTTPS instead of HTTP protocol
many websites less use www as main subdomain
some links adds punctuation mark (which have to be preserved)
I finally reshuffle the search rule to .*?(https?\:\/\/[a-zA-Z0-9[:punct:]]+) and it worked correctly with the file I had.
Unfortunately, this seemingly simple task is going to be almost impossible to do in notepad++. The regex you would have to construct would be...horrible. It might not even be possible, but if it is, it's not worth it. I pretty much guarantee that.
However, all is not lost. There are other tools more suitable to this problem.
Really what you want is a tool that can search through an input file and print out a list of regex matches. The UNIX utility "grep" will do just that. Don't be scared off because it's a UNIX utility: you can get it for Windows:
http://gnuwin32.sourceforge.net/packages/grep.htm
The grep command line you'll want to use is this:
grep -o 'http:\/\/www.[a-zA-Z0-9./-]\+\?' <filename(s)>
(Where <filename(s)> are the name(s) of the files you want to search for URLs in.)
You might want to shake up your regex a little bit, too. The problems I see with that regex are that it doesn't handle URLs without the 'www' subdomain, and it won't handle secure links (which start with https). Maybe that's what you want, but if not, I would modify it thusly:
grep -o 'https\?:\/\/[a-zA-Z0-9./-]\+\?' <filename(s)>
Here are some things to note about these expressions:
Inside a character group, there's no need to quote metacharacters except for [ and (sometimes) -. I say sometimes because if you put the dash at the end, as I have above, it's no longer interpreted as a range operator.
The grep utility's syntax, annoyingly, is different than most regex implementations in that most of the metacharacters we're familiar with (?, +, etc.) must be escaped to be used, not the other way around. Which is why you see backslashes before the ? and + characters above.
Lastly, the repetition metacharacter in this expression (+) is greedy by default, which could cause problems. I made it lazy by appending a ? to it. The way you have your URL match formulated, it probably wouldn't have caused problems, but if you change your match to, say [^ ] instead of [a-zA-Z0-9./-], you would see URLs on the same line getting combined together.
I did this a different way.
Find everything up to the first/next (https or http) (then everything that comes next) up to (html or htm), then output just the '(https or http)(everything next) then (html or htm)' with a line feed/ carriage return after each.
So:
Find: .*?(https:|http:)(.*?)(html|htm)
Replace with: \1\2\3\r\n
Saves looking for all possible (incl non-generic) url matches.
You will need to manually remove any text after the last matched URL.
Can also be used to create url links:
Find: .*?(https:|http:)(.*?)(html|htm)
Replace: \1\2\3\r\n
or image links (jpg/jpeg/gif):
Find: .*?(https:|http:)(.*?)(jpeg|jpg|gif)
Replace: <img src="\1\2\3">\r\n
I know my answer won't be RegEx related, but here is another efficient way to get lines containing URLs.
This won't remove text around links like Toto mentioned in comments.
At least if there is nice pattern to all links, like https://.
CTRL+F => change tab to Mark
Insert https://
Tick Mark to bookmark.
Mark All.
Find => Bookmarks => Delete all lines without bookmark.
I hope someone who lands here in search of same problem will find my way more user-friendly.
You can still use RegEx to mark lines :)

Regex for URL routing - match alphanumeric and dashes except words in this list

I'm using CodeIgniter to write an app where a user will be allowed to register an account and is assigned a URL (URL slug) of their choosing (ex. domain.com/user-name). CodeIgniter has a URL routing feature that allows the utilization of regular expressions (link).
User's are only allowed to register URL's that contain alphanumeric characters, dashes (-), and under scores (_). This is the regex I'm using to verify the validity of the URL slug: ^[A-Za-z0-9][A-Za-z0-9_-]{2,254}$
I am using the url routing feature to route a few url's to features on my site (ex. /home -> /pages/index, /activity -> /user/activity) so those particular URL's obviously cannot be registered by a user.
I'm largely inexperienced with regular expressions but have attempted to write an expression that would match any URL slugs with alphanumerics/dash/underscore except if they are any of the following:
default_controller
404_override
home
activity
Here is the code I'm using to try to match the words with that specific criteria:
$route['(?!default_controller|404_override|home|activity)[A-Za-z0-9][A-Za-z0-9_-]{2,254}'] = 'view/slug/$1';
but it isn't routing properly. Can someone help? (side question: is it necessary to have ^ or $ in the regex when trying to match with URL's?)
Alright, let's pick this apart.
Ignore CodeIgniter's reserved routes.
The default_controller and 404_override portions of your route are unnecessary. Routes are compared to the requested URI to see if there's a match. It is highly unlikely that those two items will ever be in your URI, since they are special reserved routes for CodeIgniter. So let's forget about them.
$route['(?!home|activity)[A-Za-z0-9][A-Za-z0-9_-]{2,254}'] = 'view/slug/$1';
Capture everything!
With regular expressions, a group is created using parentheses (). This group can then be retrieved with a back reference - in our case, the $1, $2, etc. located in the second part of the route. You only had a group around the first set of items you were trying to exclude, so it would not properly capture the entire wild card. You found this out yourself already, and added a group around the entire item (good!).
$route['((?!home|activity)[A-Za-z0-9][A-Za-z0-9_-]{2,254})'] = 'view/slug/$1';
Look-ahead?!
On that subject, the first group around home|activity is not actually a traditional group, due to the use of ?! at the beginning. This is called a negative look-ahead, and it's a complicated regular expression feature. And it's being used incorrectly:
Negative lookahead is indispensable if you want to match something not followed by something else.
There's a LOT more I could go into with this, but basically we don't really want or need it in the first place, so I'll let you explore if you'd like.
In order to make your life easier, I'd suggest separating the home, activity, and other existing controllers in the routes. CodeIgniter will look through the list of routes from top to bottom, and once something matches, it stops checking. So if you specify your existing controllers before the wild card, they will match, and your wild card regular expression can be greatly simplified.
$route['home'] = 'pages';
$route['activity'] = 'user/activity';
$route['([A-Za-z0-9][A-Za-z0-9_-]{2,254})'] = 'view/slug/$1';
Remember to list your routes in order from most specific to least. Wild card matches are less specific than exact matches (like home and activity), so they should come after (below).
Now, that's all the complicated stuff. A little more FYI.
Remember that dashes - have a special meaning when in between [] brackets. You should escape them if you want to match a literal dash.
$route['([A-Za-z0-9][A-Za-z0-9_\-]{2,254})'] = 'view/slug/$1';
Note that your character repetition min/max {2,254} only applies to the second set of characters, so your user names must be 3 characters at minimum, and 255 at maximum. Just an FYI if you didn't realize that already.
I saw your own answer to this problem, and it's just ugly. Sorry. The ^ and $ symbols are used improperly throughout the lookahead (which still shouldn't be there in the first place). It may "work" for a few use cases that you're testing it with, but it will just give you problems and headaches in the future.
Hopefully now you know more about regular expressions and how they're matched in the routing process.
And to answer your question, no, you should not use ^ and $ at the beginning and end of your regex -- CodeIgniter will add that for you.
Use the 404, Luke...
At this point your routes are improved and should be functional. I will throw it out there, though, that you might want to consider using the controller/method defined as the 404_override to handle your wild cards. The main benefit of this is that you don't need ANY routes to direct a wild card, or to prevent your wild card from goofing up existing controllers. You only need:
$route['404_override'] = 'view/slug';
Then, your View::slug() method would check the URI, and see if it's a valid pattern, then check if it exists as a user (same as your slug method does now, no doubt). If it does, then you're good to go. If it doesn't, then you throw a 404 error.
It may not seem that graceful, but it works great. Give it a shot if it sounds better for you.
I'm not familiar with codeIgniter specifically, but most frameworks routing operate based on precedence. In other words, the default controller, 404, etc routes should be defined first. Then you can simplify your regex to only match the slugs.
Ok answering my own question
I've seem to come up with a different expression that works:
$route['(^(?!default_controller$|404_override$|home$|activity$)[A-Za-z0-9][A-Za-z0-9_-]{2,254}$)'] = 'view/slug/$1';
I added parenthesis around the whole expression (I think that's what CodeIgniter matches with $1 on the right) and added a start of line identifier: ^ and a bunch of end of line identifiers: $
Hope this helps someone who may run into this problem later.

How do I write a regular expression for a URL without the scheme?

How can I write a RE which validates the URLs without the scheme:
Pass:
www.example.com
example.com
Fail:
http://www.example.com
^[A-Za-z0-9][A-Za-z0-9.-]+(:\d+)?(/.*)?$
string must start with an ASCII letter or number
ASCII letters, numbers, dots and dashes follow (no slashes or colons allowed)
optional: a port is allowed (":8080")
optional: anything after a slash may follow (since you said "URL")
then the end of the string
Thoughts:
no line breaks allowed
no validity or sanity checking
no support for "internationalized domain names" (IDNs)
leave off the "optional:" parts if you like, but be sure to include the final "$"
If your regex flavor supports it, you can shorten the above to:
^[A-Za-z\d][\w.-]+(:\d+)?(/.*)?$
Be aware that \w may include Unicode characters in some regex flavors. Also, \w includes the underscore, which is invalid in host names. An explicit approach like the first one would be safer.
If you're trying to do this for some real code, find the URL parsing library for your language and use that. If you don't want to use it, look inside to see what it does.
The thing that you are calling "resource" is known as a "scheme". It's documented in RFC 1738 which says:
[2.1] ... In general, URLs are written as follows:
<scheme>:<scheme-specific-part>
A URL contains the name of the scheme being used (<scheme>) followed
by a colon and then a string (the <scheme-specific-part>) whose
interpretation depends on the scheme.
And, later in the BNF,
scheme = 1*[ lowalpha | digit | "+" | "-" | "." ]
So, if a scheme is there, you can match it with:
/^[a-z0-9+.-]+:/i
If that matches, you have what the URL syntax considers a scheme and your validation fails. If you have strings with port numbers, like www.example.com:80, then things get messy. In practice, I haven't dealt with schemes with - or ., so you might add a real world fudge to get around that until you decide to use a proper library.
Anything beyond that, like checking for existing and reachable domains and so on, is better left to a library that's already figured it all out.
URL syntax is quite complex, you need to narrow it down a bit. You can match anything.ext, if that is enough:
^[a-zA-Z0-9.]+\.[a-zA-Z]{2,4}$
My guess is
/^[\p{Alnum}-]+(\.[\p{Alnum}-]+)+$/
In more primitive RE syntax that would be
/^[0-9A-Za-z-]+(\.[0-9A-Za-z-]+)+$/
Or even more primitive still:
/^[0-9A-Za-z-][0-9A-Za-z-]*\.[0-9A-Za-z-][0-9A-Za-z-]*(\.[0-9A-Za-z-][0-9A-Za-z-]*)*$/
Thanks guys, I think I have a Python and a PHP solution. Here they are:
Python Solution:
import re
url = 'http://www.foo.com'
p = re.compile(r'^(?!http(s)?://$)[A-Za-z][A-Za-z0-9.-]+(:\d+)?(/.*)?$')
m = p.search(url)
print m # m returns _sre.SRE_Match if url is valid, otherwise None
PHP Solution:
$url = 'http://www.foo.com';
preg_match('/^(?!http(s)?:\/\/$)[A-Za-z][A-Za-z0-9\.\-]+(:\d+)?(\/\.*)?$/', $url);