Regex to not consider a phrase in a string - regex

The background is as follows:
I m working on converting URL (with or without protocol and www) into
clickable links.
I have the regex working for URLs with http, https, ftp, file, www and some combination of http/https with www.
I also have the regex working for URLs with just www and no protocol.
However, I m unable to figure out a working one for finding URLs with no protocol and no server name (www).
I tried the following in (http://gskinner.com/RegExr/)
([^www\.|http\:// ][a-zA-Z0-9\.]+)((?:[a-zA-Z0-9]+\.)+)([a-zA-Z]{2,4})([\/a-zA-Z0-9]+)([\?][a-zA-Z0-9]+)?
But that seems to work only that website and not on my application. Any help is much appreciated.

OK, you're probably not going to like this answer much - but then maybe you will? I have a regular expression (adapted from ) that seems to find URLs in text. You can see a demo on regex101.com .
The actual expression is very very long - this is because it's got "every legal TLD (top level domain) in it, which is a good start for finding "good" URLs. Here it is
((?:(?:http|ftp|https):\/{2}){0,1}(?:(?:[0-9a-z_-]+\.)+(?:aero|asia|biz|cat|com|coop|edu|gov|info|int|jobs|mil|mobi|museum|name|net|org|pro|tel|travel|ac|ad|ae|af|ag|ai|al|am|an|ao|aq|ar|as|at|au|aw|ax|az|ba|bb|bd|be|bf|bg|bh|bi|bj|bm|bn|bo|br|bs|bt|bv|bw|by|bz|ca|cc|cd|cf|cg|ch|ci|ck|cl|cm|cn|co|cr|cu|cv|cx|cy|cz|cz|de|dj|dk|dm|do|dz|ec|ee|eg|er|es|et|eu|fi|fj|fk|fm|fo|fr|ga|gb|gd|ge|gf|gg|gh|gi|gl|gm|gn|gp|gq|gr|gs|gt|gu|gw|gy|hk|hm|hn|hr|ht|hu|id|ie|il|im|in|io|iq|ir|is|it|je|jm|jo|jp|ke|kg|kh|ki|km|kn|kp|kr|kw|ky|kz|la|lb|lc|li|lk|lr|ls|lt|lu|lv|ly|ma|mc|md|me|mg|mh|mk|ml|mn|mn|mo|mp|mr|ms|mt|mu|mv|mw|mx|my|mz|na|nc|ne|nf|ng|ni|nl|no|np|nr|nu|nz|nom|pa|pe|pf|pg|ph|pk|pl|pm|pn|pr|ps|pt|pw|py|qa|re|ra|rs|ru|rw|sa|sb|sc|sd|se|sg|sh|si|sj|sj|sk|sl|sm|sn|so|sr|st|su|sv|sy|sz|tc|td|tf|tg|th|tj|tk|tl|tm|tn|to|tp|tr|tt|tv|tw|tz|ua|ug|uk|us|uy|uz|va|vc|ve|vg|vi|vn|vu|wf|ws|ye|yt|yu|za|zm|zw|arpa)(?::[0-9]+)?(?:(?:\/(?:[~0-9a-zA-Z\#\+\%#\.\/_-]+))?(?:\?[0-9a-zA-Z\+\%#\/&\[\];=_-]+)?)?))
As you can see the vast majority of the expression is taken up making sure that the TLD is one of the many legal ones (270 alternatives. I didn't know there were this many until I stumbled on http://mathiasbynens.be/demo/url-regex where I found the seeds of this expression).
Changes I made to the expression I found at the link above - mostly I just made all the groups (except the outer one) non-capturing so there is just a single "match". In the sample I posted I showed that a "good" protocol definition (like http://) will be included in the capture, while a "bad" one (like http:/) will be ignored - however the following URL will still be captured. I also showed that adding punctuation right after the expression (tested with ; and !) doesn't phase the expression: it captures "up to that point" and not beyond.
Play with it and see how you like it. It is relatively poor (according to the above link) for "pathological" URLs, and doesn't work with Arabic etc - but I don't think, based on your question, that this would be an issue.
A short explanation:
(?:(?:http|ftp|https):\/{2}){0,1}
(?:http|ftp|https) - match one of http, ftp, or https - non capturing "OR" group
:\/{2} - followed by a colon and exactly two forward slashes
(?: …){0,1} - the whole thing zero or one times (so no protocol, or properly formed)
(?:(?:[0-9a-z_-]+\.)+
[0-9a-z_-]+\. - at least one of the characters in the given range, followed by a period
(?: )+ - the whole thing one or more times, non-capturing
(?:aero|asia …) - one of these strings, non-capturing (these are all the valid TLDs)
(?::[0-9]+)? - zero or one times a colon followed by one or more digits: port specification
- this makes sure that www.something.us:8080 is valid
Everything else that follows matches all the different things that can go after - directories, queries, etc.

#Floris - Your suggestion worked well. I edited it a little bit and utilized adding a # to detect emails as well. I also edited for a simpler workflow as well (without the TLD) -
((?:(?:http|ftp|https):\/{2}){0,1}(?:(?:[0-9a-z_#-]+\.)+(?:[0-9a-zA-Z]){2,4})(?::[0-9]+)?(?:(?:\/(?:[~0-9a-zA-Z\#\+\%\#\.\/_-]+))?(?:\?[0-9a-zA-Z\+\%#\/&\[\];=_-]+)?)?)
Thanks for the help.

Related

Regex match domain that contain certain subdomain

I have this regex (not mine, taken from here)
^[^\.]+\.example\.org$
The regex will match *.example.org (e.g. sub.example.org), but will leaves out sub-subdomain (e.g. sub.sub.example.org), that's great and it is what I want.
But I have other requirement, I want to match subdomain that contain specific string, in this case press. So the regex will match following (literally any subdomain that has word press in it).
free-press.example.org
press.example.org
press23.example.org
I have trouble finding the right syntax, have looked on the internet and mostly they works only for standalone text and not domain like this.
Ok, let's break down what the "subdomain" part of your regex does:
[^\.]+ means "any character except for ., at least once".
You can break your "desired subdomain" up into three parts: "the part before press", "press itself", and "the part after press".
For the parts before and after press, the pattern is basically the same as before, except that you want to change the + (one or more) to a * (zero or more), because there might not be anything before or after press.
So your subdomain pattern will look like [^\.]*press[^\.]*.
Putting it all together, we have ^[^\.]*press[^\.]*\.example\.org$. If we put that into Regex101 we see that it works for your examples.
Note that this isn't a very strict check for valid domains. It might be worth thinking about whether regexes are actually the best tool for the "subdomain checking" part of this task. You might instead want to do something like this:
Use a generic, more thorough, domain-validation regex to check that the domain name is valid.
Split the domain name into parts using String.split('.').
Check that the number of parts is correct (i.e. 3), and that the parts meet your requirements (i.e. the first contains the substring press, the second is example, and the third is org).
If you're looking for a regex that matches URLs whose subdomains contain the word press then use
^[^\.]*press[^\.]*\.example\.org$
See the demo

Find last occurrence of period with regex

I'm trying to create a regex for validating URLs. I know there are many advanced ones out there, but I want to create my own for learning purposes.
So far I have a regex that works quite well, however I want to improve the validation for the TLD part of the URI because I feel it's not quite there yet.
Here's my regex (or find it on regexr):
/^[(http(s)?):\/\/(www\.)?a-zA-Z0-9#:._\+~#=]{2,256}\.[a-zA-Z]{2,6}\b([/#?]{0,1}([A-Za-z0-9-._~:?#[\]#!$&''()*+,;=]|(%[A-Fa-f0-9]{2}))*)$/
It works well for links such as foo.com or http://foo.com or foo.co.uk
The problem appears when you introduce subdomains or second-level domains such as co.uk because the regex will accept foo.co.u or foo.co..
I did try using the following to select the substring after the last .:
/[(http(s)?):\/\/(www\.)?a-zA-Z0-9#:._\+~#=]{2,256}[^.]{2,}$/
but this prevents me from defining the path rules of the URI.
How can I ensure that the substring after the last . but before the first /, ? or # is at least 2 characters long?
From what I can see, you're almost there. Made some modification and it seems to work.
^(http(s)?:\/\/)?(www\.)?[a-zA-Z0-9#:._\+~#=]{2,256}\.[a-zA-Z]{2,6}([/#?;]([A-Za-z0-9-._~:?#[\]#!$&''()*+,;=]|(%[A-Fa-f0-9]{2}))*)?$
Can be somewhat shortened by doing
^(http(s)?:\/\/)?(www\.)?[\w#:.\+~#=]{2,256}\.[a-zA-Z]{2,6}([/#?;]([-\w.~:?#[\]#!$&''()*+,;=]|(%[A-Fa-f0-9]{2}))*)?$
(basically just tweaked your regex)
The main difference is that the parameter part is optional, but if it is there it has to start with one of /#?;. That part could probably be simplified as well.
Check it out here.
Edit:
After some experimenting I think this one is about as simple it'll get:
^(http(?:s)?:\/\/)?([-.~\w]+\.[a-zA-Z]{2,6})(:\d+)?(\/[-.~\w]*)?([#/#?;].*)?$
It also captures the separate parts - scheme, host, port, path and query/params.
Example here.

Regex for URL routing - match alphanumeric and dashes except words in this list

I'm using CodeIgniter to write an app where a user will be allowed to register an account and is assigned a URL (URL slug) of their choosing (ex. domain.com/user-name). CodeIgniter has a URL routing feature that allows the utilization of regular expressions (link).
User's are only allowed to register URL's that contain alphanumeric characters, dashes (-), and under scores (_). This is the regex I'm using to verify the validity of the URL slug: ^[A-Za-z0-9][A-Za-z0-9_-]{2,254}$
I am using the url routing feature to route a few url's to features on my site (ex. /home -> /pages/index, /activity -> /user/activity) so those particular URL's obviously cannot be registered by a user.
I'm largely inexperienced with regular expressions but have attempted to write an expression that would match any URL slugs with alphanumerics/dash/underscore except if they are any of the following:
default_controller
404_override
home
activity
Here is the code I'm using to try to match the words with that specific criteria:
$route['(?!default_controller|404_override|home|activity)[A-Za-z0-9][A-Za-z0-9_-]{2,254}'] = 'view/slug/$1';
but it isn't routing properly. Can someone help? (side question: is it necessary to have ^ or $ in the regex when trying to match with URL's?)
Alright, let's pick this apart.
Ignore CodeIgniter's reserved routes.
The default_controller and 404_override portions of your route are unnecessary. Routes are compared to the requested URI to see if there's a match. It is highly unlikely that those two items will ever be in your URI, since they are special reserved routes for CodeIgniter. So let's forget about them.
$route['(?!home|activity)[A-Za-z0-9][A-Za-z0-9_-]{2,254}'] = 'view/slug/$1';
Capture everything!
With regular expressions, a group is created using parentheses (). This group can then be retrieved with a back reference - in our case, the $1, $2, etc. located in the second part of the route. You only had a group around the first set of items you were trying to exclude, so it would not properly capture the entire wild card. You found this out yourself already, and added a group around the entire item (good!).
$route['((?!home|activity)[A-Za-z0-9][A-Za-z0-9_-]{2,254})'] = 'view/slug/$1';
Look-ahead?!
On that subject, the first group around home|activity is not actually a traditional group, due to the use of ?! at the beginning. This is called a negative look-ahead, and it's a complicated regular expression feature. And it's being used incorrectly:
Negative lookahead is indispensable if you want to match something not followed by something else.
There's a LOT more I could go into with this, but basically we don't really want or need it in the first place, so I'll let you explore if you'd like.
In order to make your life easier, I'd suggest separating the home, activity, and other existing controllers in the routes. CodeIgniter will look through the list of routes from top to bottom, and once something matches, it stops checking. So if you specify your existing controllers before the wild card, they will match, and your wild card regular expression can be greatly simplified.
$route['home'] = 'pages';
$route['activity'] = 'user/activity';
$route['([A-Za-z0-9][A-Za-z0-9_-]{2,254})'] = 'view/slug/$1';
Remember to list your routes in order from most specific to least. Wild card matches are less specific than exact matches (like home and activity), so they should come after (below).
Now, that's all the complicated stuff. A little more FYI.
Remember that dashes - have a special meaning when in between [] brackets. You should escape them if you want to match a literal dash.
$route['([A-Za-z0-9][A-Za-z0-9_\-]{2,254})'] = 'view/slug/$1';
Note that your character repetition min/max {2,254} only applies to the second set of characters, so your user names must be 3 characters at minimum, and 255 at maximum. Just an FYI if you didn't realize that already.
I saw your own answer to this problem, and it's just ugly. Sorry. The ^ and $ symbols are used improperly throughout the lookahead (which still shouldn't be there in the first place). It may "work" for a few use cases that you're testing it with, but it will just give you problems and headaches in the future.
Hopefully now you know more about regular expressions and how they're matched in the routing process.
And to answer your question, no, you should not use ^ and $ at the beginning and end of your regex -- CodeIgniter will add that for you.
Use the 404, Luke...
At this point your routes are improved and should be functional. I will throw it out there, though, that you might want to consider using the controller/method defined as the 404_override to handle your wild cards. The main benefit of this is that you don't need ANY routes to direct a wild card, or to prevent your wild card from goofing up existing controllers. You only need:
$route['404_override'] = 'view/slug';
Then, your View::slug() method would check the URI, and see if it's a valid pattern, then check if it exists as a user (same as your slug method does now, no doubt). If it does, then you're good to go. If it doesn't, then you throw a 404 error.
It may not seem that graceful, but it works great. Give it a shot if it sounds better for you.
I'm not familiar with codeIgniter specifically, but most frameworks routing operate based on precedence. In other words, the default controller, 404, etc routes should be defined first. Then you can simplify your regex to only match the slugs.
Ok answering my own question
I've seem to come up with a different expression that works:
$route['(^(?!default_controller$|404_override$|home$|activity$)[A-Za-z0-9][A-Za-z0-9_-]{2,254}$)'] = 'view/slug/$1';
I added parenthesis around the whole expression (I think that's what CodeIgniter matches with $1 on the right) and added a start of line identifier: ^ and a bunch of end of line identifiers: $
Hope this helps someone who may run into this problem later.

Regex Problem (newbie)

i'm writing a little app for spam-checking and i'm having problems with a regex.
let's say i'm having this spam-url:
http://hosting.tyumen.ru/tip.html
so i want to check its url for having 2 full stops (subdomain+ending), a slash, a word, full stop and "html".
here's what i got so far:
(http://.*?\..*?..*?/.*?.html)
might look like rubbish but it works - the problem: it's really slow and freezing my app.
any hints on how to optimize it?
thx.re
The reason it's slow is that the non-greedy operators ? being used this way is prone to catastrophic backtracking
Instead of saying "any amount of anything, but only to an extent where it doesn't conflict with later requirements", which is effectively what .*? is saying, try asking for "as much as possible, that isn't a double quote, which would terminate the href ":
\1
I also added a back-reference (\1) to your first capturing group, inside the <a>...</a>, so that you don't have to do the exact same matching all over again.
Note that this regex will be broken if, say, the a has a class name, an id, or anything else in its body. I left it like this because I wanted to give you what you asked for with as few changes as possible, and as to-the-point as possible.
(http://[\w.-]+/.+?\.html) - may be will work for your case only.
or may be faster one
(http://[\w.-]+/[^.]+\.html)
Since you claim to be a regexp newbie, I will offer a more general advice on creating and debugging regular expressions. When they get pretty complicated, I find using Regexp Coach a must.
It's a freeware and really saves a lot of headache. Not to mention you don't have to build / run your application every minute just to see if the regexp works the way you wanted.
In Python, a simple way to match URLs ending in .html or .htm is to use
url_re = re.compile(
r'https?://' # http:// or https://
r'(?:(?:[A-Z0-9](?:[A-Z0-9-]{0,61}[A-Z0-9])?\.)+[A-Z]{2,6}\.?|' #domain...
r'localhost|' #localhost...
r'\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})' # ...or ip
r'(?::\d+)?' # optional port
r'(?:\S+.html?)+' # ending in .html
, re.IGNORECASE)
which is a modified version of Django's UrlField regex.
This will match any site ending with .html or .htm. (either localhost, ip, domain).
#http://[-a-zA-Z0-9]+\.[-a-zA-Z0-9]+\.[-a-zA-Z]+/\w+\.html#

Question about URL Validation with Regex [closed]

Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 14 years ago.
Improve this question
I have the following regex that does a great job matching urls:
((https?|ftp|gopher|telnet|file|notes|ms-help):((//)|(\\\\))+[\w\d:##%/;$()~_?\+-=\\\.&]*)`
However, it does not handle urls without a prefix, ie. stackoverflow.com or www.google.com do not match. Anyone know how I can modify this regex to not care if there is a prefix or not?
EDIT: Does my question too vague? Does it need more details?
(((https?|ftp|gopher|telnet|file|notes|ms-help):((//)|(\\\\)))?[\w\d:##%/;$()~_?\+-=\\\.&]*)
I added a ()? around the protocols like Vinko Vrsalovic suggested, but now the regex will match nearly any string, as long as it has valid URL characters.
My implementation of this is I have a database that I manage the contents, and it has a field that either has plain text, a phone number, a URL or an email address. I was looking for an easy way to validate the input so I can have it properly formatted, ie. creating anchor tags for the url/email, and formatting the phone number how I have the other numbers formatted throughout the site. Any suggestions?
The below regex is from the wonderful Mastering Regular Expressions book. If you are not familiar with the free spacing/comments mode, I suggest you get familiar with it.
\b
# Match the leading part (proto://hostname, or just hostname)
(
# ftp://, http://, or https:// leading part
(ftp|https?)://[-\w]+(\.\w[-\w]*)+
|
# or, try to find a hostname with our more specific sub-expression
(?i: [a-z0-9] (?:[-a-z0-9]*[a-z0-9])? \. )+ # sub domains
# Now ending .com, etc. For these, require lowercase
(?-i: com\b
| edu\b
| biz\b
| gov\b
| in(?:t|fo)\b # .int or .info
| mil\b
| net\b
| org\b
| name\b
| coop\b
| aero\b
| museum\b
| [a-z][a-z]\b # two-letter country codes
)
)
# Allow an optional port number
( : \d+ )?
# The rest of the URL is optional, and begins with / . . .
(
/
# The rest are heuristics for what seems to work well
[^.!,?;"'<>()\[\]{}\s\x7F-\xFF]*
(?:
[.!,?]+ [^.!,?;"'<>()\[\]{}\s\x7F-\xFF]+
)*
)?
To explain this regex briefly (for a full explanation get the book) - URLs have one or more dot separated parts ending with either a limited list of final bits, or a two letter country code (.uk .fr ...). In addition the parts may have any alphanumeric characters or hyphens '-', but hyphens may not be the first or last character of the parts. Then there may be a port number, and then the rest of it.
To extract this from the website, go to http://regex.info/listing.cgi?ed=3&p=207 It is from page 207 of the 3rd edition.
And the page says "Copyright © 2008 Jeffrey Friedl" so I'm not sure what the conditions for use are exactly, but I would expect that if you own the book you could use it so ... I'm hoping I'm not breaking the rules putting it here.
If you read section 5 of the URL specification (http://www.isi.edu/in-notes/rfc1738.txt) you'll see that the syntax of a URL is at a minimum:
scheme ':' schemepart
where scheme is 1 or more characters and schemepart is 0 or more characters. Therefore if you don't have a colon, you don't have a URL.
That said, /users/ don't care if they've given you a url, to them it looks like one. So here's what I do:
BEFORE validation, if there isn't a colon in it, prepend http://, then run it through whatever validator you want. This turns any legitimate hostname (which may not include domain info, after all) into something that looks like a URL.
frob -> http://frob
(Nearly) the only rule for the host part is that it can't begin with a digit if it contains no dots. Now, there are specific validations that should be performed for specific schemes, which none of the regexes given thus far accomplish. But, spec compliance is probably not what you want to 'validate'. Therefore a dns query on the hostname portion may be useful, but unless you're using the same resolver in the same context as your user, it isn't going to work in all cases.
Your regexp matches everything starting with one of those protocols, including a lot of things that cannot possibly be existent URLs, if you relax the protocol part (making it optional with ?) then you'll just be matching almost everything, including the empty string.
In other words, it does a great job matching URLs because it matches almost anything starting with http://,https://,ftp:// and so on. Well, it also matches ftp:\\ and ms-help://, but let's ignore that.
It may make sense, depending on actual usage, because the other regexp approach of whitelisting valid domains becomes non maintainable quickly enough, but making the protocol part optional does not make sense.
An example (with the relaxed protocol part in place):
>>> r = re.compile('(((https?|ftp|gopher|telnet|file|notes|ms-help):((//)|(\\\\))+)?[\w\d:##%/;$()~_?\+-=\\\.&]*)')
>>> r.search('oompaloompa_is_not_an_ur%&%%l').groups()[0]
'oompaloompa_is_not_an_ur%&%%l' #Matches!
>>> r.search('oompaloompa_isdfjakojfsdi.sdnioknfsdjknfsdjk.fsdnjkfnsdjknfsdjk').groups()[0]
'oompaloompa_isdfjakojfsdi.sdnioknfsdjknfsdjk.fsdnjkfnsdjknfsdjk' #Matches!
>>>
Given your edit I suggest you either make the user select what is he adding, adding an enum column, or create a simpler regex that'll check for at least a dot, besides the valid characters and maybe some common domains.
A third alternative which will be VERY SLOW and only to be used when URL validation is REALLY REALLY IMPORTANT is actually accessing the URL and do a HEAD request on it, if you get a host not found or an error you know it's not valid. For emails you could try and see if the MX host exists and has port 25 open. If both fails, it'll be plain text. (I'm not suggesting this either)
You can surround the prefix part in brackets and match 0 or 1 occurrences
(((https?|ftp|gopher|telnet|file|notes|ms-help):((//)|(\\\\))+)?
So the whole regex will become
(((https?|ftp|gopher|telnet|file|notes|ms-help):((//)|(\\\\))+)?[\w\d:##%/;$()~_?\+-=\\\.&]*)
The problem with that is it's going to match more or less any word. For example "test" would also be a match.
Where are you going to use that regex? Are you trying to validate a hostname or are you trying to find hostnames inside a paragraph?
Just use:
.*
i.e. match everything.
The things you want to match are just hostnames, not URL (technically).
There's no structure you can use to definitively identify hostnames.
Perhaps you could look for things that end in ".com" but then you'll miss any .co.uk, net, .org, etc.
Edit:
In other words: If you remove the requirement that the URL-like things start with a protocol you won't have any thing to match on.
Depending on what you are using the regular expression on:
Treat everything as a URL
Keep the requirement for a protocol
Hack checks for common endings for hostnames (e.g. .com .net .org) and accept you'll miss some.