Can a URL contain a semicolon and still be valid? - regex

I am using a regular expression to convert plain text URL to clickable links.
#(https?://([-\w\.]+)+(:\d+)?(/([\w/_\.-]*(\?\S+)?)?)?)#
However, sometimes in the body of the text, URL are enumerated one per line with a semi-colon at the end. The real URL does not contain any ";".
http://www.aaa.org/pressdetail.asp?PRESS_REL_ID=275;
http://www.aaa.org/pressdetail.asp?PRESS_REL_ID=123;
http://www.aaa.org/pressdetail.asp?PRESS_REL_ID=124
Is it permitted to have a semicolon (;) in a URL or can the semicolon be considered a marker of the end of an URL? How would that fit in my regular expression?

A semicolon is reserved and should only for its special purpose (which depends on the scheme).
Section 2.2:
Many URL schemes reserve certain
characters for a special meaning:
their appearance in the
scheme-specific part of the URL has a
designated semantics. If the character
corresponding to an octet is
reserved in a scheme, the octet must
be encoded. The characters ";",
"/", "?", ":", "#", "=" and "&" are
the characters which may be
reserved for special meaning within a
scheme. No other characters may be
reserved within a scheme.

The W3C encourages CGI programs to accept ; as well as & in query strings (i.e. treat ?name=fred&age=50 and ?name=fred;age=50 the same way). This is supposed to be because & has to be encoded as & in HTML whereas ; doesn't.

The semi-colon is a legal URI character; it belongs to the sub-delimiter category: http://www.ietf.org/rfc/rfc3986.txt
However, the specification states that whether the semi-colon is legitimate for a specific URI or not depends on the scheme or producer of that URI. So, if site using those links doesn't allow semi-colons, then they're not valid for that particular case.

Technically, a semicolon is a legal sub-delimiter in a URL string; plenty of source material is quoted above including http://www.ietf.org/rfc/rfc3986.txt.
And some do use it for legitimate purposes though it's use is likely site-specific (ie, only for use with that site) because it's usage has to be defined by the site using it.
In the real world however, the primary use for semicolons in URLs is to hide a virus or phishing URL behind a legitimate URL.
For example, sending someone an email with this link:
http:// www.yahoo.com/junk/nonsense;0200.0xfe.0x37.0xbf/malicious_file/
will result in the Yahoo! link (www.yahoo.com/junk/nonsense) being ignored because even though it is legitimate (ie, properly formed) no such page exists. But the second link (0200.0xfe.0x37.0xbf/malicious_file/) presumably exists* and the user will be directed to the malicious_file page; whereupon one's corporate IT manager will get a report and one will likely get a pink slip.
And before all the nay-sayers get their dander up, this is exactly how the new Facebook phishing problem works. The names have been changed to protect the guilty as usual.
*No such page actually exists to my knowledge. The link shown is for purposes of this discussion only.

http://www.ietf.org/rfc/rfc3986.txt covers URLs and what characters may appear in unencoded form. Given that URLs containing semicolons work properly in browsers, your code should support them.

Yes, semicolons are valid in URLs. However, if you're plucking them from relatively unstructured prose, it's probably safe to assume a semicolon at the end of a URL is meant as sentence punctuation. The same goes for other sentence-punctuation characters like periods, question marks, quotes, etc..
If you're only interested in URLs with an explicit http[s] protocol, and your regex flavor supports lookbehinds, this regex should suffice:
https?://[\w!#$%&'()*+,./:;=?#\[\]-]+(?<![!,.?;:"'()-])
After the protocol, it simply matches one or more characters that may be valid in a URL, without worrying about structure at all. But then it backs off as many positions as necessary until the final character is not something that might be sentence punctuation.

Quoting RFCs is not all that helpful in answering this question, because you will encounter URLs with semicolons (and commas for that matter). We had a Regex that did not handle semicolons and commas, and some of our users at NutshellMail complained because URLs containing them do in fact exist in the wild. Try building a dummy URL in Facebook or Twitter that contains a ';' or ',' and you will see that those two services encode the full URL properly.
I replaced the Regex we were using with the following pattern (and have tested that it works):
string regex = #"((www\.|(http|https|ftp|news|file)+\:\/\/)[_.a-zA-Z0-9-]+\.[a-zA-Z0-9\/_:#=.+?,##%&~_-]*[^.|\'|\# |!|\(|?|,| |>|<|;|\)])";
This Regex came from http://rickyrosario.com/blog/converting-a-url-into-a-link-in-csharp-using-regular-expressions/ (with a slight modification)

Related

Regular Expression to match a specific URL broken up by arbitrary characters

I run a Django-based forum (the framework is probably not important to the question, but still) and it has been increasingly getting spammed with posts that link to a specific website constantly (www.solidwoodkitchen.co.uk - these people are apparently the worst).
I've implemented a string blocking system that stops them posting to the forum if the URL of the website is included in the post, but as spam bots usually do, it has figured out a way around that by breaking up the URL with other characters (eg. w_w_w.s*olid_wood*kit_ch*en._*co.*uk .). So a couple of questions:
Is it even possible to build a regex capable of finding the specific URL within a block of text even when it has been modified like that?
If it is, would this cause a performance hit?
Description
You could break the url into a string of characters, then join them together with [^a-z0-9]*?. So in this case with www.solidwoodkitchen.co.uk the resulting regex would look like:
w[^a-z0-9]*?w[^a-z0-9]*?w[^a-z0-9]*?[.][^a-z0-9]*?s[^a-z0-9]*?o[^a-z0-9]*?l[^a-z0-9]*?i[^a-z0-9]*?d[^a-z0-9]*?w[^a-z0-9]*?o[^a-z0-9]*?o[^a-z0-9]*?d[^a-z0-9]*?k[^a-z0-9]*?i[^a-z0-9]*?t[^a-z0-9]*?c[^a-z0-9]*?h[^a-z0-9]*?e[^a-z0-9]*?n[^a-z0-9]*?[.][^a-z0-9]*?c[^a-z0-9]*?o[^a-z0-9]*?[.][^a-z0-9]*?u[^a-z0-9]*?k
Edit live on Debuggex
This could would basically search for the entire string of characters seperated by zero or more non alphanumeric characters.
Or you could take the input text and strip out all punctuation then simply search for wwwsolidwoodkitchencouk.

Regex for URL routing - match alphanumeric and dashes except words in this list

I'm using CodeIgniter to write an app where a user will be allowed to register an account and is assigned a URL (URL slug) of their choosing (ex. domain.com/user-name). CodeIgniter has a URL routing feature that allows the utilization of regular expressions (link).
User's are only allowed to register URL's that contain alphanumeric characters, dashes (-), and under scores (_). This is the regex I'm using to verify the validity of the URL slug: ^[A-Za-z0-9][A-Za-z0-9_-]{2,254}$
I am using the url routing feature to route a few url's to features on my site (ex. /home -> /pages/index, /activity -> /user/activity) so those particular URL's obviously cannot be registered by a user.
I'm largely inexperienced with regular expressions but have attempted to write an expression that would match any URL slugs with alphanumerics/dash/underscore except if they are any of the following:
default_controller
404_override
home
activity
Here is the code I'm using to try to match the words with that specific criteria:
$route['(?!default_controller|404_override|home|activity)[A-Za-z0-9][A-Za-z0-9_-]{2,254}'] = 'view/slug/$1';
but it isn't routing properly. Can someone help? (side question: is it necessary to have ^ or $ in the regex when trying to match with URL's?)
Alright, let's pick this apart.
Ignore CodeIgniter's reserved routes.
The default_controller and 404_override portions of your route are unnecessary. Routes are compared to the requested URI to see if there's a match. It is highly unlikely that those two items will ever be in your URI, since they are special reserved routes for CodeIgniter. So let's forget about them.
$route['(?!home|activity)[A-Za-z0-9][A-Za-z0-9_-]{2,254}'] = 'view/slug/$1';
Capture everything!
With regular expressions, a group is created using parentheses (). This group can then be retrieved with a back reference - in our case, the $1, $2, etc. located in the second part of the route. You only had a group around the first set of items you were trying to exclude, so it would not properly capture the entire wild card. You found this out yourself already, and added a group around the entire item (good!).
$route['((?!home|activity)[A-Za-z0-9][A-Za-z0-9_-]{2,254})'] = 'view/slug/$1';
Look-ahead?!
On that subject, the first group around home|activity is not actually a traditional group, due to the use of ?! at the beginning. This is called a negative look-ahead, and it's a complicated regular expression feature. And it's being used incorrectly:
Negative lookahead is indispensable if you want to match something not followed by something else.
There's a LOT more I could go into with this, but basically we don't really want or need it in the first place, so I'll let you explore if you'd like.
In order to make your life easier, I'd suggest separating the home, activity, and other existing controllers in the routes. CodeIgniter will look through the list of routes from top to bottom, and once something matches, it stops checking. So if you specify your existing controllers before the wild card, they will match, and your wild card regular expression can be greatly simplified.
$route['home'] = 'pages';
$route['activity'] = 'user/activity';
$route['([A-Za-z0-9][A-Za-z0-9_-]{2,254})'] = 'view/slug/$1';
Remember to list your routes in order from most specific to least. Wild card matches are less specific than exact matches (like home and activity), so they should come after (below).
Now, that's all the complicated stuff. A little more FYI.
Remember that dashes - have a special meaning when in between [] brackets. You should escape them if you want to match a literal dash.
$route['([A-Za-z0-9][A-Za-z0-9_\-]{2,254})'] = 'view/slug/$1';
Note that your character repetition min/max {2,254} only applies to the second set of characters, so your user names must be 3 characters at minimum, and 255 at maximum. Just an FYI if you didn't realize that already.
I saw your own answer to this problem, and it's just ugly. Sorry. The ^ and $ symbols are used improperly throughout the lookahead (which still shouldn't be there in the first place). It may "work" for a few use cases that you're testing it with, but it will just give you problems and headaches in the future.
Hopefully now you know more about regular expressions and how they're matched in the routing process.
And to answer your question, no, you should not use ^ and $ at the beginning and end of your regex -- CodeIgniter will add that for you.
Use the 404, Luke...
At this point your routes are improved and should be functional. I will throw it out there, though, that you might want to consider using the controller/method defined as the 404_override to handle your wild cards. The main benefit of this is that you don't need ANY routes to direct a wild card, or to prevent your wild card from goofing up existing controllers. You only need:
$route['404_override'] = 'view/slug';
Then, your View::slug() method would check the URI, and see if it's a valid pattern, then check if it exists as a user (same as your slug method does now, no doubt). If it does, then you're good to go. If it doesn't, then you throw a 404 error.
It may not seem that graceful, but it works great. Give it a shot if it sounds better for you.
I'm not familiar with codeIgniter specifically, but most frameworks routing operate based on precedence. In other words, the default controller, 404, etc routes should be defined first. Then you can simplify your regex to only match the slugs.
Ok answering my own question
I've seem to come up with a different expression that works:
$route['(^(?!default_controller$|404_override$|home$|activity$)[A-Za-z0-9][A-Za-z0-9_-]{2,254}$)'] = 'view/slug/$1';
I added parenthesis around the whole expression (I think that's what CodeIgniter matches with $1 on the right) and added a start of line identifier: ^ and a bunch of end of line identifiers: $
Hope this helps someone who may run into this problem later.

regex, find last part of a url

Let's take an url like
www.url.com/some_thing/random_numbers_letters_everything_possible/set_of_random_characters_everything_possible.randomextension
If I want to capture "set_of_random_characters_everything_possible.randomextension" will [^/\n]+$work? (solution taken from Trying to get the last part of a URL with Regex)
My question is: what does the "\n" part mean (it works even without it)? And, is it secure if the url has the most casual combination of characters apart "/"?
First, please note that www.url.com/some_thing/random_numbers_letters_everything_possible/set_of_random_characters_everything_possible.randomextension is not a URL without a scheme like http:// in front of it.
Second, don't parse URLs yourself. What language are you using? You probably don't want to use a regex, but rather an existing module that has already been written, tested, and debugged.
If you're using PHP, you want the parse_url function.
If you're using Perl, you want the URI module.
Have a look at this explanation: http://regex101.com/r/jG2jN7
Basically what is going on here is "match any character besides slash and new line, infinite to 1 times". People insert \r\n into negated char classes because in some programs a negated character class will match anything besides what has been inserted into it. So [^/] would in that case match new lines.
For example, if there was a line break in your text, you would not get the data after the linebreak.
This is however not true in your case. You need to use the s-flag (PCRE_DOTALL) for this behavior.
TL;DR: You can leave it or remove it, it wont matter.
Ask away if anything is unclear or I've explained it a little sloppy.

hl.regex.pattern not working in solr

I am using solr to fetch data.
I was using below parameters to fetch data:
http://testURL/solr/core0/select?start=10&rows=10&hl.fl=CC&hl.requireFieldMatch=true&hl=on&hl.maxAnalyzedChars=1&hl.fragsize=145&hl.snippets=99&sort=COlumn1+desc&q=CC%3a%28%22test%22~2%29&fl=title120%2ccolumn2%2ccolumn3%2cRL_DateTime%2cSid%2ccolumn4%2cguid%2chour&hl.regex.pattern=^\d+%20%3E%3E%20
Above query is not working with hl.regex.pattern parameter.
If I remove "hl.regex.pattern" than it is providing results in highlight section.
If I provide that regex pattern than it will not.
Regex is working in my c# code.
So am I missing anything here?
It's almost certainly the ^\. Those aren't valid in a URI, so you'll have to escape them.
From RFC 1738:
only alphanumerics, the special characters "$-_.+!*'(),", and
reserved characters used for their reserved purposes may be used
unencoded within a URL.
This is a little dated, since non-Roman alphanumerics like λάμδα are allowed now, but the gist is the same.
Try hl.regex.pattern=%5E%5Cd+%20%3E%3E%20 instead.

Need regex to get domain + subdomain

So im using this function here:
function get_domain($url)
{
$pieces = parse_url($url);
$domain = isset($pieces['host']) ? $pieces['host'] : '';
if (preg_match('/(?P<domain>[a-z0-9][a-z0-9\-]{1,63}\.[a-z\.]{2,6})$/i', $domain, $regs)) {
return $regs['domain'];
}
return false;
}
$referer = get_domain($_SERVER['HTTP_REFERER']);
And what i need is another regex for it, if someone would be so kind to help.
Exactly what i need is for it to get the whole domain, including subdomains.
Lets say as a real problem i have now. When people blogging link from example: myblog.blogger.com
The referer url will be just blogger.com, which is not ideal..
So if someone could help me so i can get the including subdomain as regex code for the function above, id apreciate it alot!
Thanks!
This regex should match a domain in a string, including any dubdomains:
/([a-z0-9|-]+\.)*[a-z0-9|-]+\.[a-z]+/
Translated to rough english, it functions like this: "match the first part of the string that has 'sometextornumbers.sometext', and also include any number of 'sometextornumbers.' that might preceed it.
See it in action here: http://regexr.com?2vppk
Note that the multiline and global flags in that link are only there to be able to match the entire blob of test-text, so you don't need if you're passing only one line to the regex
Good luck with the above as Domain names now contain non-roman characters. These would have to be processed into equivalent but unique ascii before regex could work reliably. See RFC 3490 Internationalizing Domain Names in Applications (IDNA) ...
See https://www.rfc-editor.org/rfc/rfc3490
which has
Until now, there has been no standard method for domain names to use
characters outside the ASCII repertoire. This document defines
internationalized domain names (IDNs) and a mechanism called
Internationalizing Domain Names in Applications (IDNA) for handling
them in a standard fashion. IDNs use characters drawn from a large
repertoire (Unicode), but IDNA allows the non-ASCII characters to be
represented using only the ASCII characters already allowed in so-
called host names today. This backward-compatible representation is
required in existing protocols like DNS, so that IDNs can be
introduced with no changes to the existing infrastructure. IDNA is
only meant for processing domain names, not free text.
I guess this is an optimization for the first suggestion.
The main improvements:
does not react to invalid pattern sub..domain.xyz
captures more that one sub-domain as group
captures port if given
https://((?:[a-z0-9-]+\.)*)([a-z0-9-]+\.[a-z]+)($|\s|\:\d{1,5})
Test it: https://regex101.com/r/njFIil/1
This regex does not handle any unicode symbols, which could be a problem as mentioned above.
Better solution:
/^([a-z0-9|-]+[a-z0-9]{1,}\.)*[a-z0-9|-]+[a-z0-9]{1,}\.[a-z]{2,}$/
Regex sample:
https://regexr.com/4k71a
And for email address:
/^[a-z0-9|.|-]+[a-z0-9]{1,}#([a-z0-9|-]+[a-z0-9]{1,}\.)*[a-z0-9|-]+[a-z0-9]{1,}\.[a-z]{2,}$/