Regex to match URL not followed by " or < - regex

I'm trying to modify the url-matching regex at http://daringfireball.net/2010/07/improved_regex_for_matching_urls to not match anything that's already part of a valid URL tag or used as the link text.
For example, in the following string, I want to match http://www.foo.com, but NOT http://www.bar.com or http://www.baz.com
www.foo.com http://www.baz.com
I was trying to add a negative lookahead to exclude matches followed by " or <, but for some reason, it's only applying to the "m" in .com. So, this regex still returns http://www.bar.co and http://www.baz.co as matches.
I can't see what I'm doing wrong... any ideas?
\b((?:[a-z][\w-]+:(?:/{1,3}|[a-z0-9%])|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:'".,<>?«»“”‘’]))(?!["<])
Here is a simpler example too:
((((ht|f)tps?:\/\/)|(www.))[a-zA-Z0-9_\-.:#/~}?]+)(?!["<])

I looked into this issue last year and developed a solution that you may want to look at - See: URL Linkification (HTTP/FTP) This link is a test page for the Javascript solution with many examples of difficult-to-linkify URLs.
My regex solution, written for both PHP and Javascript - is not simple (but neither is the problem as it turns out.) For more information I would recommend also reading:
The Problem With URLs by Jeff Atwood, and
An Improved Liberal, Accurate Regex Pattern for Matching URLs by John Gruber
The comments following Jeff's blog post are a must read if you want to do this right...
Note also that John Gruber's regex has a component that can go into realm of catastrophic backtracking (the part which matches one level of matching parentheses).

Yeah, its actually trivial to make it work if you just want to exclude trailing characters, just make your expression 'independent', then no backtracking will occurr in that segment.
(?>\b ...)(?!["<])
A perl test:
use strict;
use warnings;
my $str = 'www.foo.com http://www.baz.comhttp://www.some.com';
while ($str =~ m~
(?>
\b((?:[a-z][\w-]+:(?:/{1,3}|[a-z0-9%])|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:'".,<>?«»“”‘’]))
)
(?!["<])
~xg)
{
print "$1\n";
}
Output:
www.foo.com
http://www.some.com

Related

username before #

I want to match the username before # in address mail,
and i create this regex
[A-Za-z+ /w+0-9._%+-]+#
the result of my example is:
example: blabla,blabla,Test#Testing.com,blabla,blabla,blabla
result : Test#
How can I get only Test without #.
The simplest way is:
([A-Za-z /0-9._%+-]+)#
and than use at what you taken ($1 in perl, match var in tcl, etc.)
btw,
I didn't know email addresses can have spaces in them, are you sure you're not taking too much in?
Edit:
here's a little tutorial on lookaheads (supporting Wiktor's comment)
http://www.regular-expressions.info/lookaround.html

Matching URLs with other characters around

I need a regex pattern to match URLs in a complicated environment.
An URL would be in this position:
[url=http://www.php.net/manual/en/function.preg-replace.php:32p0eixu]TEST[/url:32p0eixu]
(That's just a sample URL)
I need to match the URL until the colon, the colon and the code after that should be ignored. There are so many URLs out there and I'm not that experienced to create a pattern to match everything from http:// to :
As I said, everything else should be ignored, left away, except the URL which I need to store in a variable.
Could someone help me create such a pattern? My tries were matching the URL above, but when I put in more complicated URLs, they wouldn't match.
This is the pattern I've created. It works with simple URLs, but not with the complicated ones:
http(s)?://[A-Za-z0-9.,/_-]+
I'm not very good in regex, I'm still learning.
Thank you.
This regex should do it for you.
\[url=(.*?):[a-zA-Z0-9]*\]
Run against your test data:
[url=http://www.php.net/manual/en/function.preg-replace.php:32p0eixu]TEST[/url:32p0eixu]
This will return the URL in capture group 1.
Assuming PHP (since your test URL is for the PHP manual), you'd use this with preg_match like this:
$value = "[url=http://www.php.net/manual/en/function.preg-replace.php:32p0eixu]TEST[/url:32p0eixu]";
$pattern = "/\[url=(.*?):[a-zA-Z0-9]*\]/";
preg_match($pattern, $value, $matches);
echo $matches[1];
Output:
http://www.php.net/manual/en/function.preg-replace.php
This will also work against URLs which contain colons in them, such as:
http://www.php.net:8080/manual/en/function.preg-replace.php
http://www.php.net/manual/us:en/function.preg-replace.php
How about this:
^(http(s)?:\/\/)?[^]^(^)^ ]+
Below regex will give you the url part before colon:
\[url=((http|https)?://)?[^\:]+

Selecting URLs using RegExp but ignoring them when surrounded by double quotes

I've searched around quite a bit now, but I can't get any suggestions to work in my situation. I've seen success with negative lookahead or lookaround, but I really don't understand it.
I wish to use RegExp to find URLs in blocks of text but ignore them when quoted. While not perfect yet I have the following to find URLs:
(https?\://)?(\w+\.)+\w{2,}(:[0-9])?\/?((/?\w+)+)?(\.\w+)?
I want it to match the following:
www.test.com:50/stuff
http://player.vimeo.com/video/63317960
odd.name.amazone.com/pizza
But not match:
"www.test.com:50/stuff
http://plAyerz.vimeo.com/video/63317960"
"odd.name.amazone.com/pizza"
Edit:
To clarify, I could be passing a full paragraph of text through the expression. Sample paragraph of what I'd like below:
I would like the following link to be found www.example.com. However this link should be ignored "www.example.com". It would be nice, but not required, to have "www.example.com and www.example.com" ignored as well.
A sample of a different one I have working below. language is php:
$articleEntry = "Hey guys! Check out this cool video on Vimeo: player.vimeo.com/video/63317960";
$pattern = array('/\n+/', '/(https?\:\/\/)?(player\.vimeo\.com\/video\/[0-9]+)/');
$replace = array('<br/><br/>',
'<iframe src="http://$2?color=40cc20" width="500" height="281" frameborder="0" webkitAllowFullScreen mozallowfullscreen allowFullScreen></iframe>');
$articleEntry = preg_replace($pattern,$replace,$articleEntry);
The result of the above will replace any new lines "\n" with a double break "" and will embed the Vimeo video by replacing the Vimeo address with an iframe and link.
I've found a solution!
(?=(([^"]+"){2})*[^"]*$)((https?:\/\/)?(\w+\.)+\w{2,}(:[0-9]+)?((\/\w+)+(\.\w+)?)?\/?)
The first part from (? to *$) what makes it work for me. I found this as an answer in java Regex - split but ignore text inside quotes? by https://stackoverflow.com/users/548225/anubhava
While I had read that question before, I had overlooked his answer because it wasn't the one that "solved" the question. I just changed the single quote to double quote and it works out for me.
add ^ and $ to your regex
^(https?\://)?(\w+\.)+\w{2,}(:[0-9])?\/?((/?\w+)+)?(\.\w+)?$
please notice you might need to escape the slashes after http (meaning https?\:\/\/)
update
if you want it to be case sensitive, you shouldn't use \w but [a-z]. the \w contains all letters and numbers, so you should be careful while using it.

How to Regex Multiple URLs From Same Variable In Perl

I'm trying to search a field in a database to extract URLs. Sometimes there will be more than 1 URL in a field and I would like to extract those in to separate variables (or an array).
I know my regex isn't going to cover all possibilities. As long as I flag on anything that starts with http and ends with a space I'm ok.
The problem I'm having is that my efforts either seem to get only 1 URL per record or they get only 1 the last letter from each URL. I've tried a couple different techniques based on solutions other have posted but I haven't found a solution that works for me.
Sample input line:
Testing http://marko.co http://tester.net Just about anything else you'd like.
Output goal
$var[0] = http://marko.co
$var[1] = http://tester.net
First try:
if ( $status =~ m/http:(\S)+/g ) {
print "$&\n";
}
Output:
http://marko.co
Second try:
#statusurls = ($status =~ m/http:(\S)+/g);
print "#statusurls\n";
Output:
o t
I'm new to regex, but since I'm using the same regex for each attempt, I don't understand why it's returning such different results.
Thanks for any help you can offer.
I've looked at these posts and either didn't find what I was looking for or didn't understand how to implement it:
This one seemed the most promising (and it's where I got the 2nd attempt from, but it didn't return the whole URL, just the letter: How can I store regex captures in an array in Perl?
This has some great stuff in it. I'm curious if I need to look at the URL as a word since it's bookended by spaces: Regex Group in Perl: how to capture elements into array from regex group that matches unknown number of/multiple/variable occurrences from a string?
This one offers similar suggestions as the first two. How can I store captures from a Perl regular expression into separate variables?
Solution:
#statusurls = ($status =~ m/(http:\S+)/g);
print "#statusurls\n";
Thanks!
I think that you need to capture more than just one character. Try this regex instead:
m/http:(\S+)/g

Except URL regex

Sigh, regex trouble again.
I have following in $text:
[img]http://www.site.com/logo.jpg[/img]
and
[url]http://www.site.com[/url]
I have regex expression:
$text = preg_replace("/(?<!(\[img\]|\[url\]))([http|ftp]+:\/\/)?\S+[^\s.,>)\];'\"!?]\.+[com|ru|net|ua|biz|org]+\/?[^<>\n\r ]+[A-Za-z0-9](?!(\[\/img\]|\[\/url\]))/","there was link",$text);
The point is to replace url only if it's not preceded by [img] or [url] and not followed by [/img] or [/url]. On the output of previous example I get:
there was link
and
there was link
Both, URL and lookbehind and lookforward regexps are working fine separately.
$text = "[img]bash.org/logo.jpg[/img]";
$text = preg_replace("/(?<!(\[img\]|\[url\]))bash.org(?!(\[\/img\]|\[\/url\]))/","there was link",$text);
echo $text leaves everything as is and gives me [img]bash.org/logo.jpg[/img]
I suppose the problem is in combination of lookarounds and URL regex. Where's my mistake?
I WANT TO
replace http://www.google.com with "there was link", but leave as is "[url]http://www.google.com[/url]"
I'M GETTING
http://www.google.com replaced with "there was link" and [url]http://www.google.com[/url] replaced with "there was link"
HERE'S PHP CODE TO TEST
<?php
$text = "[url]http://www.google.com[/url] <br><br> http://www.google.com";
// should NOT be changed //should be changed
$text = preg_replace("/(?<!\[url\])([http|ftp]+:\/\/)?\S+[^\s.,>)\];'\"!?]\.+[com|ru|net|ua|biz|org]+\/?[^<>\n\r ]+[A-Za-z0-9](?!\[\/url\])/","there was link",$text);
echo $text;
echo '<hr width="100%">';
$text = ":) :-) 0:) 0:-) :)) :-))";
$text = preg_replace("/(?<!0):-?\)(?!\))/","smiley",$text);
echo $text; // lookarounds work
echo '<hr width="100%">';
$text = "http://stackoverflow.com/questions/2482921/regexp-exclusion";
$text = preg_replace("/([http|ftp]+:\/\/)?\S+[^\s.,>)\];'\"!?]\.+[com|ru|net|ua|biz|org]+\/?[^<>\n\r ]+[A-Za-z0-9]/","it's a link to stackoverflow",$text);
echo $text; // URL pattern works fine
?>
Assuming I'm understanding you, you wish to replace all URLs in your $input, with the words 'link was here', unless the URL was within either the url or img bbcode tags. The reason the lookaround assertions aren't working is because those parts are actually matching against your very greedy URL pattern (which I'm fairly sure does lots of things you don't mean it to). Writing a pattern that will match any valid URL (including query string) within other text and that will also not match the tags attached to it is not necessarily the simplest of matters. Especially since your current pattern has the http:// or ftp:// as optional.
The only way you are likely to gain any success is to decide on a strict set of rules that constitute a url.
It is tough to fully understand your question, but it looks like you're doing reverse BBcode. So, leave it alone if it's surrounded by tags? If that is the case, then I think you will have an interesting problem on your hands because URL regexes are notoriously complex.
I think you may be making this more complex than it needs to be. Instead, I would change anything that is between the BBcode. Here's what I think needs to happen:
find the string segment "[url]"
capture anything that proceeds it
end the capture when the string segment "[/url]" is seen
That is an easy regex:
$string = "[url]http://www.google.com[/url] <br><br> http://www.google.com";
$replace = "there was link";
$text = preg_replace_all($regex,$replace,$text);
echo $text;
I know this isn't exactly what you asked for (in fact, probably the exact opposite), but it would achieve the same result and be much easier.
You can probably try using negative lookaheads with this regex, but I am not sure it would give you proper results:
$regex = "#(?!\[url\])(.*)(?!\[/url\])#";
One important note: This does not sanitize user input. Make sure you do this, but I would separate the logic so it is very easy to see what you are doing and where you are doing it. I would also use a library to do this because it's easier and probably safer.
Final working regexp looks like:
(?<!\[img\]|\[url\])((^|\s)([\w-]+://|www[.])[^\s()<>]+(?:\([\w\d]+\)|([^[:punct:]\s]|/)))(?!\[\/img\]|\[/url\])
Example:
<?php
$text = "
[img]http://google.com/logo.jpg[/img]
[img]www.google.com/logo.jpg[/img]
[img]http://www.google.com/logo.jpg[/img]
[url]http://google.com/logo.jpg[/url]
[url]www.google.com/logo.jpg[/url]
[url]http://www.google.com/logo.jpg[/url]
www.google.com/logo.jpg
http://google.com/logo.jpg
http://www.google.com/logo.jpg
";
$text = nl2br($text);
$text = preg_replace("'(?<!\[img\]|\[url\])((^|\s)([\w-]+://|www[.])[^\s()<>]+(?:\([\w\d]+\)|([^[:punct:]\s]|/)))(?!\[\/img\]|\[/url\])'i","<font color=\"#ff0000\">link</font>",$text);
echo $text;
?>
outputs:
[img]http://google.com/logo.jpg[/img]
[img]www.google.com/logo.jpg[/img]
[img]http://www.google.com/logo.jpg[/img]
[url]http://google.com/logo.jpg[/url]
[url]www.google.com/logo.jpg[/url]
[url]http://www.google.com/logo.jpg[/url]
link
link
link
The trick is to replace only links starting with ^ or \s . No other way to solve this issue wasn't found.
Where's my mistake?
Well, the worst mistake is the lookbehind. It isn't needed, and it's making the job much harder than it needs to be. Assuming the existing tags are well formed, you needn't bother looking for the opening tag; its presence is implied by the presence of the closing tag.
EDIT: Your regex has several other problems besides the lookbehind, but it didn't seem worthwhile to try and fix it. Instead, I grabbed a regex from RegexBuddy's built-in library of useful regexes, and added the lookahead to it.
Try this regex (or see it in action on ideone):
'_\b(?>
(?>www\.|ftp\.|(?:https?|ftp|file)://) # scheme or subdomain
[-+&##/%=~|$?!:,.\w]*[+&##/%=~|$\w] # everything else
)(?!\[/(?:img|url)\])
_x'
Just because a problem can be described in terms of looking forward or backward, preceding or following, etc., doesn't mean you should design the regex that way. Lookbehind in particular should never be the first tool you reach for.