Except URL regex - regex

Sigh, regex trouble again.
I have following in $text:
[img]http://www.site.com/logo.jpg[/img]
and
[url]http://www.site.com[/url]
I have regex expression:
$text = preg_replace("/(?<!(\[img\]|\[url\]))([http|ftp]+:\/\/)?\S+[^\s.,>)\];'\"!?]\.+[com|ru|net|ua|biz|org]+\/?[^<>\n\r ]+[A-Za-z0-9](?!(\[\/img\]|\[\/url\]))/","there was link",$text);
The point is to replace url only if it's not preceded by [img] or [url] and not followed by [/img] or [/url]. On the output of previous example I get:
there was link
and
there was link
Both, URL and lookbehind and lookforward regexps are working fine separately.
$text = "[img]bash.org/logo.jpg[/img]";
$text = preg_replace("/(?<!(\[img\]|\[url\]))bash.org(?!(\[\/img\]|\[\/url\]))/","there was link",$text);
echo $text leaves everything as is and gives me [img]bash.org/logo.jpg[/img]
I suppose the problem is in combination of lookarounds and URL regex. Where's my mistake?
I WANT TO
replace http://www.google.com with "there was link", but leave as is "[url]http://www.google.com[/url]"
I'M GETTING
http://www.google.com replaced with "there was link" and [url]http://www.google.com[/url] replaced with "there was link"
HERE'S PHP CODE TO TEST
<?php
$text = "[url]http://www.google.com[/url] <br><br> http://www.google.com";
// should NOT be changed //should be changed
$text = preg_replace("/(?<!\[url\])([http|ftp]+:\/\/)?\S+[^\s.,>)\];'\"!?]\.+[com|ru|net|ua|biz|org]+\/?[^<>\n\r ]+[A-Za-z0-9](?!\[\/url\])/","there was link",$text);
echo $text;
echo '<hr width="100%">';
$text = ":) :-) 0:) 0:-) :)) :-))";
$text = preg_replace("/(?<!0):-?\)(?!\))/","smiley",$text);
echo $text; // lookarounds work
echo '<hr width="100%">';
$text = "http://stackoverflow.com/questions/2482921/regexp-exclusion";
$text = preg_replace("/([http|ftp]+:\/\/)?\S+[^\s.,>)\];'\"!?]\.+[com|ru|net|ua|biz|org]+\/?[^<>\n\r ]+[A-Za-z0-9]/","it's a link to stackoverflow",$text);
echo $text; // URL pattern works fine
?>

Assuming I'm understanding you, you wish to replace all URLs in your $input, with the words 'link was here', unless the URL was within either the url or img bbcode tags. The reason the lookaround assertions aren't working is because those parts are actually matching against your very greedy URL pattern (which I'm fairly sure does lots of things you don't mean it to). Writing a pattern that will match any valid URL (including query string) within other text and that will also not match the tags attached to it is not necessarily the simplest of matters. Especially since your current pattern has the http:// or ftp:// as optional.
The only way you are likely to gain any success is to decide on a strict set of rules that constitute a url.

It is tough to fully understand your question, but it looks like you're doing reverse BBcode. So, leave it alone if it's surrounded by tags? If that is the case, then I think you will have an interesting problem on your hands because URL regexes are notoriously complex.
I think you may be making this more complex than it needs to be. Instead, I would change anything that is between the BBcode. Here's what I think needs to happen:
find the string segment "[url]"
capture anything that proceeds it
end the capture when the string segment "[/url]" is seen
That is an easy regex:
$string = "[url]http://www.google.com[/url] <br><br> http://www.google.com";
$replace = "there was link";
$text = preg_replace_all($regex,$replace,$text);
echo $text;
I know this isn't exactly what you asked for (in fact, probably the exact opposite), but it would achieve the same result and be much easier.
You can probably try using negative lookaheads with this regex, but I am not sure it would give you proper results:
$regex = "#(?!\[url\])(.*)(?!\[/url\])#";
One important note: This does not sanitize user input. Make sure you do this, but I would separate the logic so it is very easy to see what you are doing and where you are doing it. I would also use a library to do this because it's easier and probably safer.

Final working regexp looks like:
(?<!\[img\]|\[url\])((^|\s)([\w-]+://|www[.])[^\s()<>]+(?:\([\w\d]+\)|([^[:punct:]\s]|/)))(?!\[\/img\]|\[/url\])
Example:
<?php
$text = "
[img]http://google.com/logo.jpg[/img]
[img]www.google.com/logo.jpg[/img]
[img]http://www.google.com/logo.jpg[/img]
[url]http://google.com/logo.jpg[/url]
[url]www.google.com/logo.jpg[/url]
[url]http://www.google.com/logo.jpg[/url]
www.google.com/logo.jpg
http://google.com/logo.jpg
http://www.google.com/logo.jpg
";
$text = nl2br($text);
$text = preg_replace("'(?<!\[img\]|\[url\])((^|\s)([\w-]+://|www[.])[^\s()<>]+(?:\([\w\d]+\)|([^[:punct:]\s]|/)))(?!\[\/img\]|\[/url\])'i","<font color=\"#ff0000\">link</font>",$text);
echo $text;
?>
outputs:
[img]http://google.com/logo.jpg[/img]
[img]www.google.com/logo.jpg[/img]
[img]http://www.google.com/logo.jpg[/img]
[url]http://google.com/logo.jpg[/url]
[url]www.google.com/logo.jpg[/url]
[url]http://www.google.com/logo.jpg[/url]
link
link
link
The trick is to replace only links starting with ^ or \s . No other way to solve this issue wasn't found.

Where's my mistake?
Well, the worst mistake is the lookbehind. It isn't needed, and it's making the job much harder than it needs to be. Assuming the existing tags are well formed, you needn't bother looking for the opening tag; its presence is implied by the presence of the closing tag.
EDIT: Your regex has several other problems besides the lookbehind, but it didn't seem worthwhile to try and fix it. Instead, I grabbed a regex from RegexBuddy's built-in library of useful regexes, and added the lookahead to it.
Try this regex (or see it in action on ideone):
'_\b(?>
(?>www\.|ftp\.|(?:https?|ftp|file)://) # scheme or subdomain
[-+&##/%=~|$?!:,.\w]*[+&##/%=~|$\w] # everything else
)(?!\[/(?:img|url)\])
_x'
Just because a problem can be described in terms of looking forward or backward, preceding or following, etc., doesn't mean you should design the regex that way. Lookbehind in particular should never be the first tool you reach for.

Related

Regex to capture an URL

I've extracted an URL from a website in this string form:
#{href=http://download.company.net/file.exe}[0]
I can't figure out pattern how to get this part out of it: http://download.company.net/file.exe so I can use it as URL to download file.
From my point of view the logic would be, that I need to first match "http" as beggining of a string, wildcard inbetween and then match "}", but not include it in final output. So IDK ...[http]*\} (I know that this "syntax" of mine is totally wrong, but you get the idea)
Reason I dont want to include "exe" to pattern, is that file extension could be "msi" and I want it to be more universal. Also some good and comprehensive PS regex article would help me greatly (with inexperience in mind) - I really didnt find any "newbie friendly" or comprehensive enough to understand this topic.
You can either, use [regex]::match or -replace.
In the following example, I capture everything after href= that is not a starting curly bracket }:
'#{href=http://download.company.net/file.exe}[0]' -replace '#{href=([^}]+).*', '$1'
Output:
http://download.company.net/file.exe
I'd use -cmatch or -imatch as
if ($content -imatch '(?<=href=).*(?=})') {
$result = $matches[0]
} else {
$result = ''
}
In case of test data, it will return
http://download.company.net/file.exe

Regex gets more result then in text available

I have a really weird problem: i searching for URLs on a html site and want only a specific part of the url. In my test html page the link occurs only once, but instead of one result i get about 20...
this is my regex im using:
perl -ne 'm/http\:\/\myurl\.com\/somefile\.php.+\/afolder\/(.*)\.(rar|zip|tar|gz)/; print "$1.$2\n";'
sample input would be something like this:
<html><body>Somelinknme</body></html>
which is a very easy example. so in real the link would apper on a normal website with content around...
my result should be something like this:
testfile.zip
but instead i see this line very often... Is this a problem with the regex or with something else?
Yes, the regex is greedy.
Use an appropriate tool for HTML instead: HTML::LinkExtor or one of the link methods in WWW::Mechanize, then URI to extract a specific part.
use 5.010;
use WWW::Mechanize qw();
use URI qw();
use URI::QueryParam qw();
my $w = WWW::Mechanize->new;
$w->get('file:///tmp/so10549258.html');
for my $link ($w->links) {
my $u = URI->new($link->url);
# 'http://myurl.com/somefile.php?x=foo&y=bla&z=sdf&path=/foo/bar/afolder/testfile.zip&more=arguments&and=evenmore'
say $u->query_param('path');
# '/foo/bar/afolder/testfile.zip'
$u = URI->new($u->query_param('path'));
say (($u->path_segments)[-1]);
# 'testfile.zip'
}
Are there 20 lines following in the file after your link?
Your problem is that the matching variables are not reseted. You match your link the first time, $1 and $2 get their values. In the following lines the regex is not matching, but $1 and $2 has still the old values, therefore you should print only if the regex matches and not every time.
From perlre, see section Capture Groups
NOTE: Failed matches in Perl do not reset the match variables, which makes it easier to write code that tests for a series of more specific cases and remembers the best match.
This should do the trick for your sample input & output.
$Str = '<html><body>Somelinknme</body></html>';
#Matches = ($Str =~ m#path=.+/(\w+\.\w+)#g);
print #Matches ;

How to remove a part of an URL with regexes?

How can I turn this:
http://site.com/index.php?id=15
Into this?:
http://site.com/index.php?id=
Which RegEx(s) do I use?
I've been trying to do this for a good 2 hours now and I've had no luck.
I can't seem to take out the number(s) at the end, and sometimes there are
letters in the end as well which give me problems.
I am using Bing! instead of Google.
My RegEx so far is this when I search something:
$start = '<h3><a href="';
$end = '" onmousedown=';
while ($result =~ m/$start(.*?)$end/g)
What can I add in their to take out the letters and digits in the end and just leave it as an equal sign?
Thank you.
Since you cannot parse [X]HTML properly with regular expressions, you should look for the minimum possible context that will get you the href you want.
To the best of my knowledge, the one character that cannot be in a href is ". therefore
/href="([^"]+)"/
Should yield a URL in $1. I would sanity check it for URL-ishness before extracting the id string you want, and then:
s/\?id=\w+/id=/
But this has hack written all over it, because you can't parse HTML with regular expressions. So it will probably break the first time you demonstrate it to a customer.
You should really check out proper Perl parsing: http://www.google.com/webhp?q=perl+html+parser
You asked for a regular expression solution but your problem is a bit ill-defined and regexes for HTML are only for stop-gap/one-off stuff or else you’re probably just hurting yourself.
Since I am really not positive what your actual need and HTML source look like this is a generic solution to taking a URL and spitting out all the links found on the page without query strings. Having id= is for all reasonable purposes/code equivalent to no id.
There are many ways, at least three or four of them good solutions, to do this in Perl. This is one that is often overlooked: libxml. Docs: XML::LibXML, URI, and URI::QueryParam (if you want better query manipulation).
use warnings;
use strict;
use URI;
use XML::LibXML;
my $source = shift || die "Give a URL!\n";
my $parser = XML::LibXML->new;
$parser->recover(1);
my $doc = $parser->load_html( location => $source );
for my $anchor ( $doc->findnodes('//a[#href]') )
{
my $uri = URI->new_abs( $anchor->getAttribute("href"), $source );
# commented out ideas.
# next unless $uri->host eq "TARGET HOST NAME";
# next unless $uri->path eq "TARGET PATH";
# Clear the query completely; id= might as well be nothing.
$uri->query(undef);
print $uri, $/;
}
It sounds like maybe you’re using Bing! for scraping. This kind of thing is against pretty much every search engine’s ToS. Don’t do it. They have APIs (well, Google does at least) if you register and get a dev token.
I'm not 100% sure what you are doing, but this is the problem:
while ($result =~ m/$start(.*?)$end/g)
What's the purpose of this loop? You're taking a scalar called $result and checking for a pattern match. How is $result changing?
Your original question was how to make this:
http://site.com/index.php?id=15
into this:
http://site.com/index.php?id=
That is, how do you remove the 15 (or another number) from the expression. The answer is pretty simple:
$url =~ s/=\d+$/=/;
That'll anchor your regular expression at the end of the URL replacing the ending digits with nothing.
If you're removing any string, it's a bit more complex:
$url =~ s/=[^=]+/=/;
You can't simply use \S+ because regular expressions are normally greedy. Therefore, you want to specify any series of non-equal sign characters preceded by an equal sign.
Now, as for the while loop, maybe you want an if statement instead...
if ($result =~ /$start(.*?)$end/g) {
print "Doing something if this matched\n";
}
else {
print "Doing something if there's no match\n";
}
And, I'm not sure what this means:
I am using Bing! instead of Google.
Are you trying to parse the input from Bing!? If so, please explain exactly what you're really trying to do. Maybe we know a better way of doing this. For example, if you're parsing the output of a search result, there might be an API that you can use.
How can I turn this:
http://site.com/index.php?id=15
Into this?:
http://site.com/index.php?id=
I think this is the solution you are looking for
#!/usr/bin/perl
use strict;
use warnings;
my $url="http://site/index.php?id=15";
$url =~ s/(?<=id=).*//g;
print $url;
Output :
http://site.com/index.php?id=
as per your need anything after = sign will be omitted from the URL

Odd Perl Regex Behavior with Parens

I'm pulling in some Wikipedia markup and I'm wanting to match the URLs in relative (on Wikipedia) links. I don't want to match any URL containing a colon (not counting the protocol colon), to avoid special pages and the like, so I have the following code:
while ($body =~ m|<a href="(?<url>/wiki/[^:"]+)|gis) {
my $url = $+{url};
print "$url\n";
}
unfortunately, this code is not working quite as expected. Any URL that contains a parenthetical [i.e. /wiki/Eon_(geology)] is getting truncated prematurely just before the opening paren, so that URL would match as /wiki/Eon_. I've been looking at the code for a bit and I cannot figure out what I'm doing wrong. Can anyone provide some insight?
There isn't anything wrong in this code as it stands, so long as your Perl is new enough to support these RE features. Tested with Perl 5.10.1.
$body = <<"__ENDHTML__";
Body Blah blah
Body
__ENDHTML__
while ($body =~ m|<a href="(?<url>/wiki/[^:"]+)|gis) {
my $url = $+{url};
print "$url\n";
}
Are you using an old Perl?
You didn't anchor the RE to the end of the string. Put a " afterwards.
While that is a problem, it isn't the problem he was trying to solve. The problem he was trying to solve was that there was nothing to match the method/hostname (http://en.wiki...) in the RE. Adding a .*? would help that, before the "(?"

Regex to match URL not followed by " or <

I'm trying to modify the url-matching regex at http://daringfireball.net/2010/07/improved_regex_for_matching_urls to not match anything that's already part of a valid URL tag or used as the link text.
For example, in the following string, I want to match http://www.foo.com, but NOT http://www.bar.com or http://www.baz.com
www.foo.com http://www.baz.com
I was trying to add a negative lookahead to exclude matches followed by " or <, but for some reason, it's only applying to the "m" in .com. So, this regex still returns http://www.bar.co and http://www.baz.co as matches.
I can't see what I'm doing wrong... any ideas?
\b((?:[a-z][\w-]+:(?:/{1,3}|[a-z0-9%])|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:'".,<>?«»“”‘’]))(?!["<])
Here is a simpler example too:
((((ht|f)tps?:\/\/)|(www.))[a-zA-Z0-9_\-.:#/~}?]+)(?!["<])
I looked into this issue last year and developed a solution that you may want to look at - See: URL Linkification (HTTP/FTP) This link is a test page for the Javascript solution with many examples of difficult-to-linkify URLs.
My regex solution, written for both PHP and Javascript - is not simple (but neither is the problem as it turns out.) For more information I would recommend also reading:
The Problem With URLs by Jeff Atwood, and
An Improved Liberal, Accurate Regex Pattern for Matching URLs by John Gruber
The comments following Jeff's blog post are a must read if you want to do this right...
Note also that John Gruber's regex has a component that can go into realm of catastrophic backtracking (the part which matches one level of matching parentheses).
Yeah, its actually trivial to make it work if you just want to exclude trailing characters, just make your expression 'independent', then no backtracking will occurr in that segment.
(?>\b ...)(?!["<])
A perl test:
use strict;
use warnings;
my $str = 'www.foo.com http://www.baz.comhttp://www.some.com';
while ($str =~ m~
(?>
\b((?:[a-z][\w-]+:(?:/{1,3}|[a-z0-9%])|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:'".,<>?«»“”‘’]))
)
(?!["<])
~xg)
{
print "$1\n";
}
Output:
www.foo.com
http://www.some.com