RegEx expression to find a href links and add NoFollow to them - regex

I am trying to write a RegEx rule to find all a href HTML links on my webpage and add a 'rel="nofollow"' to them.
However, I have a list of URLs that must be excluded (for exmaple, ANY (wildcards) internal link (eg. pokerdiy.com) - so that any internal link that has my domain name in is excluded from this. I want to be able to specify exact URLs in the exclude list too - for example - http://www.example.com/link.aspx)
Here is what I have so far which is not working:
(]+)(href="http://.*?(?!(pokerdiy))[^>]+>)
If you need more background/info you can see the full thread and requirements here (skip the top part to get to the meat):
http://www.snapsis.com/Support/tabid/601/aff/9/aft/13117/afv/topic/afpgj/1/Default.aspx#14737

An improvement to James' regex:
(<a\s*(?!.*\brel=)[^>]*)(href="https?://)((?!(?:(?:www\.)?'.implode('|(?:www\.)?', $follow_list).'))[^"]+)"((?!.*\brel=)[^>]*)(?:[^>]*)>
This regex will matches links NOT in the string array $follow_list. The strings don't need a leading 'www'. :)
The advantage is that this regex will preserve other arguments in the tag (like target, style, title...). If a rel argument already exists in the tag, the regex will NOT match, so you can force follows on urls not in $follow_list
Replace the with:
$1$2$3"$4 rel="nofollow">
Full example (PHP):
function dont_follow_links( $html ) {
// follow these websites only!
$follow_list = array(
'google.com',
'mypage.com',
'otherpage.com',
);
return preg_replace(
'%(<a\s*(?!.*\brel=)[^>]*)(href="https?://)((?!(?:(?:www\.)?'.implode('|(?:www\.)?', $follow_list).'))[^"]+)"((?!.*\brel=)[^>]*)(?:[^>]*)>%',
'$1$2$3"$4 rel="nofollow">',
$html);
}
If you want to overwrite rel no matter what, I would use a preg_replace_callback approach where in the callback the rel attribute is replaced separately:
$subject = preg_replace_callback('%(<a\s*[^>]*href="https?://(?:(?!(?:(?:www\.)?'.implode('|(?:www\.)?', $follow_list).'))[^"]+)"[^>]*)>%', function($m) {
return preg_replace('%\srel\s*=\s*(["\'])(?:(?!\1).)*\1(\s|$)%', ' ', $m[1]).' rel="nofollow">';
}, $subject);

I've developed a slightly more robust version that can detect whether the anchor tag already has "rel=" in it, therefore not duplicating attributes.
(<a\s*(?!.*\brel=)[^>]*)(href="https?://)((?!blog.bandit.co.nz)[^"]+)"([^>]*)>
Matches
Google
<a title="Google" href="http://google.com">Google</a>
<a target="_blank" href="http://google.com">Google</a>
Google
But doesn't match
<a rel="nofollow" href="http://google.com">Google</a>
Google
Google
Google
Google
<a target="_blank" href="http://blog.bandit.co.nz">Bandit</a>
Replace using
$1$2$3"$4 rel="nofollow">
Hope this helps someone!
James

(<a href="https?://)((?:(?!\b(pokerdiy.com|www\.example\.com/link\.aspx)\b)[^"])+)"
would match the first part of any link that starts with http:// or https:// and doesn't contain pokerdiy.com or www.example.com/link.aspx anywhere in the href attribute. Replace that by
\1\2" rel="nofollow"
If a rel="nofollow" is already present, you'll end up with two of these. And of course, relative links or other protocols like ftp:// etc. won't be matched at all.
Explanation:
(?!\b(foo|bar)\b)[^"] matches any non-" character unless it it possible to match foo or bar at the current location. The \bs are there to make sure we don't accidentally trigger on rebar or foonly.
This whole contruct is repeated ((?: ... )+), and whatever is matched is preserved in backreference \2.
Since the next token to be matched is a ", the entire regex fails if the attribute contains foo or bar anywhere.

Related

Regex Pattern to Match A Href and Remove

I am trying to create a regex to match all a href links that contain my domain and I will end up removing the links. It is working fine until I run into an a href link that has another HTML tag within the tag.
Regex Statement:
(<a[^<]*coreyjansen\.com[^<]*>)([^"]*?)(<\/a>)
It matches the a href links in this statement with no problem
Need a lawyer? Contact <span style="color: #000000">Random text is great Corey is awesome</span>
It is unable to match both of the a href links this statement:
<strong><a href="http://coreyjansen.com/"><img class="alignright size-full
wp-image-12" src="http://50h0.com/wp-content/uploads/2014/06/lawyers.jpg"
alt="lawyers" width="250" height="250" /></a>
I have been trying to play with the neglected character set with no luck. If I remove the neglected character set what ends up happening is it will match two links that are right after each other such as example 2 as one match.
The issue here is that [^<]*> matches everything up until last >. That's the greedy behaviour of * asterisk. You can make it non-greedy by appending ? after asterisk(which you already do in other part of your query). It will then match everything until first occurrence of >. Then you have to change the middle part of your regex too ie. to catch everything until first tag </a> like this:
(<a[^<]*coreyjansen\.com[^<]*?>)(.*?)(<\/a>)
Use below regex which matches only a tag
(<a[^>]*coreyjansen\.com[^>]*>)
Example data
<strong><a href="http://coreyjansen.com/"><img class="alignright size-full
wp-image-12" src="http://50h0.com/wp-content/uploads/2014/06/lawyers.jpg"
alt="lawyers" width="250" height="250" /><a href="http://coreyjansen.com/"/>
Above regex will match all three a tag with your required domain.
Try above on regex
I'm playing with the following regex and it seems to be working:
<a.*coreyjansen\.com.*</a>
it captures anything between anchor tags that contain your site name. I am using javascript pattern matching from www.regexpal.com, depending on the language it could be slightly different
You need to match start of tag <a then match address before > char. You are matching wrong char. When you match that, then everithing between <a> and </a> is displayed link. I don't know why you compare to not contain quotes, every tag attribute (in HTML5) has value inside quotes, so you need to match everything except link ending tag </a>. It's done by ((?!string to not match).)* and after that should follow </a>. The result regex is:
(<a[^>]*coreyjansen\.com[^>]*>)((?!<\/a>).)*(<\/a>)

Replace substring of a string using REGEX in Notepad++

I am using notepad++ and I want to create an automation in order to replace some strings.
In this case I am going to deal with the a href tag.
So, I will give 3 examples of some lines I have in my code :
01)
<img src="urlurlurlurl" alt="">
02)
<a href="https://url.com" class="logo"><img src="urlurlurlurl" alt="">
</a>
03)
<img src="urlurlurlurl" alt="">
04)
link
So, if I wanted to replace the full a href tag above in all 4 cases, I would use this one : <a href(.*?)a>
Now, I am trying to think of a way to replace the url within the a href tag only.
I tried using that :
href="(?s)(.*?)"|href ="(?s)(.*?)"
and it works fine because I also take into consideration that there might be a space.
But now in the replace window I have to include href=""
Is there a way to make it search for the a href tags and then replace a specific substring of it?
I want to know because there are cases where I have other tags that include a url and I want to replace it. But a generic replacement for all the strings that are included within quotes ("string") would not be good as I do not to replace all of them.
You can use a negated class to match everything before and after the href like,
(a[^>]*href\s*=\s*")[^"]*
replace with capture group $1REPLACE_STRING
Regex Demo
What it does?
a[^>]* Matches a followed by anything other than a closing >.
href\s*=\s*" Matches href=". Till here is captured in group 1.
[^"]* Matches anything other than ". This form the url that you want to replace.

Reg Exp: Get string only if it is not between a tags

I am doing a search and replace of some terms, adding a link to these words. If these words are already part of another link, I should avoid it the replace (if not, I should end with <a href...> <a href ...> word </a> </a>, which is something I want to avoid.
I don't know if this is possible, so I'd like to know that and if in case it is, any hint. I am kind of lost. So far, I am being able only to get those words that are part of a link, but not those which exclusively are not.
Thanks!
You can do something like this:
$urls = array('word1'=> 'http://urlfor.word1.com',
'word2'=> 'http://urlfor.word2.com',
'word3'=> 'http://urlfor.word3.com');
$pattern = '~<(?:a\s.*?</a>|!--.*?(?:-->|$)|[^>]+>)(*SKIP)(*FAIL)|\b(?:word1|word2|word3)\b~sD';
$result = preg_replace_callback($pattern, function($m) use ($urls) {
return '' . $m[0] . ''; },
$html);
$urls is an associative array where keys are the words and the values are corresponding urls.
the pattern use the (*SKIP)(*FAIL) trick to skip parts that are already between link tags, inside a tag or in an html comment. (Note that you can easily extend the pattern to skip script, style and CDATA content or to deal with unclosed <a> tags )
This worked:
~<(?:a\s.*?</a>|[^>]+>)(*SKIP)(*FAIL)|\b(?:ultrices)\b~ig
adding g to get all the matches and not only the first one.

Regex to match content before string

I am trying to extract an url from content using yahoo pipes but for that I need to match everything before the url, and everything after :
<div class="medium mode player"><div class="info-header"><a rel="nofollow" target="_blank"
href="http://i1.sndcdn.com/artworks-000059185212-dsb68g-crop.jpg?3eddc42" class="artwork"
style="background:url(http://i1.sndcdn.com/artworks-000059185212-dsb68g-badge.jpg?
3eddc42);">Dream ft. Notorious BIG Artwork</a> <h3><a rel="nofollow" target="_blank"
href="http://soundcloud.com/tom-misch/dream-ft-notorious-big">Dream ft. Notorious BIG</a>
</h3> <span class="subtitle"><span class="user tiny online"><a rel="nofollow"
target="_blank" href="http://soundcloud.com/tom-misch" class="user-name">Tom Misch</a>
The url I want is that one : http://soundcloud.com/tom-misch/dream-ft-notorious-big
I tried to learn a bit about regex but when I think I understand, nothing I try works
Hope some of you can help me on that guys !
cheers
This probably will do, it only matches URLs from soundcloud, that uses the http protocol and have no subdomain, the group will capture the full url so that you can use it, and it uses a lazy quantifier to match up to the first quote:
(http://soundcloud.*?)"
Here is an alternative:, that does not uses a lazy quatifier, instead it uses a negated class to match anything but a quote:
(http://soundcloud[^"]+)
Keep in mind that both regexs will actually match both URLs, depending on the library and the flags that you use it might return only the first occurrence or both, you can just use the first one or further check the results for the correct format.
If you really want to use just a regex and your regex library supports look-ahead, you can do this:
(http://soundcloud.*?)\s+(?!class="user-name")
The look-ahead (?!= will not match if the string that follows is class="user-name"
I didn't too, find what library yahoo pipes uses, if you want to replace everything around the url, you can change the regex to:
^.*?(http://soundcloud[^"]+).*$
And use $1 in the replacement string to get the url back (keep in mind that I mixed .*? with [^"]+, that's because I want to replace the whole string with the first url and not the second one, so I need the first .* to match up to the point of the first url and stop, that's what the lazy quantifier if for).

Regex for extracting links with specified attributes

I'm trying to build regex to extract links from text which have not rel="nofollow".
Example:
aiusdiua asudauih <a rel="nofollow" hre="http://uashiuadha.asudh/adas>adsaag</a> uhwaida <br> asdgydug <a href="http://asdha.sda/uduih/dufhuis>aguuia</a>
Thanks!
The following regex will do the job:
<a (?![^>]*?rel="nofollow")[^>]*?href="(.*?)"
The wanted urls will be in the capture group #1. E.g. in Ruby it would be:
if input =~ /<a (?![^>]*?rel="nofollow")[^>]*?href="(.*?)"/
match = $~[1]
end
Since it accepts [^>]*? before rel in the negative lookahead, href or anything else can come before rel. If href comes after rel, it'll of course also be ok.
Try this
<(?:A|AREA)\b[^<>]*?(?!rel="nofollow")[^<>]*?href=['"]([^>"]*)[^>]*?>
if you are using .net regex then
<(?:A|AREA)\b[^<>]*?(?!rel="nofollow")[^<>]*?href=['"](?<URL>[^>"]*)[^>]*?>
data lies in group named URL or group 1