I want to delete the Google prefix in all URLs.
<a href="http://news.google.com/news/url?sa=t&fd=R&ct2=en&usg=YFo&url=http://www.goo.tv/gd/2015/0509/735557.html
dfgdfgdfgdfgdf9
<a href="http://news.google.com/news/url?sa=t&fd=R&ct2=en&usg=AFQjCNFUS_UVkd9L-r7g&clid=c3878e0698331&cid=5213281008&ei=5DFNVJ4eymQLmyYFo&url=http://www.goo.tv/gd/2015/0509/735557.html
I want to remove http://news.google.com/news/url?sa=t&fd=R&ct2=en&blalba....url=
this Google prefix, so that it only retains the real URL.
I tried the regex, but it doesn't match each prefix, it matches all content
<a href="(http:\/\/news.google.com/news/url\?([\s\S]*)&url=)
Use Lazy Quantifiers:
<a href="(http:\/\/news.google.com\/news\/url\?([\s\S]*?)&url=)
Your regex did not worked because it was greedy(*), and took the match until the last &url= found. Lazy quantifiers(*?) stops at first match found, which is the expected behavior for your case.
Related
I am trying to create a regex to match all a href links that contain my domain and I will end up removing the links. It is working fine until I run into an a href link that has another HTML tag within the tag.
Regex Statement:
(<a[^<]*coreyjansen\.com[^<]*>)([^"]*?)(<\/a>)
It matches the a href links in this statement with no problem
Need a lawyer? Contact <span style="color: #000000">Random text is great Corey is awesome</span>
It is unable to match both of the a href links this statement:
<strong><a href="http://coreyjansen.com/"><img class="alignright size-full
wp-image-12" src="http://50h0.com/wp-content/uploads/2014/06/lawyers.jpg"
alt="lawyers" width="250" height="250" /></a>
I have been trying to play with the neglected character set with no luck. If I remove the neglected character set what ends up happening is it will match two links that are right after each other such as example 2 as one match.
The issue here is that [^<]*> matches everything up until last >. That's the greedy behaviour of * asterisk. You can make it non-greedy by appending ? after asterisk(which you already do in other part of your query). It will then match everything until first occurrence of >. Then you have to change the middle part of your regex too ie. to catch everything until first tag </a> like this:
(<a[^<]*coreyjansen\.com[^<]*?>)(.*?)(<\/a>)
Use below regex which matches only a tag
(<a[^>]*coreyjansen\.com[^>]*>)
Example data
<strong><a href="http://coreyjansen.com/"><img class="alignright size-full
wp-image-12" src="http://50h0.com/wp-content/uploads/2014/06/lawyers.jpg"
alt="lawyers" width="250" height="250" /><a href="http://coreyjansen.com/"/>
Above regex will match all three a tag with your required domain.
Try above on regex
I'm playing with the following regex and it seems to be working:
<a.*coreyjansen\.com.*</a>
it captures anything between anchor tags that contain your site name. I am using javascript pattern matching from www.regexpal.com, depending on the language it could be slightly different
You need to match start of tag <a then match address before > char. You are matching wrong char. When you match that, then everithing between <a> and </a> is displayed link. I don't know why you compare to not contain quotes, every tag attribute (in HTML5) has value inside quotes, so you need to match everything except link ending tag </a>. It's done by ((?!string to not match).)* and after that should follow </a>. The result regex is:
(<a[^>]*coreyjansen\.com[^>]*>)((?!<\/a>).)*(<\/a>)
I need to match only the first occurrence of html link with 'data-{someData}' attributes. I've written regex like below:
\<a\s+(.+)\s+data-\s*(.+)\s*>(.+)<\/a>
and it works for a pice of html with only one html link like:
SOME TEXT/HTML
<a href="~/link.aspx?_id=B0B5056BD5984878BEB5C92AF6B74DB3&_z=z"
data-dms="{6782B150-F6FA-49E6-A2FF-6D6014470373}"
data-targetid="{B0B5056B-D598-4878-BEB5-C92AF6B74DB3}"
data-dms-event="Content button">Link1
</a>
SOME TEXT/HTML
but the problem is when html contains more links. Then the regex match till the last one occurrence of </a>. So from the below html:
SOME TEXT/HTML
<a href="~/link.aspx?_id=B0B5056BD5984878BEB5C92AF6B74DB3&_z=z"
data-dms="{6782B150-F6FA-49E6-A2FF-6D6014470373}"
data-targetid="{B0B5056B-D598-4878-BEB5-C92AF6B74DB3}"
data-dms-event="Content button">Link1
</a>
SOME TEXT/HTML
<a href="~/link.aspx?_id=1256272320C4429DAB8A1F40D429C841&_z=z"
data-dms="{6782B150-F6FA-49E6-A2FF-6D6014470373}"
data-targetid="{12562723-20C4-429D-AB8A-1F40D429C841}"
data-dms-event="Content button">Link2
</a>
SOME TEXT/HTML
I need to fix my regex to match only:
<a href="~/link.aspx?_id=B0B5056BD5984878BEB5C92AF6B74DB3&_z=z"
data-dms="{6782B150-F6FA-49E6-A2FF-6D6014470373}"
data-targetid="{B0B5056B-D598-4878-BEB5-C92AF6B74DB3}"
data-dms-event="Content button">Link1
</a>
First off you, have you looked for options other than regexp? Regexp is not the ideal tool to parse html. If your language have a DOM you should be able to extract the needed tag from this.
That said, if you need to use regexp, there are two ways to get around the problem you are facing.
The first, and in general the preferable, solution is to be more restrictive in what you match. Rather than matching any character with . match any legal characters with character classes such as [^>].
The second is to use eager matching rather than greedy matching. This is done by adding ? after your quantifiers. Ie replace +with +? and * with *?. By using eager matching the regexp will return on the first match found, rather than on the last.
I'd like to match URLs that don't end in /, to use it in Dreamweaver's find tool.
What regex could I use?
For example, I'd like the following URL to be matched:
<a href="http://www.sometext"
You can do it with this simple regex:
href=".+?[^/]"
Explanation:
It will match href="________X", where X != /.
The following will match:
<a href="http://some-url.com">
<a href="http://www.another-url-here.com/content">
These ones won't:
<a href="http://www.url.com/">
<a href="http://www.url-2.com/posts/2014/">
Edit:
The following will allow URLs like <a href= http://www.url.com> too.
`href=\s*".+?[^/]"
Sure. You can use [^/]" at the end of your link expression to match any non-slash followed by a close-quote.
Maybe with this
href\s*=\s*"[^"]*[^"/\s]\s*"
I am trying to extract an url from content using yahoo pipes but for that I need to match everything before the url, and everything after :
<div class="medium mode player"><div class="info-header"><a rel="nofollow" target="_blank"
href="http://i1.sndcdn.com/artworks-000059185212-dsb68g-crop.jpg?3eddc42" class="artwork"
style="background:url(http://i1.sndcdn.com/artworks-000059185212-dsb68g-badge.jpg?
3eddc42);">Dream ft. Notorious BIG Artwork</a> <h3><a rel="nofollow" target="_blank"
href="http://soundcloud.com/tom-misch/dream-ft-notorious-big">Dream ft. Notorious BIG</a>
</h3> <span class="subtitle"><span class="user tiny online"><a rel="nofollow"
target="_blank" href="http://soundcloud.com/tom-misch" class="user-name">Tom Misch</a>
The url I want is that one : http://soundcloud.com/tom-misch/dream-ft-notorious-big
I tried to learn a bit about regex but when I think I understand, nothing I try works
Hope some of you can help me on that guys !
cheers
This probably will do, it only matches URLs from soundcloud, that uses the http protocol and have no subdomain, the group will capture the full url so that you can use it, and it uses a lazy quantifier to match up to the first quote:
(http://soundcloud.*?)"
Here is an alternative:, that does not uses a lazy quatifier, instead it uses a negated class to match anything but a quote:
(http://soundcloud[^"]+)
Keep in mind that both regexs will actually match both URLs, depending on the library and the flags that you use it might return only the first occurrence or both, you can just use the first one or further check the results for the correct format.
If you really want to use just a regex and your regex library supports look-ahead, you can do this:
(http://soundcloud.*?)\s+(?!class="user-name")
The look-ahead (?!= will not match if the string that follows is class="user-name"
I didn't too, find what library yahoo pipes uses, if you want to replace everything around the url, you can change the regex to:
^.*?(http://soundcloud[^"]+).*$
And use $1 in the replacement string to get the url back (keep in mind that I mixed .*? with [^"]+, that's because I want to replace the whole string with the first url and not the second one, so I need the first .* to match up to the point of the first url and stop, that's what the lazy quantifier if for).
I'm trying to build regex to extract links from text which have not rel="nofollow".
Example:
aiusdiua asudauih <a rel="nofollow" hre="http://uashiuadha.asudh/adas>adsaag</a> uhwaida <br> asdgydug <a href="http://asdha.sda/uduih/dufhuis>aguuia</a>
Thanks!
The following regex will do the job:
<a (?![^>]*?rel="nofollow")[^>]*?href="(.*?)"
The wanted urls will be in the capture group #1. E.g. in Ruby it would be:
if input =~ /<a (?![^>]*?rel="nofollow")[^>]*?href="(.*?)"/
match = $~[1]
end
Since it accepts [^>]*? before rel in the negative lookahead, href or anything else can come before rel. If href comes after rel, it'll of course also be ok.
Try this
<(?:A|AREA)\b[^<>]*?(?!rel="nofollow")[^<>]*?href=['"]([^>"]*)[^>]*?>
if you are using .net regex then
<(?:A|AREA)\b[^<>]*?(?!rel="nofollow")[^<>]*?href=['"](?<URL>[^>"]*)[^>]*?>
data lies in group named URL or group 1