how to match a url non greedy - regex

I am hoping that someone can help me to make this match non greedy... I am using Javascript and ASP Classic
.match(/(<a\s+.*?><\/a>)/ig);
The purpose is to extract URL's from a page in this format <a href ></a>
I need to capture just the url
Thanks

Try the following:
.match(/(<a\s+.*?href="(.*?)".*?>/)/ig);

Related

Verify href has a valid url or not xslt 1

I need to validate the url inside href tag. If that url is valid then do nothing else remove that href tag inside <a> tag. We can use any general regex or any other kind of url validation to do this that validates the href.
Example:
tinyurl
valid url
invalid url
Result:
<a rel="nofollow">tinyurl</a>
valid url
<a rel="nofollow">invalid url</a>
Thanks in advance. Any clue/help given is appreciated.
regex that can be helpful:
/((([A-Za-z]{3,9}:(?:\/\/)?)(?:[-;:&=\+\$,\w]+#)?[A-Za-z0-9.-]+(:[0-9]+)?|(?:www.|[-;:&=\+\$,\w]+#)[A-Za-z0-9.-]+)((?:\/[\+~%\/.\w-_]*)?\??(?:[-\+=&;%#.\w_]*)#?(?:[\w]*))?)/
Michael Sperberg-McQueen has defined XSD types that match different flavours of URI in
http://www.w3.org/2011/04/XMLSchema/TypeLibrary-URI-RFC3986.xsd
and
http://www.w3.org/2011/04/XMLSchema/TypeLibrary-IRI-RFC3987.xsd
To see the way these complex regular expressions are constructed, view these documents at the raw XML level using (for example) curl.
Regular expressions can be used for pattern matching in XSLT 2.0, but there's no support in XSLT 1.0.

PDF matched incorrectly for image only href/src urls

I am trying to get my regular expression to match any image url with certain optionals.
In my set that matches image file extensions everything is fine until I put in the gif extension. When I do that the pdf urls get matched for some reason.
Could anyone shed light on this?
I am using this within PHP with preg_match_all function
Rules for matching
Can be either src or href link
Can be relative or absolute link
Protocol can be http or https if given
Select only the link if matched
Case insensitive and global
Pattern (Take out gif and pdfs are skipped)
[src|href]="([(https|http):\/\/]?[^"]*.[jpg|png|jpeg|gif])"
Test strings
Should match <a href="http://blog.mysite.com/wp-content/uploads/2014/04/13061-someimage.jpg">
Should match <a href="/wp-content/uploads/2014/04/13061-someimage.jpg">
No match
No match
Should match <img href="http://blog.mysite.com/wp-content/uploads/2014/04/13061-someimage.jpg"/>
Should match <img href="/wp-content/uploads/2014/04/13061-someimage.gif"/>
Should match <img href="http://blog.mysite.com/wp-content/uploads/2014/04/13061-someimage.jpg" />
Should match <img href="/wp-content/uploads/2014/04/13061-someimage.jpg" />
www.regex101.com fiddle: https://regex101.com/r/x3vVSx/1
Thanks to #Micha Wiedenmann for this.
Quote/Unquote
You mixed up [ and (, you want (jpg|png|jpeg|gif) instead of [jpg|png|...]. Similarly for [src|href].

Can't seem to capture newline+spaces in Regex

I know regexes aren't the best for web parsing, but I'm using it as an exercise.
I'm using Район:[^<>]*\n\s*<[^<>]*>\n\s*<a[^<>]*>([^<>]+)<\/a>
to try to match:
Район: </span>
<span class="company__contacts-item-text">
<a class="link" href="/moscow/top/marina-roscha/">Марьина роща</a>
I've been looking at it for a while but I don't know what I've been doing wrong. How can I capture something that would have newlines and different urls in the tags?
Try this regex:
Район:.+?<a[^>]+>(.+?)</a>
DESCRIPTION
DEMO
https://regex101.com/r/wA4oH0/1

Using regular expression, How can I get the href url of an anchor tag that matches a particular class value such as foo?

Can anyone help me with PHP regex (Regular Expression). I want to get all URLs that matches a certain attribute. The following example is, I want to get all href URLs that has a class of 'foo'.
<a title="foo" href="http://foo.com/" class="foo">Foo</a>
Bar
<a class="foo" title="foobar" href="http://foobar.com/">FooBar</a>
Result should be match the 2 URLs:
http://foo.com/
http://foobar.com/
I know this can be done easily using PHP packages such as DOM crawlers, but I want to use PHP RegEx.
See Demo
class="foo"[^>]*href="([^"]*)"[^>]*|href="([^"]*)"[^>]*class="foo"
[^>]*:match other attributes

Regex to match content before string

I am trying to extract an url from content using yahoo pipes but for that I need to match everything before the url, and everything after :
<div class="medium mode player"><div class="info-header"><a rel="nofollow" target="_blank"
href="http://i1.sndcdn.com/artworks-000059185212-dsb68g-crop.jpg?3eddc42" class="artwork"
style="background:url(http://i1.sndcdn.com/artworks-000059185212-dsb68g-badge.jpg?
3eddc42);">Dream ft. Notorious BIG Artwork</a> <h3><a rel="nofollow" target="_blank"
href="http://soundcloud.com/tom-misch/dream-ft-notorious-big">Dream ft. Notorious BIG</a>
</h3> <span class="subtitle"><span class="user tiny online"><a rel="nofollow"
target="_blank" href="http://soundcloud.com/tom-misch" class="user-name">Tom Misch</a>
The url I want is that one : http://soundcloud.com/tom-misch/dream-ft-notorious-big
I tried to learn a bit about regex but when I think I understand, nothing I try works
Hope some of you can help me on that guys !
cheers
This probably will do, it only matches URLs from soundcloud, that uses the http protocol and have no subdomain, the group will capture the full url so that you can use it, and it uses a lazy quantifier to match up to the first quote:
(http://soundcloud.*?)"
Here is an alternative:, that does not uses a lazy quatifier, instead it uses a negated class to match anything but a quote:
(http://soundcloud[^"]+)
Keep in mind that both regexs will actually match both URLs, depending on the library and the flags that you use it might return only the first occurrence or both, you can just use the first one or further check the results for the correct format.
If you really want to use just a regex and your regex library supports look-ahead, you can do this:
(http://soundcloud.*?)\s+(?!class="user-name")
The look-ahead (?!= will not match if the string that follows is class="user-name"
I didn't too, find what library yahoo pipes uses, if you want to replace everything around the url, you can change the regex to:
^.*?(http://soundcloud[^"]+).*$
And use $1 in the replacement string to get the url back (keep in mind that I mixed .*? with [^"]+, that's because I want to replace the whole string with the first url and not the second one, so I need the first .* to match up to the point of the first url and stop, that's what the lazy quantifier if for).