Regex to extract hyperlink containing a specific word - regex

I need to extract a hyperlink, containing a specific word in the url, from a piece of text. Example;
"This is a text with a link to some page. Click this link <a href="/server/specificword.htm>this is a link to a page</a> to see that page. Here is a link that doesn't have the word "specificword" in it: <a href="/server/mypage.htm>this is a link without the word "specificword" in the url</a>"
So, I need to parse this text, check the hyperlinks to see if one of them contains the word "specificword", and then extract the entire hyperlink. I would then end up with this:
<a href="/server/specificword.htm>this is a link to a page</a>
I need the hyperlink that has specificword in the url eg. /server/specificword.htm, not in the link text
One regex I have tried, is this one: /(<a[^>]*>.*?</a>)|specificword/
This will match all hyperlinks in the text, or "specificword". If the text has multiple links, without the word "specificword", I will get those too.
Also, I have tried this one, but it matces nothing:
<a.*?href\s*=\s*["\']([^"\'>]*specificword[^"\'>]*)["\'][^>]*>.*?<\/a>
My regex skills end here, any help would be great....

try this for all the a tag:
/<a [^>]*\bhref\s*=\s*"[^"]*SPECIFICWORD.*?<\/a>/
or just for the link (in the first capture group):
/<a [^>]*\bhref\s*=\s*"([^"]*SPECIFICWORD[^"]*)/
If you use php, for the link:
preg_match_all('/<a [^>]*\bhref\s*=\s*"\K[^"]*SPECIFICWORD[^"]*/', $text, $results);

This one should suit your needs:
.*?
Demo
If you want to allow other attributes on your anchor tar, and be more premissive about inner spaces, you could try:
<a( [^>]*?)? href="[^"]*?specificword.*?"( .*?)?>.*?</a>
Demo
You could also of course use non-capturing groups (?:...):
<a(?: [^>]*?)? href="[^"]*?specificword.*?"(?: .*?)?>.*?</a>
Demo
And finally, if you want to allow simple quotes for your href attribute:
<a(?: [^>]*?)? href=(["'])[^\1]*?specificword.*?\1(?: .*?)?>.*?</a>
Demo
Last but not least: if you want to capture the URL, just put parentheses around the [^\1]*?specificword.*? part:
<a(?: [^>]*?)? href=(["'])([^\1]*?specificword.*?)\1(?: .*?)?>.*?</a>
Demo

The final regex you tried almost had it. Try this alteration of it:
<a\s.*?href=["']([^"']*?specificword[^"']*?)[^>]*>.*?<\/a>
The main difference is making the quantifiers "lazy".

try this pattern this is exact requirement you want
(?=.*href=\"([^\"]*specificword[^"]*)")<a [^>]+>
if you want only url value use Groups[1]
Like:
Regex.match("input string",#"(?=.*href=\"([^\"]*specificword[^"]*)")<a [^>]+>").Groups[1].value;

Related

Regex to find a specific anchor tag that have href with a specific domain and nofollow

I have a string that contains html I want a regex that get me the string that has with a specific domain name and has noFollow
I have found this would will do work on the domain name but does not include nofollow condition
(<a\s*(?!.\brel=)[^>])(href="https?://)((?stackoverflow)[^"]+)"([^>]*)>
let's say the domain name I want is stackoverflow
Example:
- "click here " this would match
- "<a href="stackoverflow.com"> would not match since it has no follow
- "<a href="google.com" rel = "nofollow"> would not match
It's bit hard to match a HTML tag with specific condition, but the following regex should do it:
select regexp_match(str, '<a((?:\s+(([^\/=''"<>\s]+)(=((''[^'']*'')|("[^"]*")|([^\s<>''"=`]+)))?)))* href=((''(https?:\/\/)?stackoverflow\.com[^'']*'')|("(https?:\/\/)?stackoverflow\.com[^"]*"))((?: (([^\/=''"<>\s]+)(=((''[^'']*'')|("[^"]*")|([^\s<>''"=`]+)))?)))*\s+rel=("nofollow"|''nofollow'')((?: (([^\/=''"<>\s]+)(=((''[^'']*'')|("[^"]*")|([^\s<>''"=`]+)))?)))*\/?>') from tes;
It's really hard to read, but basically most of the regex is there for matching attributes. The important thing for you is to find stackoverflow\.com (which can be found 2 times; one for href with single quote and second for double quote) and replace it with whatever domain you need (and don't forget to escape it properly).
Some notes
I don't know which regexp function you want to use, but you should be able to use it with whatever regexp function you need. Another thing is that your example click here won't be matched, because you have spaces between attribute name and = sign (i don't know if this is valid HTML or not). It will work with this click here . If you need to match addresses which might include spaces between = signs just comment me and I'll try to edit the regex.

RegExp find wrong tags

I have some urls saved in DB like hello world
with break tags, so i need to delete them, the problem that <br/> are in other places to so i can't delete all of them,
i write RegExp <*"*<br\/?>"> but it select not only <br> and quotes too.
You really shouldn't be using regular expressions for parsing HTML or XML.
Having said that. As I understand it, you have br tags inside the href attribute of a tags.
try :
href\s*?=\s*?\"(.*?)(<br\/?\>)\"
If you try to search about the right lines in the database, then this is your regex extended to match the whole line:
<.*\".*<br\/>\">.*>
After this you can mach the '<br/>' directly in those lines. Is there a language to edit your DB?
Some of the other answers here are okay. I'll offer an alternative:
https://regex101.com/r/uG5PBA/2
This'll put the break tags in a capture group -- group 1, so that you can simply nix them.
Regex:
<a[\s\S]*?(\<br\/>)[\s\S]*?<\/a>
Test String:
hello worldhello world

Using RegEx to Extract Anchor Text of Links With Beginning of Specific Target URL

I need assistance with capturing "Mr. John Doe" from the following HTML code:
Mr. John Doe
I have been trying various string matching and thought that I was close when I tried using the following RegEx:
(.*)
...But, no matches were found in the capturing group.
This is something I'm trying to set as a parameter in a crawl simulation software (using PCRE). I'm simply looking to extract the author name which would appear within a hyperlink that links to a target URL beginning with /author/...
Any pointers? Thank you in advance!
Your problem is that you require a space when there's none:
(.*)
# ^^^
Remove it and it works:
(.*)
However, this expression could be vastly optimized (no dot-star-soup everywhere, that is) and if you're still at the beginning, better use a parser and xpath queries instead.

Notepad++ html tag / string (a href) replace

I found another post that uses the following regex <a[^>]*>([^<]+)</a> it works great however I want to use a capture group to target URLs that have the following 4 letters in them RTRD.
I used <a[^>]*>(RTRD+)</a> and that did not work.
TESTER I want to remove the URL and leave TESTER
LEAVE I want to not touch this one.
One that will work: <a\s[^>]*href\=[\"][^\"]*(RTRD)[^\"]*[\"][^>]*>([^<]+)<\/a>
Decomposition:
<a\s[^>]* find opening a tag with space followed by some arguments
href\=[\"][^\"]* find href attribute with " opening and then multiple non " closing
(RTRD) Your Key group
[^\"]*[\"] Find remainder of argument and closing "
[^>]*>([^<]+)<\/a> The remainder of the original regex
Things your original RegExp would match:
<a stuffhere!!.,?>RTRDDD</a>
<a>RTRD</a>
Decomposing your RegExp:
<a[^>]*> Look for opening tag with any properties
(RTRD+) Look for the RTRD group but also match one or more D
<a[^>]*> Look for closing tag
Use <a[^>]*RTRD[^>]*>([^<]+)<\/a> here.
Inside the opening tag (<a[^>]*>) should be the pattern RTRD somewhere. This can be done by replacing [^>]* with [^>]*RTRB[^>]*which is simply
[^>]* Anything thats not a >(closing tag)
RTRB The pattern RTRB
[^>]* Again anything thats not a >
But caution: This also matches <aRTRB>test</a> or <a id="RTRB">blubb</a>
And if you have any other way than using Regex on HTML, use that way (string operations etc)

Conditional Regex to match url

I am trying to make a if/then condition to match the url, but I can't seem to get it to work. I am trying to match URLs and then capture the non-optional group. So - if a url comes in like this:
/en/testing.aspx
I want to capture /testing.aspx
if the url comes in like this:
/testing.aspx
I want to capture /testing.aspx
Is there an easy way to do this using regex?
EDIT:
The Url can be multi-part url, like /en/sub1/sub2/testing.aspx - I essentially want everything after "/en/".
use regex \/en(\/.+)$
Check this out
edited
https://regex101.com/r/lwowhi/6
If there is "/en/" in the URL and you still want to capture /testing.aspx then here is an edit (?:\/en)*(\/.+)$
https://regex101.com/r/lwowhi/8
You can use a greedy regex which will consume everything up until the final forward slash. Then, capture everything which comes after that point.
^.*?(?:\/en)?(\/.*)$
Demo
Guessing all pages are .aspx then use group.
regex: .(/..aspx)
this will match "/testing.aspx" in all bellow samples
/testing.aspx or
/en/testing.aspx or
www.abc.com/en-us/testing.aspx