Regex Pattern to Match A Href and Remove - regex

I am trying to create a regex to match all a href links that contain my domain and I will end up removing the links. It is working fine until I run into an a href link that has another HTML tag within the tag.
Regex Statement:
(<a[^<]*coreyjansen\.com[^<]*>)([^"]*?)(<\/a>)
It matches the a href links in this statement with no problem
Need a lawyer? Contact <span style="color: #000000">Random text is great Corey is awesome</span>
It is unable to match both of the a href links this statement:
<strong><a href="http://coreyjansen.com/"><img class="alignright size-full
wp-image-12" src="http://50h0.com/wp-content/uploads/2014/06/lawyers.jpg"
alt="lawyers" width="250" height="250" /></a>
I have been trying to play with the neglected character set with no luck. If I remove the neglected character set what ends up happening is it will match two links that are right after each other such as example 2 as one match.

The issue here is that [^<]*> matches everything up until last >. That's the greedy behaviour of * asterisk. You can make it non-greedy by appending ? after asterisk(which you already do in other part of your query). It will then match everything until first occurrence of >. Then you have to change the middle part of your regex too ie. to catch everything until first tag </a> like this:
(<a[^<]*coreyjansen\.com[^<]*?>)(.*?)(<\/a>)

Use below regex which matches only a tag
(<a[^>]*coreyjansen\.com[^>]*>)
Example data
<strong><a href="http://coreyjansen.com/"><img class="alignright size-full
wp-image-12" src="http://50h0.com/wp-content/uploads/2014/06/lawyers.jpg"
alt="lawyers" width="250" height="250" /><a href="http://coreyjansen.com/"/>
Above regex will match all three a tag with your required domain.
Try above on regex

I'm playing with the following regex and it seems to be working:
<a.*coreyjansen\.com.*</a>
it captures anything between anchor tags that contain your site name. I am using javascript pattern matching from www.regexpal.com, depending on the language it could be slightly different

You need to match start of tag <a then match address before > char. You are matching wrong char. When you match that, then everithing between <a> and </a> is displayed link. I don't know why you compare to not contain quotes, every tag attribute (in HTML5) has value inside quotes, so you need to match everything except link ending tag </a>. It's done by ((?!string to not match).)* and after that should follow </a>. The result regex is:
(<a[^>]*coreyjansen\.com[^>]*>)((?!<\/a>).)*(<\/a>)

Related

Regex find specific character but just when inside an HTML tag

I have an HTML string, e.g. :
<a href=“{{foo.bar}}”>some text “nice” here</a>
I'm trying to find out if any opening/closing double quote (“”, not ") is present inside an html tag (i.e. inside <>, but there could others things also in the tag).
In my example, <a href=“{{foo.bar}}”> should match but “nice” or </a> shouldn't.
What is the right regex for this ?
Actually I don't believe you've found it but you rather you fell into the common trap of regular expressions. You found a pattern which matches what you desire in a specific case.
If you place a < character inside the value of the tag of the link, <a href=“{{foo.bar}}”>some text < “nice” here</a> and your regex will match <a href=“{{foo.bar}}”> and < “nice” here</a>.
So an extra caution needs to be taken when it comes to regular expressions. To match any opening tag of html better use <\w+.*?>. After that extract whatever you find inside “”.
ok, found it : <[^>]*[“”]+[^>]*>
That does not work as you probably expect it to. When you add capturing groups, you'll see which parts of the string are actually matched by which groups:
<([^>]*)([“”]+)([^>]*)>
matches your example in this way:
<a href=“{{foo.bar}}”> a href=“{{foo.bar}} ”
^ Full match ^ 1st group ^ 2nd group ^ 3rd group (nothing)
Building on #Themelis' answer, you probably want to start with something like this:
<(\w+ [^<>“]*)“([^”]+)”([^<>]*)>
matches your example in this way:
<a href=“{{foo.bar}}”> a href= {{foo.bar}}
^ Full match ^ 1st group ^ 2nd group ^ 3rd group (nothing)

How to Match Redundant Lines From Contenteditable Div in Regex

I'm trying to process the html inside a contenteditable div. It might look like:
<div>Hi I'm Jack...</div>
<div><br></div>
<div><br></div>
<div>More text.</div> *<div><br></div>*
*<div><br></div>**<div><br></div>*
*<div><br></div>*
*<div>
<br>
</div>*
What regex expression would match all trailing <div><br></div> but not the ones sandwiched between useful divs containing text, i.e., <div> text (not html) </div>?
I have enclosed all expressions I want to match in asterisks. The asterisk are for reference only and are not part of my string.
Thanks,
Jack
You can use the pattern:
(?:<div>[\n\s]*<br>[\n\s]*<\/div>)(?!.*?<div>[^<]+<\/div>)
You can try it here.
Let me know if this works for all your cases and I will write a detailed explanation of the pattern.

Replace substring of a string using REGEX in Notepad++

I am using notepad++ and I want to create an automation in order to replace some strings.
In this case I am going to deal with the a href tag.
So, I will give 3 examples of some lines I have in my code :
01)
<img src="urlurlurlurl" alt="">
02)
<a href="https://url.com" class="logo"><img src="urlurlurlurl" alt="">
</a>
03)
<img src="urlurlurlurl" alt="">
04)
link
So, if I wanted to replace the full a href tag above in all 4 cases, I would use this one : <a href(.*?)a>
Now, I am trying to think of a way to replace the url within the a href tag only.
I tried using that :
href="(?s)(.*?)"|href ="(?s)(.*?)"
and it works fine because I also take into consideration that there might be a space.
But now in the replace window I have to include href=""
Is there a way to make it search for the a href tags and then replace a specific substring of it?
I want to know because there are cases where I have other tags that include a url and I want to replace it. But a generic replacement for all the strings that are included within quotes ("string") would not be good as I do not to replace all of them.
You can use a negated class to match everything before and after the href like,
(a[^>]*href\s*=\s*")[^"]*
replace with capture group $1REPLACE_STRING
Regex Demo
What it does?
a[^>]* Matches a followed by anything other than a closing >.
href\s*=\s*" Matches href=". Till here is captured in group 1.
[^"]* Matches anything other than ". This form the url that you want to replace.

regex to mach the first occurrence of html link

I need to match only the first occurrence of html link with 'data-{someData}' attributes. I've written regex like below:
\<a\s+(.+)\s+data-\s*(.+)\s*>(.+)<\/a>
and it works for a pice of html with only one html link like:
SOME TEXT/HTML
<a href="~/link.aspx?_id=B0B5056BD5984878BEB5C92AF6B74DB3&_z=z"
data-dms="{6782B150-F6FA-49E6-A2FF-6D6014470373}"
data-targetid="{B0B5056B-D598-4878-BEB5-C92AF6B74DB3}"
data-dms-event="Content button">Link1
</a>
SOME TEXT/HTML
but the problem is when html contains more links. Then the regex match till the last one occurrence of </a>. So from the below html:
SOME TEXT/HTML
<a href="~/link.aspx?_id=B0B5056BD5984878BEB5C92AF6B74DB3&_z=z"
data-dms="{6782B150-F6FA-49E6-A2FF-6D6014470373}"
data-targetid="{B0B5056B-D598-4878-BEB5-C92AF6B74DB3}"
data-dms-event="Content button">Link1
</a>
SOME TEXT/HTML
<a href="~/link.aspx?_id=1256272320C4429DAB8A1F40D429C841&_z=z"
data-dms="{6782B150-F6FA-49E6-A2FF-6D6014470373}"
data-targetid="{12562723-20C4-429D-AB8A-1F40D429C841}"
data-dms-event="Content button">Link2
</a>
SOME TEXT/HTML
I need to fix my regex to match only:
<a href="~/link.aspx?_id=B0B5056BD5984878BEB5C92AF6B74DB3&_z=z"
data-dms="{6782B150-F6FA-49E6-A2FF-6D6014470373}"
data-targetid="{B0B5056B-D598-4878-BEB5-C92AF6B74DB3}"
data-dms-event="Content button">Link1
</a>
First off you, have you looked for options other than regexp? Regexp is not the ideal tool to parse html. If your language have a DOM you should be able to extract the needed tag from this.
That said, if you need to use regexp, there are two ways to get around the problem you are facing.
The first, and in general the preferable, solution is to be more restrictive in what you match. Rather than matching any character with . match any legal characters with character classes such as [^>].
The second is to use eager matching rather than greedy matching. This is done by adding ? after your quantifiers. Ie replace +with +? and * with *?. By using eager matching the regexp will return on the first match found, rather than on the last.

Match text via Regex that is within an HTML tag

Via a Regex, I'm trying to match the word one, only when it's within an HTML <p> tag.
<p>zero one two three</p>
zero one two<p>three</p>
<p>zero one <b>two</b></p><p>three</p>
<p>two</p>three one
#1 and #3 above should be matches. It feels like I need a lookahead that makes sure there is a closing </p> tag without an opening <p> tag that comes before it (or a lookbehind that does the opposite). But I can't seem to come up with the right expression. Any ideas are appreciated.
<p>(?:(?!<\/p>).)*(\bone\b)(?:(?!<\/p>).)*<\/p>
You can try this.Just grab the capture.See demo.
http://regex101.com/r/xT7yD8/12
You could try the below regex to match the string one which is inside the <p> tag.
\bone\b(?=(?:(?!<\/?p>).)*<\/p>)
DEMO