How to use regex to replace text between tags in Notepad++ - regex

I have a code like this:
<pre><code>Some HTML code</code></pre>
I need to escape the HTML between the <pre><code></code></pre> tags. I have lots of tags, so I thought - why not let regex do it for me. The problem is I don't know how. I've seen lots of examples using Google and Stackoverflow, but nothing I could use. Can someone here help me?
Example:
<pre><code>Some HTML code</code></pre>
To
<pre><code>Some <a href="http">HTML</a> code</code></pre>
Or just a regex so I can replace anything between the <pre><code> and </code></pre> tags one by one. I'm almost certain that this can be done.

This regex will match the parts of the anchor tag
you need to put back:
<pre><code>([^<]*?)(.*?)(.*?)</code></pre>
See a live demo, which shows it matching correctly and also shows the various parts being captured as groups which we'll refer to in the replacement string (see below).
Use the regex above with the following replacement:
<pre><code>\1<a href="\2">\3</a>\4</pre></code>
The \1, \2 etc are the captured groups in the regex that put back what we're keeping from the match.

A regular expression to return "the thing between <pre><code> and </code></pre>" could be
/(?<=<pre><code>).*?(?=<\/code><\/pre>)/
This uses lookaround expressions to delimit the "thing that gets matched". Typically using regex in situations with nested tags is fraught with danger and you are much better off using "real tools" made specifically for the job of parsing xml, html etc. I am a huge fan of Beautiful Soup (Python) myself. Not familiar with Notepad++, so not sure if its dialect of regex matches this expression exactly.

Related

use regex to get both link and text associated with it (anchor tag)

I created a regex string that I hoped would get both the link and the associated text in an html page. For instance, if I had a link such as:
<a href='www.la.com/magic.htm'>magicians of los angeles</a>
Then the link I want is 'www.la.com/magic.htm' and the text I want is 'magicians of los angeles'.
I used the following regex expression:
strsearch = "\<a\s+(.*?)\>(.*?)\</a\s*?\>|"
But my vb program told me I was getting too many matches.
Is there something wrong with the regEx expression?
The circle-brackets are meant to get 'groups' that can be back-referenced.
Thanks
What about this one:
\<a href=.+\</a>
All there is left to do is to go over each match and extract the substrings using regular string manipulation.
Check here (although regexr follows javascript regex implementation, it is still useful in our scenario)
With that being said, I often see people stating that regexes are not suited for parsing Html. You might need to use an Html Parser for this. You have HtmlAgilityPack, which is not maintained anymore, and AngleSharp, that I know of to recommend.
I tried with following pattern , it worked.
\<a href=(.*?)\>(.*?)\<\/a\s*?\>|
Also Found two errors on your origin string:
missed a escape syntax on /a
the reserved word 'href' is captured on
first group
At last , i would like recommend you a great site to test REGEX string. It will helps your debug really fast. Refer this (also demonstrating the result you want) :
REGEX101

Regex to match format of valid markup language tags

I am trying to write regex for all type tags either it is html or xml.
I wrote two regex for this
<(\"[^\"]*\"|'[^']*'|[^'\">])*>
<html.*>(.*?)</html>
these are matching all valid tags,,,but it is matching invalid tags too like:
<"font size=12">
...so I want regex for valid tags only. Can anybody please help??
Some people worked for this with code coverage to get a good HTML/XML tag matcher (many traps!)
One of the working solution may be: http://haacked.com/archive/2004/10/25/usingregularexpressionstomatchhtml.aspx/
The Regex is <\/?\w+((\s+\w+(\s*=\s*(?:".*?"|'.*?'|[^'">\s]+))?)+\s*|\s*)\/?>
It matchs individually opening + ending tags, useful if you want to remove tags for instance (in fact you can not expect really more with a simple regex as Jithin answered you)

Regex optimization for Tag Extraction

I am extracting the tags and their surrounding text in the HTML source code. This is the Regex pattern that I use in my code:
String a_pattern = "(.*?)<a (.*?)</a>(.*)";
I tried to use "?" to make the quantifier lazy. However, this pattern takes lots of time while compiling a long string.
Can you please give me some hints on optimizing this pattern?
I should note that I need all three groups of text (before, within, and after )
Thank you
If you want only to get the content from those tags you can use this regex:
<a.*?>(.*?)<\/a>
Working demo
You can check the performance for both regex in this link:
Regex performance
As you can see, the regex you are using is 97% slower.
This result uses the javascript regex engine, but it is really useful to know regex performance.

Regular expression with negative look aheads

I am trying to contruct a regular expression to remove links from content unless it contains 1 of 2 conditions.
<a.*?href=[""'](http[s]?:\/\/(.*?)\.link\.com)?\/(?!m\/).*?<\/a>
This will match any link to link.com that does not have m/ at the end of the domain section. I want to change this slightly so it does't match URLs that are links to pdf files regardless of having the m/ in the url, I came up with:
<a.*?href=["'](http[s]?:\/\/(.*?)\.brodies\.com)?\/(?!m\/).*?\.(?!pdf)["'].*?<\/a>
Which is ooh so very close except now it will only match if the URL has a "." at the end - I can see why it's doing it. I can't seem to make the "." optional as this causes the non greedy pattern prior to the "." to keep going until it hits the ["']
Any help would be good to help solve this.
Thanks
Paul
You probably want to use (?<!\.pdf)["'] instead of \.(?!pdf)["'].
But note that this expression has several issues, best way to solve them is to use a proper HTML parser.
First, RegEx match open tags except XHTML self-contained tags.
That said, (since it probably will not deter,) here is a slightly-better-constrained version of what you're trying to, with the caveat that this is still not good enough!
<a[^>]+?href\s*=\s*["'](https?:\/\/[^"']*?\.link\.com)?\/(?!m\/)[^"']*?\.(?!pdf)[^"']*?["'][^>]*?>.*?<\/a>
You can see a running example of this regex at: http://rubular.com/r/obkKrKpB8B.
Your problem was actually just that you were looking for a quote character immediately after the dot, here: .(?!pdf)["'].

Regex challenge: Match phrase only if outside of an <a href> tag

I am working on improving our glossary functionality in a custom CMS that is running with classic ASP (ASP 3.0) on IIS with VBScript code. I am stumped on a regex challenge I cannot solve.
Here is the current code:
If InStr(ART_ArticleBody, "href") = False then
sql="SELECT URL, Term, RegX FROM GLOSSARYDB;"
Set rsGlossary = Server.CreateObject("ADODB.Recordset")
rsGlossary.open sql, strSQLConn
Set RegExObject = New RegExp
While Not rsGlossary.EOF
URL = rsGlossary("URL")
Phrase = rsGlossary("RegX")
With RegExObject
.Pattern = Phrase
.IgnoreCase = true
.Global = false
End With
set expressionmatch = RegExObject.Execute(ART_ArticleBody)
if expressionmatch.count > 0 then
For Each expressionmatched in expressionmatch
RegExObject.Pattern = Phrase
URL = ""& expressionmatched.Value & ""
ART_ArticleBody = RegExObject.Replace(ART_ArticleBody, URL)
next
end if
rsGlossary.movenext
wend
rsGlossary.movefirst
Set RegExObject = nothing
end if
Instead of skipping putting glossary links in any article that has an href in it, as the above code does, I would like to change the code to process every article but have the RegEx pattern avoid matching on a glossary entry if the match is inside of an a tag.
For example, in italics below is a test example for this regex entry in my DB: ROI|return on investment|investment return
Here is a link that uses the glossary term: Info on return on investment.
Now, here is the glossary term in plain text, not inside of a link: return on investment.
We want to find the third instance of a match but not find the first two because they are both inside of a HTML link.
In the above text, if I were processing the article for the glossary entry "ROI|return on investment|investment return" I do not want to match on the first or second occurance that match because they are in an a tag. I need the regex pattern to skip over those matches and just match on any that are not inside of an a tag.
Any help on this would be greatly appreciated.
Try this regex:
<a\b[^<>]*>[\s\S]*?</a>|(ROI|return on investment|investment return)
This matches an HTML anchor, or any of the terms you're looking for. The terms are captured into group number 1. So in your VBScript code, check if the first capturing group matched anything, and you've got one of your keywords outside an <a> tag.
This regex indeed won't work correctly if you have nested <a> tags. That shouldn't be a problem, as anchors are normally not nested inside each other. If it is a problem, you can't solve it with VBScript/JavaScript regular expressions. The regex also won't work correctly if you have <a> tags that are missing their closing tags. If you want to take that into account, try this regex:
<a\b[^<>]*>(?:(?:(?!<a\b)[\s\S])*?</a>)?|(ROI|return on investment|investment return)
This problem is, as they say, "non-trivial" in its current state. However, if you could modify your system to output more semantic markup, it would make things much easier:
undesired tag match
This is <span class="tag">a tag</span>
In this case, you can simply search:
(?<=<span class=\"tag\">)(phrase1|phrase2|phrase3)(?=</span>)
Or something a little more robust
(?<=<span class=\"tag\">).+?(?=</span>)
This way you can easily focus your searches to data within a specific <span>, and leave everything else aside.
You can't solve it because it can't be done, at least not with 100% reliability. HTML is not a "regular" language in the regular expression sense. Like the saying goes, when you have a hammer, everything starts to look like a nail. There are some things regular expressions aren't good at. This is one of them.
Most languages have some form of HTML parsing library as standard or easily obtained. Use those. That's what they were designed for.
In general, you can't use a regular expression to recognize arbitrarily nested constructs (such as bracket-delimited HTML tags). If you had solved this problem, there's be a lot of mathematicians lining up to hear about it. :)
Having said that, .NET does indeed offer an extension to regular expressions that permits what I just said was impossible, and--even better!--the sample chapter for the great "Mastering Regular Expressions" available here happens to cover that feature.
(accounts receivable|A/R)(?!((?!</?a\b).)*</a)
(phrase1|phrase2|phrase3)(?!((?!</?a\b).)*</a)
The above approach seems to work, at least in my RegexBuddy software. I didn't figure it out on my own. Had some help from a guru. Time to test it in my ASP code. Thanks to all who provided input. I'm sure I didn't describe what I needed well enough for you to come up with the above solution. Mea culpa.