Regex challenge: Match phrase only if outside of an <a href> tag - regex

I am working on improving our glossary functionality in a custom CMS that is running with classic ASP (ASP 3.0) on IIS with VBScript code. I am stumped on a regex challenge I cannot solve.
Here is the current code:
If InStr(ART_ArticleBody, "href") = False then
sql="SELECT URL, Term, RegX FROM GLOSSARYDB;"
Set rsGlossary = Server.CreateObject("ADODB.Recordset")
rsGlossary.open sql, strSQLConn
Set RegExObject = New RegExp
While Not rsGlossary.EOF
URL = rsGlossary("URL")
Phrase = rsGlossary("RegX")
With RegExObject
.Pattern = Phrase
.IgnoreCase = true
.Global = false
End With
set expressionmatch = RegExObject.Execute(ART_ArticleBody)
if expressionmatch.count > 0 then
For Each expressionmatched in expressionmatch
RegExObject.Pattern = Phrase
URL = ""& expressionmatched.Value & ""
ART_ArticleBody = RegExObject.Replace(ART_ArticleBody, URL)
next
end if
rsGlossary.movenext
wend
rsGlossary.movefirst
Set RegExObject = nothing
end if
Instead of skipping putting glossary links in any article that has an href in it, as the above code does, I would like to change the code to process every article but have the RegEx pattern avoid matching on a glossary entry if the match is inside of an a tag.
For example, in italics below is a test example for this regex entry in my DB: ROI|return on investment|investment return
Here is a link that uses the glossary term: Info on return on investment.
Now, here is the glossary term in plain text, not inside of a link: return on investment.
We want to find the third instance of a match but not find the first two because they are both inside of a HTML link.
In the above text, if I were processing the article for the glossary entry "ROI|return on investment|investment return" I do not want to match on the first or second occurance that match because they are in an a tag. I need the regex pattern to skip over those matches and just match on any that are not inside of an a tag.
Any help on this would be greatly appreciated.

Try this regex:
<a\b[^<>]*>[\s\S]*?</a>|(ROI|return on investment|investment return)
This matches an HTML anchor, or any of the terms you're looking for. The terms are captured into group number 1. So in your VBScript code, check if the first capturing group matched anything, and you've got one of your keywords outside an <a> tag.
This regex indeed won't work correctly if you have nested <a> tags. That shouldn't be a problem, as anchors are normally not nested inside each other. If it is a problem, you can't solve it with VBScript/JavaScript regular expressions. The regex also won't work correctly if you have <a> tags that are missing their closing tags. If you want to take that into account, try this regex:
<a\b[^<>]*>(?:(?:(?!<a\b)[\s\S])*?</a>)?|(ROI|return on investment|investment return)

This problem is, as they say, "non-trivial" in its current state. However, if you could modify your system to output more semantic markup, it would make things much easier:
undesired tag match
This is <span class="tag">a tag</span>
In this case, you can simply search:
(?<=<span class=\"tag\">)(phrase1|phrase2|phrase3)(?=</span>)
Or something a little more robust
(?<=<span class=\"tag\">).+?(?=</span>)
This way you can easily focus your searches to data within a specific <span>, and leave everything else aside.

You can't solve it because it can't be done, at least not with 100% reliability. HTML is not a "regular" language in the regular expression sense. Like the saying goes, when you have a hammer, everything starts to look like a nail. There are some things regular expressions aren't good at. This is one of them.
Most languages have some form of HTML parsing library as standard or easily obtained. Use those. That's what they were designed for.

In general, you can't use a regular expression to recognize arbitrarily nested constructs (such as bracket-delimited HTML tags). If you had solved this problem, there's be a lot of mathematicians lining up to hear about it. :)
Having said that, .NET does indeed offer an extension to regular expressions that permits what I just said was impossible, and--even better!--the sample chapter for the great "Mastering Regular Expressions" available here happens to cover that feature.

(accounts receivable|A/R)(?!((?!</?a\b).)*</a)
(phrase1|phrase2|phrase3)(?!((?!</?a\b).)*</a)
The above approach seems to work, at least in my RegexBuddy software. I didn't figure it out on my own. Had some help from a guru. Time to test it in my ASP code. Thanks to all who provided input. I'm sure I didn't describe what I needed well enough for you to come up with the above solution. Mea culpa.

Related

Regex to match everything except a pattern

Regex noob here struggling with this, which I know it will be easy for some of you regex gods out there!
Given the following:
title: Some title
date: 2022-08-15
tags: <value to extract>
identifier: 1234567
---------------------------
Some text
some more text
I would like a regex to match everything except the value of tags (ie the "<value to extract>" text).
For context, this is supposed to run on emacs (in case it matters).
EDIT: Just to clarify as per #phils question, all I care about extracting the tags value. However, this is via a package setting that asks for a regex string and I don't have much control over how it gets use. It seems to expect a regex to strip what I don't need from the string rather than matching what I do want, which is slightly annoying.. Also, the since it seems to match everything with \\(.\\), I'm guessing it's using the global flag?
Please let me know if any of this isn't clear.
Emacs regular expressions can't trivially express "not foo" for arbitrary values of foo. (The likes of PCRE have non-regular extensions for zero-width negative look-ahead/behind assertions, but in Emacs that sort of functionality is generally done with the support of lisp code1.)
You can still do it purely with regexp matching, but it's simply very cumbersome. An Emacs regexp which matches any line which does not begin with tags: is:
^\(?:$\|[^t]\|t[^a]\|ta[^g]\|tag[^s]\|tags[^:]\).*
or if you need to enter it in the elisp double-quoted read syntax for strings:
"^\\(?:$\\|[^t]\\|t[^a]\\|ta[^g]\\|tag[^s]\\|tags[^:]\\).*"
1 In lisp code you would instead simply check each line to see whether it does start with tags: and, if so, skip it (which is why Emacs generally gets away without the feature you're looking for, but of course that doesn't help you here).
After playing around with it for a bit and taken inspiration from #phils' answer, I've come up with the following:
"^\\(?:\\(#\\+\\)?\\(?:filetags:\s+\\|tags:\s+\\|title:.*\\|identifier:.*\\|date:.*\\)\\|.*\\)"
I've also added an extra \\(#\\+\\)? to account for org meta keys which would usually have the format #+key: value.

use regex to get both link and text associated with it (anchor tag)

I created a regex string that I hoped would get both the link and the associated text in an html page. For instance, if I had a link such as:
<a href='www.la.com/magic.htm'>magicians of los angeles</a>
Then the link I want is 'www.la.com/magic.htm' and the text I want is 'magicians of los angeles'.
I used the following regex expression:
strsearch = "\<a\s+(.*?)\>(.*?)\</a\s*?\>|"
But my vb program told me I was getting too many matches.
Is there something wrong with the regEx expression?
The circle-brackets are meant to get 'groups' that can be back-referenced.
Thanks
What about this one:
\<a href=.+\</a>
All there is left to do is to go over each match and extract the substrings using regular string manipulation.
Check here (although regexr follows javascript regex implementation, it is still useful in our scenario)
With that being said, I often see people stating that regexes are not suited for parsing Html. You might need to use an Html Parser for this. You have HtmlAgilityPack, which is not maintained anymore, and AngleSharp, that I know of to recommend.
I tried with following pattern , it worked.
\<a href=(.*?)\>(.*?)\<\/a\s*?\>|
Also Found two errors on your origin string:
missed a escape syntax on /a
the reserved word 'href' is captured on
first group
At last , i would like recommend you a great site to test REGEX string. It will helps your debug really fast. Refer this (also demonstrating the result you want) :
REGEX101

Reg Expression to scrape background:url but 'url(data:image'

I am working on gradle script to go through large css file and scrap out the URLs for images. So far:
def temp = ".post-format background:url(image/goes/here.jpg); {background: .post-format {background: url(../img/post //formats.png);display:;display:.woocommerce-info:before {background: url(data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAAIAAAAFCAYAAABvsz2cAAAAG0lEQVQIHWP8DwQMQMACxIwwBliECcQDATgDAMHrBQqJ6tMZAAAAAElFTkSuQmCC)center no-repeat #18919c }"
def list = temp.findAll(/background:[\s]?url\([^\)]*\)/){ match ->
match
}
This works but it also takes the 'data:image' file url that we don't need. So, here the temp variable contains both - the good 'image/goes/here.jpg' url and also the one we don't need 'data:image/png[..]'. How would we have to update the regular expression to make it work? If you could also share your rational behind of the correct regular expression to help us better learn regular expressions i would much appreciate. Thank You a lot
You can use the negative look ahead mechanism to accomplish what you want. Immediately following the escaped left parenthesis you insert (?!data:image) which means that you must not match that text at that point. So your regex becomes:
/background:[\s]?url\((?!data:image)[^\)]*\)/
You can see the approach illustrated in this rubular. See also How can I find everything BUT certain phrases with a regular expression?
You didn't specify what language you're using, but if the URL you want is always the first one, just don't do a global match (which is what findAll does, whatever language that is). Most likely, changing temp.findAll to temp.match and assigning the results to a scalar string variable will do it. But please tell us which language.

How to use regex to replace text between tags in Notepad++

I have a code like this:
<pre><code>Some HTML code</code></pre>
I need to escape the HTML between the <pre><code></code></pre> tags. I have lots of tags, so I thought - why not let regex do it for me. The problem is I don't know how. I've seen lots of examples using Google and Stackoverflow, but nothing I could use. Can someone here help me?
Example:
<pre><code>Some HTML code</code></pre>
To
<pre><code>Some <a href="http">HTML</a> code</code></pre>
Or just a regex so I can replace anything between the <pre><code> and </code></pre> tags one by one. I'm almost certain that this can be done.
This regex will match the parts of the anchor tag
you need to put back:
<pre><code>([^<]*?)(.*?)(.*?)</code></pre>
See a live demo, which shows it matching correctly and also shows the various parts being captured as groups which we'll refer to in the replacement string (see below).
Use the regex above with the following replacement:
<pre><code>\1<a href="\2">\3</a>\4</pre></code>
The \1, \2 etc are the captured groups in the regex that put back what we're keeping from the match.
A regular expression to return "the thing between <pre><code> and </code></pre>" could be
/(?<=<pre><code>).*?(?=<\/code><\/pre>)/
This uses lookaround expressions to delimit the "thing that gets matched". Typically using regex in situations with nested tags is fraught with danger and you are much better off using "real tools" made specifically for the job of parsing xml, html etc. I am a huge fan of Beautiful Soup (Python) myself. Not familiar with Notepad++, so not sure if its dialect of regex matches this expression exactly.

Regex match an attribute value

What would the regular expression be to return 'details.jsp' (without quotes!) from this original tag. I can quite easily match all of value="details.jsp" but am having trouble just matching the contents within the attribute.
<s:include value="details.jsp" />
Any help greatly appreciated!
Thanks
Lawrence
/value=["']([^'"]+)/ would place "details.jsp" in the first capture group.
Edit:
In response to ircmaxell's comment, if you really need it, the following expression is more flexible:
/value=(['"])(.+)\1/
It will match things like <s:include value="something['else']">, but just note that the value will be placed in the second capture group.
But as mentioned before, regex is not what you want to use for parsing XML (unless it's a really simple case), so don't invest too much time into complex regexes when you should be using a parser.