Reg ex to match everything between two tags - regex

I have a string similair to this
<td><p>alakjsdlajsdlkj</p><p><b>asdkjalsdkjaskldj</b></p><p>asdjlaksjdlaksjd</p></td>
What is the regular expression to grab everything between the tags?
I want to grab the following (including the HTML)
<p>alakjsdlajsdlkj</p><p><b>asdkjalsdkjaskldj</b></p><p>asdjlaksjdlaksjd</p>

You can't accomplish this with regular expressions. They just aren't descriptive/powerful enough, mainly in that there is no mechanism to keep track of how many of something it has seen. In short, this is because the regex mechanism has no notion of a stack (it represents finite state machines, not pushdown automata).
For example, consider the pattern <p>(.*)</p>. If you used greedy mode (match as much as possible) and have a string like <p>first</p><p>second</p>, the match will be first</p><p>second. If you went with non-greedy mode (make the smallest match possible) and get a string like <p><p>stuff</p></p>, you'll be rewarded with the match <p>stuff. So neither mode covers all cases (or any case) well.
As #kristopher points out, it's possible to have a pattern that avoids including another tag inside the match, but this will only match innermost tags.
To do what you want robustly, you'll need an actual parser. Several html parsing solutions have been created by others, or for simple parsing needs, you might be able to write your own.

if your tags nest this gets messy fast.
are you unable to use an html parser library? It would be FAR better to do so.
<([^>]+)>([^<]+)</\1>
gets you
any string wrapped in angle brackets
plus any characters up until the next tag
this doesn't handle nested or mismatched tags though
<div>test <b>nested</b></div>
will only catch the
< b >
not the div since the < div > will encounter the start of the < b > before encountering the end of its own tag.

try this, it should just match the outermost tags and return the inner string in the group
^<\w+>(.*)</\w+>$
But it does not checks for correct nesting etc. Use an appropriate framework if possible.

If you can't use an HTML parser and the td and ending td are at the beginning and end of the string:
^<td>(.*)</td>$

Related

find string that is missing substring in xml files regular expression

This is my reg expression that find it
(<instance_material symbol="material_)([0-9]+)(part)(.*?)(")(/)(>)
I need to find a string that does not contain the word "part"
and the xml lines are
<instance_material symbol="material_677part01_h502_w5" target="#material_677part01_h502_w5"/>
<instance_material symbol="material_677" target="#material_677"/>
You can use negative lookahead
^(?!.*part).*?$
^ - start of string.
(?!.*part) - condition to avoid part.
.*? - Match anything except new line.
$ - End of string
Demo
Many regex starters will encounter the problem finding a string not containing certain words. You could find more useful tips on Regular-Expression.info.
^((?!part).)*$
You need to be aware that all attempts to process XML using regular expressions are wrong, in the sense that (a) there will be some legitimate ways of writing the XML document that the regex doesn't match, and (b) there will be some ways of getting false matches, e.g. by putting nasty stuff in XML comments. Sometimes being right 99% of the time is OK of course, but don't do this in production because soon we'll have people writing on SO "I need to generate XML with the attributes in a particular order because that's what the receiving application requires."
Your regex, for example, requires the attribute to be in double rather than single quotes, and it doesn't allow whitespace around the "=" sign, or in several other places where XML allows whitespace. If there's any risk of people deliberately trying to defeat your regex, you need to consider tricks like people writing p in place of p.
Even if this is a one-off with no risk of malicious subversion, you're much better off doing this with XPath. It then becomes a simple query like //instance_materal[#symbol[not(contains(., 'part'))]]

Regex capture words inside tags

Given an XML document, I'd like to be able to pick out individual key/value pairsfrom a particular tag:
<aaa>key0:val0 key1:val1 key2:va2</aaa>
I'd like to get back
key0:val0
key1:val1
key2:val2
So far I have
(?<=<aaa>).*(?=<\/aaa>)
Which will match everything inside, but as one result.
I also have
[^\s][\w]*:[\w]*[^\s] which will also match correctly in groups on this:
key0:val0 key1:val1 key2:va2
But not with the tags. I believe this is an issue with searching for subgroups and I'm not sure how to get around it.
Thanks!
You cannot combine the two expressions in the way you want, because you have to match each occurrence of "key:value".
So in what you came up with - (?<=<abc>)([\w]*:[\w]*[\s]*)+(?=<\/abc>) - there are two matching groups. The bigger one matches everything inside the tags, while the other matches a single "key:value" occurrence. The regex engine cannot give each individual occurence because it does not work that way. So it just gives you the last one.
If you think in python, on the matcher object obtained after applying you regex, you will have access to matcher.group(1) and matcher.group(2), because you have two matching ( ) groups in the regex.
But what you want is the n occurences of "key:value". So it's easier to just run the simpler \w+:\w+ regex on the string inside the tags.
I uploaded this one at parsemarket, and I'm not sure its what you are looking for, but maybe something like this:
(<aaa>)((\w+:\w+\s)*(\w+:\w+)*)(<\/aaa>)
AFAIK, unless you know how many k:v pairs are in the tags, you can't capture all of them in one regex. So, if there are only three, you could do something like this:
<(?:aaa)>(\w+:\w+\s*)+(\w+:\w+\s*)+(\w+:\w+\s*)+<(?:\/aaa)>
But I would think you would want to do some sort of loop with whatever language you are using. Or, as some of the comments suggest, use the parser classes in the language. I've used BeautifulSoup in Python for HTML.

Removing empty bbcode tags using regex

Using regex I'm trying to remove empty bbcode tags. By empty I mean nothing in between them:
[tag][/tag]
If there is something between them then it should be kept.
I've searched a lot and played around with a regex tester but haven't come up with anything that works right.
Edit: I realize now why I was having a hard time with this. In addition to the example above, I also have one's like:
[url=http://www.somedomain.com/][/url]
I'm trying to cleanup bbcode when a form is submitted so it's not stored since it's unneeded.
In Javascript, you could do :
str.replace(/\[([^\[\]]*)\]\[\/\1\]/g, '');
The operative aspect of regex in this case is the use of internal backrefs; I'm not sure, off the top of my head, whether this is universally supported, but .NET, in any case, seems to use PCRE (is this true?).
The pattern, then, is [, a word, ][/, the same word, ]. If we assume the word has simply the quality of "does not contain ]", then an appropriate regex to match an empty tag is \[([^\]]+)\]\[/\1\], escaped as necessary in context.
For the second case, if assume the form [tag=arg][/tag], and that tag and arg each don't contain any ']' (not a reasonable assumption! but dealing with it is left as an exercise for the reader -- and I'm quite sure most bbcode implementations don't actually deal with that problem, either), one could use a regex \[([^\]=]+)(=[^\]]*)?\]\[/\1\].

Last Matched String Issue

I'm using the following regular expression to pull out some html:
(?i)(?:\<tr\s*class='list'[^\>]*\>)[^$+]*\</tr\>
Problem is its not segregating the TRs correctly. I'm trying to use $+ to reference the tag selector again to ensure that the contents of the match don't have the start tag again. Here is the sample html:
http://www.pastie.org/1311827
There are multiple <tr>s in some matches. Please help.
I don't know what you think [^$+]* means, but it defines a negated character class that matches zero or more times. In other words, it matches an empty string, or one or more characters that aren't a literal dollar sign or plus.
HTML cannot be trivially parsed by regex (unless it is known ahead of time what the structure will look like) because in order to properly parse a document you need to be able to recurse, as elements within the document can be nested within themselves (for instance a <div> can contain another <div>). While some languages (you didn't specify what you're using) support recursive regular expressions (perl and PHP for instance), it would likely be more efficient to use a proper DOM parser than recursive regex (the complexity of which non-withstanding) anyways!
Use document.getElementsByTagName in your favorite DOM library and iterate through the nodeList with a loop, then parse the getAttribute('class').
I suggest not using regex because it's only a matter of time before the regex breaks, unless you're dealing with very trivial markup, in addition DOM is just made for that purpose.

Regex challenge: Match phrase only if outside of an <a href> tag

I am working on improving our glossary functionality in a custom CMS that is running with classic ASP (ASP 3.0) on IIS with VBScript code. I am stumped on a regex challenge I cannot solve.
Here is the current code:
If InStr(ART_ArticleBody, "href") = False then
sql="SELECT URL, Term, RegX FROM GLOSSARYDB;"
Set rsGlossary = Server.CreateObject("ADODB.Recordset")
rsGlossary.open sql, strSQLConn
Set RegExObject = New RegExp
While Not rsGlossary.EOF
URL = rsGlossary("URL")
Phrase = rsGlossary("RegX")
With RegExObject
.Pattern = Phrase
.IgnoreCase = true
.Global = false
End With
set expressionmatch = RegExObject.Execute(ART_ArticleBody)
if expressionmatch.count > 0 then
For Each expressionmatched in expressionmatch
RegExObject.Pattern = Phrase
URL = ""& expressionmatched.Value & ""
ART_ArticleBody = RegExObject.Replace(ART_ArticleBody, URL)
next
end if
rsGlossary.movenext
wend
rsGlossary.movefirst
Set RegExObject = nothing
end if
Instead of skipping putting glossary links in any article that has an href in it, as the above code does, I would like to change the code to process every article but have the RegEx pattern avoid matching on a glossary entry if the match is inside of an a tag.
For example, in italics below is a test example for this regex entry in my DB: ROI|return on investment|investment return
Here is a link that uses the glossary term: Info on return on investment.
Now, here is the glossary term in plain text, not inside of a link: return on investment.
We want to find the third instance of a match but not find the first two because they are both inside of a HTML link.
In the above text, if I were processing the article for the glossary entry "ROI|return on investment|investment return" I do not want to match on the first or second occurance that match because they are in an a tag. I need the regex pattern to skip over those matches and just match on any that are not inside of an a tag.
Any help on this would be greatly appreciated.
Try this regex:
<a\b[^<>]*>[\s\S]*?</a>|(ROI|return on investment|investment return)
This matches an HTML anchor, or any of the terms you're looking for. The terms are captured into group number 1. So in your VBScript code, check if the first capturing group matched anything, and you've got one of your keywords outside an <a> tag.
This regex indeed won't work correctly if you have nested <a> tags. That shouldn't be a problem, as anchors are normally not nested inside each other. If it is a problem, you can't solve it with VBScript/JavaScript regular expressions. The regex also won't work correctly if you have <a> tags that are missing their closing tags. If you want to take that into account, try this regex:
<a\b[^<>]*>(?:(?:(?!<a\b)[\s\S])*?</a>)?|(ROI|return on investment|investment return)
This problem is, as they say, "non-trivial" in its current state. However, if you could modify your system to output more semantic markup, it would make things much easier:
undesired tag match
This is <span class="tag">a tag</span>
In this case, you can simply search:
(?<=<span class=\"tag\">)(phrase1|phrase2|phrase3)(?=</span>)
Or something a little more robust
(?<=<span class=\"tag\">).+?(?=</span>)
This way you can easily focus your searches to data within a specific <span>, and leave everything else aside.
You can't solve it because it can't be done, at least not with 100% reliability. HTML is not a "regular" language in the regular expression sense. Like the saying goes, when you have a hammer, everything starts to look like a nail. There are some things regular expressions aren't good at. This is one of them.
Most languages have some form of HTML parsing library as standard or easily obtained. Use those. That's what they were designed for.
In general, you can't use a regular expression to recognize arbitrarily nested constructs (such as bracket-delimited HTML tags). If you had solved this problem, there's be a lot of mathematicians lining up to hear about it. :)
Having said that, .NET does indeed offer an extension to regular expressions that permits what I just said was impossible, and--even better!--the sample chapter for the great "Mastering Regular Expressions" available here happens to cover that feature.
(accounts receivable|A/R)(?!((?!</?a\b).)*</a)
(phrase1|phrase2|phrase3)(?!((?!</?a\b).)*</a)
The above approach seems to work, at least in my RegexBuddy software. I didn't figure it out on my own. Had some help from a guru. Time to test it in my ASP code. Thanks to all who provided input. I'm sure I didn't describe what I needed well enough for you to come up with the above solution. Mea culpa.