Capturing regex group using condition - regex

I have a regular expression that breaks html into necessary for me peaces. I will not present the whole regex, because it's too long. In a nutshell, its a multi-line table cells row-by-row parser. Recently i've ran into a trouble: the layout of parsing pages has changed, so I started remastering the regex to fit new layout, but I've found that layout wrapping data I need in a particular cell in some rows may differ.
What do we have?
The layout of the cell may be like this or like this
which leads me to question: how do I capture needed data and do not have additional unnecessary group?
Conditions in regexps described here regular-expressions.info/conditional.html, I've read it but still don't have a clue.

This should help :)
<td class='(?:class1|class2)'>\s*((?=\w).*)\s*</td>

Edited: took over regexhacks expression, as it is a solution that is better.
Not sure, but maybe you are looking for non-capturing groups used as (?:). Thus you could do
<td class='class(?:1|2)'>\s*((?=\w).*)\s*</td>
Well, in this example you would not need the groups:
<td class='class[12]'>\s*((?=\w).*)\s*</td>
but in more complex cases you could use them.
See sample: rubular
But this might not be what you want. Could you give a more precise example of the problem?

Related

Regular expression with the same opening and closing optional tag

I need a help with this regexp..
using
/\{block:(Posts|Photos|Videos)(\s\[.*?\])?\}(\s?[^\"]+\s?)\{\/block\}/U
i get {block:Posts}abcdef{/block} from this:
<div>
{block:Posts [a=1, b=2]}
abcdef
{/block}
</div>
But if my text is like this:
<div>
{block:Posts [a=1, b=2]}
{block:Text}
abcdef
{/block}
{/block}
</div>
i get {block:Posts}{block:Text}abcdef{/block} because it's based on the first {/block} found in text.
A simple way to avoid this is using {/block:Posts} to close the block, but how can I do that since the opening block tag is optional (Posts|Photos|Videos)? If I open the block with Photos I must be sure it has to be closed with {/block:Photos}.
Using /\{block:(Posts|Photos|Videos)(\s\[.*?\])?\}(\s?[^\"]+\s?)\{\/block\:(Posts|Photos|Videos)\}/U of course doesn't help...
Can anyone please help me?
Thanks!!
PS
Is it possible, modifying the regex above, to get the optional parameters a and b as an array?
There might be an overall better solution for your problem, but you can use a backreference in this case, as (Posts|Photos|Videos) is capture group already:
\{\/block:\1\}
You can do this using a backreference:
\{block:(Posts|Photos|Videos)(\s\[.*?\])?\}(\s?[^\"]+\s?)\{\/block\1\}
Note the added backreference \1 at the end. The backreference will match whatever was matched by the first group, i.e. the first pair of parenthesis, in our case (Posts|Photos|Videos).
Note however that in general regular expressions are too limited to parse languages like HTML as explained by this post. Languages which require counting of opening entities (like brackets or tags) and then matching the exact number of closing entities can't be expressed using regular expressions. Another example of a language that isn't regular for this reason is the language of arithmetic expressions with parenthesis or a language composed of strings of the form aa...abb...b with the same number of a and b. General proof of this fact uses the Pumping Lemma.
Note also that regular expressions as used in software tools are usually a bit more powerful than bare mathematical regular expressions due to a number of additions beyond basic operations of union, concatenation and Kleene star that are provided by these software tools. Backreferences themselves constitue a major enhancement of regular expressions and allow one to express languages that are not considered regular in the mathematical sense. This is why your problem has a solution at all. Counting of opening and closing entities is still impossible, though.

perl regex help -- hopefully an easy question

ashamed as I am to admit it, I'm terrible with regex... so here I am to ask your help :)
i have an html file that looks sorta like this:
<table>
<tr>
<td sadf="a">
asdf
</td>
</tr>
</table>
what I'd like to do, with Perl regex, is remove everything except for everything in the td tag. so i would want output to be this:
<td sadf="a">
asdf
</td>
please help me out. Thanks
A html parser would be much better at this task, but if you insist on using a regular expression, try this:
<td[\s\S]*?</td>
It matches as few of any character as possible up until the end tag </td>.
Try using XML::Simple. As others have pointed out, you can't use regex for parsing XML.
XML::Simple will turn your HTML into a hash structure. From there, you can easily locate the "td" element, and copy the whole thing to another hash reference. Then, you can use XML::Simple to turn it back into HTML.
XML::Simple can't guarantee the same structure in XML (although it'll be pro-grammatically the same). However, I rarely have problems with turning HTML into a hashref and back into HTML.
A simpler way of thinking of this is that you want to grab the tag part with a regular expression (rather than remove everything except the tag part).
In this case, the regular expression is simple, and would probably look something like this for the first line, for example: <td \w+?="\w*"> (you can match \n to grab a multiline block). It's hard to answer without knowing exactly what is changing in your regex, but if you follow a reference like this one you should be fine.
In addition, it probably is best to do this without regex at all (using an HTML parser at all) if it's anything more than a limited, specific grab. I'll assume you know that you want to use regex, but there are really much better ways of doing this if you've got something more complicated than a very basic search pattern on your hands.

Regex to Parse HTML Tables

I am trying to remove the tables within an HTML file, specifically, for the following document, I'd like to remove anything within the tags <TABLE....> and </TABLE>. The document contains multiple tables with texts in between.
The expression that I came up with, <TABLE.*>\s*[\s|\S]*</TABLE>\s*, however would remove the text in between the tables. In fact it would remove everything between the first <TABLE> and the last </TABLE> tags. I would like to keep the texts in between and only remove the tables. Any suggestion is greatly appreciated. Thanks.
====================
<TABLE STYLE=xxx, Font=yyy, etc>
table texts that should be DELETED...
</TABLE>
other texts that should be KEPT...
<TABLE STYLE=xxx, Font=yyy, etc>
table texts that should be DELETED...
</TABLE>
==========================================
The answer is to use a HTML or SGML parser, there are some around for .NET:
http://htmlagilitypack.codeplex.com/
SGML parser .NET recommendations
If you absolutely want to use regular expressions, familiarize yourself with balancing groups, otherwise nested tables will break. It's not easy, and may perform much slower than a regular SGML parser. Be warned though: Seeing your expression I assume that you are a regex newbie (hint: avoid greedy . matches at any cost), so this is probably not yet your cup of tea.
Since I know you're not going to look at an HTML parser even if I tell you you really should, I'll just answer the question.
This matches only tables:
<table.*?>.*?</table>
It requires two options: dotall and ignoreCase.
You can try it here: http://gskinner.com/RegExr/
Now do consider using HTML Agility Pack suggested by Lucero ok?
Edit: maybe this was what you meant, sorry:

Regex match an attribute value

What would the regular expression be to return 'details.jsp' (without quotes!) from this original tag. I can quite easily match all of value="details.jsp" but am having trouble just matching the contents within the attribute.
<s:include value="details.jsp" />
Any help greatly appreciated!
Thanks
Lawrence
/value=["']([^'"]+)/ would place "details.jsp" in the first capture group.
Edit:
In response to ircmaxell's comment, if you really need it, the following expression is more flexible:
/value=(['"])(.+)\1/
It will match things like <s:include value="something['else']">, but just note that the value will be placed in the second capture group.
But as mentioned before, regex is not what you want to use for parsing XML (unless it's a really simple case), so don't invest too much time into complex regexes when you should be using a parser.

Regex challenge: Match phrase only if outside of an <a href> tag

I am working on improving our glossary functionality in a custom CMS that is running with classic ASP (ASP 3.0) on IIS with VBScript code. I am stumped on a regex challenge I cannot solve.
Here is the current code:
If InStr(ART_ArticleBody, "href") = False then
sql="SELECT URL, Term, RegX FROM GLOSSARYDB;"
Set rsGlossary = Server.CreateObject("ADODB.Recordset")
rsGlossary.open sql, strSQLConn
Set RegExObject = New RegExp
While Not rsGlossary.EOF
URL = rsGlossary("URL")
Phrase = rsGlossary("RegX")
With RegExObject
.Pattern = Phrase
.IgnoreCase = true
.Global = false
End With
set expressionmatch = RegExObject.Execute(ART_ArticleBody)
if expressionmatch.count > 0 then
For Each expressionmatched in expressionmatch
RegExObject.Pattern = Phrase
URL = ""& expressionmatched.Value & ""
ART_ArticleBody = RegExObject.Replace(ART_ArticleBody, URL)
next
end if
rsGlossary.movenext
wend
rsGlossary.movefirst
Set RegExObject = nothing
end if
Instead of skipping putting glossary links in any article that has an href in it, as the above code does, I would like to change the code to process every article but have the RegEx pattern avoid matching on a glossary entry if the match is inside of an a tag.
For example, in italics below is a test example for this regex entry in my DB: ROI|return on investment|investment return
Here is a link that uses the glossary term: Info on return on investment.
Now, here is the glossary term in plain text, not inside of a link: return on investment.
We want to find the third instance of a match but not find the first two because they are both inside of a HTML link.
In the above text, if I were processing the article for the glossary entry "ROI|return on investment|investment return" I do not want to match on the first or second occurance that match because they are in an a tag. I need the regex pattern to skip over those matches and just match on any that are not inside of an a tag.
Any help on this would be greatly appreciated.
Try this regex:
<a\b[^<>]*>[\s\S]*?</a>|(ROI|return on investment|investment return)
This matches an HTML anchor, or any of the terms you're looking for. The terms are captured into group number 1. So in your VBScript code, check if the first capturing group matched anything, and you've got one of your keywords outside an <a> tag.
This regex indeed won't work correctly if you have nested <a> tags. That shouldn't be a problem, as anchors are normally not nested inside each other. If it is a problem, you can't solve it with VBScript/JavaScript regular expressions. The regex also won't work correctly if you have <a> tags that are missing their closing tags. If you want to take that into account, try this regex:
<a\b[^<>]*>(?:(?:(?!<a\b)[\s\S])*?</a>)?|(ROI|return on investment|investment return)
This problem is, as they say, "non-trivial" in its current state. However, if you could modify your system to output more semantic markup, it would make things much easier:
undesired tag match
This is <span class="tag">a tag</span>
In this case, you can simply search:
(?<=<span class=\"tag\">)(phrase1|phrase2|phrase3)(?=</span>)
Or something a little more robust
(?<=<span class=\"tag\">).+?(?=</span>)
This way you can easily focus your searches to data within a specific <span>, and leave everything else aside.
You can't solve it because it can't be done, at least not with 100% reliability. HTML is not a "regular" language in the regular expression sense. Like the saying goes, when you have a hammer, everything starts to look like a nail. There are some things regular expressions aren't good at. This is one of them.
Most languages have some form of HTML parsing library as standard or easily obtained. Use those. That's what they were designed for.
In general, you can't use a regular expression to recognize arbitrarily nested constructs (such as bracket-delimited HTML tags). If you had solved this problem, there's be a lot of mathematicians lining up to hear about it. :)
Having said that, .NET does indeed offer an extension to regular expressions that permits what I just said was impossible, and--even better!--the sample chapter for the great "Mastering Regular Expressions" available here happens to cover that feature.
(accounts receivable|A/R)(?!((?!</?a\b).)*</a)
(phrase1|phrase2|phrase3)(?!((?!</?a\b).)*</a)
The above approach seems to work, at least in my RegexBuddy software. I didn't figure it out on my own. Had some help from a guru. Time to test it in my ASP code. Thanks to all who provided input. I'm sure I didn't describe what I needed well enough for you to come up with the above solution. Mea culpa.