I want the text between the last /th> and /tbody> - regex

This is what I use
output = System.Text.RegularExpressions.Regex.Replace(output, "(?s)/th>(.*?)</tbody>", "$1")
Notice that I am using (.*?) because I want the search to be ungreedy. That is there are severals /th> around. I want to get rid texts above the LAST /th>
This is what I got.
<!-- statistics_period -->
<input name="subForm" type="hidden" value="1">
<input name="hidTotal" type="hidden" value="886">
<div class="domlistframe">
<div class="divMainListingTable">
<table width="76%" align="left" class="mainListTable" cellspacing="0" cellpadding="3">
<tbody><tr>
<th nowrap=""> <
<th colspan="4"> </th>
<th id="sercol" nowrap="" colspan="11">Totals</th>
You see? Several /th> there.
Yes I know full well the horrible consequences of parsing html with regular expression as described here RegEx match open tags except XHTML self-contained tags.
I am parsing mostly table anyway. It's working
Note: here is a simpler problem that's equivalent with above
Say I have a text like this
cow cow cow chicken cat cow cat dog hello bla.
Say I want cat dog hello. That is text between the last cow and bla.
What would be the regular expression for that?
Notice I want the text between the LAST cow and bla.
Doing it
cow.*bla
will give me the whole text
Doing it cow.?*bla should give me what I want. However, as you can see from the sample I uses, it didn't work.

HINT
Try the pattern:
.*cow((?!cow).*?)bla
for the cow..bla problem.
The leading .* skips everything until the last cow is encountered

This is only a partial answer. Basically I solved the problem by using the technique hjpotter92 uses.
What I did is
output = System.Text.RegularExpressions.Regex.Replace(output, "(?s).*/th>(.*?)</tbody>", "$1")
Because the first .* is greedy. It will automatically match the maximum string that contains .*th>
Some question remains. Why my original code doesn't work?
I suspect it has something to do with regular expression works from left to right. Again any input would be fine.
I would also thank htpotter for telling me what complement operator in regex is.
Hmmm... Well, this answer does answer the question of what should I do to make it work and now it's working. However, it's based on other answer. Which one I should pick as answer?

Related

re.sub don't replace match [duplicate]

This question already has answers here:
How to remove HTML comments using Regex in Python
(6 answers)
Closed 4 years ago.
I have an html-file that has some sections that need to be removed.
All section will be removed except one. I was able to give you a small example, however it is pretty weird that a regex editor recognizes the section.
I want to remove everything between <!-- and -->, but it doesn't work.
test = '<br/><br/> </span> <!--TABLE<table class=MsoTableGrid border=1 cellspacing=0 cellpadding=0 style=\'border-collapse:collapse;border:none\'> <tr style=\'height:12.95pt\'> <td width=225 valign=top style=\'width:109.45pt;border:solid windowtext 1.0pt;padding:2.4pt 5.4pt 2.4pt 5.4pt;height:12.95pt\'> <span style=\'font-family:"Arial",sans-serif\'> <b>Kontosaldo in \x80</b> </span> </td> </tr> <tr style=\'height:12.95pt\'> <td width=146 valign=top style=\'width:109.45pt;border:solid windowtext 1.0pt;padding:2.4pt 5.4pt 2.4pt 5.4pt;height:12.95pt\'> <span style=\'font-family:"Arial",sans-serif\'> [substringR] </span> </td> </tr> </table>TABLE-->'
r = re.compile(r"(?<=<!--)([\s\n.<>\]\[\\=;,€\/\-\'\":\w\n]+)(?=-->)")
mystring = r.sub('', test)
"Everything inbetween <!-- and -->" is this expression:
<!--.*?-->
replaced with the empty string. Compile with the re.DOTALL flag.
Note Modifying HTML with regex is a recipe for disaster. Don't do it. This particular task, namely "removing comments" is a grey area: Regex cannot deal with languages that can be arbitrarily nested (such as HTML), but HTML comments cannot be nested, so there is a good chance that this works. However, don't try the same approach with "replacing all tables", it won't work.
But still, HTML can be functional and still horribly broken in soooo many ways, that even for this task there will be HTML files that disintegrate completely when you try this seemingly safe regex on them.
The proper approach is just as #Aaron suggests: Parse the HTML file into a DOM tree. Find nodes you want to remove. Write the DOM tree back to a file; as shown in this answer: How to find all comments with Beautiful Soup.

how to separate two regexp (for taking text from brackets in commented area)?

I have some html page, it looks like:
<span>Some text</span>
<p>And again</p>
<table>
<thead>
<tr>
<th>Text</th>
<th>Text [some text]</th>
<th>Text</th>
</tr>
</thead>
<tbody>
<!--[content-->
<tr>
<td>again some txt but with [this]</td>
<td>in this td the same situation [oops]</td>
<td>hello [world]</td>
</tr>
<!--content]-->
</tbody>
</table>
<span>here is [the text]</span>
I need to take text from square brackets, but just in commented fields. I have 2 reg exp and they are work fine, but separately.
/[^[\]]+(?=])/g - this is for text in brackets;
(?=<!--\[content)([\s\S]*?content]-->) - for commented fields.
But I can't combine it. I was trying this (?=<!--\[content)([^[\]]+(?=]))([\s\S]*?content]-->) but it's not works. I don't know much regexp, how can I combine it?
UPD: for output I need text in brackets only between commented fields (this, oops, world).
First, I might start from some simple one:
(?<=\[)[^\]\[]*(?=\])(?=[\s\S]*?<!--content\]-->)
Explanation
(?<=\[)[^\]\[]*(?=\]) match text inside any square brackets,
(?=[\s\S]*?<!--content\]-->) lookahead to any string that followed by a closed content tag.
Its sound so make sense right! BUT anyway, check this out DEMO1. yeah...it didn't work. So, the question is why???
In the regex above there is still some problem about the lookahead assertion, as I mentioned before in the previous explanation:
(?=[\s\S]*?<!--content\]-->) lookahead to any string that followed by a closed content tag.
This is WRONG, it should be:
(?=[\s\S]*?<!--content\]-->) lookahead to any string that followed by any open or closed content tags.
So, the conclusion our issue is the regex [\s\S]*? sometimes it just matches "more than one content tags".
Workaround
To prevent the above issue, we can put another negative lookaheads of the open content tags to be coupled with every characters that will be generated by [\s\S]*. Thus, we get:
(?<=\[)[^\]\[]*(?=\])(?=(?:(?!<!--\[content-->)[\s\S])*?<!--content\]-->)
Notice that
[\s\S]*
is just modified to
(?:(?!<!--\[content-->)[\s\S])*?
which means (?!<!--\[content-->) is spawned to be in front of every characters that generated by [\s\S]*. For example if [\s\S]* generates ABCDEF..., the negative lookahead will be spawned in this way:
(?!<!--\[content-->)A(?!<!--\[content-->)B(?!<!--\[content-->)C(?!<!--\[content-->)D(?!<!--\[content-->)E(?!<!--\[content-->)F...
Finally, please check the DEMO2. See that right? it's just work!
DISCLAIMER: My regex here will be work fine under only the simple examples that you were provided on the question. For the another complex such as some recursive structure, I can not guarantee that.

Trying to replace part of a string which may need Regex pattern matcher?

I am using MVC3,C#, Razor.
I am creating some XHTML from some data.
I need to remove some inline font-size tags that get added accidently, but can cause chaos with the formatting.
So I might have:
<p> <span style="font-size: 12px">my text</span> </p>
I would like to replace or remove these "font-size" inline CSS rules. I am suspecting that I may need some regex match approach, but am unsure how to go about it, apart from the fact it will be something like:
string pattern = <regex pattern for "<style=" and "font-size" CSS attribute>;
myString = Regex.Replace(myString, pattern, "");
My intended output is either:
<p> <span style="">my text</span> </p>
Thank you in advance.

How to create a regex to match everything inside and including <div>...</div>?

This is the sample text that I'm working with. I'm using Coda to do a find and replace...
<td width="20%"><div > Item #</div></td>
<td width="20%"><div > Pole Tip</div></td>
<td width="20%"><div > Length</div></td>
<td width="20%"><div > Test Weight (lbs.)</div></td>
<td width="20%"><div > Price</div></td>
I want to get rid of the div tags that markup the text inside the td.
Ex...I want to change this:
<td width="20%"><div > Item #</div></td>
to this:
<td width="20%">Item #</td>
So far I have this as a regex:
<div >[\s\w\(\)#]*</div>
However this matches all of the above in my sample text EXCEPT:
<td width="20%"><div > Test Weight (lbs.)</div></td>
In my regex, I even tried to add the ( and )...what am I doing wrong?
In Reply to Andy, I agree that Data Parsing of Well-Formed Markup should be kept to DOM Navigational tools. XML for sure, or HTML>XML Converters are good. I don't know what Miles is working with, but I frequently work with HTML that is so malformed that it can't be parsed by Markup parsers.
In some of my Regex tutorials on Document Parsing, I discuss the Regex Trim pattern, which is simply Zero or More Whitespace {\s*}. Though you might shy away from it because it adds a tiny bit of length to the Regex Pattern, there is virtually zero efficiency loss. That being said...
(<td[^>]*>)\s*<div[^>]*>\s*((?:[^<]*(?(?!</div>\s*</td>)<))*)\s*</div>\s*(</td>)
Replace this with $1$2$3 and you win, as well as get back a clean result. Of course, you can replace or remove as many Trims (\s*) as you like, just a personal preference if I am parsing Documents or Malformed Markup.
Thats because you missed the . This works just fine
<div >[\s\w\(\)#.]*</div>

A regular expression question

I have content something like
<div class="c2">
<div class="c3">
<p>...</p>
</div>
</div>
What I want is to match the div.c2's inner HTML. The contents of it may vary a lot. The only problem I am facing here is that how can I make it to work so that the right closing div is taken?
You can't. This problem is unsolvable with classic regular expressions, and with most of the existing regex implementations.
However, some regex engines have special support for balanced pair matching. See, e.g., here (.NET). Though even in this case your regex will be able to parse only a subset of syntactically correct texts (e.g., what if a < /div > is embedded in a comment?). You need an HTML parser to get reliable results.
Any chance this will always be valid XHTML? If so, you'd be better off parsing it as XML than trying to regex this.
Delete the first line, delete the last line. Problem solved. No need for RegEx.
The following pattern works well with .Net RegEx implementation:
\<div class="c2"\>{[\n a-z.<>="0-9/]+}\</div\>
And we replace that with \1.
Input:
<div class="c2">
<div class="c3">
<p>...</p>
</div></div></div></div></div></div></div></div>
</div>
Output:
<div class="c3">
<p>...</p>
</div></div></div></div></div></div></div></div>