re.sub don't replace match [duplicate] - regex

This question already has answers here:
How to remove HTML comments using Regex in Python
(6 answers)
Closed 4 years ago.
I have an html-file that has some sections that need to be removed.
All section will be removed except one. I was able to give you a small example, however it is pretty weird that a regex editor recognizes the section.
I want to remove everything between <!-- and -->, but it doesn't work.
test = '<br/><br/> </span> <!--TABLE<table class=MsoTableGrid border=1 cellspacing=0 cellpadding=0 style=\'border-collapse:collapse;border:none\'> <tr style=\'height:12.95pt\'> <td width=225 valign=top style=\'width:109.45pt;border:solid windowtext 1.0pt;padding:2.4pt 5.4pt 2.4pt 5.4pt;height:12.95pt\'> <span style=\'font-family:"Arial",sans-serif\'> <b>Kontosaldo in \x80</b> </span> </td> </tr> <tr style=\'height:12.95pt\'> <td width=146 valign=top style=\'width:109.45pt;border:solid windowtext 1.0pt;padding:2.4pt 5.4pt 2.4pt 5.4pt;height:12.95pt\'> <span style=\'font-family:"Arial",sans-serif\'> [substringR] </span> </td> </tr> </table>TABLE-->'
r = re.compile(r"(?<=<!--)([\s\n.<>\]\[\\=;,€\/\-\'\":\w\n]+)(?=-->)")
mystring = r.sub('', test)

"Everything inbetween <!-- and -->" is this expression:
<!--.*?-->
replaced with the empty string. Compile with the re.DOTALL flag.
Note Modifying HTML with regex is a recipe for disaster. Don't do it. This particular task, namely "removing comments" is a grey area: Regex cannot deal with languages that can be arbitrarily nested (such as HTML), but HTML comments cannot be nested, so there is a good chance that this works. However, don't try the same approach with "replacing all tables", it won't work.
But still, HTML can be functional and still horribly broken in soooo many ways, that even for this task there will be HTML files that disintegrate completely when you try this seemingly safe regex on them.
The proper approach is just as #Aaron suggests: Parse the HTML file into a DOM tree. Find nodes you want to remove. Write the DOM tree back to a file; as shown in this answer: How to find all comments with Beautiful Soup.

Related

How to remove a particular pattern of string from regex match? Can't use XML parser it will recognise only good XML tags [duplicate]

This question already has answers here:
How do you parse and process HTML/XML in PHP?
(31 answers)
Closed 3 years ago.
I have a regexp pattern /[\w]+=[\w" :]+/ to remove the attributes like /id=""/ in the xml tags, I tried to keep this pattern as generic as possible, but this pattern removes the /href="https:/ attribute as well which I don't want to remove from the xml tags
Regex pattern /[\w]+=[\w" :]+/
Source xml string is,
<table id="this is id">
<tr id="this tr id">
Visit W3Schools.com!
<div id="this is div id"><span id="div Class:a">this is span
text</span></div>
</tr>
</table>
I'm expecting this o/p,
<table >
<tr >
Visit W3Schools.com!
<div ><span >this is span text</span></div>
</tr>
</table>
but I'm getting this o/p,
<table >
<tr >
<a //www.w3schools.com">Visit W3Schools.com!</a>
<div ><span >this is span text</span></div>
</tr>
</table>
The above above is available in this link My RegEx pattern to remove id attribute
TL;DR Use a negative look-ahead assertion
Details
You can start your regular expression with a negative look-ahead assertion that will exclude the pattern you don't want matched:
(?!href)\b[\w]+=[\w" :]+
And if there were two or more attributes you wanted to exclude, you would list them with an "or" in between:
(?!href|exclude_this_too)\b[\w]+=[\w" :]+
Demo, extended from yours
Please note the \b I also added: it says [\w]+ must be at the start of a word. This is important, otherwise it matches ref=... just leaving out the h, which is not what you want.

I want the text between the last /th> and /tbody>

This is what I use
output = System.Text.RegularExpressions.Regex.Replace(output, "(?s)/th>(.*?)</tbody>", "$1")
Notice that I am using (.*?) because I want the search to be ungreedy. That is there are severals /th> around. I want to get rid texts above the LAST /th>
This is what I got.
<!-- statistics_period -->
<input name="subForm" type="hidden" value="1">
<input name="hidTotal" type="hidden" value="886">
<div class="domlistframe">
<div class="divMainListingTable">
<table width="76%" align="left" class="mainListTable" cellspacing="0" cellpadding="3">
<tbody><tr>
<th nowrap=""> <
<th colspan="4"> </th>
<th id="sercol" nowrap="" colspan="11">Totals</th>
You see? Several /th> there.
Yes I know full well the horrible consequences of parsing html with regular expression as described here RegEx match open tags except XHTML self-contained tags.
I am parsing mostly table anyway. It's working
Note: here is a simpler problem that's equivalent with above
Say I have a text like this
cow cow cow chicken cat cow cat dog hello bla.
Say I want cat dog hello. That is text between the last cow and bla.
What would be the regular expression for that?
Notice I want the text between the LAST cow and bla.
Doing it
cow.*bla
will give me the whole text
Doing it cow.?*bla should give me what I want. However, as you can see from the sample I uses, it didn't work.
HINT
Try the pattern:
.*cow((?!cow).*?)bla
for the cow..bla problem.
The leading .* skips everything until the last cow is encountered
This is only a partial answer. Basically I solved the problem by using the technique hjpotter92 uses.
What I did is
output = System.Text.RegularExpressions.Regex.Replace(output, "(?s).*/th>(.*?)</tbody>", "$1")
Because the first .* is greedy. It will automatically match the maximum string that contains .*th>
Some question remains. Why my original code doesn't work?
I suspect it has something to do with regular expression works from left to right. Again any input would be fine.
I would also thank htpotter for telling me what complement operator in regex is.
Hmmm... Well, this answer does answer the question of what should I do to make it work and now it's working. However, it's based on other answer. Which one I should pick as answer?

How to create a regex to match everything inside and including <div>...</div>?

This is the sample text that I'm working with. I'm using Coda to do a find and replace...
<td width="20%"><div > Item #</div></td>
<td width="20%"><div > Pole Tip</div></td>
<td width="20%"><div > Length</div></td>
<td width="20%"><div > Test Weight (lbs.)</div></td>
<td width="20%"><div > Price</div></td>
I want to get rid of the div tags that markup the text inside the td.
Ex...I want to change this:
<td width="20%"><div > Item #</div></td>
to this:
<td width="20%">Item #</td>
So far I have this as a regex:
<div >[\s\w\(\)#]*</div>
However this matches all of the above in my sample text EXCEPT:
<td width="20%"><div > Test Weight (lbs.)</div></td>
In my regex, I even tried to add the ( and )...what am I doing wrong?
In Reply to Andy, I agree that Data Parsing of Well-Formed Markup should be kept to DOM Navigational tools. XML for sure, or HTML>XML Converters are good. I don't know what Miles is working with, but I frequently work with HTML that is so malformed that it can't be parsed by Markup parsers.
In some of my Regex tutorials on Document Parsing, I discuss the Regex Trim pattern, which is simply Zero or More Whitespace {\s*}. Though you might shy away from it because it adds a tiny bit of length to the Regex Pattern, there is virtually zero efficiency loss. That being said...
(<td[^>]*>)\s*<div[^>]*>\s*((?:[^<]*(?(?!</div>\s*</td>)<))*)\s*</div>\s*(</td>)
Replace this with $1$2$3 and you win, as well as get back a clean result. Of course, you can replace or remove as many Trims (\s*) as you like, just a personal preference if I am parsing Documents or Malformed Markup.
Thats because you missed the . This works just fine
<div >[\s\w\(\)#.]*</div>

Vb.net help me with regex please

I havent worked with regex before... But I need to parse values in about 500 urls and I need regex for automate it.
Each site contains about 10 values, I need to separate them to own list.
1.
<td width="78" style="padding-left:9px;" align="left"><a style="font-weight:bold;color:#E93393;" href="/meanings/Example1.html">Example1</a> </td>
2.
<td width="78" style="padding-left:9px;" align="left"><a style="font-weight:bold;color:#004EFF;" href="/meanings/Example2.html">Example2</a> </td>
So, I need to get those 2 values to separate list. It should look for color code to determine in which list value goes.
Could somebody help me? :)
NO..NO..NO..
Regex doesnt work for parsing HTML files..
HTML is not strict nor is it regular with its format..
Use htmlagilitypack

Regex: Match a <tr> that contains a string

I am trying to match all <tr> elements that contain the word "Source", but when the other attributes (colspan/width/height, contained <td>s and their attributes, etc.) are unknown. (I know this can be done with a javascript/jQuery selector, but I am just processing the HTML for a non-javascript context.)
Example of target:
<tr>
<td>Don't affect this</td>
</tr>
<tr>
<td colspan="3" width="288" height="57"><strong>Sources:</strong> Author</td>
</tr>
(This is what I want to change it to:)
<tr>
<td>Don't affect this</td>
</tr>
<tr class="source">
<td colspan="3" width="288" height="57"><strong>Sources:</strong> Author</td>
</tr>
Here are regex patterns I have tried that haven't worked:
/<tr>((?:.*?)Source(?:s?):(?:.*?))<\/tr>/gmi,
No matches.
/<tr>((?:[\s\S]*?)Source(?:s?):(?:[\s\S]*?))<\/tr>/gmi,
Matches the first tr, but not the second.
I think there's regex principle I may be failing to grasp here, about greediness or something related. Any suggestions?
/<tr[^>]*>(?:(?!<|source)[\s\S])*(?:<(?!\/?tr)[^>]*>(?:(?!<|source)[\s\S])*)*source[\s\S]*?<\/tr>/i
Are you sure you can't use jQuery for this? :P But seriously, this will be easier to grasp if I put it in terms of Friedl's "unrolled loop" idiom:
opening normal ( special normal * ) * closing
opening: <tr[^>]*> - the opening <tr> tag
normal: (?:(?!<|source)[\s\S])* - zero or more of any characters, with the lookahead to make sure each time that the character is not the beginning of a tag or the word "source"
special: <(?!\/?tr)[^>]*> - any tag except another opening <tr> or a closing </tr>. By consuming a complete tag, we avoid false positives on the word "source" in the name or value of an attribute.
closing: source - The only other thing it could possibly encounter here is a <tr> or </tr> tag, which would indicate a failed match for our purposes. Finding "source" before one of those tags is how we know we've found a match. (The rest of the regex, [\s\S]*?<\/tr>, merely consumes the remainder of the tag so you can retrieve it via group[0].)
A <tr> there isn't necessarily invalid, of course; it could be the beginning of a nested TR element, presumably within a nested TABLE element. If that TR contains the word "source", the regex will match it on a separate match attempt. It will match only the innermost, complete TR tag with the word "source" in it.
As usual when using regexes on HTML, I'm making several simplifying assumptions involving well-formedness, SGML comments, CDATA sections, etc., etc. Caveat emptor.
If you are using a library like jQuery you do not even need to use a regex:
$('tr:contains("Source")').something...