I am trying to remove the tables within an HTML file, specifically, for the following document, I'd like to remove anything within the tags <TABLE....> and </TABLE>. The document contains multiple tables with texts in between.
The expression that I came up with, <TABLE.*>\s*[\s|\S]*</TABLE>\s*, however would remove the text in between the tables. In fact it would remove everything between the first <TABLE> and the last </TABLE> tags. I would like to keep the texts in between and only remove the tables. Any suggestion is greatly appreciated. Thanks.
====================
<TABLE STYLE=xxx, Font=yyy, etc>
table texts that should be DELETED...
</TABLE>
other texts that should be KEPT...
<TABLE STYLE=xxx, Font=yyy, etc>
table texts that should be DELETED...
</TABLE>
==========================================
The answer is to use a HTML or SGML parser, there are some around for .NET:
http://htmlagilitypack.codeplex.com/
SGML parser .NET recommendations
If you absolutely want to use regular expressions, familiarize yourself with balancing groups, otherwise nested tables will break. It's not easy, and may perform much slower than a regular SGML parser. Be warned though: Seeing your expression I assume that you are a regex newbie (hint: avoid greedy . matches at any cost), so this is probably not yet your cup of tea.
Since I know you're not going to look at an HTML parser even if I tell you you really should, I'll just answer the question.
This matches only tables:
<table.*?>.*?</table>
It requires two options: dotall and ignoreCase.
You can try it here: http://gskinner.com/RegExr/
Now do consider using HTML Agility Pack suggested by Lucero ok?
Edit: maybe this was what you meant, sorry:
Related
This question already has answers here:
How do you parse and process HTML/XML in PHP?
(31 answers)
Closed 5 years ago.
Good day!
I am trying to write a bit more difficult regex, but without success :(
I try to match html from starting
<div class="about">
and count closing
</div>
tags. So to match everything in between.
I wrote a regex, but it is not performing. I guess I am missing something like that counts of instances could have anything in between them. I tried to google it but the might of regex is obviously tough for newbies.
<div class="about">[\s\S]*(<\/div>){2}
Help and advice appreciated.
As others have said, you should avoid regexes in many cases where there exists a better parser (whether HTML, CSS, CSV, or whatever) that works for your use-case.
The reason for this is that the data may be tree structured, and might have some of the things you're looking for within other elements; for example, within <!-- --> comments. And then you have to exclude those. Which means recognizing when a comment really is a comment, and it rapidly becomes a mess.
But there are use-cases where such a parser is overkill. If you want a quick guesstimate, from a commandline command rather than a script you'll be using forever and sharing with others, regexes can still be your friend.
Something like this:
<div class="about">([\s\S]*?<\/div>)*
This will capture not only the divs within the "about" div, but every closing div tag in the remainder of the page, whether it's commented out or not (along with any separating tags and whitespace and other stuff). If yours is a simple enough case that this is all you want, then that's fine.
But if you want anything complex, then you'll rapidly venture into recursive regexes, with conditionals, and then the pain starts; the DOM tree parser will become the better option, long before you reach that point.
First thanks to everybody sharing time and knowledge.
With your help I did the job with
<div class="about">([\s\S]*?<\/div>){6}
{6} is the count of closing div tag.
However, what is more important that you gave me the clues that this will work until html page structure changes and to make it permanent I should use a DOM parser.
I'd like to hear if anyone can help to to replace my large XML file's HTML markup.
The XML file has my own schema and it's all fine. But I need to remove <sspan>, <style>, <div> and attributes in <p> tags.
For an example, I need to keep all <ul>, <ol>, <li>, <strong>, <a>, <img> and other tags but remove <div> (with attributes), <span> (with attributes), and attributes in <p> tags.
I have tried many examples from this site and many other sites. But most of them didn't worked.
Quoting from an answer I posted yesterday:
I've heard some very good things about
Beautiful Soup, HTML
Purifier, and the HTML Agility
Pack, which use Python, PHP, and
.NET, respectively. Trust me--save
yourself some pain and use those
instead.
I strongly advise you not to use regex for this. No sane regex is going to work, or probably even come close to working. However, a decent XML parser can do this fairly easily. I'm not sure what programming languages you have access to, but if you can use PHP, .NET or another programming language, you can use the above parsers to find each span, style, div, and p and remove attributes or the entire tags.
jQuery has some good functionality for DOM-manipulation like you're describing, and you can use it to generate HTML which you then cut and paste.
If you absolutely must use regex, you could try this:
Pattern: <\s*/?\s*(span|style|div)\b[^>]*?>
Replacement: (nothing)
Pattern: <\s*p\b[^>]*?>
Replacement: <p>
ashamed as I am to admit it, I'm terrible with regex... so here I am to ask your help :)
i have an html file that looks sorta like this:
<table>
<tr>
<td sadf="a">
asdf
</td>
</tr>
</table>
what I'd like to do, with Perl regex, is remove everything except for everything in the td tag. so i would want output to be this:
<td sadf="a">
asdf
</td>
please help me out. Thanks
A html parser would be much better at this task, but if you insist on using a regular expression, try this:
<td[\s\S]*?</td>
It matches as few of any character as possible up until the end tag </td>.
Try using XML::Simple. As others have pointed out, you can't use regex for parsing XML.
XML::Simple will turn your HTML into a hash structure. From there, you can easily locate the "td" element, and copy the whole thing to another hash reference. Then, you can use XML::Simple to turn it back into HTML.
XML::Simple can't guarantee the same structure in XML (although it'll be pro-grammatically the same). However, I rarely have problems with turning HTML into a hashref and back into HTML.
A simpler way of thinking of this is that you want to grab the tag part with a regular expression (rather than remove everything except the tag part).
In this case, the regular expression is simple, and would probably look something like this for the first line, for example: <td \w+?="\w*"> (you can match \n to grab a multiline block). It's hard to answer without knowing exactly what is changing in your regex, but if you follow a reference like this one you should be fine.
In addition, it probably is best to do this without regex at all (using an HTML parser at all) if it's anything more than a limited, specific grab. I'll assume you know that you want to use regex, but there are really much better ways of doing this if you've got something more complicated than a very basic search pattern on your hands.
I have a regular expression that breaks html into necessary for me peaces. I will not present the whole regex, because it's too long. In a nutshell, its a multi-line table cells row-by-row parser. Recently i've ran into a trouble: the layout of parsing pages has changed, so I started remastering the regex to fit new layout, but I've found that layout wrapping data I need in a particular cell in some rows may differ.
What do we have?
The layout of the cell may be like this or like this
which leads me to question: how do I capture needed data and do not have additional unnecessary group?
Conditions in regexps described here regular-expressions.info/conditional.html, I've read it but still don't have a clue.
This should help :)
<td class='(?:class1|class2)'>\s*((?=\w).*)\s*</td>
Edited: took over regexhacks expression, as it is a solution that is better.
Not sure, but maybe you are looking for non-capturing groups used as (?:). Thus you could do
<td class='class(?:1|2)'>\s*((?=\w).*)\s*</td>
Well, in this example you would not need the groups:
<td class='class[12]'>\s*((?=\w).*)\s*</td>
but in more complex cases you could use them.
See sample: rubular
But this might not be what you want. Could you give a more precise example of the problem?
I'm trying to write a regular expression for my html parser.
I want to match a html tag with given attribute (eg. <div> with class="tab news selected" ) that contains one or more <a href> tags. The regexp should match the entire tag (from <div> to </div>). I always seem to get "memory exhausted" errors - my program probably takes every tag it can find as a matching one.
I'm using boost regex libraries.
You should probably look at this question re. regexps and HTML. The gist is that using regular expressions to parse HTML is not by any means an ideal solution.
You may also find these questions helpful:
Can you provide some examples of why it is hard to parse XML and HTML with a regex?
Can you provide an example of parsing HTML with your favorite parser?
As others have said, don't use regexes if at all possible. If your code is actually XHTML (i.e. it is also well-formed XML) aI can recommend both the Xerces and Expat XML parsers, which will do a much betterv job for you than regexes.
Maybe regexps aren't the best solution, but I'm already using like five different libraries and boost does fine when it comes to locating <a href> tags and keywords.
I'm using these regexps:
/<a[^\n]*/searched attribute/[^\n]*>[^\n]*</a>/ for locating <a href> tags and:
/<a[^\n]*href[[^\n]*>/searched keyword/</a>/ for locating links
(BTW can it be done better? - I suck at regex ;))
What I need now is locating tags containing <a href>'s and I think regexps will do all right - maybe I'll need to write my own parsing function as piotr said.
Do as flex does: match <div> with a case insensitive match, and put your parser in a "div matched" state, keep processing input until </div> and reset state.
This takes two regexps and a state variable.
SGML tags valid characters are [A-Za-z_:]
So: /<[A-Za-z_:]+>/ matches a tag.