How to write a regular expression for html parsing? - c++

I'm trying to write a regular expression for my html parser.
I want to match a html tag with given attribute (eg. <div> with class="tab news selected" ) that contains one or more <a href> tags. The regexp should match the entire tag (from <div> to </div>). I always seem to get "memory exhausted" errors - my program probably takes every tag it can find as a matching one.
I'm using boost regex libraries.

You should probably look at this question re. regexps and HTML. The gist is that using regular expressions to parse HTML is not by any means an ideal solution.

You may also find these questions helpful:
Can you provide some examples of why it is hard to parse XML and HTML with a regex?
Can you provide an example of parsing HTML with your favorite parser?

As others have said, don't use regexes if at all possible. If your code is actually XHTML (i.e. it is also well-formed XML) aI can recommend both the Xerces and Expat XML parsers, which will do a much betterv job for you than regexes.

Maybe regexps aren't the best solution, but I'm already using like five different libraries and boost does fine when it comes to locating <a href> tags and keywords.
I'm using these regexps:
/<a[^\n]*/searched attribute/[^\n]*>[^\n]*</a>/ for locating <a href> tags and:
/<a[^\n]*href[[^\n]*>/searched keyword/</a>/ for locating links
(BTW can it be done better? - I suck at regex ;))
What I need now is locating tags containing <a href>'s and I think regexps will do all right - maybe I'll need to write my own parsing function as piotr said.

Do as flex does: match <div> with a case insensitive match, and put your parser in a "div matched" state, keep processing input until </div> and reset state.
This takes two regexps and a state variable.
SGML tags valid characters are [A-Za-z_:]
So: /<[A-Za-z_:]+>/ matches a tag.

Related

Strip specific HTML tags using Notepad++

I'd like to hear if anyone can help to to replace my large XML file's HTML markup.
The XML file has my own schema and it's all fine. But I need to remove <sspan>, <style>, <div> and attributes in <p> tags.
For an example, I need to keep all <ul>, <ol>, <li>, <strong>, <a>, <img> and other tags but remove <div> (with attributes), <span> (with attributes), and attributes in <p> tags.
I have tried many examples from this site and many other sites. But most of them didn't worked.
Quoting from an answer I posted yesterday:
I've heard some very good things about
Beautiful Soup, HTML
Purifier, and the HTML Agility
Pack, which use Python, PHP, and
.NET, respectively. Trust me--save
yourself some pain and use those
instead.
I strongly advise you not to use regex for this. No sane regex is going to work, or probably even come close to working. However, a decent XML parser can do this fairly easily. I'm not sure what programming languages you have access to, but if you can use PHP, .NET or another programming language, you can use the above parsers to find each span, style, div, and p and remove attributes or the entire tags.
jQuery has some good functionality for DOM-manipulation like you're describing, and you can use it to generate HTML which you then cut and paste.
If you absolutely must use regex, you could try this:
Pattern: <\s*/?\s*(span|style|div)\b[^>]*?>
Replacement: (nothing)
Pattern: <\s*p\b[^>]*?>
Replacement: <p>

Regex to Parse HTML Tables

I am trying to remove the tables within an HTML file, specifically, for the following document, I'd like to remove anything within the tags <TABLE....> and </TABLE>. The document contains multiple tables with texts in between.
The expression that I came up with, <TABLE.*>\s*[\s|\S]*</TABLE>\s*, however would remove the text in between the tables. In fact it would remove everything between the first <TABLE> and the last </TABLE> tags. I would like to keep the texts in between and only remove the tables. Any suggestion is greatly appreciated. Thanks.
====================
<TABLE STYLE=xxx, Font=yyy, etc>
table texts that should be DELETED...
</TABLE>
other texts that should be KEPT...
<TABLE STYLE=xxx, Font=yyy, etc>
table texts that should be DELETED...
</TABLE>
==========================================
The answer is to use a HTML or SGML parser, there are some around for .NET:
http://htmlagilitypack.codeplex.com/
SGML parser .NET recommendations
If you absolutely want to use regular expressions, familiarize yourself with balancing groups, otherwise nested tables will break. It's not easy, and may perform much slower than a regular SGML parser. Be warned though: Seeing your expression I assume that you are a regex newbie (hint: avoid greedy . matches at any cost), so this is probably not yet your cup of tea.
Since I know you're not going to look at an HTML parser even if I tell you you really should, I'll just answer the question.
This matches only tables:
<table.*?>.*?</table>
It requires two options: dotall and ignoreCase.
You can try it here: http://gskinner.com/RegExr/
Now do consider using HTML Agility Pack suggested by Lucero ok?
Edit: maybe this was what you meant, sorry:

Which regular expression to use to extract some words from an HTML text?

I am having a hard time building a regular expression to grab some words from a HTML text.
Let's say I have the following :
<p style="padding-left :12px">SOME_TEXT_I_WANT</p><p>SOME_OTHER_TEXT</p>
*SOME_TEXT_I_WANT* and *SOME_OTHER_TEXT* can be either a bunch of words like "SOME RANDOM TEXT" or HTML text like "<strong>SOME BOLD TEXT</strong>"
My goal is to extract those texts with one regex.
Which language do you intend to use? Does a HTML parser exist for this language? If yes, consider using a parser.
However, if this is a "one-off", you may be able to get through with something along the lines of:
#<p[^>]*>(.*?)</p>#
The above has certain limitations, most notably it does not match <p data-something="a > b">...</p> nor nested <p>s. (I am not able to tell whether the mark-up you're trying to parse actually allows nested <p>s—just informing you on possible pitfalls.)
Assuming you are using PHP:
$html = "<p>some text here</p>"
preg_replace("/<.+?>/","", $html);
Don't use regex. If you ask why, there is a very popular SO post that describes what can happen if you try to use regex for parsing HTML.
Use your language's HTML or XML parser and extract what you need using existing functionality.

How can I parse <img src> with a regex?

I need a clever regex to match ... in these:
<img src="..."
<img src='...'
<img src=...
I want to match the inner content of src, but only if it is surrounded by ", ' or none. This means that <img src=..." or <img src='... must not be accepted.
Any ideas how to match these 3 cases with one regex.
So far I use something like this ("|'|[\s\S])(.*?)\1 and the part that I want to get loose is the hacky [\S\s] which I use to match "missing symbol" on the beginning and the end of the ....
Wow, second one I'm answering today.
Don't parse HTML with regex. Use an HTML/XML parser and your life will be much easier. Tidy will clean up your HTML code for you, so you can run the HTML through Tidy first and then through a parser. Some tidy-based libraries will perform parsing in addition to santizing, and so you may not even have to run it through another parser.
Java, for example has JTidy and PHP has PHP Tidy.
UPDATE
Against my better judgement, I'm giving you this:
/<img\s+src\s*=\s*(["'][^"']+["']|[^>]+)>/
Which works only for your specific case. Even so, it will not take into account escaped " or ' in your image-source names, or the > character. There are probably a bunch of other limitations as well. The capturing group gives you your image names (in the case of names surrounded by single or double quotes, it gives you those as well, but you can strip those out).
Depending on what scripting or programming language you are using to solve this, it can be done with either multiple regex, or simply one regex that checks groups.
<img[^s]+src=("(.+)"|'(.+)'|(.+))[^/<]+(/>|</img>)
If all you want is the image src attribute, you don't have to parse using a parser. In fact, if you're wanting other attributes, just use a different regex. You will run into issues with multiple matches of the image tag, but in that case just match image tags, and for each one perform your desired regex.

Getting alt tags with regex

I am parsing some HTML source. Is there a regex script to find out whether alt tags in a html document are empty?
I want to see if the alt tags are empty or not.
Is regex suitable for this or should I use string manipulation in C#?
You have to parse the HTML and check tags, use the following link, it includes a C# library for parsing HTML tags, and you can loop through tags and get the number of tags: Parsing HTML tags.
If this is valid XHTML, why do you need Regex at all? If you simply search for the string:
alt=""
... you should be able to find all empty alt tags.
In any case, it shouldn't be too complicated to construct a Regex for the search too, taking into account poorly written HTML markup (especially with spaces):
alt\s*=\s*"\s*"
If you want to do it just looking at the page then CSS selectors might be better, assuming your browser supports the :not selector.
Install the selectorgadget bookmarklet. Activate it on your page and then put the following selector in the intput box and press enter.
img:not([alt])
If you are automating it, and have access to the DOM for the HTML you could use the same selector.
Regexes are fundamentally bad at parsing HTML (see Can you provide some examples of why it is hard to parse XML and HTML with a regex? for why). What you need is an HTML parser. See Can you provide an example of parsing HTML with your favorite parser? for examples using a variety of parsers.