Regular expression to match style="whatever:0; morestuff:1; otherstuff:3" - regex

I'm trying to match anything between and including style=""
eg: style="whatever:0; morestuff:1; otherstuff:3"

The pattern will be /style="([^"]*)"/, but may vary a bit depending on what language you're using.
Also if you're trying to do this through javascript, jquery would make this as easy as
$("#element-id").attr("style");
If you're trying to do this from another language, use an HTML parsing lib as HTML isn't regular. BeautifulSoup for Python is quite nice.

String under test
style="whatever:0; morestuff:1; otherstuff:3"
Regex
style\s*=\s*"([^"]*)"
Contents of group 1
whatever:0; morestuff:1; otherstuff:3
Notice!
It is very hard to write a regex-based HTML parser that is correct, secure, and maintainable. If you need to write program that deals with HTML in a robust, reliable, and secure way, you should use a real HTML parsing library like jsoup (Java) or Html Agility Pack (C#). To find an HTML parser for your favorite language, Google: yourlanguage html parser.

If you need to remove all style tags from html (clean inline styles entirely), use this as regexp:
style=\"[^\"]*\"
This works for me in sublime text 2-3

/(style="([^"]*)")/
for the whole string (untested). do you want the key value pairs retrieved as well?

Related

Strip specific HTML tags using Notepad++

I'd like to hear if anyone can help to to replace my large XML file's HTML markup.
The XML file has my own schema and it's all fine. But I need to remove <sspan>, <style>, <div> and attributes in <p> tags.
For an example, I need to keep all <ul>, <ol>, <li>, <strong>, <a>, <img> and other tags but remove <div> (with attributes), <span> (with attributes), and attributes in <p> tags.
I have tried many examples from this site and many other sites. But most of them didn't worked.
Quoting from an answer I posted yesterday:
I've heard some very good things about
Beautiful Soup, HTML
Purifier, and the HTML Agility
Pack, which use Python, PHP, and
.NET, respectively. Trust me--save
yourself some pain and use those
instead.
I strongly advise you not to use regex for this. No sane regex is going to work, or probably even come close to working. However, a decent XML parser can do this fairly easily. I'm not sure what programming languages you have access to, but if you can use PHP, .NET or another programming language, you can use the above parsers to find each span, style, div, and p and remove attributes or the entire tags.
jQuery has some good functionality for DOM-manipulation like you're describing, and you can use it to generate HTML which you then cut and paste.
If you absolutely must use regex, you could try this:
Pattern: <\s*/?\s*(span|style|div)\b[^>]*?>
Replacement: (nothing)
Pattern: <\s*p\b[^>]*?>
Replacement: <p>

Which regular expression to use to extract some words from an HTML text?

I am having a hard time building a regular expression to grab some words from a HTML text.
Let's say I have the following :
<p style="padding-left :12px">SOME_TEXT_I_WANT</p><p>SOME_OTHER_TEXT</p>
*SOME_TEXT_I_WANT* and *SOME_OTHER_TEXT* can be either a bunch of words like "SOME RANDOM TEXT" or HTML text like "<strong>SOME BOLD TEXT</strong>"
My goal is to extract those texts with one regex.
Which language do you intend to use? Does a HTML parser exist for this language? If yes, consider using a parser.
However, if this is a "one-off", you may be able to get through with something along the lines of:
#<p[^>]*>(.*?)</p>#
The above has certain limitations, most notably it does not match <p data-something="a > b">...</p> nor nested <p>s. (I am not able to tell whether the mark-up you're trying to parse actually allows nested <p>s—just informing you on possible pitfalls.)
Assuming you are using PHP:
$html = "<p>some text here</p>"
preg_replace("/<.+?>/","", $html);
Don't use regex. If you ask why, there is a very popular SO post that describes what can happen if you try to use regex for parsing HTML.
Use your language's HTML or XML parser and extract what you need using existing functionality.

How to remove both html tags & content,values inside the tags using regular expressions

How to remove both html tags & content,values inside the tags using regular expressions
Have a look at some of these articles
When is it wise to use
regular expressions with HTML?
Regex HTML Extraction C#
But be aware, you are opening yourself for a whole world of hurt.
Parsing Html The Cthulhu Way
As generic as your question is, you will probably only get generic answers:
Use an HTML DOM parser.
You should use an html parser for this, e.g. htmlcleaner

Getting alt tags with regex

I am parsing some HTML source. Is there a regex script to find out whether alt tags in a html document are empty?
I want to see if the alt tags are empty or not.
Is regex suitable for this or should I use string manipulation in C#?
You have to parse the HTML and check tags, use the following link, it includes a C# library for parsing HTML tags, and you can loop through tags and get the number of tags: Parsing HTML tags.
If this is valid XHTML, why do you need Regex at all? If you simply search for the string:
alt=""
... you should be able to find all empty alt tags.
In any case, it shouldn't be too complicated to construct a Regex for the search too, taking into account poorly written HTML markup (especially with spaces):
alt\s*=\s*"\s*"
If you want to do it just looking at the page then CSS selectors might be better, assuming your browser supports the :not selector.
Install the selectorgadget bookmarklet. Activate it on your page and then put the following selector in the intput box and press enter.
img:not([alt])
If you are automating it, and have access to the DOM for the HTML you could use the same selector.
Regexes are fundamentally bad at parsing HTML (see Can you provide some examples of why it is hard to parse XML and HTML with a regex? for why). What you need is an HTML parser. See Can you provide an example of parsing HTML with your favorite parser? for examples using a variety of parsers.

How to write a regular expression for html parsing?

I'm trying to write a regular expression for my html parser.
I want to match a html tag with given attribute (eg. <div> with class="tab news selected" ) that contains one or more <a href> tags. The regexp should match the entire tag (from <div> to </div>). I always seem to get "memory exhausted" errors - my program probably takes every tag it can find as a matching one.
I'm using boost regex libraries.
You should probably look at this question re. regexps and HTML. The gist is that using regular expressions to parse HTML is not by any means an ideal solution.
You may also find these questions helpful:
Can you provide some examples of why it is hard to parse XML and HTML with a regex?
Can you provide an example of parsing HTML with your favorite parser?
As others have said, don't use regexes if at all possible. If your code is actually XHTML (i.e. it is also well-formed XML) aI can recommend both the Xerces and Expat XML parsers, which will do a much betterv job for you than regexes.
Maybe regexps aren't the best solution, but I'm already using like five different libraries and boost does fine when it comes to locating <a href> tags and keywords.
I'm using these regexps:
/<a[^\n]*/searched attribute/[^\n]*>[^\n]*</a>/ for locating <a href> tags and:
/<a[^\n]*href[[^\n]*>/searched keyword/</a>/ for locating links
(BTW can it be done better? - I suck at regex ;))
What I need now is locating tags containing <a href>'s and I think regexps will do all right - maybe I'll need to write my own parsing function as piotr said.
Do as flex does: match <div> with a case insensitive match, and put your parser in a "div matched" state, keep processing input until </div> and reset state.
This takes two regexps and a state variable.
SGML tags valid characters are [A-Za-z_:]
So: /<[A-Za-z_:]+>/ matches a tag.