HTTPmodule - Replacing Markup - regex

I have created an HTTPModule that is being called for every request to my website. Inside the module I have created my own filter wrapper for HTTPApplication.Context.Response.Filter that allows me to manipulate the markup just before it is sent back to the client.
The idea here is that I am going to search for certain words/phrases and replace them with the same word/phrase in a given language which will be stored in a database.
One of the words I am trying to replace is "Password". The problem is that there are controls in the markup called _ctl122_txtPassword and when I am in my filter I am literally just doing string manipulation (search/replace/etc.) so the control name gets renamed to _ctl122_txtTranslation which breaks all kinds of things.
So I dont want to replace matches in this:
<input type="password" style="width: 200px;" class="formfield" id="_ctl22_txtPassword" name="_ctl22:txtPassword">
but I do want to replace matches in this:
<td align="right" class="formlabel">Password:</td>
I have tried a few RegEx solutions but I am far from a RegEx ninja so this could be the way to go but I just dont know them well enough.
The only other alternative I have tried is actually replacing the string '>Password'.
Thanks in advance for the help.

Due to the nature of HTML, it is kind of hard to write a regex that will handle every case.
You could use this as a starting point
http://snook.ca/archives/active_server_pages/vbscript_code_t
A better solution may be to use a HTML parsing tool (Html Agility Pack)
http://social.msdn.microsoft.com/Forums/en/regexp/thread/3b0a595b-cd09-446f-bbcb-d826511c364e
If i was going to do this (sounds like a multi-language site), i would probably use resource files
Or maybe have defined macro boundaries, so that my regex could look for things easily.
eg with ##'s
<td align="right" class="formlabel">##Password##:</td>

Related

Strip specific HTML tags using Notepad++

I'd like to hear if anyone can help to to replace my large XML file's HTML markup.
The XML file has my own schema and it's all fine. But I need to remove <sspan>, <style>, <div> and attributes in <p> tags.
For an example, I need to keep all <ul>, <ol>, <li>, <strong>, <a>, <img> and other tags but remove <div> (with attributes), <span> (with attributes), and attributes in <p> tags.
I have tried many examples from this site and many other sites. But most of them didn't worked.
Quoting from an answer I posted yesterday:
I've heard some very good things about
Beautiful Soup, HTML
Purifier, and the HTML Agility
Pack, which use Python, PHP, and
.NET, respectively. Trust me--save
yourself some pain and use those
instead.
I strongly advise you not to use regex for this. No sane regex is going to work, or probably even come close to working. However, a decent XML parser can do this fairly easily. I'm not sure what programming languages you have access to, but if you can use PHP, .NET or another programming language, you can use the above parsers to find each span, style, div, and p and remove attributes or the entire tags.
jQuery has some good functionality for DOM-manipulation like you're describing, and you can use it to generate HTML which you then cut and paste.
If you absolutely must use regex, you could try this:
Pattern: <\s*/?\s*(span|style|div)\b[^>]*?>
Replacement: (nothing)
Pattern: <\s*p\b[^>]*?>
Replacement: <p>

perl regex help -- hopefully an easy question

ashamed as I am to admit it, I'm terrible with regex... so here I am to ask your help :)
i have an html file that looks sorta like this:
<table>
<tr>
<td sadf="a">
asdf
</td>
</tr>
</table>
what I'd like to do, with Perl regex, is remove everything except for everything in the td tag. so i would want output to be this:
<td sadf="a">
asdf
</td>
please help me out. Thanks
A html parser would be much better at this task, but if you insist on using a regular expression, try this:
<td[\s\S]*?</td>
It matches as few of any character as possible up until the end tag </td>.
Try using XML::Simple. As others have pointed out, you can't use regex for parsing XML.
XML::Simple will turn your HTML into a hash structure. From there, you can easily locate the "td" element, and copy the whole thing to another hash reference. Then, you can use XML::Simple to turn it back into HTML.
XML::Simple can't guarantee the same structure in XML (although it'll be pro-grammatically the same). However, I rarely have problems with turning HTML into a hashref and back into HTML.
A simpler way of thinking of this is that you want to grab the tag part with a regular expression (rather than remove everything except the tag part).
In this case, the regular expression is simple, and would probably look something like this for the first line, for example: <td \w+?="\w*"> (you can match \n to grab a multiline block). It's hard to answer without knowing exactly what is changing in your regex, but if you follow a reference like this one you should be fine.
In addition, it probably is best to do this without regex at all (using an HTML parser at all) if it's anything more than a limited, specific grab. I'll assume you know that you want to use regex, but there are really much better ways of doing this if you've got something more complicated than a very basic search pattern on your hands.

Regex to Parse HTML Tables

I am trying to remove the tables within an HTML file, specifically, for the following document, I'd like to remove anything within the tags <TABLE....> and </TABLE>. The document contains multiple tables with texts in between.
The expression that I came up with, <TABLE.*>\s*[\s|\S]*</TABLE>\s*, however would remove the text in between the tables. In fact it would remove everything between the first <TABLE> and the last </TABLE> tags. I would like to keep the texts in between and only remove the tables. Any suggestion is greatly appreciated. Thanks.
====================
<TABLE STYLE=xxx, Font=yyy, etc>
table texts that should be DELETED...
</TABLE>
other texts that should be KEPT...
<TABLE STYLE=xxx, Font=yyy, etc>
table texts that should be DELETED...
</TABLE>
==========================================
The answer is to use a HTML or SGML parser, there are some around for .NET:
http://htmlagilitypack.codeplex.com/
SGML parser .NET recommendations
If you absolutely want to use regular expressions, familiarize yourself with balancing groups, otherwise nested tables will break. It's not easy, and may perform much slower than a regular SGML parser. Be warned though: Seeing your expression I assume that you are a regex newbie (hint: avoid greedy . matches at any cost), so this is probably not yet your cup of tea.
Since I know you're not going to look at an HTML parser even if I tell you you really should, I'll just answer the question.
This matches only tables:
<table.*?>.*?</table>
It requires two options: dotall and ignoreCase.
You can try it here: http://gskinner.com/RegExr/
Now do consider using HTML Agility Pack suggested by Lucero ok?
Edit: maybe this was what you meant, sorry:

Which regular expression to use to extract some words from an HTML text?

I am having a hard time building a regular expression to grab some words from a HTML text.
Let's say I have the following :
<p style="padding-left :12px">SOME_TEXT_I_WANT</p><p>SOME_OTHER_TEXT</p>
*SOME_TEXT_I_WANT* and *SOME_OTHER_TEXT* can be either a bunch of words like "SOME RANDOM TEXT" or HTML text like "<strong>SOME BOLD TEXT</strong>"
My goal is to extract those texts with one regex.
Which language do you intend to use? Does a HTML parser exist for this language? If yes, consider using a parser.
However, if this is a "one-off", you may be able to get through with something along the lines of:
#<p[^>]*>(.*?)</p>#
The above has certain limitations, most notably it does not match <p data-something="a > b">...</p> nor nested <p>s. (I am not able to tell whether the mark-up you're trying to parse actually allows nested <p>s—just informing you on possible pitfalls.)
Assuming you are using PHP:
$html = "<p>some text here</p>"
preg_replace("/<.+?>/","", $html);
Don't use regex. If you ask why, there is a very popular SO post that describes what can happen if you try to use regex for parsing HTML.
Use your language's HTML or XML parser and extract what you need using existing functionality.

How to write a regular expression for html parsing?

I'm trying to write a regular expression for my html parser.
I want to match a html tag with given attribute (eg. <div> with class="tab news selected" ) that contains one or more <a href> tags. The regexp should match the entire tag (from <div> to </div>). I always seem to get "memory exhausted" errors - my program probably takes every tag it can find as a matching one.
I'm using boost regex libraries.
You should probably look at this question re. regexps and HTML. The gist is that using regular expressions to parse HTML is not by any means an ideal solution.
You may also find these questions helpful:
Can you provide some examples of why it is hard to parse XML and HTML with a regex?
Can you provide an example of parsing HTML with your favorite parser?
As others have said, don't use regexes if at all possible. If your code is actually XHTML (i.e. it is also well-formed XML) aI can recommend both the Xerces and Expat XML parsers, which will do a much betterv job for you than regexes.
Maybe regexps aren't the best solution, but I'm already using like five different libraries and boost does fine when it comes to locating <a href> tags and keywords.
I'm using these regexps:
/<a[^\n]*/searched attribute/[^\n]*>[^\n]*</a>/ for locating <a href> tags and:
/<a[^\n]*href[[^\n]*>/searched keyword/</a>/ for locating links
(BTW can it be done better? - I suck at regex ;))
What I need now is locating tags containing <a href>'s and I think regexps will do all right - maybe I'll need to write my own parsing function as piotr said.
Do as flex does: match <div> with a case insensitive match, and put your parser in a "div matched" state, keep processing input until </div> and reset state.
This takes two regexps and a state variable.
SGML tags valid characters are [A-Za-z_:]
So: /<[A-Za-z_:]+>/ matches a tag.