perl regex help -- hopefully an easy question - regex

ashamed as I am to admit it, I'm terrible with regex... so here I am to ask your help :)
i have an html file that looks sorta like this:
<table>
<tr>
<td sadf="a">
asdf
</td>
</tr>
</table>
what I'd like to do, with Perl regex, is remove everything except for everything in the td tag. so i would want output to be this:
<td sadf="a">
asdf
</td>
please help me out. Thanks

A html parser would be much better at this task, but if you insist on using a regular expression, try this:
<td[\s\S]*?</td>
It matches as few of any character as possible up until the end tag </td>.

Try using XML::Simple. As others have pointed out, you can't use regex for parsing XML.
XML::Simple will turn your HTML into a hash structure. From there, you can easily locate the "td" element, and copy the whole thing to another hash reference. Then, you can use XML::Simple to turn it back into HTML.
XML::Simple can't guarantee the same structure in XML (although it'll be pro-grammatically the same). However, I rarely have problems with turning HTML into a hashref and back into HTML.

A simpler way of thinking of this is that you want to grab the tag part with a regular expression (rather than remove everything except the tag part).
In this case, the regular expression is simple, and would probably look something like this for the first line, for example: <td \w+?="\w*"> (you can match \n to grab a multiline block). It's hard to answer without knowing exactly what is changing in your regex, but if you follow a reference like this one you should be fine.
In addition, it probably is best to do this without regex at all (using an HTML parser at all) if it's anything more than a limited, specific grab. I'll assume you know that you want to use regex, but there are really much better ways of doing this if you've got something more complicated than a very basic search pattern on your hands.

Related

Strip specific HTML tags using Notepad++

I'd like to hear if anyone can help to to replace my large XML file's HTML markup.
The XML file has my own schema and it's all fine. But I need to remove <sspan>, <style>, <div> and attributes in <p> tags.
For an example, I need to keep all <ul>, <ol>, <li>, <strong>, <a>, <img> and other tags but remove <div> (with attributes), <span> (with attributes), and attributes in <p> tags.
I have tried many examples from this site and many other sites. But most of them didn't worked.
Quoting from an answer I posted yesterday:
I've heard some very good things about
Beautiful Soup, HTML
Purifier, and the HTML Agility
Pack, which use Python, PHP, and
.NET, respectively. Trust me--save
yourself some pain and use those
instead.
I strongly advise you not to use regex for this. No sane regex is going to work, or probably even come close to working. However, a decent XML parser can do this fairly easily. I'm not sure what programming languages you have access to, but if you can use PHP, .NET or another programming language, you can use the above parsers to find each span, style, div, and p and remove attributes or the entire tags.
jQuery has some good functionality for DOM-manipulation like you're describing, and you can use it to generate HTML which you then cut and paste.
If you absolutely must use regex, you could try this:
Pattern: <\s*/?\s*(span|style|div)\b[^>]*?>
Replacement: (nothing)
Pattern: <\s*p\b[^>]*?>
Replacement: <p>

Capturing regex group using condition

I have a regular expression that breaks html into necessary for me peaces. I will not present the whole regex, because it's too long. In a nutshell, its a multi-line table cells row-by-row parser. Recently i've ran into a trouble: the layout of parsing pages has changed, so I started remastering the regex to fit new layout, but I've found that layout wrapping data I need in a particular cell in some rows may differ.
What do we have?
The layout of the cell may be like this or like this
which leads me to question: how do I capture needed data and do not have additional unnecessary group?
Conditions in regexps described here regular-expressions.info/conditional.html, I've read it but still don't have a clue.
This should help :)
<td class='(?:class1|class2)'>\s*((?=\w).*)\s*</td>
Edited: took over regexhacks expression, as it is a solution that is better.
Not sure, but maybe you are looking for non-capturing groups used as (?:). Thus you could do
<td class='class(?:1|2)'>\s*((?=\w).*)\s*</td>
Well, in this example you would not need the groups:
<td class='class[12]'>\s*((?=\w).*)\s*</td>
but in more complex cases you could use them.
See sample: rubular
But this might not be what you want. Could you give a more precise example of the problem?

HTTPmodule - Replacing Markup

I have created an HTTPModule that is being called for every request to my website. Inside the module I have created my own filter wrapper for HTTPApplication.Context.Response.Filter that allows me to manipulate the markup just before it is sent back to the client.
The idea here is that I am going to search for certain words/phrases and replace them with the same word/phrase in a given language which will be stored in a database.
One of the words I am trying to replace is "Password". The problem is that there are controls in the markup called _ctl122_txtPassword and when I am in my filter I am literally just doing string manipulation (search/replace/etc.) so the control name gets renamed to _ctl122_txtTranslation which breaks all kinds of things.
So I dont want to replace matches in this:
<input type="password" style="width: 200px;" class="formfield" id="_ctl22_txtPassword" name="_ctl22:txtPassword">
but I do want to replace matches in this:
<td align="right" class="formlabel">Password:</td>
I have tried a few RegEx solutions but I am far from a RegEx ninja so this could be the way to go but I just dont know them well enough.
The only other alternative I have tried is actually replacing the string '>Password'.
Thanks in advance for the help.
Due to the nature of HTML, it is kind of hard to write a regex that will handle every case.
You could use this as a starting point
http://snook.ca/archives/active_server_pages/vbscript_code_t
A better solution may be to use a HTML parsing tool (Html Agility Pack)
http://social.msdn.microsoft.com/Forums/en/regexp/thread/3b0a595b-cd09-446f-bbcb-d826511c364e
If i was going to do this (sounds like a multi-language site), i would probably use resource files
Or maybe have defined macro boundaries, so that my regex could look for things easily.
eg with ##'s
<td align="right" class="formlabel">##Password##:</td>

Regex to Parse HTML Tables

I am trying to remove the tables within an HTML file, specifically, for the following document, I'd like to remove anything within the tags <TABLE....> and </TABLE>. The document contains multiple tables with texts in between.
The expression that I came up with, <TABLE.*>\s*[\s|\S]*</TABLE>\s*, however would remove the text in between the tables. In fact it would remove everything between the first <TABLE> and the last </TABLE> tags. I would like to keep the texts in between and only remove the tables. Any suggestion is greatly appreciated. Thanks.
====================
<TABLE STYLE=xxx, Font=yyy, etc>
table texts that should be DELETED...
</TABLE>
other texts that should be KEPT...
<TABLE STYLE=xxx, Font=yyy, etc>
table texts that should be DELETED...
</TABLE>
==========================================
The answer is to use a HTML or SGML parser, there are some around for .NET:
http://htmlagilitypack.codeplex.com/
SGML parser .NET recommendations
If you absolutely want to use regular expressions, familiarize yourself with balancing groups, otherwise nested tables will break. It's not easy, and may perform much slower than a regular SGML parser. Be warned though: Seeing your expression I assume that you are a regex newbie (hint: avoid greedy . matches at any cost), so this is probably not yet your cup of tea.
Since I know you're not going to look at an HTML parser even if I tell you you really should, I'll just answer the question.
This matches only tables:
<table.*?>.*?</table>
It requires two options: dotall and ignoreCase.
You can try it here: http://gskinner.com/RegExr/
Now do consider using HTML Agility Pack suggested by Lucero ok?
Edit: maybe this was what you meant, sorry:

How to write a regular expression for html parsing?

I'm trying to write a regular expression for my html parser.
I want to match a html tag with given attribute (eg. <div> with class="tab news selected" ) that contains one or more <a href> tags. The regexp should match the entire tag (from <div> to </div>). I always seem to get "memory exhausted" errors - my program probably takes every tag it can find as a matching one.
I'm using boost regex libraries.
You should probably look at this question re. regexps and HTML. The gist is that using regular expressions to parse HTML is not by any means an ideal solution.
You may also find these questions helpful:
Can you provide some examples of why it is hard to parse XML and HTML with a regex?
Can you provide an example of parsing HTML with your favorite parser?
As others have said, don't use regexes if at all possible. If your code is actually XHTML (i.e. it is also well-formed XML) aI can recommend both the Xerces and Expat XML parsers, which will do a much betterv job for you than regexes.
Maybe regexps aren't the best solution, but I'm already using like five different libraries and boost does fine when it comes to locating <a href> tags and keywords.
I'm using these regexps:
/<a[^\n]*/searched attribute/[^\n]*>[^\n]*</a>/ for locating <a href> tags and:
/<a[^\n]*href[[^\n]*>/searched keyword/</a>/ for locating links
(BTW can it be done better? - I suck at regex ;))
What I need now is locating tags containing <a href>'s and I think regexps will do all right - maybe I'll need to write my own parsing function as piotr said.
Do as flex does: match <div> with a case insensitive match, and put your parser in a "div matched" state, keep processing input until </div> and reset state.
This takes two regexps and a state variable.
SGML tags valid characters are [A-Za-z_:]
So: /<[A-Za-z_:]+>/ matches a tag.