RegEx for matching HTML tags - regex

I am trying to use regular expression to extract start tags in lines of a given HTML code. In the following lines I expect to get only 'body' and 'h1'as start tags in the first line and 'html','head' and 'title' as start tags in the second line:
I have already tried to do this using the following regular expression:
start_tags = re.findall(r'<(\w+)\s*.*?[^\/]>',line)
'<body data-modal-target class=\'3\'><h1>Website</h1><br /></body></html>'
'<html><head><title>HTML Parser - II</title></head>'
But my output for the first line is: ['body','h1','br'], while I do not expect to catch 'br' as I excluded '/'.
And for the second line is ['html','title'], whereas I expect to catch 'head' too. It would be a grate kind if you let me know which part of my code is wrong?

If you wish to do so with regular expressions, you might want to design multiple different expressions, step by step. You may be able to connect them using OR pipes, but it may not be necessary.
RegEx 1 for h1-h6 tags
This link helps you to capture body tags excluding body and head:
(<(.*)>(.*)</([^br][A-Za-z0-9]+)>)
You might want to add more boundaries to it. For example, you can replace (.*) with lists of chars [].
RegEx Circuit
This link helps you to visualize your expressions:
RegEx 2 for head and body
For head and body tags, you might want to swipe the new lines, which you might want an expression similar to:
(<head>([\s\S]*)<\/head>)|(<body>([\s\S]*)</body>)
Performance
These expressions are rather expensive, you might want to simplify them, or write some other scripts to parse your HTMLs, or find a HTML parser maybe, to do so.

Related

RegEx to delete all XML data outside of specified tags

I am using the latest and greatest version of NotePad++. Is it possible for a RegEx to delete all text and tags I don't need and only leave behind text and tags I need? The tags I need to remain look like this:
<warning>I need this text to remain intact together with accompanying tags.</warning>
There must be around 500 of these WARNING tag pairs nested within a variety of XML levels. I would like the RegEx to delete all data that exists outside of these WARNING tags but not the opening and closing warning tags themselves or the text within the tags. Below are four different RegEx variations I tested out and they all eliminate the text located within the warning tags after performing a Find&Replace operation therefore they are no help:
<warning>[^<>]+</warning>
<warning>[^>]+</warning>
<warning>(.+?)</warning>
<warning>.*?</warning>
I would tremendously appreciate any help that will assist me in developing a RegEx that will perform the data clean up task I need to perform.
I use notepad++ regex find and replace below seems works for me. Remember to select regular expression.
Search and replace both regex below with empty. Require 2 steps though, not perfect yet
1st replace remove all lines that not startswith warning
2nd replace remove all the empty lines leaving only lines with warning
^(?!\s*?<warning>).*?$
^\s*

Regex to match format of valid markup language tags

I am trying to write regex for all type tags either it is html or xml.
I wrote two regex for this
<(\"[^\"]*\"|'[^']*'|[^'\">])*>
<html.*>(.*?)</html>
these are matching all valid tags,,,but it is matching invalid tags too like:
<"font size=12">
...so I want regex for valid tags only. Can anybody please help??
Some people worked for this with code coverage to get a good HTML/XML tag matcher (many traps!)
One of the working solution may be: http://haacked.com/archive/2004/10/25/usingregularexpressionstomatchhtml.aspx/
The Regex is <\/?\w+((\s+\w+(\s*=\s*(?:".*?"|'.*?'|[^'">\s]+))?)+\s*|\s*)\/?>
It matchs individually opening + ending tags, useful if you want to remove tags for instance (in fact you can not expect really more with a simple regex as Jithin answered you)

Regex capture words inside tags

Given an XML document, I'd like to be able to pick out individual key/value pairsfrom a particular tag:
<aaa>key0:val0 key1:val1 key2:va2</aaa>
I'd like to get back
key0:val0
key1:val1
key2:val2
So far I have
(?<=<aaa>).*(?=<\/aaa>)
Which will match everything inside, but as one result.
I also have
[^\s][\w]*:[\w]*[^\s] which will also match correctly in groups on this:
key0:val0 key1:val1 key2:va2
But not with the tags. I believe this is an issue with searching for subgroups and I'm not sure how to get around it.
Thanks!
You cannot combine the two expressions in the way you want, because you have to match each occurrence of "key:value".
So in what you came up with - (?<=<abc>)([\w]*:[\w]*[\s]*)+(?=<\/abc>) - there are two matching groups. The bigger one matches everything inside the tags, while the other matches a single "key:value" occurrence. The regex engine cannot give each individual occurence because it does not work that way. So it just gives you the last one.
If you think in python, on the matcher object obtained after applying you regex, you will have access to matcher.group(1) and matcher.group(2), because you have two matching ( ) groups in the regex.
But what you want is the n occurences of "key:value". So it's easier to just run the simpler \w+:\w+ regex on the string inside the tags.
I uploaded this one at parsemarket, and I'm not sure its what you are looking for, but maybe something like this:
(<aaa>)((\w+:\w+\s)*(\w+:\w+)*)(<\/aaa>)
AFAIK, unless you know how many k:v pairs are in the tags, you can't capture all of them in one regex. So, if there are only three, you could do something like this:
<(?:aaa)>(\w+:\w+\s*)+(\w+:\w+\s*)+(\w+:\w+\s*)+<(?:\/aaa)>
But I would think you would want to do some sort of loop with whatever language you are using. Or, as some of the comments suggest, use the parser classes in the language. I've used BeautifulSoup in Python for HTML.

Google Analytics Regular Expressions

Kinda new to Rgeluar expressions and for the benefit of learning wanted to know how to do the following on one line:
page matching regular expression: .pdf/$
and page containing "somestring"
and page excluding "someotherstring"
I can obtain my desired output using the 3 rules above. My question is can I put all into one line using regular expression? So the first line would be something like:
page matching reg exp: .pdf/$ somestring+ (then regex for does not contain in GA) someotherstring
Is it possible to put all in a oner?
Lookahead will help you to match multiple independent things in one expression, and even allows to require non-matching. In your case:
/^(?=.*somestring)(?!.*someotherstring).*\.pdf$/

How to Easily Remove Unwanted Parts in HTML Table Cells Using Notepad++

I have series of different occurrences of table cells in some html files as shown in this image:
http://screencast.com/t/MqGHN2iwfd
Apart from the beginning and end of each cell, they have the following parts in common:
.net/?mobile=true
/spotlightProfile.htm?f=mkt&v=
/#stats
I want to either be able to remove all the parts that look like that once
OR be able to remove one-by-one in notepad++:
the url part that precede .net/?mobile=true
the url parts before and after /spotlightProfile.htm?f=mkt&v= and
the url part before /#stats
Furthermore, please, I also want to be able to remove the duplicate occurrence also in Notepad++
Thanks a lot in anticipation for helping out.
Regex would look something like this.
Search for: (.*)(\/\?mobild=true|\/spotlightProfile\.htm\?f=mkt&v=|\/#stats)?(.*)
Replace With: \1\3
Basically we create 3 groups:
before the expression you match,
the expression you trying to replace
the rest of the line