Regex to match format of valid markup language tags - regex

I am trying to write regex for all type tags either it is html or xml.
I wrote two regex for this
<(\"[^\"]*\"|'[^']*'|[^'\">])*>
<html.*>(.*?)</html>
these are matching all valid tags,,,but it is matching invalid tags too like:
<"font size=12">
...so I want regex for valid tags only. Can anybody please help??

Some people worked for this with code coverage to get a good HTML/XML tag matcher (many traps!)
One of the working solution may be: http://haacked.com/archive/2004/10/25/usingregularexpressionstomatchhtml.aspx/
The Regex is <\/?\w+((\s+\w+(\s*=\s*(?:".*?"|'.*?'|[^'">\s]+))?)+\s*|\s*)\/?>
It matchs individually opening + ending tags, useful if you want to remove tags for instance (in fact you can not expect really more with a simple regex as Jithin answered you)

Related

RegEx for matching HTML tags

I am trying to use regular expression to extract start tags in lines of a given HTML code. In the following lines I expect to get only 'body' and 'h1'as start tags in the first line and 'html','head' and 'title' as start tags in the second line:
I have already tried to do this using the following regular expression:
start_tags = re.findall(r'<(\w+)\s*.*?[^\/]>',line)
'<body data-modal-target class=\'3\'><h1>Website</h1><br /></body></html>'
'<html><head><title>HTML Parser - II</title></head>'
But my output for the first line is: ['body','h1','br'], while I do not expect to catch 'br' as I excluded '/'.
And for the second line is ['html','title'], whereas I expect to catch 'head' too. It would be a grate kind if you let me know which part of my code is wrong?
If you wish to do so with regular expressions, you might want to design multiple different expressions, step by step. You may be able to connect them using OR pipes, but it may not be necessary.
RegEx 1 for h1-h6 tags
This link helps you to capture body tags excluding body and head:
(<(.*)>(.*)</([^br][A-Za-z0-9]+)>)
You might want to add more boundaries to it. For example, you can replace (.*) with lists of chars [].
RegEx Circuit
This link helps you to visualize your expressions:
RegEx 2 for head and body
For head and body tags, you might want to swipe the new lines, which you might want an expression similar to:
(<head>([\s\S]*)<\/head>)|(<body>([\s\S]*)</body>)
Performance
These expressions are rather expensive, you might want to simplify them, or write some other scripts to parse your HTMLs, or find a HTML parser maybe, to do so.

use regex to get both link and text associated with it (anchor tag)

I created a regex string that I hoped would get both the link and the associated text in an html page. For instance, if I had a link such as:
<a href='www.la.com/magic.htm'>magicians of los angeles</a>
Then the link I want is 'www.la.com/magic.htm' and the text I want is 'magicians of los angeles'.
I used the following regex expression:
strsearch = "\<a\s+(.*?)\>(.*?)\</a\s*?\>|"
But my vb program told me I was getting too many matches.
Is there something wrong with the regEx expression?
The circle-brackets are meant to get 'groups' that can be back-referenced.
Thanks
What about this one:
\<a href=.+\</a>
All there is left to do is to go over each match and extract the substrings using regular string manipulation.
Check here (although regexr follows javascript regex implementation, it is still useful in our scenario)
With that being said, I often see people stating that regexes are not suited for parsing Html. You might need to use an Html Parser for this. You have HtmlAgilityPack, which is not maintained anymore, and AngleSharp, that I know of to recommend.
I tried with following pattern , it worked.
\<a href=(.*?)\>(.*?)\<\/a\s*?\>|
Also Found two errors on your origin string:
missed a escape syntax on /a
the reserved word 'href' is captured on
first group
At last , i would like recommend you a great site to test REGEX string. It will helps your debug really fast. Refer this (also demonstrating the result you want) :
REGEX101

Using - replace and pattern matching with XML tags Powershell

I am trying to replace the contents of a string which contains xml tags as follows
I want to replace the entirety of the below statement, where ABCDEF could be any random value
<originalFileName>ABCDEF</originalFileName>
How would I do this?
Solution
You can try this for your purpose:
(<originalFileName>[\w]*</originalFileName>)
Not recommended
However, note that it is not recommended.
Regular expressions are a tool that is insufficient to understand the constructs employed by XML/HTML/XHTML. XML/HTML/XHTML is not a regular language and hence cannot be parsed by regular expressions. Regex queries are not equipped to break down XML/HTML/XHTML into its meaningful parts. Even enhanced irregular regular expressions as used by Perl are not up to the task of parsing XML/HTML/XHTML.XML/HTML/XHTML is a language of sufficient complexity that it cannot be parsed by regular expressions.
Further details : RegEx match open tags except XHTML self-contained tags
This could help you:
EDITED (this new one will also accept non-characters between your tags and I have scaped the / symbol which can give some errors if not):
(<originalFileName>.*<\/originalFileName>)
Check it here.

Regex to identify HTML tags (as a regex repetition learning exercise ONLY!!)

I'm very very new to regex. I'd managed to not touch it with a 10-foot pole for so long. And I tried my best to avoid it so far. But now a personal project is pushing me to learn it.
So I started. And I'm going through the tutorial located here:http://www.regular-expressions.info/tutorial.html
Currently I'm here: http://www.regular-expressions.info/repeat.html
My question is:
The tutorial says <[A-Za-z][A-Za-z0-9]*> will match an HTML tag.
But wouldn't it also match invalid html tags like - <h11> or <h111>?
Also how would it match the closing tags?
Edit - My question is very specific. I am referring to one particular example in one particular tutorial to clarify whether or not my understanding of repetitions is correct. Again, I REPEAT, I DO NOT care about html parsing with regex.
I don't see any harm in answering your question seeing as how you are attempting to learn regex:
1) Yes, it will match invalid tags as well because it's any letter followed by any zero or more matches of another letter or a number.
2) It will not match closing tags (there would have to be a search for a / somewhere in there).
One more comment: one way people used to use to look for html tags inside a document was to look for the pattern of opening and closing brackets, like so:
<\/?[^>]*>
That's opening-bracket, an optional slash, (anything but a closing bracket)-repeated and then a closing bracket. Of course, I am not recommending anyone do this. It's merely left here as an exercise.
The tutorial says <[A-Za-z][A-Za-z0-9]*> will match an HTML tag.
But wouldn't it also match invalid html tags like - or ?
Also how would it match the closing tags?
Yes, that will match <h11> as well as <X098wdfhfdshs98fhj2hsdljhkvjnvo9sudvsodfih23234osdfs>.
If you want to just match a letter followed by an optional single digit, so you'd match <h1>, then you want <[A-Za-z][0-9]?>

How to use regex to replace text between tags in Notepad++

I have a code like this:
<pre><code>Some HTML code</code></pre>
I need to escape the HTML between the <pre><code></code></pre> tags. I have lots of tags, so I thought - why not let regex do it for me. The problem is I don't know how. I've seen lots of examples using Google and Stackoverflow, but nothing I could use. Can someone here help me?
Example:
<pre><code>Some HTML code</code></pre>
To
<pre><code>Some <a href="http">HTML</a> code</code></pre>
Or just a regex so I can replace anything between the <pre><code> and </code></pre> tags one by one. I'm almost certain that this can be done.
This regex will match the parts of the anchor tag
you need to put back:
<pre><code>([^<]*?)(.*?)(.*?)</code></pre>
See a live demo, which shows it matching correctly and also shows the various parts being captured as groups which we'll refer to in the replacement string (see below).
Use the regex above with the following replacement:
<pre><code>\1<a href="\2">\3</a>\4</pre></code>
The \1, \2 etc are the captured groups in the regex that put back what we're keeping from the match.
A regular expression to return "the thing between <pre><code> and </code></pre>" could be
/(?<=<pre><code>).*?(?=<\/code><\/pre>)/
This uses lookaround expressions to delimit the "thing that gets matched". Typically using regex in situations with nested tags is fraught with danger and you are much better off using "real tools" made specifically for the job of parsing xml, html etc. I am a huge fan of Beautiful Soup (Python) myself. Not familiar with Notepad++, so not sure if its dialect of regex matches this expression exactly.