I'm using a regex to parse some HTML I have the following regex which matches all tags except img and a.
\<(?!img|a)[^\>]+\>
This works well but I also want it to match the closing tags, I've tried the following but it doesn't work:
\</?(?!img|a)[^\>]+\>
What would be the best way to do this?
(Also before there is a plethora of comments saying not to use regexes to parse HTML I'd just like to say that this HTML is generated by a tool and is very uniform.)
EDIT:
<p>So in this</p>
<p>HTML <strong>with nested tags</strong></p>
<p>It should remove <i>everything</i> except This link
and this <img src="#" alt="image" /> but it also needs to kep the textual content</p>
I think that the simplest solution would be the following:
<\/?(?!img|a)[^>]+>
It simply matches:
a <,
a / (escaped with \) if there is any (quantifier ?),
asserts that there is neither img nor a,
a sequence of anything but > ([^>]+) and
a >
See it working here on regex101.
Ok here is a pretty wasteful solution:
<(?!img|a|\/img|\/a)[^>]+>
It would be great if someone could find a better one.
Related
I have text:
<div class="separator" style="clear: both; text-align: center;"><img src="/demodomain.com/-13ucJuEQEUw/linktoimg.png"><img border="0" data-original-height="618" data-original-width="1062" height="372" src="https://21.imgdomain.com/-13ucJuEQEUw/WsGsjY2E2bI/-13ucJuEQEUw/linktoimg.jpg" width="640"></div>
I use (?<=<img)(.*?)([0-9]+.imgdomain.com)(.*?)(.*?)> to mark image domain which is in tag <img>.
But it doesn't work as my expect, it also marks image domain which in tag <a>.
Demo picture
Demo Regex
How can i get correct marking? Thanks!
Your regex is too permissive especially the use of .* this matches any character instead better use [^>] which will not match > this example matches only the img inside part.
(?<=<img)([^>]*)([0-9]+.imgdomain\.com)[^>]*?>
While for very some simple cases parsing data from HTML with regex might be ok you really should be aware of the pitfalls. For example a tag with escaped > will break the regex above. If it is not an assumption you can make better use a parser. Here the link to live demo
I just don't get my Regex right:
I have the following template:
<!-- Defines the template for the tabs. -->
{{TMPL:Import=../../../../Data/Templates/Ribbon/tabs.tmpl; Name=Tabs}}
<div class="tabs">
<ul role="tablist">
{{BOS:Sequence}}
<li role="tab" class="{{TabType}}" id="{{tabId}}">
<span>{{TabFile}}</span>
</li>
{{EOS:Sequence}}
</ul>
</div>
{{Render:Tabs}}
I would like to find everything between {{}} except the tags that begins with {{BOS, {{EOS, {{TMPL, {{Render
Here are a couple approaches:
Attempt 1:
({{).*(}})
This selects everything between {{ }} tags, which is not good.
Attempt 2:
({{)[^TMPL][^BOS][^EOS][^Render].*(}})
This will make that {{TabType}} and {{TabFile}} are not selected anymore and I just don't know why.
With some other regex, I get that {{TabType}}" id="{{tabId}} is selected as one match.
Does anyone have a clue on how to solve this, I really need a regex Guru :-)
You can use negative lookahead based regex like this:
{{(?!TMPL|[BE]OS|Render).*?}}
RegEx Demo
You have to use the following regex to get the content between braces:
\{\{(.*?)\}\}
Working Demo
If you want to exclude the content from the comment you posted you can use a regex technique to exclude what you don't want and keep what you want at the end of the regex:
\{\{BOS:Sequence\}\}|\{\{EOS:Sequence\}\}|\{\{TMPL:Import.*?\}\}|\{\{Render:Tabs\}\}|\{\{(.*?)\}\}
Working demo
By the way, if you want to have a shortcut for above regex you can use:
\{\{(?:BOS|EOS):Sequence\}\}|\{\{TMPL:Import.*?\}\}|\{\{Render:Tabs\}\}|\{\{(.*?)\}\}
This is a very useful technique for pattern exclusion that I glad to learn it from Anubhava and zx81 (they rock using regex pattern). For this regex technique you can find the content you need using capturing groups (check the green highlights on the screenshot below):
Using [^TMPL] and the like won't work because these are character classes. You could use a negative lookahead, though (or even lookbehind depending upon the regex library you are using).
\{\{(?!BOS:)(?!EOS:)(?!Render:)(?!TMPL:)(.*?)\}\}
Still I get the feeling that you want the BOS, EOS, etc. to just be strings in the template with {{ and other values to be interpolated. If you are using handlebars or something, you can have strings interpolated:
{{'{{BOS:Sequence}}'}}
I am parsing html. I know this shouldn't be done with regex but dom/xpath. In my case it should just be fast, simple and without tidy so I chose regex.
The task is replacing all the style='xxx' with an empty string, except within tables.
This regex for preg_replace works catching all style='xxx' no matter where:
'/ style="([^"]+)"/s'
The content can look like this
<!-- more html here -->
<span style='do:smtg'><table class=... > <span style="...">
<table> <div style=""></div></table></span></table>
<!-- more html here -->
or just simple non nested tables, meaning regex should exclude all style='...' also within nested tables.
Is there a simple syntax doing this?
Thou Shalt Not Parse HTML with Regular Expressions!
No, really, you shouldn't.
As evidenced by your example, you can expect nested tables. That means the regex should keep track of the level of nesting, to decide whether or not you're in a table. If you find a way to do this, it will certainly not be "fast and simple".
Email, resurrecting this question because it had a regex that wasn't mentioned. (Found your question while doing some research for a regex bounty quest.)
With all the disclaimers about using regex to parse html, here is a simple way to do it.
First we need a regex to match tables, nested or not. This does it with simple recursion:
<table(?:.*?(?R).*?|.*?)</table>
Next, we exclude these, and match what we do want. Here is the whole regex:
(?s)<table(?:.*?(?R).*?|.*?)<\/table>(*SKIP)(*F)|style=(['"])[^'"]*\1
See the demo
The left side of the alternation matches complete tables, nested or not, then deliberately fails. The right side matches and captures your styles to Group 1, allowing for different quote styles. We know these are the right styles because they were not matched by the expression on the left.
With this regex, you can do a simple preg_replace($regex, "", $yourstring);
Reference
How to match pattern except in situations s1, s2, s3
How to match a pattern unless...
I want to remove a div from a couple hundred html files
<div id="mydiv">
blahblah blah
more blah blah
more html
<some javascript here too>
</div>
I thought that this would do the job but it doesn't
<div(.*)</div>
Does anyone know which is the proper regex for this?
Regex
<div[^>]+>(.*?)</div>
Don't forget to check the option . matches newline like in the image below :
Alternatively, you can use this regex also: <div[^>]+>([\s\S]*?)</div> with or without the checkbox checked.
Discussion
Since * metacharacter is greedy, you need to tell him to take as few as possible characters (use of ?).
Check that the divs you want to remove DO NOT contain nested div. In that case, the regex at the start of my answer won't help you.
If you face this case, I'd suggest you using an html parser.
I have the following code grabbed from a webpage source code:
<span>41,396</span>
And the following regex:
("<span>.*</span>")
Which returns
<span>New Users</span>
However, I don't want to have the tags in the results. I've tried a few things, but Regular Expressions are new to me.
More so than this I need to get the Regex for the following code:
<span>41,396</span>
</span>
<span class="levelColumn">
<span>2,150</span>
</span>
<span class="xpColumn">
<span>161,305,807</span>
I was thinking this may involve line breaks and more, which is why I threw this is separately.
You could try something like
<span( class=\".+\")?>(.*)</span>
And then get capture group 2 for the tag's body. But be aware that regular expressions are NOT good for parsing HTML/XML. What would happen if you had nested <span> tags?
If the input gets even the slightest bit more complicated than what you've shown, look for an HTML parser and try using that instead.
You can use capturing group differently to get the value instead of tag + value
"<span>(.*)</span>"
Think to use a HTML parsing library in your language of choice if regex become more complicated.
As far as I know regex will lookup line by line, but you could have an expression that would work that out.
Try: <span>(.*)</span>
You should be able to retrieve the information you want with \1
In the case of <span class="xpColumn"> it would just not match and \1 would be empty..
Cheers :)