I have an HTML parser doing the hard work, but I need a regex to select anchors that don't have an attriburte id="optout". Here's my current regex that selects all anchors that have href with http... this is great just needs to ignore those anchors with id="optout" -- any ideas?
Thanks!
<cfset matches = ReMatch('<a[^>]*href="http[^"]*"[^>]*>(.+?)</a>', arguments.htmlCode) />
Regex is the wrong tool for this task, and given that you've already got a HTML parser involved, there's no reason not to keep using it!
Here's the trivial way to do it with a HTML parser (jsoup):
jsoup.parse( Arguments.HtmlCode ).select('a:not([id=optout])')
Here's the far less maintainable regex way to do it:
rematch( '(?i)<a\s*(?:(?!id\s*=\s*[''"]optout[''"])[^>])+>(?:[^<]+|<(?!/a>))+</a>' , Arguments.HtmlCode )
Related
I'm trying to perform a regex replacement on the HTML below. I'm using an existing (I didn't write it and don't really understand it) regex pattern that ignores anything inside of an HTML tag, but I need it to also ignore anything between script tags. The pattern is (?<!<[^>]*)(diversity|and|inclusion). The problem is that the and in 'playerBrandingId' in the javascript is getting matched and ultimately replaced. In case it matters, I'm using C#. You can see what I get here.
<p>When it comes to building more diverse and inclusive workforces, the sports industry is already a leader, but it can do much more. One of the ways SBD/SBJ is focusing on diversity and inclusion is by talking to business leaders about what the industry can do better. In our first video in the “SBJ Diversity and Inclusion” series, we hear from execs working in leagues, technology, recruitment and academia.</p>
<div class="article-offset-block article-video article-offset-block--half">
<div class="u-vr2">
<div id='video-F17F523A70EB43ECAF54DF46144835B4'></div>
</div>
</div>
<script>
var playerParam = {
'pcode': 'poeXI63BtIsR_ugBoy3Z6X8KfiMo',
'playerBrandingId': 'video-F17F523A70EB43ECAF54DF46144835B4',
'autoplay': false,
'loop': false
};
OO.ready(function () { window.ppF17F523A70EB43ECAF54DF46144835B4 = OO.Player.create('video-F17F523A70EB43ECAF54DF46144835B4', 'w5cW9qZTE6qRRDqfBdi861XWJTXci9uE', playerParam); });
</script>
EDIT:
The pattern is generated by a user's query, so the pattern could include the word window or player which would be matched in the javascript when I change the pattern to include the \b like so: (?<!<[^>]*)\b(window|player|and)\b
Another example
Change your regex to (?<!<[^>]*)\b(diversity|and|inclusion)\b The \b adds a test for a word boundary. forcing each word inside the ( and ) to be whole words.
EDIT:
You are trying to parse the HTML to extract the text nodes then check them,
you should not under any circumstances try to parse HTML with a regex unless you wish to invoke rite 666 Ph'nglui mglw'nafh Cthulhu R'lyeh wgah'nagl fhtagn.
Use an HTML parsing library see this page for some ways to do it or search for extracting text nodes from HTML with .NET and C#
The answer is that you cannot do what I'm trying to do with Regex according to this.
I'm using a regex to parse some HTML I have the following regex which matches all tags except img and a.
\<(?!img|a)[^\>]+\>
This works well but I also want it to match the closing tags, I've tried the following but it doesn't work:
\</?(?!img|a)[^\>]+\>
What would be the best way to do this?
(Also before there is a plethora of comments saying not to use regexes to parse HTML I'd just like to say that this HTML is generated by a tool and is very uniform.)
EDIT:
<p>So in this</p>
<p>HTML <strong>with nested tags</strong></p>
<p>It should remove <i>everything</i> except This link
and this <img src="#" alt="image" /> but it also needs to kep the textual content</p>
I think that the simplest solution would be the following:
<\/?(?!img|a)[^>]+>
It simply matches:
a <,
a / (escaped with \) if there is any (quantifier ?),
asserts that there is neither img nor a,
a sequence of anything but > ([^>]+) and
a >
See it working here on regex101.
Ok here is a pretty wasteful solution:
<(?!img|a|\/img|\/a)[^>]+>
It would be great if someone could find a better one.
I am parsing html. I know this shouldn't be done with regex but dom/xpath. In my case it should just be fast, simple and without tidy so I chose regex.
The task is replacing all the style='xxx' with an empty string, except within tables.
This regex for preg_replace works catching all style='xxx' no matter where:
'/ style="([^"]+)"/s'
The content can look like this
<!-- more html here -->
<span style='do:smtg'><table class=... > <span style="...">
<table> <div style=""></div></table></span></table>
<!-- more html here -->
or just simple non nested tables, meaning regex should exclude all style='...' also within nested tables.
Is there a simple syntax doing this?
Thou Shalt Not Parse HTML with Regular Expressions!
No, really, you shouldn't.
As evidenced by your example, you can expect nested tables. That means the regex should keep track of the level of nesting, to decide whether or not you're in a table. If you find a way to do this, it will certainly not be "fast and simple".
Email, resurrecting this question because it had a regex that wasn't mentioned. (Found your question while doing some research for a regex bounty quest.)
With all the disclaimers about using regex to parse html, here is a simple way to do it.
First we need a regex to match tables, nested or not. This does it with simple recursion:
<table(?:.*?(?R).*?|.*?)</table>
Next, we exclude these, and match what we do want. Here is the whole regex:
(?s)<table(?:.*?(?R).*?|.*?)<\/table>(*SKIP)(*F)|style=(['"])[^'"]*\1
See the demo
The left side of the alternation matches complete tables, nested or not, then deliberately fails. The right side matches and captures your styles to Group 1, allowing for different quote styles. We know these are the right styles because they were not matched by the expression on the left.
With this regex, you can do a simple preg_replace($regex, "", $yourstring);
Reference
How to match pattern except in situations s1, s2, s3
How to match a pattern unless...
I can't seem to put together a working pattern to disallow all html tags except for the strong and em tag.
I don't want to parse the html but just want to give the user a warning that the input will not be accepted. I am aware that this is not supported in all browsers but I would love a pure html solution, as I already have a working JS solution, but I wan't to layer the user experience.
<input name="user_input" pattern="^(?!<[^>]*>).*$" />
So allowed tags: strong, em
the use of all other tags should make the result false
Any one able to crack this one?
KR
edit:
<input type="text" pattern="((?!<(?!\/?(strong|em))[^>]*>).)*">
is what seems to do the trick. Thank you for your help!
You can use a Negative Lookahead (?!) for this purpose.
An example regex string which matches the entire pair:
<(?!\/?strong|\/?em)[^>]*>.*(?:<\/.*?>)?
A shorter regex, which matches the first tag only
<(?!\/?(strong|em))[^>]*>
This match will pass if a HTML tag with something EXCEPT strong or em exists.
So, if match = $true, you can deny the input and give the user a warning.
Regex101 demo
I want to remove a div from a couple hundred html files
<div id="mydiv">
blahblah blah
more blah blah
more html
<some javascript here too>
</div>
I thought that this would do the job but it doesn't
<div(.*)</div>
Does anyone know which is the proper regex for this?
Regex
<div[^>]+>(.*?)</div>
Don't forget to check the option . matches newline like in the image below :
Alternatively, you can use this regex also: <div[^>]+>([\s\S]*?)</div> with or without the checkbox checked.
Discussion
Since * metacharacter is greedy, you need to tell him to take as few as possible characters (use of ?).
Check that the divs you want to remove DO NOT contain nested div. In that case, the regex at the start of my answer won't help you.
If you face this case, I'd suggest you using an html parser.