Is it possible to replace all matches between two sub strings in one regular expression?
My case is that I want to inject HTML after the li tag but only inside myList.
Before:
<ul class="myList">
<li>List item</li>
<li>List item</li>
</ul>
After (... would be the injected markup):
<ul class="myList">
<li>...List item</li>
<li>...List item</li>
</ul>
Any help would be great, thanks.
My case is that I want to inject HTML after the li tag but only inside myList.
Other than for trivial cases where it's easy to write a "quick, hacky" script to get the job done, you should never use regular expressions to parse HTML.
A "quick, hacky" script in this case could be e.g. to search for:
<ul class="myList">.*<li>([^<]*)</li>(?=.*</ul>)
(Note: this probably needs to be a multi-line search; if there's no option for this then replace .* with [\s\S]*.)
...And then replace the value of the first match group (probably represented by $1 or \1, depending on how you do this).
However, as per my link above, I'd like to emphasise that this is not a perfect answer. It is literally impossible to perfectly parse HTML with a regular expression.
To do this "properly", you must use an XML parser instead.
Related
I just don't get my Regex right:
I have the following template:
<!-- Defines the template for the tabs. -->
{{TMPL:Import=../../../../Data/Templates/Ribbon/tabs.tmpl; Name=Tabs}}
<div class="tabs">
<ul role="tablist">
{{BOS:Sequence}}
<li role="tab" class="{{TabType}}" id="{{tabId}}">
<span>{{TabFile}}</span>
</li>
{{EOS:Sequence}}
</ul>
</div>
{{Render:Tabs}}
I would like to find everything between {{}} except the tags that begins with {{BOS, {{EOS, {{TMPL, {{Render
Here are a couple approaches:
Attempt 1:
({{).*(}})
This selects everything between {{ }} tags, which is not good.
Attempt 2:
({{)[^TMPL][^BOS][^EOS][^Render].*(}})
This will make that {{TabType}} and {{TabFile}} are not selected anymore and I just don't know why.
With some other regex, I get that {{TabType}}" id="{{tabId}} is selected as one match.
Does anyone have a clue on how to solve this, I really need a regex Guru :-)
You can use negative lookahead based regex like this:
{{(?!TMPL|[BE]OS|Render).*?}}
RegEx Demo
You have to use the following regex to get the content between braces:
\{\{(.*?)\}\}
Working Demo
If you want to exclude the content from the comment you posted you can use a regex technique to exclude what you don't want and keep what you want at the end of the regex:
\{\{BOS:Sequence\}\}|\{\{EOS:Sequence\}\}|\{\{TMPL:Import.*?\}\}|\{\{Render:Tabs\}\}|\{\{(.*?)\}\}
Working demo
By the way, if you want to have a shortcut for above regex you can use:
\{\{(?:BOS|EOS):Sequence\}\}|\{\{TMPL:Import.*?\}\}|\{\{Render:Tabs\}\}|\{\{(.*?)\}\}
This is a very useful technique for pattern exclusion that I glad to learn it from Anubhava and zx81 (they rock using regex pattern). For this regex technique you can find the content you need using capturing groups (check the green highlights on the screenshot below):
Using [^TMPL] and the like won't work because these are character classes. You could use a negative lookahead, though (or even lookbehind depending upon the regex library you are using).
\{\{(?!BOS:)(?!EOS:)(?!Render:)(?!TMPL:)(.*?)\}\}
Still I get the feeling that you want the BOS, EOS, etc. to just be strings in the template with {{ and other values to be interpolated. If you are using handlebars or something, you can have strings interpolated:
{{'{{BOS:Sequence}}'}}
I need a regular Expression to get a string in between <ul> and </ul> tag...
But the thing is if there is one "<ul></ul>" tag inside the <ul> tag then regex stops with the inner tag...But i need the entire string between the outer two tags...
Can anyone help me?
Try this regex
String text = "<ul>My list</ul>";
String text1 = text.replaceAll("</?ul>", "");
^
? says / one time or none at all
So it will take out <ul> and </ul>
This is java language by the way. The regex may work in different languages
it will pick everything between from the first <ul> to the last </ul>.
<ul>(.*)</ul>
I have some html which I want to grab between 2 tags. However nested tags exist in the html so looking for wouldn't work as it would return on the first nested div.
Basically I want my regex to..
Match some text literally, followed by ANY character upto another literal text string. So my question is how do I get [^<]* to continue matching until it see's the next div.
such as
<div id="test"[^<]*<div id="test2"
Example html
<div id="test" class="whatever">
<div class="wrapper">
<fieldset>Test</fieldset><div class="testclass">some info</div>
</div>
<!-- end test div--></div>
</div>
<div id="test2" class="endFind">
In general, I suspect you want to look at "greedy" vs "lazy" in your regex, assuming that's supported by your platform/language.
For example, <div[^>]*>(.*?)</div> would make $1 match all the text inside a div, but would try to keep it as small as possible. Some people call *? a "lazy star".
But it seems you're looking to find the text within a div that is before the start of the first nested div. That would be something like <div[^>]*>(.*?)<div
Read about greedy vs lazy here and check to make sure that whatever language you're using supports it.
$ php -r '$text="<div>Test<div>foo</div></div>\n"; print preg_replace("/<div[^>]*>(.*?)<div.*/", "\$1", $text);'
Test
$
Regex is not capable of parsing HTML. If this is part of an application, you're doing something wrong. If you absolutely have to parse a document, use a html/xml parser.
If you're trying to screen scrape something and don't want to bother with a parser, look for identifying marks in the page you're scraping. For example, maybe the embedded div ends just before the one you want to match, so you could match </div></div> instead.
Alternatively, here's a regex that meets your requirements. However, it is very fragile: it will break if, for example, #test's children have children, or the html isn't valid, or I missed something, etc, etc ...
/<div id="test"[^<]*(<([^ >]+).+<\/$2>[^<]*)*<\/div>/
I have such string
<p>test <span class=\"match\">match</span> <span class=\"testtes\">dddddd</span></p>
I want to get string without tags. But I want to save highlighting by class "match":
test <span class=\"match\">match</span> dddddd
If I want to just remove all tags I substitute all substrings that satisfied regexp /<\/?[^>]*>/ by empty string. But what regexp should I use in my special case?
UPD: The algorithm is: if you see and some sentence without tags and then then you shouldn't remove these spans; otherwise you should remove all tags
I can could do someting like this
<\/?(?![^>]*class=\\"match)[^>]*>
This would preserve the opening tag and result in this
test <span class=\"match\">match dddddd
See it here on Regexr
But how should I find the matching closing tag?
<p>test <span class=\"match\">match</span> <span class=\"testtes\">dddddd</span></p>
^^^^^^^ or the next one? ^^^^^^^
Regex can't know which closing tag belongs to the opening <span> tag that contains that class. I don't have the possibility to find matching closing tags. So its not a good idea to do this using regex.
I am quite sure the language you are using has an html parser that can be used to do this task.
I'm a total regexp noob. I'm working with wordpress and I'm desperately trying to deal with wordpress's wautop, which I hate and love (more hate!). Anyways I'm trying to remove <p> tags around certain commands.
Here's what I get:
<p>
[hide]
<img.../>
[/hide]
</p>
or
<p>
[imagelist]
<img .../>
<img .../>
[/imagelist]
</p>
Here's what I'd like:
[hide]
<img.../>
[/hide]
or
[imagelist]
<img .../>
<img .../>
[/imagelist]
I've tried:
preg_replace('/<p[^>]*>(\[[^>]*\])<\/p[^>]*>/', '$1', $content); // No luck!
EDIT:
When I am doing the regexp it is still just a variable containing text.. It is not parsed as html yet. I know it is possible because I already did it with getting rid of p tags around an image tag. So I just need a regexp to handle text that will be parsed as html at some point in the future.
Here's a similar question
Thanks!
Matt Mueller
You can't use regular expressions to parse HTML, because HTML is, by definition, a non-regular language. Period, end of discussion.
The language of matching HTML tags is context-free, not regular. This means regular expressions are probably not the right tool to use here. Context-free languages require parsers rather than regular expressions. So, you can either remove ALL <p> and </p> tags with a regular expression, or you can use an HTML parser to remove matching tags from certain parts of your document.
Try this regex:
'%<p[^>]*>\s*(\[([^\[\]]+)\].*?\[/\2\])\s*</p>%s'
Explanation:
\[([^\[\]]+)\] matches the opening bbcode tag and captures the tag name in group #2.
\[/\2\] matches a corresponding losing tag.
.*? matches anything, reluctantly. Thanks to the s flag at the end, it also matches newlines. The effect of the reluctant .*? is that it stops matching the first time it finds a closing bbcode tag with the right name. If tags are nested (within tags with the same name) or improperly balanced, it won't work correctly. I wouldn't expect that be a problem, but I have no experience with WordPress, so YMMV.