Regular expression to select from X to Y - regex

I have the folowing HTML:
<Some Html above....../>
<!--Template Start -->
<div>
<p>Some text</p>
...
<div>
<!--Template End -->
<Some Html below/>
Now how can I write regular expression to match all text from Template Start to
Template End
here it says that notepad++ use Scintilla engine.
Notepad++ non-greedy regular expressions

<!--Template Start -->(.*?)<!--Template End -->
s modifier should be switched on.

Assuming that there are no nested templates:
<!--Template Start -->(.*?)<!--Template End -->
Note to switch on mode DOT_ALL to also cover newlines.

It's a shame, but Notepad++ doesn't support matching newlines (\r\n) natively in regex mode. It does support matching newlines only in extended mode.
However it DOES support INSERTING newlines in both modes.
To achieve desired results, you can do a workaround:
Delete all newlines in extended mode (replace \r\n with nothing) so you have one-liner.
Do regex manipulations in regex mode.
Add newlines back in extended mode (e.g. replace <div> with <div>\r\n and so on) or regex mode.
I've read somewhere that PythonScript plugin for N++ adds better support for regexes but I haven't checked it.

Related

Using regex in Find/Replace to replace multiple matches

I'm using Sublime Text, and I want to use Find/Replace to make HTML to Markdown. One problem I encountered is how to replace multiple matches?
The HTML is below:
<blockquote>
<p> text 1 </p>
<p> text 2 </p>
<p> text 3 </p>
<p> text 4 </p>
</blockquote>
And I want to change it to
><p> text 1 </p>
><p> text 2 </p>
><p> text 3 </p>
><p> text 4 </p>
I use
<blockquote>\n(^.+$\n)+?.+</blockquote>
to capture the p tag within the blockquote. But how to replace it?
Thanks a lot.
I have tested this for your simple test case. The main problem is, it may or may not work for more complex input, where you may need to further customize the regex.
Find what:
(?:<blockquote>\s*+|(?<!\A)(?<!</blockquote>)\G)(.*)\s++(?:</blockquote>)?
This solution will clean the closing tag as it match the last line. It fixes the caveat in the first solution where the end tag </blockquote> is not removed.
Replace with:
\n> $1
Use regular expression mode and highlight matches to check what will be replaced.
It will strip all leading spaces, and leave only 1 space between > and the text.
The regex above is built based on my own answer to the question of solving this class of problem with regex alone: Collapse and Capture a Repeating Pattern in a Single Regex Expression.
My earlier solution is based on the second construct, while the current solution is based on the first construct. The initial solution is quoted here, in case you want to customize the regex to be more flexible with its end tag (e.g. free spacing):
(?:<blockquote>\s*+|(?!\A)\G\s++(?!</blockquote>))(.*)
You can do this in two steps.
1)<blockquote>((?:(?!<\/blockquote>).)*)<\/blockquote> replace by $1.
See demo.
http://regex101.com/r/dZ1vT6/35
2)^\s+ replace by <
See demo.
http://regex101.com/r/dZ1vT6/36

Regular expression replace start and end, ignore middle

In an Ant build file, is there a way to use a replaceregexp to find and replace two tags, and retain what's in between them? For example, to find all of these:
</a>1234abcdefg</P>
</a>123456789. </p>
</a> yop </p>
</a></p>
and replace
</a> and </p>
with
<#> and <##>
so that I have, respectively:
<#>1234abcdefg##
<#>123456789. <##>
<#> yop <##>
<#><##>
I can't replace the tags individually since they occur in other places, I just want the instances in which </a> is followed by </p>, in the same line, with either nothing or something in between them, and I want to keep what's in between them.
Try this:
<replaceregexp file="notTested.xml" match="(<)\/a(>.*?<)\/p(>)" replace="\1#\2##\3" byline="true" flags="g" />
as for, but it replaces what's between the tags with .* , i haven't seen .* in a replacement/substitution expression. probably it takes it as literals . and *.
as for </a>.*</p>, the > .* < will not work when you have multiple declerations of </a> and </p> on the same line... such as:
</a>1234abcdefg</P>abcde</a>123456789. </p> would be replaced as
<#>1234abcdefg</P>abcde</a>123456789. <##>
you need to use non greedy quantifier ?. See WiKi for the use of .*? vs .*.
Solution 1: You can try this
You store the match with parenthesis, and then replace it.
exp = new Regex(#"YourtagStartRegex(bodyRegex)YourtagClosingRegex");
str = exp.Replace(str, "$1");
Reference:Replace the start and end of a string ignoring the middle with regex, how?
Or
Solution 2:
Regex ignore middle part of capture

How can I search text using regex when it contains \r\n

I am using Sublime Text 2's regex search and replace tool and would like to search text that includes the \r and \n special characters but cannot see how just at the moment.
For example, I have the text:
<div class="head">\r\n
\r\n Keep this text\r\n</div>
Which I would like to transform into:
<h1>Keep this text</h1>
I would also like to factor in the eventuality that these \r\n characters may not be present.
How might I search accounting for \r\n being present and absent, and then remove them as per above? If two regex are required that's fine too.
So far I have <div class="head">(\w)+</div>, however this is stalled by the aforementioned \r\n.
I think you're looking for \s, which matches white space.
So your regex should be something like the following:
<div class="head">\s*(.+?)\s*</div>
If you can do this in ST2, then I think it would fit your need:
Find:
<div class="head">[\s\r\n]*([\w ]+)[\s\r\n]*<\/div>
Replace by:
<h1>$1</h1>
Demo

how to match any string in Emacs regexp?

I'm referring to this page: http://ergoemacs.org/emacs/emacs_regex.html
which says that to capture a pattern in Emacs Regexp, you need to escape the paren like this: \(myPattern\).
It further says that the syntax for capturing a sequence of ASCII characters is [[:ascii:]]+
In my document, I'm trying to match all strings that occur between <p class="calibre3"> and </p>
So, following the syntax above, I do a replace-regexp for
<p class="calibre3">\([[:ascii:]]+\)</p>
but it finds no matches.
Suggestions?
Regexps are not good for general-purpose HTML parsing, but as paragraph tags cannot be validly nested, the following is going to be fine (provided the mark-up is valid & well-formed).
<p class="calibre3">\(.*?\)</p>
*? is the non-greedy zero-or-more repetitions operator, so it will match as little as possible -- in this case everything until the next </p> (as opposed to the greedy version, which would match everything until the final </p> in the text).
The [^<] approach is fine if it fits the data in question, but it won't work if there are other tags within the paragraphs.
You need to escape your angle brackets and I would use [^<] instead of [[:ascii]] like so:
\<p class="calibre3"\>([^<]+\)</p\>
<p class="calibre3">\([^<]\)+</p>
Source: #TooTone

Regular expression to remove <p> tags around elements wrapped in [...]'s

I'm a total regexp noob. I'm working with wordpress and I'm desperately trying to deal with wordpress's wautop, which I hate and love (more hate!). Anyways I'm trying to remove <p> tags around certain commands.
Here's what I get:
<p>
[hide]
<img.../>
[/hide]
</p>
or
<p>
[imagelist]
<img .../>
<img .../>
[/imagelist]
</p>
Here's what I'd like:
[hide]
<img.../>
[/hide]
or
[imagelist]
<img .../>
<img .../>
[/imagelist]
I've tried:
preg_replace('/<p[^>]*>(\[[^>]*\])<\/p[^>]*>/', '$1', $content); // No luck!
EDIT:
When I am doing the regexp it is still just a variable containing text.. It is not parsed as html yet. I know it is possible because I already did it with getting rid of p tags around an image tag. So I just need a regexp to handle text that will be parsed as html at some point in the future.
Here's a similar question
Thanks!
Matt Mueller
You can't use regular expressions to parse HTML, because HTML is, by definition, a non-regular language. Period, end of discussion.
The language of matching HTML tags is context-free, not regular. This means regular expressions are probably not the right tool to use here. Context-free languages require parsers rather than regular expressions. So, you can either remove ALL <p> and </p> tags with a regular expression, or you can use an HTML parser to remove matching tags from certain parts of your document.
Try this regex:
'%<p[^>]*>\s*(\[([^\[\]]+)\].*?\[/\2\])\s*</p>%s'
Explanation:
\[([^\[\]]+)\] matches the opening bbcode tag and captures the tag name in group #2.
\[/\2\] matches a corresponding losing tag.
.*? matches anything, reluctantly. Thanks to the s flag at the end, it also matches newlines. The effect of the reluctant .*? is that it stops matching the first time it finds a closing bbcode tag with the right name. If tags are nested (within tags with the same name) or improperly balanced, it won't work correctly. I wouldn't expect that be a problem, but I have no experience with WordPress, so YMMV.