Using regex in Find/Replace to replace multiple matches - regex

I'm using Sublime Text, and I want to use Find/Replace to make HTML to Markdown. One problem I encountered is how to replace multiple matches?
The HTML is below:
<blockquote>
<p> text 1 </p>
<p> text 2 </p>
<p> text 3 </p>
<p> text 4 </p>
</blockquote>
And I want to change it to
><p> text 1 </p>
><p> text 2 </p>
><p> text 3 </p>
><p> text 4 </p>
I use
<blockquote>\n(^.+$\n)+?.+</blockquote>
to capture the p tag within the blockquote. But how to replace it?
Thanks a lot.

I have tested this for your simple test case. The main problem is, it may or may not work for more complex input, where you may need to further customize the regex.
Find what:
(?:<blockquote>\s*+|(?<!\A)(?<!</blockquote>)\G)(.*)\s++(?:</blockquote>)?
This solution will clean the closing tag as it match the last line. It fixes the caveat in the first solution where the end tag </blockquote> is not removed.
Replace with:
\n> $1
Use regular expression mode and highlight matches to check what will be replaced.
It will strip all leading spaces, and leave only 1 space between > and the text.
The regex above is built based on my own answer to the question of solving this class of problem with regex alone: Collapse and Capture a Repeating Pattern in a Single Regex Expression.
My earlier solution is based on the second construct, while the current solution is based on the first construct. The initial solution is quoted here, in case you want to customize the regex to be more flexible with its end tag (e.g. free spacing):
(?:<blockquote>\s*+|(?!\A)\G\s++(?!</blockquote>))(.*)

You can do this in two steps.
1)<blockquote>((?:(?!<\/blockquote>).)*)<\/blockquote> replace by $1.
See demo.
http://regex101.com/r/dZ1vT6/35
2)^\s+ replace by <
See demo.
http://regex101.com/r/dZ1vT6/36

Related

How to Match Redundant Lines From Contenteditable Div in Regex

I'm trying to process the html inside a contenteditable div. It might look like:
<div>Hi I'm Jack...</div>
<div><br></div>
<div><br></div>
<div>More text.</div> *<div><br></div>*
*<div><br></div>**<div><br></div>*
*<div><br></div>*
*<div>
<br>
</div>*
What regex expression would match all trailing <div><br></div> but not the ones sandwiched between useful divs containing text, i.e., <div> text (not html) </div>?
I have enclosed all expressions I want to match in asterisks. The asterisk are for reference only and are not part of my string.
Thanks,
Jack
You can use the pattern:
(?:<div>[\n\s]*<br>[\n\s]*<\/div>)(?!.*?<div>[^<]+<\/div>)
You can try it here.
Let me know if this works for all your cases and I will write a detailed explanation of the pattern.

How to search and replace html tag using regex

I want to search and replace html tag p and /p with div and /div
inside blockquote only. the example is as follows :
<blockquote>
<p>paragraph 1</p>
</blockquote>
<p>paragraph 1 outside blockquote</p>
<blockquote>
<p>paragraph 2</p>
<p>paragraph 3</p>
</blockquote>
<p>paragraph 2 outside blockquote</p>
the search regex is :
(<blockquote>)(.*?)(p>)(.*?)(</blockquote>)
and the replace regex is :
\1\2div>\4
The problem is the p tag outside blockquote will be changed too after repeating "replace all" command. The above regex can only search and replace one instance, I have to execute the "replace all" command continually until all p are replaced. Is there any way to repeat the regex automatically? (I use Editpad Pro v.7.2.3)
Search:
(<blockquote>(?:(?!</?blockquote).)*?)<p>(.*?)</p>((?:(?!</?blockquote).)*</blockquote>)
Replace with:
\1<div>\2</div>\3
DEMO
An alternative would be to replace one tag at a time, reducing the ammount of times you should replace all occurrences. However, I don't know if this will work in EditPad.
Find:
<p>((?:(?!</?blockquote).)*?)</p>(?=(?:(?!</?blockquote).)*</blockquote>)
Replace with:
<div>\1</div>
DEMO
This is a FAQ in many quarters. regex is good for many things, and parsing balanced delimiters is not one of them.
You need to read up about Document Object Model, and XPath. Then load your HTML into a DOM, find its nodes with XPath, operate on them, then write them back.

Regex Match All Characters Between Tags on nth occurrence

I need to match text between two tags, but starting at a specific occurrence of the tag.
Imagine this text:
Some long <br> text goes <br> here. And some <br> more can <br> go here.<br>
In my example, I would like to match here. And some.
I successfully matched the text between the first occurrence (between the first and second br tags) with:
<br>(.*?)<br>
But I am looking for the text in the next match (which would be between the second and third br tags). This is probably more obvious than I realize, but Regex is not my strong suite.
Just extend your regex:
<br>(.*?)<br>(.*?)<br>
or, for an unlimited number of matches, and trimming the spaces:
<br>\s*(.*?)(?=\s*<br>)
EDIT: Now that I see that you are parsing an HTML document, be aware that regular expressions may not be the best tool for that job, especially if your parsing requirements are complex.

Regular expression replace start and end, ignore middle

In an Ant build file, is there a way to use a replaceregexp to find and replace two tags, and retain what's in between them? For example, to find all of these:
</a>1234abcdefg</P>
</a>123456789. </p>
</a> yop </p>
</a></p>
and replace
</a> and </p>
with
<#> and <##>
so that I have, respectively:
<#>1234abcdefg##
<#>123456789. <##>
<#> yop <##>
<#><##>
I can't replace the tags individually since they occur in other places, I just want the instances in which </a> is followed by </p>, in the same line, with either nothing or something in between them, and I want to keep what's in between them.
Try this:
<replaceregexp file="notTested.xml" match="(<)\/a(>.*?<)\/p(>)" replace="\1#\2##\3" byline="true" flags="g" />
as for, but it replaces what's between the tags with .* , i haven't seen .* in a replacement/substitution expression. probably it takes it as literals . and *.
as for </a>.*</p>, the > .* < will not work when you have multiple declerations of </a> and </p> on the same line... such as:
</a>1234abcdefg</P>abcde</a>123456789. </p> would be replaced as
<#>1234abcdefg</P>abcde</a>123456789. <##>
you need to use non greedy quantifier ?. See WiKi for the use of .*? vs .*.
Solution 1: You can try this
You store the match with parenthesis, and then replace it.
exp = new Regex(#"YourtagStartRegex(bodyRegex)YourtagClosingRegex");
str = exp.Replace(str, "$1");
Reference:Replace the start and end of a string ignoring the middle with regex, how?
Or
Solution 2:
Regex ignore middle part of capture

Regexp: remove all tags from string except one kind of tags

I have such string
<p>test <span class=\"match\">match</span> <span class=\"testtes\">dddddd</span></p>
I want to get string without tags. But I want to save highlighting by class "match":
test <span class=\"match\">match</span> dddddd
If I want to just remove all tags I substitute all substrings that satisfied regexp /<\/?[^>]*>/ by empty string. But what regexp should I use in my special case?
UPD: The algorithm is: if you see and some sentence without tags and then then you shouldn't remove these spans; otherwise you should remove all tags
I can could do someting like this
<\/?(?![^>]*class=\\"match)[^>]*>
This would preserve the opening tag and result in this
test <span class=\"match\">match dddddd
See it here on Regexr
But how should I find the matching closing tag?
<p>test <span class=\"match\">match</span> <span class=\"testtes\">dddddd</span></p>
^^^^^^^ or the next one? ^^^^^^^
Regex can't know which closing tag belongs to the opening <span> tag that contains that class. I don't have the possibility to find matching closing tags. So its not a good idea to do this using regex.
I am quite sure the language you are using has an html parser that can be used to do this task.