Match whitespace created by Chrome devtools source view? - regex

<span style='mso-tab-count:1'>         </span>
<span style='mso-tab-count:1'> </span>
The bottom line above is from a "View Source Code" page and the top line is from the Chrome Developer Tools Source view. The RegEx below matches the bottom tags, which contain a series of spaces, but not the top tags, which enclose just empty whitespace. See this on the Regex Tester at https://regex101.com/r/P9dUP9/2
(<span style='mso-tab-count:1'>)\s{2,}(<\/span>)
How could I make the Regex also match the top line, and how can I tell the difference between the two kinds of whitespace on the screen without copying and pasting both of them into a text editor?

I guess it's a non-printable control character. My hex editor tells me it's \x20, but that's not captured for me. Your best bet would be using an exclusion such as:
(<span style='mso-tab-count:1'>)[^<]{2,}(<\/span>)
or
(<span style='mso-tab-count:1'>)\W{2,}(<\/span>)

Related

Possible Bug using Regex in Notepad++ with Replace All?

Have I found a bug in Notepad++ or am I doing something wrong?
Background info
(Please note that I do know that one are supposed not to use Regex parsing HTML, but I think this is a special case that should work - without the possible Notepad++ bug ;-)
I have exported Apple Notes as HTML using Exporter 3.0 on a Mac. In the HTML output every Note line is between <div> - </div> elements and also "header/title lines" like <h1> - </h1> or <h2> - </h2> etc. Each "header/title line" is often split in several unnecessary HTML header elements as in the following simplified example.
<div><h1>TEST </h1><h1>Title<br></h1></div>
<div><b><h2>T1</h2><u><h2>T2</h2></u><h2> </h2></b><h2>(</h2><h2>T3</h2><u><h2>T4</h2></u><h2>)</h2><b><h2><br></h2></b></div>
This HTML can't be imported into OneNote giving the same result as seen in Apple Notes i.e. each "header/title" line is split in multiple lines. That's true even when changing the <h1>/<h2> block elements to inline elements using an initial <style>h1, h2 {display: inline;}</style> statement. (Maybe that is a bug or restriction in OneNote, but I need to find a workaround.)
Therefore, I need to clean the example HTML output above from the unnecessary HTML header <h1> or <h2> (all but the first in every line) and </h1> or </h2> (all but the last in every line), to get the following result that can be imported to OneNote without problem.
<div><h1>TEST Title<br></h1></div>
<div><b><h2>T1<u>T2</u> </b>(T3<u>T4</u>)<b><br></h2></b></div>
Solution ? - Developed Regex
I'm quite new to Regex, especially advanced Regex, but I think I have found a way to clean the erroneous HTML code using TWO different Regex expressions as follows.
Both works well when tested using regex101.com, I think.
The first one is used to remove unnecessary </h1> or </h2> elements and is a Positive Lookahead function (it works both in regex101 and in Notepad++)
(</h[1-6]>)(?=.*?\1)
(Demo)
Picture 1 shows a working Find All + Mark All in Notepad++
Picture 2 shows a working Replace All
The Second one used to remove unnecessary <h1> or <h2> elements and is a Positive Lookbehind function (it works in regex101 but NOT fully in Notepad++)
(?<=(<(h[1-6])>))(?:.*?)\K\1
(Demo)
Picture 3 shows a working Find All + Mark All in Notepad++ = All 8 occurrences found
Picture 4 shows a NOT working Replace All in Notepad++ = Only 5 occurrences (of the 8 found) are replaced
If I redo the same Replace All a second time 2 of the
remaining 3 occurrences are replaced.
If I redo the same Replace All a third time the last
remaining occurrence is replaced.
BUG ?
Is this a bug in Notepad++ or is this behavior normal or am I doing something strange here? Please help me understand.
So, rather than make multiple passes through your data, you can get it all in one pass with this:
(^.*?<h[1-6]>)?(.*?)</?h[1-6]>(?=.*</h[1-6]>.*?$)
and replace it with \1\2. The first capture group skips the first <h#> on each line and is null after line start. The second capture group captures everything up to the next <h#> tag. The optional slash (/?) scans and deletes both open and close tags. The last part is a positive lookahead to make sure the last </h#> is not deleted.
In the two lines of your examples all the header levels were the same on the line and this regex is fine. If the first open and last close don't match, then you have a problem but I think your solutions also have that same problem. In any case you can fix that in a second pass with ^(.*<h)([1-6])(.*<h)[1-6] and replace it with \1\2\3\2.
I would also point out that this creates unbalanced HTML with a <b>, followed by <h1>, followed by </b>, followed by </h1>. I don't know if that is OK for your case. If not, it might be better to remove ALL the <h#> tags and anchor new ones just inside the <div> </div> pair.
In any event here is a REGEX101 screenprint with this regex working on your examples:

How to Match Redundant Lines From Contenteditable Div in Regex

I'm trying to process the html inside a contenteditable div. It might look like:
<div>Hi I'm Jack...</div>
<div><br></div>
<div><br></div>
<div>More text.</div> *<div><br></div>*
*<div><br></div>**<div><br></div>*
*<div><br></div>*
*<div>
<br>
</div>*
What regex expression would match all trailing <div><br></div> but not the ones sandwiched between useful divs containing text, i.e., <div> text (not html) </div>?
I have enclosed all expressions I want to match in asterisks. The asterisk are for reference only and are not part of my string.
Thanks,
Jack
You can use the pattern:
(?:<div>[\n\s]*<br>[\n\s]*<\/div>)(?!.*?<div>[^<]+<\/div>)
You can try it here.
Let me know if this works for all your cases and I will write a detailed explanation of the pattern.

Regex Match All Characters Between Tags on nth occurrence

I need to match text between two tags, but starting at a specific occurrence of the tag.
Imagine this text:
Some long <br> text goes <br> here. And some <br> more can <br> go here.<br>
In my example, I would like to match here. And some.
I successfully matched the text between the first occurrence (between the first and second br tags) with:
<br>(.*?)<br>
But I am looking for the text in the next match (which would be between the second and third br tags). This is probably more obvious than I realize, but Regex is not my strong suite.
Just extend your regex:
<br>(.*?)<br>(.*?)<br>
or, for an unlimited number of matches, and trimming the spaces:
<br>\s*(.*?)(?=\s*<br>)
EDIT: Now that I see that you are parsing an HTML document, be aware that regular expressions may not be the best tool for that job, especially if your parsing requirements are complex.

Notepad++ Replace new line inside text

I have this sample, because is one of one million rows with that.
I have this text:
<tr class="even">
<td><a href="http://www.ujk.edu.pl/">Jan Kochanowski
University of Humanities and Sciences (Swietokrzyska Pedagogical
University) / Uniwersytet Humanistyczno Przyrodniczy Jana Kochanowskiego
w Kielcach</a></td>
I want to replace to be like that:
<tr class="even">
<td>Jan Kochanowski University of Humanities and Sciences (Swietokrzyska Pedagogical University) / Uniwersytet Humanistyczno Przyrodniczy Jana Kochanowskiegow Kielcach</td>
I tried that REGEX: (.*)
But didn't work.
Open the replace window with Ctrl + H
Then enter
Find what: ([^>])[\r\n]{1,2}
Replace with: \1
Check Regular Expression
[^>] matches a character that isn't a >
The {1,2} protects against a file that may only have a newline and not a carriage return.
\1 replaces just the character that was in the grouping ( ).
Open Notepad++
Click Search >> Replace..
Replace with: \n
At the bottom you will find "Search Mode", click "Extended"
Live example here: http://postimg.org/image/c66pw8kkr/
If you can't make jmstoker's solution work, try like this:
you need to check if the line breaks are just CRLF or just one of them, for that, click on the toolbar the icon "show all characters" or go to menu View -> Show Symbol -> Show all characters
in the replace dialog select the "Extended" search mode
in the "find what:" field, write this: \r\n (or just \r or just \n, basically match CR with \r and LF with \n)
leave the replace with field empty
once this is done, all the line breaks will have been replaced, but you want the <tr class="even"> to be on its own line, so just replace still using an extended search <tr class="even"><td> with <tr class="even">\r\n<td>
I'm guessing you also have rows with class "odd" or something like that so you might need to repeat that last step with the different class :)

How can I search text using regex when it contains \r\n

I am using Sublime Text 2's regex search and replace tool and would like to search text that includes the \r and \n special characters but cannot see how just at the moment.
For example, I have the text:
<div class="head">\r\n
\r\n Keep this text\r\n</div>
Which I would like to transform into:
<h1>Keep this text</h1>
I would also like to factor in the eventuality that these \r\n characters may not be present.
How might I search accounting for \r\n being present and absent, and then remove them as per above? If two regex are required that's fine too.
So far I have <div class="head">(\w)+</div>, however this is stalled by the aforementioned \r\n.
I think you're looking for \s, which matches white space.
So your regex should be something like the following:
<div class="head">\s*(.+?)\s*</div>
If you can do this in ST2, then I think it would fit your need:
Find:
<div class="head">[\s\r\n]*([\w ]+)[\s\r\n]*<\/div>
Replace by:
<h1>$1</h1>
Demo