Convert GFM highlighted code block to Stack Overflow highlighted code block - regex

1. Question
I can't convert GFM highlighted code block to Stack Overflow highlighted code block.
2. Example
For example, I need convert:
Do not change this line
```markdown
Sasha great!
Sasha nice!
She is beautiful, surprise!
```
Do not change this line
to:
Do not change this line
<!-- language: lang-markdown -->
Sasha great!
Sasha nice!
She is beautiful, surprise!
Do not change this line
3. Problem
That get highlighted code block, I need to add tab in beginning of each line inside code block. I don't understand, how I can do it.
4. Not helped
my example regex:
Find:
\`\`\`(.+?)\n((.+?\n)+)\`\`\`
Replace:
<!-- language: lang-\1 -->\n\n\t\2
Demonstration on Regex101.
I get result:
Do not change this line
<!-- language: lang-markdown -->
Sasha great!
Sasha nice!
She is beautiful, surprise!
Do not change this line
Tabulation symbol added in beginning only for first line inside code block. What can I do, that add tab symbol in beginning of each line inside code block?

Since you are using Sublime Text find / replace functionality and there is no programming language involved it would take you about two steps to achieve what you desire.
For first step try to search for:
(?m)(?:^```\h*\S+\s+\K|\G(?!\A))^(?!```)(.*\R+)(?:```)?
and replace with:
\t\1
Live demo
Second find / replace process would be for adding HTML comment so search for:
(?m)^```\h*(\S+)
and replace it with:
<!-- language: lang-\1 -->\n
Live demo

Related

Possible Bug using Regex in Notepad++ with Replace All?

Have I found a bug in Notepad++ or am I doing something wrong?
Background info
(Please note that I do know that one are supposed not to use Regex parsing HTML, but I think this is a special case that should work - without the possible Notepad++ bug ;-)
I have exported Apple Notes as HTML using Exporter 3.0 on a Mac. In the HTML output every Note line is between <div> - </div> elements and also "header/title lines" like <h1> - </h1> or <h2> - </h2> etc. Each "header/title line" is often split in several unnecessary HTML header elements as in the following simplified example.
<div><h1>TEST </h1><h1>Title<br></h1></div>
<div><b><h2>T1</h2><u><h2>T2</h2></u><h2> </h2></b><h2>(</h2><h2>T3</h2><u><h2>T4</h2></u><h2>)</h2><b><h2><br></h2></b></div>
This HTML can't be imported into OneNote giving the same result as seen in Apple Notes i.e. each "header/title" line is split in multiple lines. That's true even when changing the <h1>/<h2> block elements to inline elements using an initial <style>h1, h2 {display: inline;}</style> statement. (Maybe that is a bug or restriction in OneNote, but I need to find a workaround.)
Therefore, I need to clean the example HTML output above from the unnecessary HTML header <h1> or <h2> (all but the first in every line) and </h1> or </h2> (all but the last in every line), to get the following result that can be imported to OneNote without problem.
<div><h1>TEST Title<br></h1></div>
<div><b><h2>T1<u>T2</u> </b>(T3<u>T4</u>)<b><br></h2></b></div>
Solution ? - Developed Regex
I'm quite new to Regex, especially advanced Regex, but I think I have found a way to clean the erroneous HTML code using TWO different Regex expressions as follows.
Both works well when tested using regex101.com, I think.
The first one is used to remove unnecessary </h1> or </h2> elements and is a Positive Lookahead function (it works both in regex101 and in Notepad++)
(</h[1-6]>)(?=.*?\1)
(Demo)
Picture 1 shows a working Find All + Mark All in Notepad++
Picture 2 shows a working Replace All
The Second one used to remove unnecessary <h1> or <h2> elements and is a Positive Lookbehind function (it works in regex101 but NOT fully in Notepad++)
(?<=(<(h[1-6])>))(?:.*?)\K\1
(Demo)
Picture 3 shows a working Find All + Mark All in Notepad++ = All 8 occurrences found
Picture 4 shows a NOT working Replace All in Notepad++ = Only 5 occurrences (of the 8 found) are replaced
If I redo the same Replace All a second time 2 of the
remaining 3 occurrences are replaced.
If I redo the same Replace All a third time the last
remaining occurrence is replaced.
BUG ?
Is this a bug in Notepad++ or is this behavior normal or am I doing something strange here? Please help me understand.
So, rather than make multiple passes through your data, you can get it all in one pass with this:
(^.*?<h[1-6]>)?(.*?)</?h[1-6]>(?=.*</h[1-6]>.*?$)
and replace it with \1\2. The first capture group skips the first <h#> on each line and is null after line start. The second capture group captures everything up to the next <h#> tag. The optional slash (/?) scans and deletes both open and close tags. The last part is a positive lookahead to make sure the last </h#> is not deleted.
In the two lines of your examples all the header levels were the same on the line and this regex is fine. If the first open and last close don't match, then you have a problem but I think your solutions also have that same problem. In any case you can fix that in a second pass with ^(.*<h)([1-6])(.*<h)[1-6] and replace it with \1\2\3\2.
I would also point out that this creates unbalanced HTML with a <b>, followed by <h1>, followed by </b>, followed by </h1>. I don't know if that is OK for your case. If not, it might be better to remove ALL the <h#> tags and anchor new ones just inside the <div> </div> pair.
In any event here is a REGEX101 screenprint with this regex working on your examples:

Find content of the tag <Caption>

I want to record a macro for Notepad++ to find several Texts which are inside a xml-document with some -tags and a lot of other XML-Tags. So I want to use regex and need a little of help. I think I'm quite close.
example: <Caption>ThetextIwanttofind</Caption>
my regex: <Caption\b[^>]*>(.*?)</Caption>
The problem is the closing Caption-tag. How to rewrite my regex to get the inner text with the closing Caption?
Thx for your help!
<Caption\b[^>]*>(.*?)<Caption> --> works for Caption without a closing tag
One solution would be to use :
<Caption\b[^>]*>(.*?)<\/?Caption>
^
But it's kind of ugly

KML Inserting a Specific Tag between Two Other Tags Based on a Condition

TLDR - Insert the Style tag and its contents (see code blocks) between the name and description tag only if the description mentions the phrase "the office" in order to change the current Google Earth placemark from the default yellow one to a custom one...
Hi guys, I’m having a bit of trouble figuring this one out…
Using Notepadd++ I am editing a Google Earth kml file where I have many placemarks that follow this XML pattern:
<Placemark>
<name>Jim</name>
<description>Jim goes to the office every day</description>
<TimeSpan>
<begin>2016-06-20T12:00:00Z</begin>
<end>2016-06-25T12:00:00Z</end>
</TimeSpan>
<Point>
<coordinates>123412341234,123412341234,1</coordinates>
</Point>
</Placemark>
I would like to find every instance of the phrase “the office”. If that text is found, the code below is inserted between name and description in a fashion that would be readable by Google earth.
<Style id="customstyle">
<IconStyle>
<color>a1ff00ff</color>
<scale>1.5</scale>
<Icon>
<href>http://maps.google.com/mapfiles/kml/shapes/shaded_dot.png</href>
</Icon>
</IconStyle>
</Style>
Doing this would change the placemark from the default one to a custom one.
All of the tutorials I have found thus far, have been instructions on how to add words or phrases to the beginning or end of a line using Notepad++ regex, or the instructions show how to insert text on the next line using \n.
However, I think my situation is unique in that based on a certain criteria I want to insert my text above the line. (more specifically insert my text between the name and description tags)
The end result would look something like this (notice how the text I wanted to add is now in between the name tag and the description tag)
<Placemark>
<name>Jim</name>
<Style id="customstyle">
<IconStyle>
<color>a1ff00ff</color>
<scale>1.5</scale>
<Icon>
<href>http://maps.google.com/mapfiles/kml/shapes/shaded_dot.png</href>
</Icon>
</IconStyle>
</Style>
<description>Jim goes to the office every day</description>
<TimeSpan>
<begin>2016-06-15T12:00:00Z</begin>
<end>2016-06-20T12:00:00Z</end>
</TimeSpan>
<Point>
<coordinates>2135125,1234523451,12341234</coordinates>
</Point>
</Placemark>
Now all placemarks that followed that pattern would have a different type of placemark than the default one (and i would no longer have a headache).
Thanks in advance all.
Well, the regex doesn't really need to match something before the line.
It just needs to put something with lines before your match.
So it's still a fairly simple thing to do.
So using Notepad++
Find What : (\s+)(<description>)(?=.*?the office.*?<\/description>)
Replace with : $1<Style id="customstyle">$1\t<IconStyle>$1\t\t<color>a1ff00ff</color>$1\t\t<scale>1.5</scale>$1\t\t<Icon>$1\t\t\t<href>http://maps.google.com/mapfiles/kml/shapes/shaded_dot.png</href>$1\t\t</Icon>$1\t</IconStyle>$1</Style>$1$2
Search mode : Regular Expression
Note that the whitespaces before the description tag are put in capture group 1.
That's a trick make an insert with the same indentation as the tag.
But you could also just put in tags without whitespaces.
Find What : (<description>)(?=.*?the office.*?<\/description>)
Replace with : <Style id="customstyle"><IconStyle><color>a1ff00ff</color><scale>1.5</scale><Icon><href>http://maps.google.com/mapfiles/kml/shapes/shaded_dot.png</href></Icon></IconStyle></Style>$1
And then use a plugin like "XML Tools" to "Pretty Print" your XML.

Limiting a character after a wildcard in regex to it's first occurrence,

How can I tell a character that comes after a wildcard to use the first occurrence of it?
I did the following to find any tag with the word "title" in it:
<(.*?)(title)(.*?)>
but clearly what happens is I end up with the entire tag to the end of
</title>
So that in
<Bla bla ="nametitle">Yada yada</title>
I want
<Bla bla ="nametitle">
but end up with the whole tag.
Please if anyone is offended by the use of parsing html with regex simply move on and accept my apologies for the transgression. I am simply trying to find out how to use the wildcard which I have not used before correctly and apply as I see fit. Thank you.
You can use this regex:
<title.+?>
The above matches <title and goes till it encounters a >
Stop parsing at the first >. Using your example, you could do this with: <(.*?)(title)([^>]*?)>
<(?![\/]).*?title.*?>
This will find title inside any set of < > tags except for closing tags beginning with </
Example:
https://regex101.com/r/QFs4ny/1

Notepad++ Regex to remove styling

I need to remove some tags from a whole lot of html pages.
Lately I discovered the option of regex in Notepad++
But.. Even after hours of Googling I don't seem to get it right.
What do I need?
Example:
<p class=MsoNormal style='margin-left:19.85pt;text-indent:-19.85pt'><spanlang=NL style='font-size:11.0pt;font-family:Symbol'>·<span style='font:7.0pt "Times New Roman"'> </span></span><span lang=NL style='font-size:9.0pt;font-family:"Arial","sans-serif"'>zware uitvoering met doorzichtige vulruimte;</span></p>
I need to remove everything about styling, classes and id's. So I need to only have the clean tags without anything else.
Anyone able to help me on this one?
Kind regards
EDIT
Check an entire file via pastebin: http://pastebin.com/0tNwGUWP
I think this pattern will erase all styles in "p" and "span" tags :
((?<=<p)|(?<=<span))[^>]*(?=>)
=> how it works:
( (?<=<p) | (?<=<span) ): This is a LookBehind Block to make sure
that the string we are looking for comes after <p OR <span
[^>]* : Search for any character that is not a > character
(?=>) : This is a LookAfter block to make sure that the
string we are looking for comes before > character
PS: Tested on Notepad ++
If sample you provided is representative of what you need to process, then, the following quick and dirty solution will work:
Find what: [a-z]+='[^']*'
Replace with:
Find what: [a-z]+=[a-zA-Z]*
Replace with:
You must run the first one first to pick up the style='...' attributes and you'll need to run the second next to pickup both the class='...' and lang='...'.
There's good reason why others posters are saying don't attempt to parse HTML this way. You'll end up in all sorts of trouble since regex, in general cannot handle all the wonderful weirdness of HTML.
My advise as follows.
As I see in your sample text you have only "p" and "span" tags that need to be handled. And you apparently want to remove all the styles inside them. In this case, you could consider removing everything inside those tags, leave them simple <p> or <span>.
I don't know about Notepad++ but a simple C# program can do this job quickly.
Assuming <spanlang=NL a typo (should be <span lang=NL), I'd do:
Find what: (<\w+)[^>]*>
Replace with: $1>
If you don't mind doing a little bit of programming: HTMLAgilityPack can easily remove scripts/styles/wathever from you xml/html.
Example:
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(html);
doc.DocumentNode.Descendants()
.Where(n => n.Name == "script" || n.Name == "style")
.ToList()
.ForEach(n => n.Remove());