How can I match a Markdown code block with RegEx? - regex

I am trying to extract a code block from a Markdown document using PCRE RegEx. For the uninitiated, a code block in Markdown is defined thus:
To produce a code block in Markdown, simply indent every line of the
block by at least 4 spaces or 1 tab.
A code block continues until it reaches a line that is not indented (or the end of the article).
So, given this text:
This is a code block:
I need capturing along with
this line
This is a code fence below (to be ignored):
``` json
This must have three backticks
flanking it
```
I love `inline code` too but don't capture
and one more short code block:
Capture me
So far I have this RegEx:
(?:[ ]{4,}|\t{1,})(.+)
But it simply captures each line prefixed with at least four spaces or one tab. It doesn't capture the whole block.
What I need help with is how to set the condition to capture everything after 4 spaces or 1 tab until you either get to a line that is not indented or the end of the text.
Here's an online work in progress:
https://www.regex101.com/r/yMQCIG/5

You should use begin/end-of-string markers (^ and $ in combination with the m modifier). Also, your test text had only 3 leading spaces in the final block:
^((?:(?:[ ]{4}|\t).*(\R|$))+)
With \R and the repetition you match one whole block with each single match, instead of a line per match.
See demo on regex101
Disclaimer: The rules of markdown are more complicated than the presented example text shows. For instance, when (nested) lists have code blocks in them, these need to be prefixed with 8, 12 or more spaces. Regular expressions are not suitable to identify such code blocks, or other code blocks embedded in markdown notation that uses the wider range of format combinations.

There are 3 ways to highlight code: 1) using start-of-line indentation 2) using 3 or more backticks enclosing a multiline block of code or 3) inline code.
1 and 3 are part of John Gruber original Markdown specification.
Here is the way to achieve this. You need to perform 3 separate regexp tests:
Using indentation
(?:\n{2,}|\A) # Starting at beginning of string or with 2 new lines
(?<code_all>
(?:
(?<code_prefix> # Lines must start with a tab or a tab-width of spaces
[ ]{4}
|
\t
)
(?<code_content>.*\n+) # with some content, possibly nothing followed by a new line
)+
)
(?<code_after>
(?=^[ ]{0,4}\S) # Lookahead for non-space at line-start
|
\Z # or end of doc
)
2a) Using code block with backticks (vanilla markdown)
(?:\n+|\A)? # Necessarily at the begining of a new line or start of string
(?<code_all>
(?<code_start>
[ ]{0,3} # Possibly up to 3 leading spaces
\`{3,} # 3 code marks (backticks) or more
)
\n+
(?<code_content>.*?) # enclosed content
\n+
(?<!`)
\g{code_start} # balanced closing block marks
(?!`)
[ \t]* # possibly followed by some space
\n
)
(?<code_trailing_new_line>\n|\Z) # and a new line or end of string
2b) Using code block with backticks with some class specifier (extended markdown)
(?:\n+|\A)? # Necessarily at the beginning of a new line
(?<code_all>
(?<code_start>
[ ]{0,3} # Possibly up to 3 leading spaces
\`{3,} # 3 code marks (backticks) or more
)
[ \t]* # Possibly some spaces or tab
(?:
(?:
(?<code_class>[\w\-\.]+) # or a code class like html, ruby, perl
(?:
[ \t]*
\{(?<code_def>[^\}]+)\} # a definition block like {.class#id}
)? # Possibly followed by class and id definition in curly braces
)
|
(?:
[ \t]*
\{(?<code_def>[^\}]+)\} # a definition block like {.class#id}
) # Followed by class and id definition in curly braces
)
\n+
(?<code_content>.*?) # enclosed content
\n+
(?<!`)
\g{code_start} # balanced closing block marks
(?!`)
)
(?:\n|\Z) # and a new line or end of string
Using 1 or more backticks for inline code
(?<!\\) # Ensuring this is not escaped
(?<code_all>
(?<code_start>\`{1,}) # One or more backtick(s)
(?<code_content>.+?) # Code content inbetween back sticks
(?<!`) # Not preceded by a backtick
\g{code_start} # Balanced closing backtick(s)
(?!`) # And not followed by a backtick
)

Try this?
[a-z]*\n[\s\S]*?\n
It will extract from your example
This must have three backticks
flanking it

Related

Find :: outside of markdown code formatting

I have a bunch of markdown files, where I want to search for Ruby's double colon :: outside of some code formatting (e.g. where I forgot to apply proper markdown). For example
`foo::bar`
hello `foo::bar` test
` example::with::whitespace `
```
Proper::Formatted
```
```
Module::WithIndendation
```
```
Some::Nested::Modules
```
```ruby
CodeBlock::WithSyntax
```
# Some::Class
## Another::Class Heading
some text
The regex only should match Some::Class and Another::Class, because they miss the surrounding backticks, and are also not within a multiline code fence block.
I have this regex, but it also matches the multi line block
[\s]+[^`]+(::)[^`]+[\s]?
Any idea, how to exclude this?
EDIT:
It would be great, if the regex would work in Ruby, JS and on the command line for grep.
For the original input, you may use this regex in ruby to match :: string
not preceded by a ` and
not preceded by ` followed a white-space:
Regex:
(?<!`\s)(?<!`)\b\w+::\w+
RegEx Demo 1
RegEx Breakup:
(?<!\s): Negative lookbehind to assert that <code> and whitespace is not at preceding position
(?<!): Negative lookbehind to assert that <code> is not at preceding position
\b: Match word boundary
\w+: Match 1+ word characters
::: Match a ::
\w+: Match 1+ word characters
You can use this regex in Javascript:
(?<!`\w*\s*|::)\b\w+(?:::\w+)+
RegEx Demo 2
For gnu-grep, consider this command:
grep -ZzoP '`\w*\s*\b\w+::\w+(*SKIP)(*F)|\b\w+::\w+' file |
xargs -0 printf '%s\n'
Some::Class
Another::Class
RegEx Demo 3
One can use the regular expression
rgx = /`[^`]*`|([^`\r\n]*::[^`\r\n]*)/
with the form of String#gsub that takes one argument and no block, and therefore returns an enumerator (str holding the example string given in the question):
str.gsub(rgx).select { $1 }
#=> ["# Some::Class", "## Another::Class Heading"]
The idea is that the first part of the regex's alternation, `[^`]*`, matches, but does not capture, all strings delimited by backtics (including ``), whereas the second part, ([^`\r\n]*::[^`\r\n]*), matches and captures all strings on a single line that contain '::' but no backtics. We therefore concern ourselves with captures only, by invoking select { $1 } on the enumerator returned by gsub.
The regular expression can be made self-documenting by writing it in free-spacing mode.
rgx = /
` # match a backtic
[^`]* # match zero of more characters other than backtics
` # match a backtic
| # or
( # begin capture group 1
[^`\r\n]* # match zero of more characters other than backtics and
# line terminators
:: # match two colons
[^`\r\n]* # ditto line before previous
) # end capture group 1
/x # invoke free-spacing regex definition mode
[^`\r\n] contains \r (carriage return) in the event that the file was created with Windows. If desired, [^`]* can be replaced with .*? (match zero or more characters, as few as possible).

Remove the text outside the first brackets in R

I know that it was asked a lot of times, but I've tried to adapt the other answers to my need and I was not able to make it work using SKIP and FAIL (I'm a bit confused, I've to admit)
I'm using R actually.
The url I need to clean is:
url <- "posts.fields(id,from.fields(id,name),message,comments.summary(true).limit(0),likes.summary(true).limit(0))"
and I need to retain only the content inside the first brackets that are always prefixed by the word "fields" (while "posts" may vary). In other words something like
id,from.fields(id,name),message,comments.summary(true).limit(0),likes.summary(true).limit(0)
As you may see there're some nesting inside. But I eventually could change my source code to accept this string too (removing every parhentesis by every prefix)
id,from,message,comments,likes
I don't know on how to remove the trailing parhentesis which balances the first.
If it's good enough to just remove everything up to and including the first open parenthesis and also remove the last close parenthesis and thereafter then:
sub("^.*?\\((.*)\\)[^)]*$", "\\1", url)
Note:
If it's good enough to just remove the first open parenthesis and last close parenthesis then try this:
sub("\\((.*)\\)", "\\1", url)
Using lazy .* instead of greedy:
sub(".*?fields\\((.*)\\)", "\\1", url)
[1] "id,from.fields(id,name),message,comments.summary(true).limit(0),likes.summary(true).limit(0)"
You need to use a recursive pattern:
sub("[^.]*+(?:\\.(?!fields\\()[^.]*)*+\\.fields\\(([^()]*+(?:\\((?1)\\)[^()]*)*+)\\)(?s:.*)", "\\1", url, perl=T)
demo
details:
# reach the dot before "fields("
[^.]*+ # all except a dot (possessive)
(?: # open a non-capturing group
\\. # a literal dot
(?!fields\\() # not followed by "fields("
[^.]* # all except a dot
)*+ # repeat the group zero or more times
\\.fields\\(
# match a content between parenthesis with any level of nesting
( # open the capture group 1
[^()]*+ # 0 or more character that are not brackets (possessive)
(?: # open a non capturing group
\\(
(?1) # recursion in group 1
\\) #
[^()]* # all that is not a bracket
)*+ # close the non capturing group and repeat 0 or more time (possessive)
) # close the capture group 1
\\)
(?s:.*) # end of the string
Possessive quantifiers are used here to limit the backtracking when for any reason a part of the pattern fails.

Can a Regex Return the Number of the Line where the Match is Found?

In a text editor, I want to replace a given word with the number of the line number on which this word is found. Is this is possible with Regex?
Recursion, Self-Referencing Group (Qtax trick), Reverse Qtax or Balancing Groups
Introduction
The idea of adding a list of integers to the bottom of the input is similar to a famous database hack (nothing to do with regex) where one joins to a table of integers. My original answer used the #Qtax trick. The current answers use either Recursion, the Qtax trick (straight or in a reversed variation), or Balancing Groups.
Yes, it is possible... With some caveats and regex trickery.
The solutions in this answer are meant as a vehicle to demonstrate some regex syntax more than practical answers to be implemented.
At the end of your file, we will paste a list of numbers preceded with a unique delimiter. For this experiment, the appended string is :1:2:3:4:5:6:7 This is a similar technique to a famous database hack that uses a table of integers.
For the first two solutions, we need an editor that uses a regex flavor that allows recursion (solution 1) or self-referencing capture groups (solutions 2 and 3). Two come to mind: Notepad++ and EditPad Pro. For the third solution, we need an editor that supports balancing groups. That probably limits us to EditPad Pro or Visual Studio 2013+.
Input file:
Let's say we are searching for pig and want to replace it with the line number.
We'll use this as input:
my cat
dog
my pig
my cow
my mouse
:1:2:3:4:5:6:7
First Solution: Recursion
Supported languages: Apart from the text editors mentioned above (Notepad++ and EditPad Pro), this solution should work in languages that use PCRE (PHP, R, Delphi), in Perl, and in Python using Matthew Barnett's regex module (untested).
The recursive structure lives in a lookahead, and is optional. Its job is to balance lines that don't contain pig, on the left, with numbers, on the right: think of it as balancing a nested construct like {{{ }}}... Except that on the left we have the no-match lines, and on the right we have the numbers. The point is that when we exit the lookahead, we know how many lines were skipped.
Search:
(?sm)(?=.*?pig)(?=((?:^(?:(?!pig)[^\r\n])*(?:\r?\n))(?:(?1)|[^:]+)(:\d+))?).*?\Kpig(?=.*?(?(2)\2):(\d+))
Free-Spacing Version with Comments:
(?xsm) # free-spacing mode, multi-line
(?=.*?pig) # fail right away if pig isn't there
(?= # The Recursive Structure Lives In This Lookahead
( # Group 1
(?: # skip one line
^
(?:(?!pig)[^\r\n])* # zero or more chars not followed by pig
(?:\r?\n) # newline chars
)
(?:(?1)|[^:]+) # recurse Group 1 OR match all chars that are not a :
(:\d+) # match digits
)? # End Group
) # End lookahead.
.*?\Kpig # get to pig
(?=.*?(?(2)\2):(\d+)) # Lookahead: capture the next digits
Replace: \3
In the demo, see the substitutions at the bottom. You can play with the letters on the first two lines (delete a space to make pig) to move the first occurrence of pig to a different line, and see how that affects the results.
Second Solution: Group that Refers to Itself ("Qtax Trick")
Supported languages: Apart from the text editors mentioned above (Notepad++ and EditPad Pro), this solution should work in languages that use PCRE (PHP, R, Delphi), in Perl, and in Python using Matthew Barnett's regex module (untested). The solution is easy to adapt to .NET by converting the \K to a lookahead and the possessive quantifier to an atomic group (see the .NET Version a few lines below.)
Search:
(?sm)(?=.*?pig)(?:(?:^(?:(?!pig)[^\r\n])*(?:\r?\n))(?=[^:]+((?(1)\1):\d+)))*+.*?\Kpig(?=[^:]+(?(1)\1):(\d+))
.NET version: Back to the Future
.NET does not have \K. It its place, we use a "back to the future" lookbehind (a lookbehind that contains a lookahead that skips ahead of the match). Also, we need to use an atomic group instead of a possessive quantifier.
(?sm)(?<=(?=.*?pig)(?=(?>(?:^(?:(?!pig)[^\r\n])*(?:\r?\n))(?=[^:]+((?(1)\1):\d+)))*).*)pig(?=[^:]+(?(1)\1):(\d+))
Free-Spacing Version with Comments (Perl / PCRE Version):
(?xsm) # free-spacing mode, multi-line
(?=.*?pig) # lookahead: if pig is not there, fail right away to save the effort
(?: # start counter-line-skipper (lines that don't include pig)
(?: # skip one line
^ #
(?:(?!pig)[^\r\n])* # zero or more chars not followed by pig
(?:\r?\n) # newline chars
)
# for each line skipped, let Group 1 match an ever increasing portion of the numbers string at the bottom
(?= # lookahead
[^:]+ # skip all chars that are not colons
( # start Group 1
(?(1)\1) # match Group 1 if set
:\d+ # match a colon and some digits
) # end Group 1
) # end lookahead
)*+ # end counter-line-skipper: zero or more times
.*? # match
\K # drop everything we've matched so far
pig # match pig (this is the match!)
(?=[^:]+(?(1)\1):(\d+)) # capture the next number to Group 2
Replace:
\2
Output:
my cat
dog
my 3
my cow
my mouse
:1:2:3:4:5:6:7
In the demo, see the substitutions at the bottom. You can play with the letters on the first two lines (delete a space to make pig) to move the first occurrence of pig to a different line, and see how that affects the results.
Choice of Delimiter for Digits
In our example, the delimiter : for the string of digits is rather common, and could happen elsewhere. We can invent a UNIQUE_DELIMITER and tweak the expression slightly. But the following optimization is even more efficient and lets us keep the :
Optimization on Second Solution: Reverse String of Digits
Instead of pasting our digits in order, it may be to our benefit to use them in the reverse order: :7:6:5:4:3:2:1
In our lookaheads, this allows us to get down to the bottom of the input with a simple .*, and to start backtracking from there. Since we know we're at the end of the string, we don't have to worry about the :digits being part of another section of the string. Here's how to do it.
Input:
my cat pi g
dog p ig
my pig
my cow
my mouse
:7:6:5:4:3:2:1
Search:
(?xsm) # free-spacing mode, multi-line
(?=.*?pig) # lookahead: if pig is not there, fail right away to save the effort
(?: # start counter-line-skipper (lines that don't include pig)
(?: # skip one line that doesn't have pig
^ #
(?:(?!pig)[^\r\n])* # zero or more chars not followed by pig
(?:\r?\n) # newline chars
)
# Group 1 matches increasing portion of the numbers string at the bottom
(?= # lookahead
.* # get to the end of the input
( # start Group 1
:\d+ # match a colon and some digits
(?(1)\1) # match Group 1 if set
) # end Group 1
) # end lookahead
)*+ # end counter-line-skipper: zero or more times
.*? # match
\K # drop match so far
pig # match pig (this is the match!)
(?=.*(\d+)(?(1)\1)) # capture the next number to Group 2
Replace: \2
See the substitutions in the demo.
Third Solution: Balancing Groups
This solution is specific to .NET.
Search:
(?m)(?<=\A(?<c>^(?:(?!pig)[^\r\n])*(?:\r?\n))*.*?)pig(?=[^:]+(?(c)(?<-c>:\d+)*):(\d+))
Free-Spacing Version with Comments:
(?xm) # free-spacing, multi-line
(?<= # lookbehind
\A #
(?<c> # skip one line that doesn't have pig
# The length of Group c Captures will serve as a counter
^ # beginning of line
(?:(?!pig)[^\r\n])* # zero or more chars not followed by pig
(?:\r?\n) # newline chars
) # end skipper
* # repeat skipper
.*? # we're on the pig line: lazily match chars before pig
) # end lookbehind
pig # match pig: this is the match
(?= # lookahead
[^:]+ # get to the digits
(?(c) # if Group c has been set
(?<-c>:\d+) # decrement c while we match a group of digits
* # repeat: this will only repeat as long as the length of Group c captures > 0
) # end if Group c has been set
:(\d+) # Match the next digit group, capture the digits
) # end lokahead
Replace: $1
Reference
Qtax trick
On Which Line Number Was the Regex Match Found?
Because you didn't specify which text editor, in vim it would be:
:%s/searched_word/\=printf('%-4d', line('.'))/g (read more)
But as somebody mentioned it's not a question for SO but rather Super User ;)
I don't know of an editor that does that short of extending an editor that allows arbitrary extensions.
You could easily use perl to do the task, though.
perl -i.bak -e"s/word/$./eg" file
Or if you want to use wildcards,
perl -MFile::DosGlob=glob -i.bak -e"BEGIN { #ARGV = map glob($_), #ARGV } s/word/$./eg" *.txt

Delete lines between and including two patterns

I have a scalar variable that contains some information inside of a file. My goal is to strip that variable (or file) of any multi-line entry containing the words "Administratively down."
The format is similar to this:
Ethernet2/3 is up
... see middle ...
a blank line
VlanXXX is administratively down, line protocol is down
... a bunch of text indented by two spaces on multiple lines ...
a blank line
Ethernet2/5 is up
... same format as previously ...
I was thinking that if I could match "administratively down" and a leading newline (for the blank line), I would be able to apply some logic to the variable to also remove the lines between those lines.
I'm using Perl at the moment, but if anyone can give me an ios way of doing this, that would also work.
Use Perl's Paragraph Mode
Perl has a rarely-used syntax for using blank lines as record separators: the -00 flags; see Command Switches in perl(1) for details.
Example
For example, given a corpus of:
Ethernet2/3 is up
... see middle ...
VlanXXX is administratively down, line protocol is down
... a bunch of text indented by two spaces on multiple lines ...
Ethernet2/5 is up
You can use extract all pargagraphs except the ones you don't want with the following one-liner:
$ perl -00ne 'print unless /administratively down/' /tmp/corpus
Sample Output
When tested against your corpus, the one-liner yields:
Ethernet2/3 is up
... see middle ...
Ethernet2/5 is up
So, you want to delete from the beginning of a line containing "administratively down" to and including the next blank line (two consecutive newlines)?
$log =~ s/[^\n]+administratively down.+?\n\n//s;
s/ = regex substitution
[^\n]+ = any number of characters, not including newlines, followed by
administratively down = the literal text, followed by
.+? = any amount of text, including newlines, matched non-greedily, followed by
\n\n = two newlines
// = replace with nothing (i.e. delete)
s = single line mode, allows . to match newlines (it usually doesn't)
You can use this pattern:
(?<=\n\n|^)(?>[^a\n]++|\n(?!\n)|a(?!dministratively down\b))*+administratively down(?>[^\n]++|\n(?!\n))*+
details:
(?<=\n\n|^) # preceded by a newline or the begining of the string
# all that is not "administratively down" or a blank line, details:
(?> # open an atomic group
[^a\n]++ # all that is not a "a" or a newline
| # OR
\n(?!\n) # a newline not followed by a newline
| # OR
a(?!dministratively down\b) # "a" not followed by "dministratively down"
)*+ # repeat the atomic group zero or more times
administratively down # "administratively down" itself
# the end of the paragraph
(?> # open an atomic group
[^\n]++ # all that is not a newline
| # OR
\n(?!\n) # a newline not followed by a newline
)*+ # repeat the atomic group zero or more times

RegEx to match some wrapped texts

Consider following text:
aas( I)f df (as)(dfdsf)(adf).dgdf(sfg).(dfdf) asdfsdf dsfa(asd #54 54 !fa.) sdf
I want to retrive text between parenthesis, but adjacent parentheses should be consider a single unit. How can I do that?
For above example desired output is:
( I)
(as)(dfdsf)(adf)
(sfg).(dfdf)
(asd #54 54 !fa.)
Assumption
No nesting (), and no escaping of ()
Parentheses are chained together with the . character, or by being right next to each other (no flexible spacing allowed).
(a)(b).(c) is consider a single token (the . is optional).
Solution
The regex below is to be used with global matching (match all) function.
\([^)]*\)(?:\.?\([^)]*\))*
Please add the delimiter on your own.
DEMO
Explanation
Break down of the regex (spacing is insignificant). After and including # are comments and not part of the regex.
\( # Literal (
[^)]* # Match 0 or more characters that are not )
\) # Literal ). These first 3 lines match an instance of wrapped text
(?: # Non-capturing group
\.? # Optional literal .
\([^)]*\) # Match another instance of wrapped text
)* # The whole group is repeated 0 or more times
I'd go with: /(?:\(\w+\)(?:\.(?=\())?)+/g
\(\w+\) to match a-zA-Z0-9_ inside literal braces
(?:\.(?=\())? to capture a literal . only if it's followed by another opening brace
The whole thing wrapped in (?:)+ to join adjacent captures together
var str = "aas(I)f df (asdfdsf)(adf).dgdf(sfg).(dfdf) asdfsdf dsfa(asdfa) sdf";
str.match(/(?:\(\w+\)(?:\.(?=\())?)+/g);
// -> ["(I)", "(asdfdsf)(adf)", "(sfg).(dfdf)", "(asdfa)"]
try [^(](\([^()]+([)](^[[:alnum:]]*)?[(][^()]+)*\))[^)]. capture group 1 is what you want.
this expression assumes that every kind of character apart from parentheses mayy occur in the text between parentheses and it won't match portions with nested parentheses.
This one should do the trick:
\([A-Za-z0-9]+\)