Find :: outside of markdown code formatting - regex

I have a bunch of markdown files, where I want to search for Ruby's double colon :: outside of some code formatting (e.g. where I forgot to apply proper markdown). For example
`foo::bar`
hello `foo::bar` test
` example::with::whitespace `
```
Proper::Formatted
```
```
Module::WithIndendation
```
```
Some::Nested::Modules
```
```ruby
CodeBlock::WithSyntax
```
# Some::Class
## Another::Class Heading
some text
The regex only should match Some::Class and Another::Class, because they miss the surrounding backticks, and are also not within a multiline code fence block.
I have this regex, but it also matches the multi line block
[\s]+[^`]+(::)[^`]+[\s]?
Any idea, how to exclude this?
EDIT:
It would be great, if the regex would work in Ruby, JS and on the command line for grep.

For the original input, you may use this regex in ruby to match :: string
not preceded by a ` and
not preceded by ` followed a white-space:
Regex:
(?<!`\s)(?<!`)\b\w+::\w+
RegEx Demo 1
RegEx Breakup:
(?<!\s): Negative lookbehind to assert that <code> and whitespace is not at preceding position
(?<!): Negative lookbehind to assert that <code> is not at preceding position
\b: Match word boundary
\w+: Match 1+ word characters
::: Match a ::
\w+: Match 1+ word characters
You can use this regex in Javascript:
(?<!`\w*\s*|::)\b\w+(?:::\w+)+
RegEx Demo 2
For gnu-grep, consider this command:
grep -ZzoP '`\w*\s*\b\w+::\w+(*SKIP)(*F)|\b\w+::\w+' file |
xargs -0 printf '%s\n'
Some::Class
Another::Class
RegEx Demo 3

One can use the regular expression
rgx = /`[^`]*`|([^`\r\n]*::[^`\r\n]*)/
with the form of String#gsub that takes one argument and no block, and therefore returns an enumerator (str holding the example string given in the question):
str.gsub(rgx).select { $1 }
#=> ["# Some::Class", "## Another::Class Heading"]
The idea is that the first part of the regex's alternation, `[^`]*`, matches, but does not capture, all strings delimited by backtics (including ``), whereas the second part, ([^`\r\n]*::[^`\r\n]*), matches and captures all strings on a single line that contain '::' but no backtics. We therefore concern ourselves with captures only, by invoking select { $1 } on the enumerator returned by gsub.
The regular expression can be made self-documenting by writing it in free-spacing mode.
rgx = /
` # match a backtic
[^`]* # match zero of more characters other than backtics
` # match a backtic
| # or
( # begin capture group 1
[^`\r\n]* # match zero of more characters other than backtics and
# line terminators
:: # match two colons
[^`\r\n]* # ditto line before previous
) # end capture group 1
/x # invoke free-spacing regex definition mode
[^`\r\n] contains \r (carriage return) in the event that the file was created with Windows. If desired, [^`]* can be replaced with .*? (match zero or more characters, as few as possible).

Related

Regex: Matches a multi-line pattern until the same one occurs again

I need to match 3 parts in the following bit:
# [1.3.3] (2019-04-16)
### Blah
* Loreum ipsum
# [1.3.0] (2019-04-01)
### Foo
* Loreum ipsum
# [1.2.0] (2019-03-05)
### Foo
* Loreum ipsum
Basically the first one would be
# [1.3.3] (2019-04-16)
### Blah
* Loreum ipsum
and so on.
I tried the following:
(# \[.*\] \([0-9\-]{10}\)(\n|.)*)
But that basically would go on to match the whole document. I need to tell him to stop matching until a new line start with (# \[) (what would be ^(?!(# \[)).*$)
You could use the first part of your pattern to match the first line and then use a negative lookahead (?!# ) to match the following lines if they don't start with # followed by a space:
^# \[[^]]+\] \([\d-]{10}\)\n(?:(?!# ).*(?:\n|$))*
About the pattern
^# Start of string followd by # and space
\[[^]]+\] Match from opening till closing square bracket using a negated character class
\([\d-]{10}\)\n Match opening parenthesis then match 10 times what is listed in the character class followed by a closing parenthesis and a newline
(?: Non capturing group
(?!# ) Negative lookahead, assert what is on the right is not # and a space
.*(?:\n|$) Match any char except newline and match either a newline or assert end of the string
)* Close non capturing group and repeat 0+ times
Regex demo
You can use the following regex:
(# \[.*\] \([0-9\-]{10}\)(\n|[^#]|###)*)`
This will match any text until the next hash (except if that hash is part of a group of three hashes ###) .
If you need to modify it for a varying number of hashes (strictly superior to 1), you could use
(# \[.*\] \([0-9\-]{10}\)(\n|[^#]|##+)*)
You may use
^\#\s+\[.+?(?=^\#\s+\[|\Z)
See a demo on regex101.com and mind the modifiers (singleline and multiline, s and m).
Broken down this is
^\#\s+\[ # start of the line, followed by "# ["
.+? # everything else afterwards until ...
(?=
^\#\s+\[ # ... the pattern from above right at the start of a new line
| # or
\Z # the very end of the string
)
The fastest way to go would be:
^#.*(\r?\n(?!# ).*)+
To make it more precise:
^# \[\d.*(?:\r?\n(?!# ).*)+
See live demo here

How to capture and replace all patterns on a line containing a separate pattern with Regex

I'm trying to set up a regular expression that will allow me to replace 2 spaces with a tab, but only on lines containing a certain pattern.
foo: here is some sample text
bar: here is some sample text
In the above example I want to replace any groups of 2 spaces with a tab, but only on lines that contain "bar":
foo: here is some sample text
bar: here is some sample text
The closest that I've gotten has been using this:
Find: ^(\s.*)(bar)(.*) (.*)
Replace: \1\2\3\t\4
However, this only replaces one group of two spaces at a time, so I end up with this:
foo: here is some sample text
bar: here is some sample text
I could execute the replace 3 more times and get my desired result, but I am dealing with text files that may contain hundreds of these sequences.
I am using Sublime Text, but I'm pretty sure that it uses PCRE for its Regex.
This works as well
(?m-s)(?:^(?=.*\bbar\b)|(?!^)\G).*?\K[ ]{2}
https://regex101.com/r/vnM649/1
or
https://regex101.com/r/vnM649/2
Explained
(?m-s) # Multi-line mode, not Dot-All mode
(?:
^ # Only test at BOL for 'bar'
(?= .* \b bar \b )
| # or,
(?! ^ ) # Not BOL, must have found 2 spaces in this line before
\G # Start where last 2 spaces left off
)
.*? # Minimal any character (except newline)
\K # Ignore anything that matched up to this point
[ ]{2} # 2 spaces to replace with a \t
possible to translate this to work with Python?
Yes.
The \G construct gives the ability to do it all
in a single pass regex. Python regex module supports it,
but not it's re module. If using the re module, you need
to do it in 2 steps.
First is to match the line(s) where bar is
then to pass it to a callback to replace all double
spaces to a tabs, then return it as the replacement
back to the caller.
Sample Python code:
https://rextester.com/AYM96859
#python 2.7.12
import re
def replcall(m):
contents = m.group(1)
return re.sub( r'[ ]{2}',"\t", contents )
str = (
r'foo: here is some sample text' + "\n"
r'bar: here is some sample text' + "\n"
)
newstr = re.sub( r'(?m)(^(?=.*\bbar\b)(?=.*[ ]{2}).*)', replcall, str )
print newstr
The regex to get the line, expanded:
(?m)
( # (1 start)
^
(?= .* \b bar \b )
(?= .* [ ]{2} )
.*
) # (1 end)
This will work:
Find: (^(?!.*bar).*)|
Replace: \1\t
(notice the 2 spaces at the end of the "find" regex) but it'll add a tab at the end of the foo line.
See here a PCRE demo.

I have a regex that returns nil values for excluded words. How do I return nothing instead?

Given the following test string:
{{one}}
<content>{{two}}</content>
{{three}}
I only want to match {{one}} and {{two}}. I have the following regex:
{{((?!#)(?!\/).*?)}}|(?:<content\b[^>]*>[^<>]*<\/content>)
That matches {{one}} and {{three}}, but also matches a nil value (see: https://rubular.com/r/E4faa6Tze04WnG). How do I only match {{one}} and {{three}} and NOT the nil value?
(that is, the regex should only return two matches instead of three)
Taken from your comment:
I have a large body of text and I want to use ruby's gsub method to replace {{tags}} that are outside of the <content> tags.
This regex should do, what you need:
(^{{(?!#|\/).*}}$)
This matches both {{one}} and {{three}}, and similar interpolations à la {{tag}}, except those: <content>{{tag}}</content>.
Can I ignore only tags specifically and not other tags? For example, I tried it with tags here: rubular.com/r/jTKxwjNuKoSjgN, which I don't want to ignore.
Sure thing. Try this one:
(?!<content>)({{(?!#|\/).*?}})(?!<\/content>)
If you need an explanation of how and why this regex works, you can take a look at the explanation section here: https://regex101.com/r/d4DEK1/1
I suggest doing it in two steps to accomodate more complex strings. I have assumed that the strings "one" and "three" are to be extracted from the following string.
str = <<-_
{{one}}
<content>cats {{two}} and <content2>{{four}}</content2> dogs</content>
{{three}}
_
r0 = /
<
([^>]+) # match >= 1 characters other than '>' in capture group 1
>
.+? # match one or more characters lazily
<\/ # match '<' then forward slash
\1 # match the contents of capture group 1
>
/x # free-spacing regex definition mode
r1 = /
(?<=\{\{) # match '{{' in a positive lookbehind
[^\}]+ # match any number of characters other than '}'
(?=\}\}) # match '}}' in a positive lookahead
/x # free-spacing regex definition mode
str.gsub(r0, '').scan(r1)
#=> ["one", "three"]
The first step is:
str.gsub(r0, '')
#=> "{{one}}\n\n{{three}}\n"
This of course works if the second line of the string is simply
"<content>{{two}}</content>\n"
The two regular expressions are conventionally written as follows.
r0 = /<([^>]+)>.+?<\/\1>/
r1 = /(?<=\{\{)[^\}]+(?=\}\})/

RegEx to match some wrapped texts

Consider following text:
aas( I)f df (as)(dfdsf)(adf).dgdf(sfg).(dfdf) asdfsdf dsfa(asd #54 54 !fa.) sdf
I want to retrive text between parenthesis, but adjacent parentheses should be consider a single unit. How can I do that?
For above example desired output is:
( I)
(as)(dfdsf)(adf)
(sfg).(dfdf)
(asd #54 54 !fa.)
Assumption
No nesting (), and no escaping of ()
Parentheses are chained together with the . character, or by being right next to each other (no flexible spacing allowed).
(a)(b).(c) is consider a single token (the . is optional).
Solution
The regex below is to be used with global matching (match all) function.
\([^)]*\)(?:\.?\([^)]*\))*
Please add the delimiter on your own.
DEMO
Explanation
Break down of the regex (spacing is insignificant). After and including # are comments and not part of the regex.
\( # Literal (
[^)]* # Match 0 or more characters that are not )
\) # Literal ). These first 3 lines match an instance of wrapped text
(?: # Non-capturing group
\.? # Optional literal .
\([^)]*\) # Match another instance of wrapped text
)* # The whole group is repeated 0 or more times
I'd go with: /(?:\(\w+\)(?:\.(?=\())?)+/g
\(\w+\) to match a-zA-Z0-9_ inside literal braces
(?:\.(?=\())? to capture a literal . only if it's followed by another opening brace
The whole thing wrapped in (?:)+ to join adjacent captures together
var str = "aas(I)f df (asdfdsf)(adf).dgdf(sfg).(dfdf) asdfsdf dsfa(asdfa) sdf";
str.match(/(?:\(\w+\)(?:\.(?=\())?)+/g);
// -> ["(I)", "(asdfdsf)(adf)", "(sfg).(dfdf)", "(asdfa)"]
try [^(](\([^()]+([)](^[[:alnum:]]*)?[(][^()]+)*\))[^)]. capture group 1 is what you want.
this expression assumes that every kind of character apart from parentheses mayy occur in the text between parentheses and it won't match portions with nested parentheses.
This one should do the trick:
\([A-Za-z0-9]+\)

regex to match RTRIM(LTRIM(xx)) = xx

I am trying to jot down regex to find where I am using ltrim rtrim in where clause in stored procedures.
the regex should match stuff like:
RTRIM(LTRIM(PGM_TYPE_CD))= 'P'))
RTRIM(LTRIM(PGM_TYPE_CD))='P'))
RTRIM(LTRIM(PGM_TYPE_CD)) = 'P'))
RTRIM(LTRIM(PGM_TYPE_CD))= P
RTRIM(LTRIM(PGM_TYPE_CD))= somethingelse))
etc...
I am trying something like...
.TRIM.*\)\s+
[RL]TRIM\s*\( Will look for R or L followed by TRIM, any number of whitespace, and then a (
This what you want:
[LR]TRIM\([RL]TRIM\([^)]+\)\)\s*=\s*[^)]+\)*
?
What's that doing is saying:
[LR] # Match single char, either "L" or "R"
TRIM # Match text "TRIM"
\( # Match an open parenthesis
[RL] # Match single char, either "R" or "L" (same as [LR], but easier to see intent)
TRIM # Match text "TRIM"
\( # Match an open parenthesis
[^)]+ # Match one or more of anything that isn't closing parenthesis
\)\) # Match two closing parentheses
\s* # Zero or more whitespace characters
= # Match "="
\s* # Again, optional whitespace (not req unless next bit is captured)
[^)]+ # Match one or more of anything that isn't closing parenthesis
\)* # Match zero or more closing parentheses.
If this is automated and you want to know which variables are in it, you can wrap parentheses around the relevant parts:
[LR]TRIM\([RL]TRIM\(([^)]+)\)\)\s*=\s*([^)]+)\)*
Which will give you the first and second variables in groups 1 and 2 (either \1 and \2 or $1 and $2 depending on regex used).
How about something like this:
.*[RL]TRIM\s*\(\s*[RL]TRIM\s*\([^\)]*)\)\s*\)\s*=\s*(.*)
This will capture the inside of the trim and the right side of the = in groups 1 and 2, and should handle all whitespace in all relevant areas.