RegEx to match some wrapped texts - regex

Consider following text:
aas( I)f df (as)(dfdsf)(adf).dgdf(sfg).(dfdf) asdfsdf dsfa(asd #54 54 !fa.) sdf
I want to retrive text between parenthesis, but adjacent parentheses should be consider a single unit. How can I do that?
For above example desired output is:
( I)
(as)(dfdsf)(adf)
(sfg).(dfdf)
(asd #54 54 !fa.)

Assumption
No nesting (), and no escaping of ()
Parentheses are chained together with the . character, or by being right next to each other (no flexible spacing allowed).
(a)(b).(c) is consider a single token (the . is optional).
Solution
The regex below is to be used with global matching (match all) function.
\([^)]*\)(?:\.?\([^)]*\))*
Please add the delimiter on your own.
DEMO
Explanation
Break down of the regex (spacing is insignificant). After and including # are comments and not part of the regex.
\( # Literal (
[^)]* # Match 0 or more characters that are not )
\) # Literal ). These first 3 lines match an instance of wrapped text
(?: # Non-capturing group
\.? # Optional literal .
\([^)]*\) # Match another instance of wrapped text
)* # The whole group is repeated 0 or more times

I'd go with: /(?:\(\w+\)(?:\.(?=\())?)+/g
\(\w+\) to match a-zA-Z0-9_ inside literal braces
(?:\.(?=\())? to capture a literal . only if it's followed by another opening brace
The whole thing wrapped in (?:)+ to join adjacent captures together
var str = "aas(I)f df (asdfdsf)(adf).dgdf(sfg).(dfdf) asdfsdf dsfa(asdfa) sdf";
str.match(/(?:\(\w+\)(?:\.(?=\())?)+/g);
// -> ["(I)", "(asdfdsf)(adf)", "(sfg).(dfdf)", "(asdfa)"]

try [^(](\([^()]+([)](^[[:alnum:]]*)?[(][^()]+)*\))[^)]. capture group 1 is what you want.
this expression assumes that every kind of character apart from parentheses mayy occur in the text between parentheses and it won't match portions with nested parentheses.

This one should do the trick:
\([A-Za-z0-9]+\)

Related

Find :: outside of markdown code formatting

I have a bunch of markdown files, where I want to search for Ruby's double colon :: outside of some code formatting (e.g. where I forgot to apply proper markdown). For example
`foo::bar`
hello `foo::bar` test
` example::with::whitespace `
```
Proper::Formatted
```
```
Module::WithIndendation
```
```
Some::Nested::Modules
```
```ruby
CodeBlock::WithSyntax
```
# Some::Class
## Another::Class Heading
some text
The regex only should match Some::Class and Another::Class, because they miss the surrounding backticks, and are also not within a multiline code fence block.
I have this regex, but it also matches the multi line block
[\s]+[^`]+(::)[^`]+[\s]?
Any idea, how to exclude this?
EDIT:
It would be great, if the regex would work in Ruby, JS and on the command line for grep.
For the original input, you may use this regex in ruby to match :: string
not preceded by a ` and
not preceded by ` followed a white-space:
Regex:
(?<!`\s)(?<!`)\b\w+::\w+
RegEx Demo 1
RegEx Breakup:
(?<!\s): Negative lookbehind to assert that <code> and whitespace is not at preceding position
(?<!): Negative lookbehind to assert that <code> is not at preceding position
\b: Match word boundary
\w+: Match 1+ word characters
::: Match a ::
\w+: Match 1+ word characters
You can use this regex in Javascript:
(?<!`\w*\s*|::)\b\w+(?:::\w+)+
RegEx Demo 2
For gnu-grep, consider this command:
grep -ZzoP '`\w*\s*\b\w+::\w+(*SKIP)(*F)|\b\w+::\w+' file |
xargs -0 printf '%s\n'
Some::Class
Another::Class
RegEx Demo 3
One can use the regular expression
rgx = /`[^`]*`|([^`\r\n]*::[^`\r\n]*)/
with the form of String#gsub that takes one argument and no block, and therefore returns an enumerator (str holding the example string given in the question):
str.gsub(rgx).select { $1 }
#=> ["# Some::Class", "## Another::Class Heading"]
The idea is that the first part of the regex's alternation, `[^`]*`, matches, but does not capture, all strings delimited by backtics (including ``), whereas the second part, ([^`\r\n]*::[^`\r\n]*), matches and captures all strings on a single line that contain '::' but no backtics. We therefore concern ourselves with captures only, by invoking select { $1 } on the enumerator returned by gsub.
The regular expression can be made self-documenting by writing it in free-spacing mode.
rgx = /
` # match a backtic
[^`]* # match zero of more characters other than backtics
` # match a backtic
| # or
( # begin capture group 1
[^`\r\n]* # match zero of more characters other than backtics and
# line terminators
:: # match two colons
[^`\r\n]* # ditto line before previous
) # end capture group 1
/x # invoke free-spacing regex definition mode
[^`\r\n] contains \r (carriage return) in the event that the file was created with Windows. If desired, [^`]* can be replaced with .*? (match zero or more characters, as few as possible).

Regular Expression Nucleotide Search

I am trying to find a regular expression that will allow me to know if there is a dinucleotide(Two letters) that appears 2 times in a row in my sequence. I give you an example:
Let's suppose I have this sequence (The character ; is to make clear that I am talking about dinucleotides):
"AT;GC;TA;CC;AG;AG;CC;CA;TA;TA"
The result I expect is that it matches the pattern AGAG and TATA.
I have tried this already but it fails because it gives me any pair of dinucleotides, not the same pair :
([ATGC]{2}){2}
You will need to use backreferences.
Start with matching one pair:
[ATGC]{2}
will match any pair of two of the four letters.
You need to put that in capturing parentheses and refer to the contents of the parentheses with \1, like so:
([ATGC]{2});\1
Suppose the string were
"TA;TA;GC;TA;CC;AG;AG;CC;CA;TA;TA"
^^ ^^ ^^ ^^ ^^ ^^
If you wish to match "TA" twice (and "AG" once) you could apply #Andy's solution.
If you wish to match "TA" just once, no matter the number of instances of "TA;TA" in the string, you could match
([ATGC]{2});\1(?!.*\1;\1)
and retrieve the contents of capture group 1.
Demo
The expression can be broken down as follows.
([ATGC]{2}) # match two characters, each from the character class,
# and save to capture group 1
;\1 # match ';' followed by the content of capture group 1
(?! # begin a negative lookahead
.* # match zero or more characters
\1;\1 # match the content of capture group 1 followed by ';'
# followed by the content of capture group 1
) # end negative lookahead

Match string between delimiters, but ignore matches with specific substring

I have to parse all the text in a paranthesis but not the one that contains "GST"
e.g:
(AUSTRALIAN RED CROSS – ATHERTON)
(Total GST for this Invoice $1,104.96)
today for a quote (07) 55394226 − admin.nerang#waste.com.au − this applies to your Nerang services.
expected parsed value:
AUSTRALIAN RED CROSS – ATHERTON
I am trying:
^\(((?!GST).)*$
But its only matching the value and not grouping correctly.
https://regex101.com/r/HndrUv/1
What would be the correct regex for the same?
This regex should work to get the expected string:
^\((?!.*GST)(.*)\)$
It first checks if it does not contain the regular expression *GST. If true, it then captures the entire text.
(?!*GST)(.*)
All that is then surrounded by \( and \) to leave it out of the capturing group.
\((?!.*GST)(.*)\)
Finally you add the BOL and EOL symbols and you get the result.
^\((?!.*GST)(.*)\)$
The expected value is saved in the first capture group (.*).
You can use
^\((?![^()]*\bGST\b)([^()]*)\)$
See the regex demo. Details:
^ - start of string
\( - a ( char
(?![^()]*\bGST\b) - a negative lookahead that fails the match if, immediately to the right of the current location, there are zero or more chars other than ) and ( and then GST as a whole word (remove \bs if you do not need whole word matching)
([^()]*) - Group 1: any zero or more chars other than ) and (
\) - a ) char
$ - end of string
Bonus:
If substrings in longer texts need to be matched, too, you need to remove ^ and $ anchors in the above regex.

The different behavior of OR operator in regex when captured or not

I have two Regex expression, one is ^0|[1-9][0-9]*$, another one is ^(0|[1-9][0-9]*), the first expression matches string "01", while the later one can't. What's the difference of the two expressions? In my opinion, the later only captures the matched string. I want to know why the later can't match "01" string.
See graphic explanation
^0|[1-9][0-9]*$
Debuggex Demo
Versus
^(0|[1-9][0-9]*)$
Debuggex Demo
So second RegEx requires string to be either "0" or to start with 1-9 character.
Look at them this way:
^0 # Match a 0 at the start of the string
| # or
[1-9][0-9]*$ # match a number > 1 at the end of the string.
versus
^ # Match the start of the string.
( # Start of group 1:
0 # Match a zero
| # or
[1-9][0-9]* # a number > 1.
) # End of group 1.
$ # Match the end of the string.
The alternation extends to the anchors in the first example whereas it's contained within the group in the second example.

regex to match RTRIM(LTRIM(xx)) = xx

I am trying to jot down regex to find where I am using ltrim rtrim in where clause in stored procedures.
the regex should match stuff like:
RTRIM(LTRIM(PGM_TYPE_CD))= 'P'))
RTRIM(LTRIM(PGM_TYPE_CD))='P'))
RTRIM(LTRIM(PGM_TYPE_CD)) = 'P'))
RTRIM(LTRIM(PGM_TYPE_CD))= P
RTRIM(LTRIM(PGM_TYPE_CD))= somethingelse))
etc...
I am trying something like...
.TRIM.*\)\s+
[RL]TRIM\s*\( Will look for R or L followed by TRIM, any number of whitespace, and then a (
This what you want:
[LR]TRIM\([RL]TRIM\([^)]+\)\)\s*=\s*[^)]+\)*
?
What's that doing is saying:
[LR] # Match single char, either "L" or "R"
TRIM # Match text "TRIM"
\( # Match an open parenthesis
[RL] # Match single char, either "R" or "L" (same as [LR], but easier to see intent)
TRIM # Match text "TRIM"
\( # Match an open parenthesis
[^)]+ # Match one or more of anything that isn't closing parenthesis
\)\) # Match two closing parentheses
\s* # Zero or more whitespace characters
= # Match "="
\s* # Again, optional whitespace (not req unless next bit is captured)
[^)]+ # Match one or more of anything that isn't closing parenthesis
\)* # Match zero or more closing parentheses.
If this is automated and you want to know which variables are in it, you can wrap parentheses around the relevant parts:
[LR]TRIM\([RL]TRIM\(([^)]+)\)\)\s*=\s*([^)]+)\)*
Which will give you the first and second variables in groups 1 and 2 (either \1 and \2 or $1 and $2 depending on regex used).
How about something like this:
.*[RL]TRIM\s*\(\s*[RL]TRIM\s*\([^\)]*)\)\s*\)\s*=\s*(.*)
This will capture the inside of the trim and the right side of the = in groups 1 and 2, and should handle all whitespace in all relevant areas.