Search text with a regular expression to match outside specific characters - regex

I have text that looks like:
My name is (Richard) and I cannot do
[whatever (Jack) can't do] and
(Robert) is the same way [unlike
(Betty)] thanks (Jill)
The goal is to search using a regular expression to find all parenthesized names that occur anywhere in the text BUT in-between any brackets.
So in the text above, the result I am looking for is:
Richard
Robert
Jill

You can do it in two steps:
step1: match all bracket contents using:
\[[^\]]*\]
and replace it with ''
step2: match all the remaining parenthesized names(globally) using:
\([^)]*\)

You didn't say what language you're using, so here's some Python:
>>> import re
>>> REGEX = re.compile(r'(?:[^[(]+|\(([^)]*)\)|\[[^]]*])')
>>> s="""My name is (Richard) and I cannot do [whatever (Jack) can't do] and (Robert) is the same way [unlike (Betty)] thanks (Jill)"""
>>> filter(None, REGEX.findall(s))
The output is:
['Richard', 'Robert', 'Jill']
One caveat is that this does not work with arbitrary nesting. The only nesting it's really designed to work with is one level of parens in square brackets as mentioned in the question. Arbitrary nesting can't be done with just regular expressions. (This is a consequence of the pumping lemma for regular languages.)
The regex looks for chunks of text without brackets or parens, chunks of text enclosed in parens, and chunks of text enclosed in brackets. Only text in parens (not in square brackets) is captured. Python's findall finds all matches of the regex in sequence. In some languages you may need to write a loop to repeatedly match. For non-paren matches, findall inserts an empty string in the result list, so the call to filter removes those.

IF you are using .NET you can do something like:
"(?<!\[.*?)(?<name>\(\w+\))(?>!.*\])"

It's not really the best job for a single regexp - have you considered, for example, making a copy of the string and then deleting everything in between the square brackets instead? It would then be fairly straight forward to extract things from inside the parenthesis. Alternatively, you could write a very basic parser that tokenises the line (into normal text, square bracket text, and parenthesised text, I imagine) and then parses the tree that produces; it'd be more work initially but would make life much simpler if you later want to make the behaviour any more complicated.
Having said that, /(?:(?:^|\])[^\[]*)\((.*?)\)/ does the trick for your test case (but it will almost certainly have some weird behaviour if your [ and ] aren't matched properly, and I'm not convinced it's that efficient).
A quick (PHP) test case:
preg_match_all('/(?:(?:^|\])[^\[]*)\((.*?)\)/', "My name is ... (Jill)", $m);
print(implode(", ", $m[1]));
Outputs:
Richard, Robert, Jill

>>> s="My name is (Richard) and I cannot do [whatever (Jack) can't do (Jill) can] and (Robert) is the same way [unlike (Betty)] thanks (Jill)"
>>> for item in s.split("]"):
... st = item.split("[")[0]
... if ")" in st:
... for i in st.split(")"):
... if "(" in i:
... print i.split("(")[-1]
...
Richard
Robert
Jill

So you want the regex to match the name, but not the enclosing parentheses? This should do it:
[^()]+(?=\)[^\[\]]*(?:\[[^\[\]]*\][^\[\]]*)*$)
As with the other answers, I'm making certain assumptions about your target string, like expecting parentheses and square brackets to be correctly balanced and not nested.
I say it should work because, although I've tested it, I don't know what language/tool you're using to do the regex matching with. We could provide higher-quality answers if we had that info; all regex flavors are not created equal.

Related

Vim regex to split string but keep the separators

To my current understanding, the pattern below should work (expected ['bar', 'FOO', 'bar']), but only the first alternative is found (zero-width matches after FOO, but not before).
echo split('barFOObar', '\v(FOO\zs|\zeFOO)') " --> ['barFOO', 'bar']
Netiher could I solve it with lookahead/lookbehind.
echo split('barFOObar', '\v((FOO)\#<=|(FOO)\#=)') " --> ['bar', 'bar']
Compare this with e.g. Python:
echo py3eval("re.split('(?=FOO)|(?<=FOO)', 'barFOObar')") " --> ['bar', 'FOO', 'bar']
(Note: in Python, a paren-enclosed '(FOO)' would also work for this.)
Why don't the above examples in Vim's regex work the way I thought they should? (And also, is there a more or less straightforward way to do this in pure Vimscript then?)
There doesn't seem to be a way to accomplish that direct result using a single split(). In fact, the docs for split() mention this particular situation of preserving the delimiter, with:
If you want to keep the separator you can also use \zs at the end of the pattern:
:echo split('abc:def:ghi', ':\zs')
['abc:', 'def:', 'ghi']
Having said that, using both a lookahead and a lookbehind does actually work. In your example, you have a syntax error. Since you're using verymagic mode, you shouldn't escape #, since it's already special. (Thanks #user938271 for pointing that out!)
This works:
:echo split('barFOObar', '\v((FOO)#<=|(FOO)#=)')
" --> ['bar', 'FOO', 'bar']
Regarding using the markers for \zs and \ze:
:echo split('barFOObar', '\v(FOO\zs|\zeFOO)')
" --> ['barFOO', 'bar']
So the first trouble you have here is that both expressions on each side of the | are matching the same text "FOO", so since they're identical, the first wins and you get it on the left side.
Change order and you get it on the right side:
:echo split('barFOObar', '\v(\zeFOO|FOO\zs)')
" --> ['bar', 'FOObar']
Now the question is why the second token "FOObar" isn't being split since it's matching again (the lookbehind case splits this one, right?)
Well, the answer is that it's actually being split again, but it matches on the first case of \zeFOO one more time and produces a split with the empty string. You can see that by passing a keepempty argument:
:echo split('barFOObar', '\v(\zeFOO|FOO\zs)', 1)
" --> ['bar', '', 'FOObar']
One question still unanswered here is why the lookahead/lookbehind does work, while the \zs and \ze doesn't. I think I addressed that somehow in this answer to regex usage in syntax groups.
This won't work because Vim won't scan the same text twice trying to match a different regex.
Even though the \zs makes the resulting match only include bar, Vim needs to consume FOO to be able to match that regex and it won't do so if it already matched it with the other half of the pattern.
A lookbehind with \#<= is different. The reason it works is that Vim will first search for bar (or whatever text it's considering) and then look behind to see if FOO also matches. So the pattern gets anchored on bar rather than FOO and doesn't suffer from the issue of trying to start a match on a region that already matched another expression.
You can easily visualize that difference by performing a search with Vim. Try this one:
/\v(\zeFOO|FOO\zs)
And compare it with this one:
/\v((FOO)#<=|(FOO)#=)
You'll notice the latter one will match both before and after FOO, while the former won't.
Compare this with e.g. Python [re.split] ...
in Python, a paren-enclosed '(FOO)' would also work for this.
Vim's and Python's regex engines are different beasts...
Many of the limitations in Vim's engine come from its ancestry from vi. One particular limitation is capture groups, where you're limited to 9 of them and there's no way around that.
Given that limitation, you'll find that capture groups are typically used less often (and, when used, they're less powerful) than in Python.
One option to consider is to use Python in Vim instead of Vimscript. Although typically that impacts portability, so personally I wouldn't switch for this feature alone.
is there a more or less straightforward way to do this in pure Vimscript then?
One option is to reimplement a version of split() that preserves delimiters, using matchstrpos(). For example:
function! SplitDelim(expr, pat)
let result = []
let expr = a:expr
while 1
let [w, s, e] = matchstrpos(expr, a:pat)
if s == -1
break
endif
call add(result, s ? expr[:s-1] : '')
call add(result, w)
let expr = expr[e:]
endwhile
call add(result, expr)
return result
endfunction
You could first replace FOO with -FOO-, then split the string. For example:
:echo split(substitute('barFOObarFOObaz', 'FOO','-&-','g'),'-')
['bar', 'FOO', 'bar', 'FOO', 'baz']

Need to trim underscore from last character in string

I'm in need of assistance for the best method to remove an underscore from a derived string in Python 2.7.
I have a series of filenames I'm parsing, and the first portion gives information on the type of file. I need that data to match with a database entry.
Here's the rub, the regex findall strips the period, but the trailing underscore remains. As such, I can't get a 1:1 match in the database.
tmr_ba_incr_2016091500.csv
orm_160915.csv
TXT_MNG.160916.done
The findall gives me 3 elements in the output;
tmr_ba_incr_, 2016091500, csv
orm_, 160915, csv
TXT_MNG, 160916, done
The first element needs to have the ending underscore dropped.
I can't find a way to do this effectively.
tmr_ba_incr_ should be tmr_ba_incr
orm_ should be orm
TXT_MNG should be TXT_MNG
Can you help?
First I'd strip off the filetype with os.path.splitext
>>> import os
>>> os.path.splitext("tmr_ba_incr_2016091500.csv")
('tmr_ba_incr_2016091500', '.csv')
This is the standard way to deal with finding file extensions.
Then I'd just check that the last character was an underscore and remove it if it was:
>>> def remove_last_underscore(iterable):
... if iterable[-1] == '_':
... return iterable[:len(iterable)-1]
... else:
... return iterable
...
>>> remove_last_underscore("this_has_trailing_underscore_")
'this_has_trailing_underscore'
>>> remove_last_underscore("asda_asd_as")
'asda_asd_as'
Another way of removing last underscore from string is using regular expression.
import re
my_string = 'abc_'
re.match(r'^(.*?)_?$', my_string).group(1)
Here I match whole string (thus ^ and $) against pattern that allows me to extract all characters lazily (.*?) before last optional underscore (_?).
Characters are matched lazily (.*? instead of .*) so that last underscore does not match.
Please note that above method is just a regular expression trick. In fact if I needed to solve this problem in real system maintained by different people I would prefer shuttle87's solution because it is simply more transparent.
It simply reads:
if last character is '_':
return new string without trailing character
else
return original string
There is a famous quote from Jamie Zawinski:
Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems.
In our case this applies as well. Understanding the regular expression that I proposed requires more advanced knowledge of regular expressions. Beginner programmers might have lots of problems with reading it.
So you should treat my suggestion as a regular expression exercise, not a "clean code" solution to be applied in real systems :)

Regular Expression Match (get multiple stuff in a group)

I have trouble working on this regular expression.
Here is the string in one line, and I want to be able to extract the thing in the swatchColorList, specifically I want the word Natural Burlap, Navy, Red
What I have tried is '[(.*?)]' to get everything inside bracket, but what I really want is to do it in one line? is it possible, or do I need to do this in two steps?
Thanks
{"id":"1349306","categoryName":"Kids","imageSource":"7/optimized/8769127_fpx.tif","swatchColorList":[{"Natural Burlap":"8/optimized/8769128_fpx.tif"},{"Navy":"5/optimized/8748315_fpx.tif"},{"Red":"8/optimized/8748318_fpx.tif"}],"suppressColorSwatches":false,"primaryColor":"Natural Burlap","clickableSwatch":true,"selectedColorNameID":"Natural Burlap","moreColors":false,"suppressProductAttribute":false,"colorFamily":{"Natural Burlap":"Ivory/Cream"},"maxQuantity":6}
You can try this regex
(?<=[[,]\{\")[^"]+
If negative lookbehind is not supported, you can use
[[,]\{"([^"]+)
This will save needed word in group 1.
import json
str = '{"id":"1349306","categoryName":"Kids","imageSource":"7/optimized/8769127_fpx.tif","swatchColorList":[{"Natural Burlap":"8/optimized/8769128_fpx.tif"},{"Navy":"5/optimized/8748315_fpx.tif"},{"Red":"8/optimized/8748318_fpx.tif"}],"suppressColorSwatches":false,"primaryColor":"Natural Burlap","clickableSwatch":true,"selectedColorNameID":"Natural Burlap","moreColors":false,"suppressProductAttribute":false,"colorFamily":{"Natural Burlap":"Ivory/Cream"},"maxQuantity":6}'
obj = json.loads(str)
words = []
for thing in obj["swatchColorList"]:
for word in thing:
words.append(word)
print word
Output will be
Natural Burlap
Navy
Red
And words will be stored to words list. I realize this is not a regex but I want to discourage the use of regex on serialized object notations as regular expressions are not intended for the purpose of parsing strings with nested expressions.

How do I properly format this Regex search in R? It works fine in the online tester

In R, I have a column of data in a data-frame, and each element looks something like this:
Bacteria;Bacteroidetes;Bacteroidia;Bacteroidales;Marinilabiaceae
What I want is the section after the last semicolon, and I've been trying to use 'sub' and also duplicating the existing column and create a new one with just the endings kept. In essence, I want this (the genus):
Marinilabiaceae
A snippet of the code looks like this:
mydata$new_column<- sub("([\\s\\S]*;)", "", mydata$old_column)
In this situation, I am using \\ rather than \ because of R's escape sequences. The sub replaces the parts I don't want and updates it to the new column. I've tested the Regex several times in places such as this: http://regex101.com/r/kS7fD8/1
However, I'm still struggling because the results are very bizarre. Now my new column is populated with the organism's domain rather than the genus: Bacteria.
How do I resolve this? Are there any good easy-to-understand resources for learning more about R's Regex formats?
Starting with your simple string,
string <- "Bacteria;Bacteroidetes;Bacteroidia;Bacteroidales;Marinilabiaceae"
You can remove everything up to the last semicolon with "^(.*);" in your call to sub
> sub("^(.*);", "", string)
# [1] "Marinilabiaceae"
You can also use strsplit with tail
> tail(strsplit(string, ";")[[1]], 1)
# [1] "Marinilabiaceae"
Your regular expression, ([\\s\\S]*;) wouldn't work primarily because \\s matches any space characters, and your string does not contain any spaces. I think it worked in the regex101 site because that regex tester defaults to pcre (php) (see "Flavor" in top-left corner), and R regex syntax is slightly different. R requires extra backslash escape characters in many situations. For reference, this R text processing wiki has come in handy for me many times before.
Make it Greedy and get the matched group from desired index.
(.*);(.*)
^^^------- Marinilabiaceae
Here is regex101 demo
Or to get the first word use Non-Greedy way
(.*?);(.*)
Bacteria -----^^^
Here is demo
To extract everything after the last ; to the end of the line you can use:
[^;]*?$

how to avoid to match the last letter in this regexp?

I have a quesion about regexp in tcl:
first output: TIP_12.3.4 %
second output: TIP_12.3.4 %
and sometimes the output maybe look like:
first output: TIP_12 %
second output: TIP_12 %
I want to get the number 12.3.4 or 12 using the following exgexp:
output: TIP_(/[0-9].*/[0-9])
but why it does not matches 12.3.4 or 12%?
You need to escape the dot, else it stands for "match every character". Also, I'm not sure about the slashes in your regexp. Better solution:
/TIP_(\d+\.?)+/
Your problem is that / is not special in Tcl's regular expression language at all. It's just an ordinary printable non-letter character. (Other languages are a little different, as it is quite common to enclose regular expressions in / characters; this is not the case in Tcl.) Because it is a simple literal, using it in your RE makes it expect it in the input (despite it not being there); unsurprisingly, that makes the RE not match.
Fixing things: I'd use a regular expression like this: output: TIP_([\d.]+) under the assumption that the data is reasonably well formatted. That would lead to code like this:
regexp {output: TIP_([0-9.]+)} $input -> dottedDigits
Everything not in parentheses is a literal here, so that the code is able to find what to match. Inside the parentheses (the bit we're saving for later) we want one or more digits or periods; putting them inside a square-bracketed-set is perfect and simple. The net effect is to store the 12.3.4 in the variable dottedDigits (if found) and to yield a boolean result that says whether it matched (i.e., you can put it in an if condition usefully).
NB: the regular expression is enclosed in braces because square brackets are also Tcl language metacharacters; putting the RE in braces avoids trouble with misinterpretation of your script. (You could use backslashes instead, but they're ugly…)
Try this :
output: TIP_(/([0-9\.^%]*)/[0-9])
Capture group 1.
Demo here :
http://regexr.com?31f6g
The following expression works for me:
{TIP_((\d+\.?)+)}