Regular expression to match text between either square or curly brackets - regex

Related to my previous question, I have a string on the following format:
this {is} a [sample] string with [some] {special} words. [another one]
What is the regular expression to extract the words within either square or curly brackets, ie.
{is}
[sample]
[some]
{special}
[another one]
Note: In my use case, brackets cannot be nested. I would also like to keep the enclosing characters, so that I can tell the difference between them when processing the results.

Simply or (|) the different things you wish to match together:
\[.*?\]|\{.*?\}

This one seems to work:
[[{].*?[}\]]
Or this one:
\[.*?\]|{.*?}
If you want to catch the cases mentioned in the comments below.
You can use an online regex tester to try these things out. I think http://gskinner.com/RegExr/ is one of the more user-friendly options.

Related

regular expression replace removes first and last character when using $1

I have string like this:
&breakUp=Mumbai;city,Puma;brand&
where Mumbai;city and Puma;brand are filters(let say) separated by comma(,). I have to add more filters like Delhi;State.
I am using following regular expression to find the above string:
&breakUp=.([\w;,]*).&
and following regular expression to replace it:
&breakUp=$1,Delhi;State&
It is finding the string correctly but while replacing it is removing the first and last character and giving the following result:
&breakUp=umbai;city,Puma;bran,Delhi;State&
How to resolve this?
Also, If I have no filters I don't want that first comma. Like
&breakUp=&
should become
&breakUp=Delhi;State&
How to do it?
My guess is that your expression is just fine, there are two extra . in there, that we would remove those:
&breakUp=([\w;,]*)&
In this demo, the expression is explained, if you might be interested.
To bypass &breakUp=&, we can likely apply this expression:
&breakUp=([^&]+)&
Demo
Your problem seems to be the leading and trailing period, they are matched to any character.
Try using this regex:
&breakUp=([\w;,]*)&

How do you match a pattern skipping exceptions?

In vim, I'd like to match a regular expression in a search and replace operation, but with exceptions — a list of matches that I want to skip.
For example, suppose I have the text:
-one- lorem ipsum -two- blah blah -three- now is the time -four- the quick brown -five- etc. etc.
(but with lots of other possibilities) and I want to match -\(\w\+\)- and replace it with *\1* but skipping over (not matching) -two- and -four-, so the result would be:
*one* lorem ipsum -two- blah blah *three* now is the time -four- the quick brown *five* etc. etc.
It seems like I should be able to use some kind of assertion (lookbehind, lookahead, something) for this, but I'm coming up blank.
You're looking for a negative lookahead assertion. In Vim, that's done via :help /\#!, like (?!pattern) in Perl.
Basically, you say don't match FOO here, and in general match word characters:
/-\(\%(FOO\)\#!\w\+\)-/
Note how I'm using non-capturing groups (:help /\%(). What's still missing is an assertion on the end, so the above would also exclude -FOOBAR-. As we have a unique end delimiter here, it's easiest to append that:
/-\(\%(FOO-\)\#!\w\+\)-/
Applied to your example, you just need to introduce two branches (for the two exclusions) in place of FOO, and you're done:
/-\(\%(two-\|four-\)\#!\w\+\)-/
Or, by factoring out the duplicated end delimiter:
/-\(\%(\%(two\|four\)-\)\#!\w\+\)-/
This matches any word characters in between dashes, except if those words form either two or four.
The negative lookahead in my other answer is the direct solution, but its syntax is a bit complex (and there can be patterns where the rules for the delimiter are not so simple, and the result then is much less readable).
As you're using substitution, an alternative is to put the selection logic into the replacement part of :substitute. Vim allows a Vimscript expression in there via :help sub-replace-expression.
We have the captured word in submatch(1) (equivalent to \1 in a normal replacement), and now just need to check for the two excluded words; if it's one of those, do a no-op substitution by returning the original full match (submatch(0)), else just return the captured group.
:substitute/-\(\w\+\)-/\=submatch(index(['two', 'four'], submatch(1)) == -1 ? 1 : 0)/g
It's not shorter than the lookahead pattern (well, we could golf the pattern and drop the ternary operator, as a boolean is represented by 0/1, anyway), so here I would still use the pattern. But in general, it's good to know that there's more than one way to do it :-)

Regex expression to extract everything inside brackets

I need to extract content inside brackets () from the following string in C++;
#82=IFCCLASSIFICATIONREFERENCE($,'E05.11.a','Rectangular',#28);
I tried following regex but it gives an output with brackets intact.
std::regex e2 ("\\((.*?)\\)");
if (std::regex_search(sLine,m,e2)){
}
Output should be:
$,'E05.11.a','Rectangular',#28
The result you are looking for should be in the first matching subexpression, i.e. comprised in the [[1].first, m[1].second) interval.
This is because your regex matches also the enclosing parentheses, but you specified a grouping subexpression, i.e. (.*?). Here is a starting point to some documentation
Use lookaheads: "(?<=\\()[^)]*?(?=\\))". Watch out, as this won't work for nested parentheses.
You can also use backreferences.
(?<=\().*(?=\))
Try this i only tested in one tester but it worked. It basically looks for any character after a ( and before a ) but not including them.

how to avoid to match the last letter in this regexp?

I have a quesion about regexp in tcl:
first output: TIP_12.3.4 %
second output: TIP_12.3.4 %
and sometimes the output maybe look like:
first output: TIP_12 %
second output: TIP_12 %
I want to get the number 12.3.4 or 12 using the following exgexp:
output: TIP_(/[0-9].*/[0-9])
but why it does not matches 12.3.4 or 12%?
You need to escape the dot, else it stands for "match every character". Also, I'm not sure about the slashes in your regexp. Better solution:
/TIP_(\d+\.?)+/
Your problem is that / is not special in Tcl's regular expression language at all. It's just an ordinary printable non-letter character. (Other languages are a little different, as it is quite common to enclose regular expressions in / characters; this is not the case in Tcl.) Because it is a simple literal, using it in your RE makes it expect it in the input (despite it not being there); unsurprisingly, that makes the RE not match.
Fixing things: I'd use a regular expression like this: output: TIP_([\d.]+) under the assumption that the data is reasonably well formatted. That would lead to code like this:
regexp {output: TIP_([0-9.]+)} $input -> dottedDigits
Everything not in parentheses is a literal here, so that the code is able to find what to match. Inside the parentheses (the bit we're saving for later) we want one or more digits or periods; putting them inside a square-bracketed-set is perfect and simple. The net effect is to store the 12.3.4 in the variable dottedDigits (if found) and to yield a boolean result that says whether it matched (i.e., you can put it in an if condition usefully).
NB: the regular expression is enclosed in braces because square brackets are also Tcl language metacharacters; putting the RE in braces avoids trouble with misinterpretation of your script. (You could use backslashes instead, but they're ugly…)
Try this :
output: TIP_(/([0-9\.^%]*)/[0-9])
Capture group 1.
Demo here :
http://regexr.com?31f6g
The following expression works for me:
{TIP_((\d+\.?)+)}

Search text with a regular expression to match outside specific characters

I have text that looks like:
My name is (Richard) and I cannot do
[whatever (Jack) can't do] and
(Robert) is the same way [unlike
(Betty)] thanks (Jill)
The goal is to search using a regular expression to find all parenthesized names that occur anywhere in the text BUT in-between any brackets.
So in the text above, the result I am looking for is:
Richard
Robert
Jill
You can do it in two steps:
step1: match all bracket contents using:
\[[^\]]*\]
and replace it with ''
step2: match all the remaining parenthesized names(globally) using:
\([^)]*\)
You didn't say what language you're using, so here's some Python:
>>> import re
>>> REGEX = re.compile(r'(?:[^[(]+|\(([^)]*)\)|\[[^]]*])')
>>> s="""My name is (Richard) and I cannot do [whatever (Jack) can't do] and (Robert) is the same way [unlike (Betty)] thanks (Jill)"""
>>> filter(None, REGEX.findall(s))
The output is:
['Richard', 'Robert', 'Jill']
One caveat is that this does not work with arbitrary nesting. The only nesting it's really designed to work with is one level of parens in square brackets as mentioned in the question. Arbitrary nesting can't be done with just regular expressions. (This is a consequence of the pumping lemma for regular languages.)
The regex looks for chunks of text without brackets or parens, chunks of text enclosed in parens, and chunks of text enclosed in brackets. Only text in parens (not in square brackets) is captured. Python's findall finds all matches of the regex in sequence. In some languages you may need to write a loop to repeatedly match. For non-paren matches, findall inserts an empty string in the result list, so the call to filter removes those.
IF you are using .NET you can do something like:
"(?<!\[.*?)(?<name>\(\w+\))(?>!.*\])"
It's not really the best job for a single regexp - have you considered, for example, making a copy of the string and then deleting everything in between the square brackets instead? It would then be fairly straight forward to extract things from inside the parenthesis. Alternatively, you could write a very basic parser that tokenises the line (into normal text, square bracket text, and parenthesised text, I imagine) and then parses the tree that produces; it'd be more work initially but would make life much simpler if you later want to make the behaviour any more complicated.
Having said that, /(?:(?:^|\])[^\[]*)\((.*?)\)/ does the trick for your test case (but it will almost certainly have some weird behaviour if your [ and ] aren't matched properly, and I'm not convinced it's that efficient).
A quick (PHP) test case:
preg_match_all('/(?:(?:^|\])[^\[]*)\((.*?)\)/', "My name is ... (Jill)", $m);
print(implode(", ", $m[1]));
Outputs:
Richard, Robert, Jill
>>> s="My name is (Richard) and I cannot do [whatever (Jack) can't do (Jill) can] and (Robert) is the same way [unlike (Betty)] thanks (Jill)"
>>> for item in s.split("]"):
... st = item.split("[")[0]
... if ")" in st:
... for i in st.split(")"):
... if "(" in i:
... print i.split("(")[-1]
...
Richard
Robert
Jill
So you want the regex to match the name, but not the enclosing parentheses? This should do it:
[^()]+(?=\)[^\[\]]*(?:\[[^\[\]]*\][^\[\]]*)*$)
As with the other answers, I'm making certain assumptions about your target string, like expecting parentheses and square brackets to be correctly balanced and not nested.
I say it should work because, although I've tested it, I don't know what language/tool you're using to do the regex matching with. We could provide higher-quality answers if we had that info; all regex flavors are not created equal.