How to Match The Inner Possible Result With Regular Expressions - regex

I have a regular expression to match anything between { and } in my string.
"/{.*}/"
Couldn't be simpler. The problem arises when I have a single line with multiple matches. So if I have a line like this:
this is my {string}, it doesn't {work} correctly
The regex will match
{string}, it doesn't {work}
rather than
{string}
How do I get it to just match the first result?

Question-mark means "non-greedy"
"/{.*?}/"

Use a character class that includes everything except a right bracket:
/{[^}]+}/

this will work with single nested braces with only a depth of one: {(({.*?})*|.*?)*}
I'm not sure how to get infinite depth or if it's even possible with regex

Default behaviour is greedy matching, i.e. first { to last }. Use lazy matching by the ? after your *.,
/{.*?}/
or even rather than * use "not a }"
/{[^}]*}/

Related

Regex not working correctly all the time, why?

Regex just working in some cases, other not working.
https://regex101.com/r/p5u3N6/1
I expected regex match only groups of two "{ } { }" without nothing between { }
I'm guessing that we wish to only capture, three of our inputs listed in the demo using an expression similar to:
(\{.*?\}(.+?){.*?\})
Demo 1
or
(\{(.+?)\}(.+?){(.+?)\})
Demo 2
The .*? in the first part of your pattern is passing through the unexpected parts of your input until it finds because . accepts all of those characters. Simply making the quantifier lazy with ? isn't enough-- it will still proceed until it finds a match.
\{[^}]*?\}\s\{[^}]*?\}
https://regex101.com/r/p5u3N6/5
Not sure I understood your requirements, I suppose you only want pairs of {}{} to match, and allow nothing more than one space between these two. You can try this \{([^\{]+)\}\ \{([^\}]+)\}.

How is ? used in regular expression in python?

I have this snippet
print(re.sub(r'(-script\.pyw|\.exe)?', '','.exe1.exe.exe'))
The output is 1
If i remove ? from the above snippet and run it as
print(re.sub(r'(-script\.pyw|\.exe)', '','.exe1.exe.exe'))
Th output is again same.
Although I am using ?, it is getting greedy and replacing all '.exe' with NULL.
Is there any workaround to replace only first occurrence?
re.sub(pattern, repl, string, count=0, flags=0)
This is the signature for the re.sub function. Notice the count parameter. If you just want the first occurence to be replaced, use count=1.
? is a non-greedy modifier for repetition operators; when it stands next to anything else, it makes the previous element optional. Thus, Your top expression is replacing either -script.pyw or .exe or nothing with nothing. Since replacement of nothing by nothing doesn't change the string, the top and the bottom version (where empty string cannot be matched) will give the same result.
Question mark is making the preceding token in the regular expression optional
Use
print(re.sub(r'(-script\.pyw|\.exe)', '','.exe1.exe.exe', 1))
if you want to remove only the first match.
? is greedy. So if it can match, It will.
For example: aaab? will match aaab instead of aaa
In order to make ? non greedy, you must add an extra ? (this is the same way you make * and + non greedy, by the way)
So aaab?? will just match aaa. Yet, at the same time, aaab??c will match aaabc

How to make regex stop at a specific character

I created this simple regular expression to match the "display:{value}" in a style attribute so that I can change it. It works fine as long as no text follows it.
(display:.*)[^;]
matches "display:block" when
comparing display:block;
But also matches "display:inline;color:red"
when comparing this display:inline;color:red;
How do I make it stop after it has found display:{value}
Try in this way:
(display:[^;]*)
A non greedy regex that would stop at the first semi colon would look like this
(display:.*?);

Regex expression to extract everything inside brackets

I need to extract content inside brackets () from the following string in C++;
#82=IFCCLASSIFICATIONREFERENCE($,'E05.11.a','Rectangular',#28);
I tried following regex but it gives an output with brackets intact.
std::regex e2 ("\\((.*?)\\)");
if (std::regex_search(sLine,m,e2)){
}
Output should be:
$,'E05.11.a','Rectangular',#28
The result you are looking for should be in the first matching subexpression, i.e. comprised in the [[1].first, m[1].second) interval.
This is because your regex matches also the enclosing parentheses, but you specified a grouping subexpression, i.e. (.*?). Here is a starting point to some documentation
Use lookaheads: "(?<=\\()[^)]*?(?=\\))". Watch out, as this won't work for nested parentheses.
You can also use backreferences.
(?<=\().*(?=\))
Try this i only tested in one tester but it worked. It basically looks for any character after a ( and before a ) but not including them.

Regex is behaving lazy, should be greedy

I thought that by default my Regex would exhibit the greedy behavior that I want, but it is not in the following code:
Regex keywords = new Regex(#"in|int|into|internal|interface");
var targets = keywords.ToString().Split('|');
foreach (string t in targets)
{
Match match = keywords.Match(t);
Console.WriteLine("Matched {0,-9} with {1}", t, match.Value);
}
Output:
Matched in with in
Matched int with in
Matched into with in
Matched internal with in
Matched interface with in
Now I realize that I could get it to work for this small example if I simply sorted the keywords by length descending, but
I want to understand why this
isn't working as expected, and
the actual project I am working on
has many more words in the Regex and
it is important to keep them in
alphabetical order.
So my question is: Why is this being lazy and how do I fix it?
Laziness and greediness applies to quantifiers only (?, *, +, {min,max}). Alternations always match in order and try the first possible match.
It looks like you're trying to word break things. To do that you need the entire expression to be correct, your current one is not. Try this one instead..
new Regex(#"\b(in|int|into|internal|interface)\b");
The "\b" says to match word boundaries, and is a zero-width match. This is locale dependent behavior, but in general this means whitespace and punctuation. Being a zero width match it will not contain the character that caused the regex engine to detect the word boundary.
According to RegularExpressions.info, regular expressions are eager. Therefore, when it goes through your piped expression, it stops on the first solid match.
My recommendation would be to store all of your keywords in an array or list, then generate the sorted, piped expression when you need it. You would only have to do this once too as long as your keyword list doesn't change. Just store the generated expression in a singleton of some sort and return that on regex executions.