Why does this regular expression match? - regex

I have this regex:
(?<!Sub ).*\(.*\)
And I'd like it to match this:
MsgBox ("The total run time to fix AREA and TD fields is: " & =imeElapsed & " minutes.")
But not this:
Sub ChangeAreaTD()
But somehow I still match the one that starts with Sub... does anyone have any idea why? I thought I'd be excluding "Sub " by doing
(?<!Sub )
Any help is appreciated!
Thanks.

Do this:
^MsgBox .*\(.*\)
The problem is that a negative lookbehind does not guarantee the beginning of a string. It will match anywhere.
However, adding a ^ character at the beginning of the regex does guarantee the beginning of the string. Then, change Sub to MsgBox so it only matches strings that begin with MsgBox

Your regex (?<!Sub ).*\(.*\), taken apart:
(?<! # negative look-behind
Sub # the string "Sub " must not occur before the current position
) # end negative look-behind
.* # anything ~ matches up to the end of the string!
\( # a literal "(" ~ causes the regex to backtrack to the last "("
.* # anything ~ matches up to the end of the string again!
\) # a literal ")" ~ causes the regex to backtrack to the last ")"
So, with your test string:
Sub ChangeAreaTD()
The look-behind is fulfilled immediately (right at position 0).
The .* travels to the end of the string after that.
Because of this .*, the look-behind never really makes a difference.
You were probably thinking of
(?<!Sub .*)\(.*\)
but it is very unlikely that variable-length look-behind is supported by your regex engine.
So what I would do is this (since variable-length look-ahead is widely supported):
^(?!.*\bSub\b)[^(]+\(([^)]+)\)
which translates as:
^ # At the start of the string,
(?! # do a negative look-ahead:
.* # anything
\b # a word boundary
Sub # the string "Sub"
\b # another word bounday
) # end negative look-ahead. If not found,
[^(]+ # match anything except an opening paren ~ to prevent backtracking
\( # match a literal "("
( # match group 1
[^)]+ # match anything up to a closing paren ~ to prevent backtracking
) # end match group 1
\) # match a literal ")".
and then go for the contents of match group 1.
However, regex generally is hideously ill-suited for parsing code. This is true for HTML the same way it is true for VB code. You will get wrong matches even with the improved regex. For example here, because of the nested parens:
MsgBox ("The total run time to fix all fields (AREA, TD) is: ...")

You have a backtracking problem here. The first .* in (?<!Sub ).*\(.*\) can match ChangeAreaTD or hangeAreaTD. In the latter case, the previous 4 characters are ub C, which does not match Sub. As the lookbehind is negated, this counts as a match!
Just adding a ^ to the beginning of your regex will not help you, as look-behind is a zero-length matching phrase. ^(?<!MsgBox ) would be looking for a line that followed a line ending in MsgBox. What you need to do instead is ^(?!Sub )(.*\(.*\)). This can be interpreted as "Starting at the beginning of a string, make sure it does not start with Sub. Then, capture everything in the string if it looks like a method call".
A good explanation of how regex engines parse lookaround can be found here.

If your wanting to match just the functions call, not declaration, then the pre bracket match should not match any characters, but more likely any identifier characters followed by spaces. Thus
(?<!Sub )[a-zA-Z][a-zA-Z0-9_]* *\(.*\)
The identifier may need more tokens depending on the language your matching.

Related

How to find and replace a particular character but only if it is in quotes?

Problem:
I have thousands of documents which contains a specific character I don't want. E.g. the character a. These documents contain a variety of characters, but the a's I want to replace are inside double quotes or single quotes.
I would like to find and replace them, and I thought using Regex would be needed. I am using VSCode, but I'm open to any suggestions.
My attempt:
I was able to find the following regex to match for a specific string containing the values inside the ().
".*?(r).*?"
However, this only highlights the entire quote. I want to highlight the character only.
Any solution, perhaps outside of regex, is welcome.
Example outcomes:
Given, the character is a, find replace to b
Somebody once told me "apples" are good for you => Somebody once told me "bpples" are good for you
"Aardvarks" make good kebabs => "Abrdvbrks" make good kebabs
The boy said "aaah!" when his mom told him he was eating aardvark => The boy said "bbbh!" when his mom told him he was eating aardvark
Visual Studio Code
VS Code uses JavaScript RegEx engine for its find / replace functionality. This means you are very limited in working with regex in comparison to other flavors like .NET or PCRE.
Lucky enough that this flavor supports lookaheads and with lookaheads you are able to look for but not consume character. So one way to ensure that we are within a quoted string is to look for number of quotes down to bottom of file / subject string to be odd after matching an a:
a(?=[^"]*"[^"]*(?:"[^"]*"[^"]*)*$)
Live demo
This looks for as in a double quoted string, to have it for single quoted strings substitute all "s with '. You can't have both at a time.
There is a problem with regex above however, that it conflicts with escaped double quotes within double quoted strings. To match them too if it matters you have a long way to go:
a(?=[^"\\]*(?:\\.[^"\\]*)*"[^"\\]*(?:\\.[^"\\]*)*(?:"[^"\\]*(?:\\.[^"\\]*)*"[^"\\]*(?:\\.[^"\\]*)*)*$)
Applying these approaches on large files probably will result in an stack overflow so let's see a better approach.
I am using VSCode, but I'm open to any suggestions.
That's great. Then I'd suggest to use awk or sed or something more programmatic in order to achieve what you are after or if you are able to use Sublime Text a chance exists to work around this problem in a more elegant way.
Sublime Text
This is supposed to work on large files with hundred of thousands of lines but care that it works for a single character (here a) that with some modifications may work for a word or substring too:
Search for:
(?:"|\G(?<!")(?!\A))(?<r>[^a"\\]*+(?>\\.[^a"\\]*)*+)\K(a|"(*SKIP)(*F))(?(?=((?&r)"))\3)
^ ^ ^
Replace it with: WHATEVER\3
Live demo
RegEx Breakdown:
(?: # Beginning of non-capturing group #1
" # Match a `"`
| # Or
\G(?<!")(?!\A) # Continue matching from last successful match
# It shouldn't start right after a `"`
) # End of NCG #1
(?<r> # Start of capturing group `r`
[^a"\\]*+ # Match anything except `a`, `"` or a backslash (possessively)
(?>\\.[^a"\\]*)*+ # Match an escaped character or
# repeat last pattern as much as possible
)\K # End of CG `r`, reset all consumed characters
( # Start of CG #2
a # Match literal `a`
| # Or
"(*SKIP)(*F) # Match a `"` and skip over current match
)
(?(?= # Start a conditional cluster, assuming a positive lookahead
((?&r)") # Start of CG #3, recurs CG `r` and match `"`
) # End of condition
\3 # If conditional passed match CG #3
) # End of conditional
Three-step approach
Last but not least...
Matching a character inside quotation marks is tricky since delimiters are exactly the same so opening and closing marks can not be distinguished from each other without taking a look at adjacent strings. What you can do is change a delimiter to something else so that you can look for it later.
Step 1:
Search for: "[^"\\]*(?:\\.[^"\\]*)*"
Replace with: $0Я
Step 2:
Search for: a(?=[^"\\]*(?:\\.[^"\\]*)*"Я)
Replace with whatever you expect.
Step 3:
Search for: "Я
Replace with nothing to revert every thing.
/(["'])(.*?)(a)(.*?\1)/g
With the replace pattern:
$1$2$4
As far as I'm aware, VS Code uses the same regex engine as JavaScript, which is why I've written my example in JS.
The problem with this is that if you have multiple a's in 1 set of quotes, then it will struggle to pull out the right values, so there needs to be some sort of code behind it, or you, hammering the replace button until no more matches are found, to recurse the pattern and get rid of all the a's in between quotes
let regex = /(["'])(.*?)(a)(.*?\1)/g,
subst = `$1$2$4`,
str = `"a"
"helapke"
Not matched - aaaaaaa
"This is the way the world ends"
"Not with fire"
"ABBA"
"abba",
'I can haz cheezburger'
"This is not a match'
`;
// Loop to get rid of multiple a's in quotes
while(str.match(regex)){
str = str.replace(regex, subst);
}
const result = str;
console.log(result);
Firstly a few of considerations:
There could be multiple a characters within a single quote.
Each quote (using single or double quotation marks) consists of an opening quote character, some text and the same closing quote character. A simple approach is to assume that when the quote characters are counted sequentially, the odd ones are opening quotes and the even ones are closing quotes.
Following point 2, it could be worth some further thought on whether single-quoted strings should be allowed. See the following example: It's a shame 'this quoted text' isn't quoted. Here, the simple approach would think there were two quoted strings: s a shame and isn. Another: This isn't a quote ...'this is' and 'it's unclear where this quote ends'. I've avoided attempting to tackle these complexities and gone with the simple approach below.
The bad news is that point 1 presents a bit of a problem, as a capturing group with a wildcard repeat character after it (e.g. (.*)*) will only capture the last captured "thing". But the good news is there's a way of getting around this within certain limits. Many regex engines will allow up to 99 capturing groups (*). So if we can make the assumption that there will be no more than 99 as in each quote (UPDATE ...or even if we can't - see step 3), we can do the following...
(*) Unfortunately my first port of call, Notepad++ doesn't - it only allows up to 9. Not sure about VS Code. But regex101 (used for the online demos below) does.
TL;DR - What to do?
Search for: "([^a"]*)a*([^a"]*)a*([^a"]*)a*([^a"]*)a*([^a"]*)a*([^a"]*)a*([^a"]*)a*([^a"]*)a*([^a"]*)a*([^a"]*)a*([^a"]*)a*([^a"]*)a*([^a"]*)a*([^a"]*)a*([^a"]*)a*([^a"]*)a*([^a"]*)a*([^a"]*)a*([^a"]*)a*([^a"]*)a*([^a"]*)a*([^a"]*)a*([^a"]*)a*([^a"]*)a*([^a"]*)a*([^a"]*)a*([^a"]*)a*([^a"]*)a*([^a"]*)a*([^a"]*)a*([^a"]*)a*([^a"]*)a*([^a"]*)a*([^a"]*)a*([^a"]*)a*([^a"]*)a*([^a"]*)a*([^a"]*)a*([^a"]*)a*([^a"]*)a*([^a"]*)a*([^a"]*)a*([^a"]*)a*([^a"]*)a*([^a"]*)a*([^a"]*)a*([^a"]*)a*([^a"]*)a*([^a"]*)a*([^a"]*)a*([^a"]*)a*([^a"]*)a*([^a"]*)a*([^a"]*)a*([^a"]*)a*([^a"]*)a*([^a"]*)a*([^a"]*)a*([^a"]*)a*([^a"]*)a*([^a"]*)a*([^a"]*)a*([^a"]*)a*([^a"]*)a*([^a"]*)a*([^a"]*)a*([^a"]*)a*([^a"]*)a*([^a"]*)a*([^a"]*)a*([^a"]*)a*([^a"]*)a*([^a"]*)a*([^a"]*)a*([^a"]*)a*([^a"]*)a*([^a"]*)a*([^a"]*)a*([^a"]*)a*([^a"]*)a*([^a"]*)a*([^a"]*)a*([^a"]*)a*([^a"]*)a*([^a"]*)a*([^a"]*)a*([^a"]*)a*([^a"]*)a*([^a"]*)a*([^a"]*)a*([^a"]*)a*([^a"]*)a*([^a"]*)a*([^a"]*)a*([^a"]*)a*([^a"]*)a*([^a"]*)a*([^a"]*)a*([^a"]*)a*"
Replace with: "\1\2\3\4\5\6\7\8\9\10\11\12\13\14\15\16\17\18\19\20\21\22\23\24\25\26\27\28\29\30\31\32\33\34\35\36\37\38\39\40\41\42\43\44\45\46\47\48\49\50\51\52\53\54\55\56\57\58\59\60\61\62\63\64\65\66\67\68\69\70\71\72\73\74\75\76\77\78\79\80\81\82\83\84\85\86\87\88\89\90\91\92\93\94\95\96\97\98\99"
(Optionally keep repeating steps the previous two steps if there's a possibility of > 99 such characters in a single quote until they've all been replaced).
Repeat step 1 but replacing all " with ' in the regular expression, i.e: '([^a']*)a*([^a']*)a*([^a']*)a*([^a']*)a*([^a']*)a*([^a']*)a*([^a']*)a*([^a']*)a*([^a']*)a*([^a']*)a*([^a']*)a*([^a']*)a*([^a']*)a*([^a']*)a*([^a']*)a*([^a']*)a*([^a']*)a*([^a']*)a*([^a']*)a*([^a']*)a*([^a']*)a*([^a']*)a*([^a']*)a*([^a']*)a*([^a']*)a*([^a']*)a*([^a']*)a*([^a']*)a*([^a']*)a*([^a']*)a*([^a']*)a*([^a']*)a*([^a']*)a*([^a']*)a*([^a']*)a*([^a']*)a*([^a']*)a*([^a']*)a*([^a']*)a*([^a']*)a*([^a']*)a*([^a']*)a*([^a']*)a*([^a']*)a*([^a']*)a*([^a']*)a*([^a']*)a*([^a']*)a*([^a']*)a*([^a']*)a*([^a']*)a*([^a']*)a*([^a']*)a*([^a']*)a*([^a']*)a*([^a']*)a*([^a']*)a*([^a']*)a*([^a']*)a*([^a']*)a*([^a']*)a*([^a']*)a*([^a']*)a*([^a']*)a*([^a']*)a*([^a']*)a*([^a']*)a*([^a']*)a*([^a']*)a*([^a']*)a*([^a']*)a*([^a']*)a*([^a']*)a*([^a']*)a*([^a']*)a*([^a']*)a*([^a']*)a*([^a']*)a*([^a']*)a*([^a']*)a*([^a']*)a*([^a']*)a*([^a']*)a*([^a']*)a*([^a']*)a*([^a']*)a*([^a']*)a*([^a']*)a*([^a']*)a*([^a']*)a*([^a']*)a*([^a']*)a*([^a']*)a*([^a']*)a*([^a']*)a*([^a']*)a*([^a']*)a*([^a']*)a*([^a']*)a*'
Repeat steps 2-3.
Online demos
Please see the following regex101 demos, which could actually be used to perform the replacements if you're able to copy the whole text into the contents of "TEST STRING":
Demo for double quotes
Demo for single quotes.
If you can use Visual Studio (instead of Visual Studio Code), it is written in C++ and C# and uses the .NET Framework regular expressions, which means you can use variable length lookbehinds to accomplish this.
(?<="[^"\n]*)a(?=[^"\n]*")
Adding some more logic to the above regular expression, we can tell it to ignore any locations where there are an even amount of " preceding it. This prevents matches for a outside of quotes. Take, for example, the string "a" a "a". Only the first and last a in this string will be matched, but the one in the middle will be ignored.
(?<!^[^"\n]*(?:(?:"[^"\n]*){2})+)(?<="[^"\n]*)a(?=[^"\n]*")
Now the only problem is this will break if we have escaped " within two double quotes such as "a\"" a "a". We need to add more logic to prevent this behaviour. Luckily, this beautiful answer exists for properly matching escaped ". Adding this logic to the regex above, we get the following:
(?<!^[^"\n]*(?:(?:"(?:[^"\\\n]|\\.)*){2})+)(?<="[^"\n]*)a(?=[^"\n]*")
I'm not sure which method works best with your strings, but I'll explain this last regex in detail as it also explains the two previous ones.
(?<!^[^"\n]*(?:(?:"(?:[^"\\\n]|\\.)*){2})+) Negative lookbehind ensuring what precedes doesn't match the following
^ Assert position at the start of the line
[^"\n]* Match anything except " or \n any number of times
(?:(?:"(?:[^"\\\n]|\\.)*){2})+ Match the following one or more times. This ensures if there are any " preceding the match that they are balanced in the sense that there is an opening and closing double quote.
(?:"(?:[^"\\\n]|\\.)*){2} Match the following exactly twice
" Match this literally
(?:[^"\\\n]|\\.)* Match either of the following any number of times
[^"\\\n] Match anything except ", \ and \n
\\. Matches \ followed by any character
(?<="[^"\n]*) Positive lookbehind ensuring what precedes matches the following
" Match this literally
[^"\n]* Match anything except " or \n any number of times
a Match this literally
(?=[^"\n]*") Positive lookahead ensuring what follows matches the following
[^"\n]* Match anything except " or \n any number of times
" Match this literally
You can drop the \n from the above pattern as the following suggests. I added it just in case there's some sort of special cases I'm not considering (i.e. comments) that could break this regex within your text. The \A also forces the regex to match from the start of the string (or file) instead of the start of the line.
(?<!\A[^"]*(?:(?:"(?:[^"\\]|\\.)*){2})+)(?<="[^"]*)a(?=[^"]*")
You can test this regex here
This is what it looks like in Visual Studio:
I am using VSCode, but I'm open to any suggestions.
If you want to stay in an Editor environment, you could use
Visual Studio (>= 2012) or even notepad++ for quick fixup.
This avoids having to use a spurious script environment.
Both of these engines (Dot-Net and boost, respectively) use the \G construct.
Which is start the next match at the position where the last one left off.
Again, this is just a suggestion.
This regex doesn't check the validity of balanced quotes within the entire
string ahead of time (but it could with the addition of a single line).
It is all about knowing where the inside and outside of quotes are.
I've commented the regex, but if you need more info let me know.
Again this is just a suggestion (I know your editor uses ECMAScript).
Find (?s)(?:^([^"]*(?:"[^"a]*(?=")"[^"]*(?="))*"[^"a]*)|(?!^)\G)a([^"a]*(?:(?=a.*?")|(?:"[^"]*$|"[^"]*(?=")(?:"[^"a]*(?=")"[^"]*(?="))*"[^"a]*)))
Replace $1b$2
That's all there is to it.
https://regex101.com/r/loLFYH/1
Comments
(?s) # Dot-all inine modifier
(?:
^ # BOS
( # (1 start), Find first quote from BOS (written back)
[^"]*
(?: # --- Cluster
" [^"a]* # Inside quotes with no 'a'
(?= " )
" [^"]* # Between quotes, get up to next quote
(?= " )
)* # --- End cluster, 0 to many times
" [^"a]* # Inside quotes, will be an 'a' ahead of here
# to be sucked up by this match
) # (1 end)
| # OR,
(?! ^ ) # Not-BOS
\G # Continue where left off from last match.
# Must be an 'a' at this point
)
a # The 'a' to be replaced
( # (2 start), Up to the next 'a' (to be written back)
[^"a]*
(?: # --------------------
(?= a .*? " ) # If stopped before 'a', must be a quote ahead
| # or,
(?: # --------------------
" [^"]* $ # If stopped at a quote, check for EOS
| # or,
" [^"]* # Between quotes, get up to next quote
(?= " )
(?: # --- Cluster
" [^"a]* # Inside quotes with no 'a'
(?= " )
" [^"]* # Between quotes
(?= " )
)* # --- End cluster, 0 to many times
" [^"a]* # Inside quotes, will be an 'a' ahead of here
# to be sucked up on the next match
) # --------------------
) # --------------------
) # (2 end)
"Inside double quotes" is rather tricky, because there are may complicating scenarios to consider to fully automate this.
What are your precise rules for "enclosed by quotes"? Do you need to consider multi-line quotes? Do you have quoted strings containing escaped quotes or quotes used other than starting/ending string quotation?
However there may be a fairly simple expression to do much of what you want.
Search expression: ("[^a"]*)a
Replacement expression: $1b
This doesn't consider inside or outside of quotes - you have do that visually. But it highlights text from the quote to the matching character, so you can quickly decide if this is inside or not.
If you can live with the visual inspection, then we can build up this pattern to include different quote types and upper and lower case.

regex for first instance of a specific character that DOESN'T come immediately after another specific character

I have a function, translate(), takes multiple parameters. The first param is the only required and is a string, that I always wrap in single quotes, like this:
translate('hello world');
The other params are optional, but could be included like this:
translate('hello world', true, 1, 'foobar', 'etc');
And the string itself could contain escaped single quotes, like this:
translate('hello\'s world');
To the point, I now want to search through all code files for all instances of this function call, and extract just the string. To do so I've come up with the following grep, which returns everything between translate(' and either ') or ',. Almost perfect:
grep -RoPh "(?<=translate\(').*?(?='\)|'\,)" .
The problem with this though, is that if the call is something like this:
translate('hello \'world\', you\'re great!');
My grep would only return this:
hello \'world\
So I'm looking to modify this so that the part that currently looks for ') or ', instead looks for the first occurrence of ' that hasn't been escaped, i.e. doesn't immediately follow a \
Hopefully I'm making sense. Any suggestions please?
You can use this grep with PCRE regex:
grep -RoPh "\btranslate\(\s*\K'(?:[^'\\\\]*)(?:\\\\.[^'\\\\]*)*'" .
Here is a regex demo
RegEx Breakup:
\b # word boundary
translate # match literal translate
\( # match a (
\s* # match 0 or more whitespace
\K # reset the matched information
' # match starting single quote
(?: # start non-capturing group
[^'\\\\]* # match 0 or more chars that are not a backslash or single quote
) # end non-capturing group
(?: # start non-capturing group
\\\\. # match a backslash followed by char that is "escaped"
[^'\\\\]* # match 0 or more chars that are not a backslash or single quote
)* # end non-capturing group
' # match ending single quote
Here is a version without \K using look-arounds:
grep -oPhR "(?<=\btranslate\(')(?:[^'\\\\]*)(?:\\\\.[^'\\\\]*)*(?=')" .
RegEx Demo 2
I think the problem is the .*? part: the ? makes it a non-greedy pattern, meaning it'll take the shortest string that matches the pattern. In effect, you're saying, "give me the shortest string that's followed by quote+close-paren or quote+comma". In your example, "world\" is followed by a single quote and a comma, so it matches your pattern.
In these cases, I like to use something like the following reasoning:
A string is a quote, zero or more characters, and a quote: '.*'
A character is anything that isn't a quote (because a quote terminates the string): '[^']*'
Except that you can put a quote in a string by escaping it with a backslash, so a character is either "backslash followed by a quote" or, failing that, "not a quote": '(\\'|[^'])*'
Put it all together and you get
grep -RoPh "(?<=translate\(')(\\'|[^'])*(?='\)|'\,)" .

Regex Expression Not Matching

I have the following regular expression:
"\[(\d+)\].?\s+(\S+)\s+(\/+?)\\r\n"
I am pretty new to regex. I have this regexp and a string that I am trying to see if it matches or not. I believe it should match it but my program says it doesn't, and an online analyser says they do not match. I am pretty sure I am missing something small. Here is my string:
[1]+ Stopped sleep 60
However, when using this online tool to check for a match (and my program is saying they're not equal), why does the following expression not match the above regexp? Any ideas?
you appear to have escaped the \ prior to the \r resulting in it searching for the letter r
RegExp interpretation and allowed characters vary slightly with implementation, so you should give your execution context, but this is probably generic enough.
Decomposing your regexp gives
\[ - an open bracket character.
(\d+) - one or more digits; save this as capture group 1 ($1).
\] - a close bracket character.
.? - 0 or 1 character, of any kind
\s+ - 1 or more spaces.
(\S+) - 1 or more non-space characters; save this as $2
\s+ - 1 or more spaces
(\/+?) - 1 or more forward-slash characters, optional as $3
(not sure about this, this is an odd construct)
\\r\n" - an (incorrectly specified) end of line sequence, I think.
First of all, if you want to match the end of a line, use $, not \r\n. That should match the end of a line in most contexts. ^ matches the beginning of a line.
Second, I can't tell from your regexp what you are trying to capture after the "Stopped" word, so I'm going to assume you want the rest as one block, including internal spaces. A reg-exp basically the same as yours will do it.
"\[(\d+)\].?\s+(\S+)\s+(.+)\s*$"
This captures
$1 = 1,
$2 = Stopped
$3 = sleep 60
This is basically the same as yours except for the end, which grabs everything after "stopped" up to the end of the line as a single capture group, $3, except for leading and trailing blanks. If you want to do additional parsing, replace the (.+) as appropriate. Note that there must be at least 1 non-blank character after "stopped " for this to match. If you want it to match even if there is no string $3, use \s*(.*)\s*$ instead of \s+(.+)\s*$
Try to use this pattern:
\[\d+\]\+\s*\w+\s*\w+\s*\d+

Multiline selection of blocks with ID at the end of each block with regular expression

I have regular expression:
BEGIN\s+\[([\s\S]*?)END\s+ID=(.*)\]
which select multiline text and ID from text below. I would like to select only IDs with prefix X_, but if I change ID=(.*) to ID=(X_.*) begin is selected from second pair not from third as I need. Could someone help me to get correct expression please?
text example:
BEGIN [
text a
END ID=X_1]
BEGIN [
text b
text c
END ID=Y_1]
text aaa
text bbb
BEGIN [
text d
text e
END ID=X_2]
text xxx
BEGIN [
text bbb
END ID=X_3]
It isn't the .* that's gobbling everything up as people keep saying, it's the [\s\S]*?. .* can't do it because (as the OP said) the dot doesn't match newlines.
When the END\s+ID=(X_.*)\] part of your regex fails to match the last line of the second block, you're expecting it to abandon that block and start over with the third one. That's what it have to do to make the shortest match.
In reality, it backtracks to the beginning of the line and lets [\s\S]*? consume it instead. And it keeps on consuming until it finds a place where END\s+ID=(X_.*)\] can match, which happens to be the last line of the third block.
The following regex avoids that problem by matching line by line, checking each one to see if it starts with END. This effectively confines the match to one block at a time.
(?m)^BEGIN\s+\[[\r\n]+((?:(?!END).*[\r\n]+)*)END\s+ID=(X_.*)\]
Note that I used ^ to anchor each match to the beginning of a line, so I used (?m) to turn on multiline mode. But I did not--and you should not--turn on single-line/DOTALL mode.
Assuming there aren't any newlines inside a block and the BEGIN/END statements are the first non-space of their line, I'd write the regex like this (Perl notation; change the delimiters and remove comments, whitespaces and the /x modifier if you use a different engine)
m{
\n \s* BEGIN \s+ \[ # match the beginning
( (?!\n\s*\n) .)*? # match anything that isn't an empty line
# checking with a negative look-ahead (?!PATTERN)
\n \s* END \s+ ID=X_[^\]]* \] # the ID may not contain "]"
}sx # /x: use extended syntax, /s: "." matches newlines
If the content may be anything, it might be best to create a list of all blocks, and then grep through them. This regex matches any block:
m{ (
BEGIN \s+ \[
.*? # non-greedy matching is important here
END \s+ ID=[^\]]* \] # greedy matching is safe here
) }xs
(add newlines if wanted)
Then only keep those matches that match this regex:
/ID = X_[^\]]* \] $/x # anchor at end of line
If we don't do this, backtracking may prevent a correct match ([\s\S]*? can contain END ID=X_). Your regex would put anything inside the blocks until it sees a X_.*.
So using BEGIN\s+\[([/s/S]*?)END\s+ID=(.*?)\] — note the extra question mark — one match would be:
BEGIN [
text b
text c
END ID=Y_1]
text aaa
text bbb
BEGIN [
text d
text e
END ID=X_2]
…instead of failing at the Y_. A greedy match (your unchanged regex) should result in the whole file being matched: Your (.*) eats up all characters (until the end of file) and then goes back until it finds a ].
EDIT:
Should you be using perls regex engine, we can use the (*FAIL) verb:
/BEGIN\s+\[(.*?)END\s+ID=(X_[^\]]*|(*FAIL))\]/s
"Either have an ID starting with X_ or the match fails". However, this does not solve the problem with END ID=X_1]-like statements inside your data.
Change your .* to a [^\]]* (i.e. match non-]s), so that your matches can't spill over past an END block, giving you something like BEGIN\s+\[([^\]]*?)END\s+ID=(X_[^\]]*)\]

extract word with regular expression

I have a string 1/temperatoA,2/CelcieusB!23/33/44,55/66/77 and I would like to extract the words temperatoA and CelcieusB.
I have this regular expression (\d+/(\w+),?)*! but I only get the match 1/temperatoA,2/CelcieusB!
Why?
Your whole match evaluates to '1/temperatoA,2/CelcieusB' because that matches the following expression:
qr{ ( # begin group
\d+ # at least one digit
/ # followed by a slash
(\w+) # followed by at least one word characters
,? # maybe a comma
)* # ANY number of repetitions of this pattern.
}x;
'1/temperatoA,' fulfills capture #1 first, but since you are asking the engine to capture as many of those as it can it goes back and finds that the pattern is repeated in '2/CelcieusB' (the comma not being necessary). So the whole match is what you said it is, but what you probably weren't expecting is that '2/CelcieusB' replaces '1/temperatoA,' as $1, so $1 reads '2/CelcieusB'.
Anytime you want to capture anything that fits a certain pattern in a certain string it is always best to use the global flag and assign the captures into an array. Since an array is not a single scalar like $1, it can hold all the values that were captured for capture #1.
When I do this:
my $str = '1/temperatoA,2/CelcieusB!23/33/44,55/66/77';
my $regex = qr{(\d+/(\w+))};
if ( my #matches = $str =~ /$regex/g ) {
print Dumper( \#matches );
}
I get this:
$VAR1 = [
'1/temperatoA',
'temperatoA',
'2/CelcieusB',
'CelcieusB',
'23/33',
'33',
'55/66',
'66'
];
Now, I figure that's probably not what you expected. But '3' and '6' are word characters, and so--coming after a slash--they comply with the expression.
So, if this is an issue, you can change your regex to the equivalent: qr{(\d+/(\p{Alpha}\w*))}, specifying that the first character must be an alpha followed by any number of word characters. Then the dump looks like this:
$VAR1 = [
'1/temperatoA',
'temperatoA',
'2/CelcieusB',
'CelcieusB'
];
And if you only want 'temperatoA' or 'CelcieusB', then you're capturing more than you need to and you'll want your regex to be qr{\d+/(\p{Alpha}\w*)}.
However, the secret to capturing more than one chunk in a capture expression is to assign the match to an array, you can then sort through the array to see if it contains the data you want.
The question here is: why are you using a regular expression that’s so obviously wrong? How did you get it?
The expression you want is simply as follows:
(\w+)
With a Perl-compatible regex engine you can search for
(?<=\d/)\w+(?=.*!)
(?<=\d/) asserts that there is a digit and a slash before the start of the match
\w+ matches the identifier. This allows for letters, digits and underscore. If you only want to allow letters, use [A-Za-z]+ instead.
(?=.*!) asserts that there is a ! ahead in the string - i. e. the regex will fail once we have passed the !.
Depending on the language you're using, you might need to escape some of the characters in the regex.
E. g., for use in C (with the PCRE library), you need to escape the backslashes:
myregexp = pcre_compile("(?<=\\d/)\\w+(?=.*!)", 0, &error, &erroroffset, NULL);
Will this work?
/([[:alpha:]]\w+)\b(?=.*!)
I made the following assumptions...
A word begins with an alphabetic character.
A word always immediately follows a slash. No intervening spaces, no words in the middle.
Words after the exclamation point are ignored.
You have some sort of loop to capture more than one word. I'm not familiar enough with the C library to give an example.
[[:alpha:]] matches any alphabetic character.
The \b matches a word boundary.
And the (?=.*!) came from Tim Pietzcker's post.