Need regexp to find substring between two tokens - regex

I suspect this has already been answered somewhere, but I can't find it, so...
I need to extract a string from between two tokens in a larger string, in which the second token will probably appear again meaning... (pseudo code...)
myString = "A=abc;B=def_3%^123+-;C=123;" ;
myB = getInnerString(myString, "B=", ";" ) ;
method getInnerString(inStr, startToken, endToken){
return inStr.replace( EXPRESSION, "$1");
}
so, when I run this using expression ".+B=(.+);.+"
I get "def_3%^123+-;C=123;" presumably because it just looks for the LAST instance of ';' in the string, rather than stopping at the first one it comes to.
I've tried using (?=) in search of that first ';' but it gives me the same result.
I can't seem to find a regExp reference that explains how one can specify the "NEXT" token rather than the one at the end.
any and all help greatly appreciated.
Similar question on SO:
Regex: To pull out a sub-string between two tags in a string
Regex to replace all \n in a String, but no those inside [code] [/code] tag
Replace patterns that are inside delimiters using a regular expression call
RegEx matching HTML tags and extracting text

You're using a greedy pattern by not specifying the ? in it. Try this:
".+B=(.+?);.+"

Try this:
B=([^;]+);
This matches everything between B= and ; unless it is a ;. So it matches everything between B= and the first ; thereafter.

(This is a continuation of the conversation from the comments to Evan's answer.)
Here's what happens when your (corrected) regex is applied: First, the .+ matches the whole string. Then it backtracks, giving up most of the characters it just matched until it gets to the point where the B= can match. Then the (.+?) matches (and captures) everything it sees until the next part, the semicolon, can match. Then the final .+ gobbles up the remaining characters.
All you're really interested in is the "B=" and the ";" and whatever's between them, so why match the rest of the string? The only reason you have to do that is so you can replace the whole string with the contents of the capturing group. But why bother doing that if you can access contents of the group directly? Here's a demonstration (in Java, because I can't tell what language you're using):
String s = "A=abc;B=def_3%^123+-;C=123;";
Pattern p = Pattern.compile("B=(.*?);");
Matcher m = p.matcher(s);
if (m.find())
{
System.out.println(m.group(1));
}
Why do a 'replace' when a 'find' is so much more straightforward? Probably because your API makes it easier; that's why we do it in Java. Java has several regex-oriented convenience methods in its String class: replaceAll(), replaceFirst(), split(), and matches() (which returns true iff the regex matches the whole string), but not find(). And there's no convenience method for accessing capturing groups, either. We can't match the elegance of Perl one-liners like this:
print $1 if 'A=abc;B=def_3%^123+-;C=123;' =~ /B=(.*?);/;
...so we content ourselves with hacks like this:
System.out.println("A=abc;B=def_3%^123+-;C=123;"
.replaceFirst(".+B=(.*?);.+", "$1"));
Just to be clear, I'm not saying not to use these hacks, or that there's anything wrong with Evan's answer--there isn't. I just think we should understand why we use them, and what trade-offs we're making when we do.

Related

Regex expression to match word in a string

I've been banging my head with this issue as I can't get this expression to work.
I'm trying to match and output a specific word from a string, so for example, take this string:
<ANIMALS>
<value>DOG CAT COW</value>
</ANIMALS>
And now I want to match any one of them and return that value otherwise none, let's say, COW.
I've tried a lot of varying expressions with no luck such as:
IF(VALUE == "/(^|\W)COW($|\W)/", "COWVALUE", "NONE")
This doesn't work, nor do any other variants I've tried. If I keep the original string as a single word and no actual calling expression, just a word, then it always works. As soon as I introduce a string of words then I can't make it happen.
Could anyone help please?
Thanks!
Get rid of (^|\W) and ($|\W), since that only matches at word boundaries, but you want to match COW when it's not by itself as a word. The regular expression should just be /COW/, it will match that string wherever it appears in the string.
BTW, to match word boundaries you can use \b rather than those alternations.
Not sure what programming language you're using but following is a simple example based in Javascript to do achieve your goals.
regex demo

How to do regular Expression in AutoIt Script

In Autoit script Iam unable to do Regular expression for the below string Here the numbers will get changed always.
Actual String = _WinWaitActivate("RX_IST2_AM [PID:942564 NPID:10991 SID:498702881] sbivvrwm060.dev.ib.tor.Test.com:30000","")
Here the PID, NPID & SID : will be changing and rest of the things are always constant.
What i have tried below is
_WinWaitActivate("RX_IST2_AM [PID:'([0-9]{1,6})' NPID:'([0-9]{1,5})' SID:'([0-9]{1,9})' sbivvrwm060.dev.ib.tor.Test.com:30000","")
Can someone please help me
As stated in the documentation, you should write the prefix REGEXPTITLE: and surround everything with square brackets, but "escape" all including ones as the dots (.) and spaces () with a backslash (\) and instead of [0-9] you might use \d like "[REGEXPTITLE:RX_IST2_AM\ \[PID:(\d{1,6})\ NPID:(\d{1,5})\ SID:(\d{1,9})\] sbivvrwm060\.dev\.ib\.tor\.Test\.com:30000]" as your parameter for the Win...(...)-Functions.
You can even omit the round brackets ((...)) but keep their content if you don't want to capture the content to process it further like with StringRegExp(...) or StringRegExpReplace(...) - using the _WinWaitActivete(...)-Function it won't make sense anyways as it is only matching and not replacing or returning anything from your regular expression.
According to regex101 both work, with the round brackets and without - you should always use a tool like this site to confirm that your expression is actually working for your input string.
Not familiar with autoit, but remember that regex has to completely match your string to capture results. For example, (goat)s will NOT capture the word goat if your string is goat or goater.
You have forgotten to add a ] in your regex, so your pattern doesn't match the string and capture groups will not be extracted. Also I'm not completely sold on the usage of '. Based on this page, you can do something like StringRegExp(yourstring, 'RX_IST2_AM [PID:([0-9]{1,6}) NPID:([0-9]{1,5}) SID:([0-9]{1,9})]', $STR_REGEXPARRAYGLOBALMATCH) and $1, $2 and $3 would be your results respectively. But maybe your approach works too.

Get all matches for a certain pattern using RegEx

I am not really a RegEx expert and hence asking a simple question.
I have a few parameters that I need to use which are in a particular pattern
For example
$$DATA_START_TIME
$$DATA_END_TIME
$$MIN_POID_ID_DLAY
$$MAX_POID_ID_DLAY
$$MIN_POID_ID_RELTM
$$MAX_POID_ID_RELTM
And these will be replaced at runtime in a string with their values (a SQL statement).
For example I have a simple query
select * from asdf where asdf.starttime = $$DATA_START_TIME and asdf.endtime = $$DATA_END_TIME
Now when I try to use the RegEx pattern
\$\$[^\W+]\w+$
I do not get all the matches(I get only a the last match).
I am trying to test my usage here https://regex101.com/r/xR9dG0/2
If someone could correct my mistake, I would really appreciate it.
Thanks!
This will do the job:
\$\$\w+/g
See Demo
Just Some clarifications why your regex is doing what is doing:
\$\$[^\W+]\w+$
Unescaped $ char means end of string, so, your pattern is matching something that must be on the end of the string, that's why its getting only the last match.
This group [^\W+] doesn't really makes sense, groups starting with [^..] means negate the chars inside here, and \W is the negation of words, and + inside the group means literally the char +, so you are saying match everything that is Not a Not word and that is not a + sign, i guess that was not what you wanted.
To match the next word just \w+ will do it. And the global modifier /g ensures that you will not stop on the first match.
This should work - Based on what you said you wanted to match this should work . Also it won't match $$lower_case_strings if that's what you wanted. If not, add the "i" flag also.
\${2}[A-Z_]+/g

Lua pattern similar to regex positive lookahead?

I have a string which can contain any number of the delimiter §\n. I would like to remove all delimiters from a string, except the last occurrence which should be left as-is. The last delimiter can be in three states: \n, §\n or §§\n. There will never be any characters after the last variable delimiter.
Here are 3 examples with the different state delimiters:
abc§\ndef§\nghi\n
abc§\ndef§\nghi§\n
abc§\ndef§\nghi§§\n
I would like to remove all delimiters except the last occurrence.
So the result of gsub for the three examples above should be:
abcdefghi\n
abcdefghi§\n
abcdefghi§§\n
Using regular expressions, one could use §\\n(?=.), which matches properly for all three cases using positive lookahead, as there will never be any characters after the last variable delimiter.
I know I could check if the string has the delimiter at the end, and then after a substitution using the Lua pattern §\n I could add the delimiter back onto the string. That is however a very inelegant solution to a problem which should be possible to solve using a Lua pattern alone.
So how could this be done using a Lua pattern?
str:gsub( '§\\n(.)', '%1' ) should do what you want. This deletes the delimiter given that it is followed by another character, putting this character back into to string.
Test code
local str = {
'abc§\\ndef§\\nghi\\n',
'abc§\\ndef§\\nghi§\\n',
'abc§\\ndef§\\nghi§§\\n',
}
for i = 1, #str do
print( ( str[ i ]:gsub( '§\\n(.)', '%1' ) ) )
end
yields
abcdefghi\n
abcdefghi§\n
abcdefghi§§\n
EDIT: This answer doesn't work specifically for lua, but if you have a similar problem and are not constrained to lua you might be able to use it.
So if I understand correctly, you want a regex replace to make the first example look like the second. This:
/(.*?)§\\n(?=.*\\n)/g
will eliminate the non-last delimiters when replaced with
$1
in PCRE, at least. I'm not sure what flavor Lua follows, but you can see the example in action here.
REGEX:
/(.*?)§\\n(?=.*\\n)/g
TEST STRING:
abc§\ndef§\nghi\n
abc§\ndef§\nghi§\n
abc§\ndef§\nghi§§\n
SUBSTITUTION:
$1
RESULT:
abcdefghi\n
abcdefghi§\n
abcdefghi§§\n

How to ignore whitespace in a regular expression subject string?

Is there a simple way to ignore the white space in a target string when searching for matches using a regular expression pattern? For example, if my search is for "cats", I would want "c ats" or "ca ts" to match. I can't strip out the whitespace beforehand because I need to find the begin and end index of the match (including any whitespace) in order to highlight that match and any whitespace needs to be there for formatting purposes.
You can stick optional whitespace characters \s* in between every other character in your regex. Although granted, it will get a bit lengthy.
/cats/ -> /c\s*a\s*t\s*s/
While the accepted answer is technically correct, a more practical approach, if possible, is to just strip whitespace out of both the regular expression and the search string.
If you want to search for "my cats", instead of:
myString.match(/m\s*y\s*c\s*a\*st\s*s\s*/g)
Just do:
myString.replace(/\s*/g,"").match(/mycats/g)
Warning: You can't automate this on the regular expression by just replacing all spaces with empty strings because they may occur in a negation or otherwise make your regular expression invalid.
Addressing Steven's comment to Sam Dufel's answer
Thanks, sounds like that's the way to go. But I just realized that I only want the optional whitespace characters if they follow a newline. So for example, "c\n ats" or "ca\n ts" should match. But wouldn't want "c ats" to match if there is no newline. Any ideas on how that might be done?
This should do the trick:
/c(?:\n\s*)?a(?:\n\s*)?t(?:\n\s*)?s/
See this page for all the different variations of 'cats' that this matches.
You can also solve this using conditionals, but they are not supported in the javascript flavor of regex.
You could put \s* inbetween every character in your search string so if you were looking for cat you would use c\s*a\s*t\s*s\s*s
It's long but you could build the string dynamically of course.
You can see it working here: http://www.rubular.com/r/zzWwvppSpE
If you only want to allow spaces, then
\bc *a *t *s\b
should do it. To also allow tabs, use
\bc[ \t]*a[ \t]*t[ \t]*s\b
Remove the \b anchors if you also want to find cats within words like bobcats or catsup.
This approach can be used to automate this
(the following exemplary solution is in python, although obviously it can be ported to any language):
you can strip the whitespace beforehand AND save the positions of non-whitespace characters so you can use them later to find out the matched string boundary positions in the original string like the following:
def regex_search_ignore_space(regex, string):
no_spaces = ''
char_positions = []
for pos, char in enumerate(string):
if re.match(r'\S', char): # upper \S matches non-whitespace chars
no_spaces += char
char_positions.append(pos)
match = re.search(regex, no_spaces)
if not match:
return match
# match.start() and match.end() are indices of start and end
# of the found string in the spaceless string
# (as we have searched in it).
start = char_positions[match.start()] # in the original string
end = char_positions[match.end()] # in the original string
matched_string = string[start:end] # see
# the match WITH spaces is returned.
return matched_string
with_spaces = 'a li on and a cat'
print(regex_search_ignore_space('lion', with_spaces))
# prints 'li on'
If you want to go further you can construct the match object and return it instead, so the use of this helper will be more handy.
And the performance of this function can of course also be optimized, this example is just to show the path to a solution.
The accepted answer will not work if and when you are passing a dynamic value (such as "current value" in an array loop) as the regex test value. You would not be able to input the optional white spaces without getting some really ugly regex.
Konrad Hoffner's solution is therefore better in such cases as it will strip both the regest and test string of whitespace. The test will be conducted as though both have no whitespace.