Find match within a first match - regex

I have the following string
abc123+InterestingValue+def456
I want to get the InterestingValue only, I am using this regex
\+.*\+
but the output it still includes the + characters
Is there a way to search for a string between the + characters, then search again for anything that is not a + character?

Use lookarounds.
(?<=\+)[^+]*(?=\+)
DEMO

You can use a positive lookahead and a positive lookbehind (more info about these here). Basically, a positive lookbehind tells the engine "this match has to come before the next match", and a positive lookahead tells the engine "this has to come after the previous match". Neither of them actually match the pattern they're looking for though.
A positive lookbehind is a group beginning with ?<= and a positive lookahead is a group beginning with ?=. Adding these to your existing expression would look like this:
(?<=\+).*(?=\+)
regex101

If it should be the first match, you can use a capture group with an anchor:
^[^+]*\+([^+]+)\+
^ Start of string
[^+]* Optionally match any char except + using a negated character class
\+ Match literally
([^+]+) Capture group 1, match 1+ chars other than +
\+ Match literally
Regex demo

Related

Regex positive lookahead multiple occurrence

I have below sample string
abc,com;def,med;ghi,com;jkl,med
I have to grep the string which is coming before keyword ",com" (all occurrences)
Final result which is I am looking for is something like -
abc,ghi
I have tried below positive lookahead regex -
[\s\S]*?(?=com)
But this is only fetching abc, not the ghi.
What modification do I need to make in above regex?
Using a character class [\s\S] can match any character and will also match the , and ;
What you can do is match non whitespace characters except for , and ; using a negated character class and that way you don't have to make it non greedy as well.
Then assert the ,com to the right (followed by a word boundary to prevent a partial word match)
Instead of using a lookahead, you might also use a capture group:
([^\s,;]+),com\b
See a regex demo with the capture group values.

Regex: Match pattern unless preceded by pattern containing element from the matching character class

I am having a hard time coming up with a regex to match a specific case:
This can be matched:
any-dashed-strings
this-can-be-matched-even-though-its-big
This cannot be matched:
strings starting with elem- or asdf- or a single -
elem-this-cannot-be-matched
asdf-this-cannot-be-matched
-
So far what I came up with is:
/\b(?!elem-|asdf-)([\w\-]+)\b/
But I keep matching a single - and the whole -this-cannot-be-matched suffix. I cannot figure it out how to not only ignore a character present inside the matching character class conditionally, and not matching anything else if a suffix is found
I am currently working with the Oniguruma engine (Ruby 1.9+/PHP multi-byte string module).
If possible, please elaborate on the solution. Thanks a lot!
If a lookbehind is supported, you can assert a whitespace boundary to the left, and make the alternation for both words without the hyphen optional.
(?<!\S)(?!(?:elem|asdf)?-)[\w-]+\b
Explanation
(?<!\S) Assert a whitespace boundary to the left
(?! Negative lookahead, assert the directly to the right is not
(?:elem|asdf)?- Optionally match elem or asdf followed by -
) Close the lookahead
[\w-]+ Match 1+ word chars or -
\b A word boundary
See a regex demo.
Or a version with a capture group and without a lookbehind:
(?:\s|^)(?!(?:elem|asdf)?-)([\w-]+)\b
See another regex demo.

Get the first character using Regex

I'm using regex trying to get the first character of a specific word between (.*?)
About Sildenafil Citrate Phosphodiesterase-5 Enzyme Inhibitor
and the regex:
Citrate (.*?)Enzyme
So I get match Phosphodiesterase-5
But I need to get only the first character P
You could use the capturing group to capturing a single non whitespace char (\S) and use word boundaries \b :
\bCitrate (\S).*? Enzyme\b
Regex demo
Changing your regex to Citrate (.).*?Enzyme would be enough. This captures the first character after "Citrate ".
If your environments supports lookaround you try this pattern
(?<=\bCitrate ).(?=.*?Enzyme\b)
(?<=\bCitrate ) - Positive lookbehind, match must be preceded by \bCitrate
. - Match anything expect new line
(?=.*?Enzyme\b) - Positive lookahead, match must be followed by .*?Enzyme\b
Regex Demo

How to consume lookaround in regex?

I want to match
abc_def_ghi,
abc_abc_ghi,
abc_a2a_ghi,
abc_999_ghi
but not abc_xxx_ghi (with xxx in center).
I came up to manually consuming look ahead (abc_(?!xxx)..._ghi), but I wonder is there any other way without manually specifying number of characters to skip.
Original qustion was with numbers, updated for strings case.
If you don't want to specify exactly how many characters to skip, perhaps you could use a quantifier like + in the negative lookahead and use a negated character class to match not an underscore.
\babc_(?!x+_)[^_]+_ghi\b
Explanation
\babc_ Word boundary, match abc_
(?! Negative lookahead, assert what is directly on the right is not
x+_ Match 1+ times x followed by an underscore
) Close lookahead
[^_]+_ Negated character class, match 1+ times any char except _
ghi\b Match ghi and word boundary
Regex demo
You can use this
123_(?:(?!000)\d){3}_789
Regex demo
If you don't wish to use look-arounds, this expression might be an option:
(?:abc_xxx_ghi)|(abc_.{3}_ghi)
Other than that I can't think of anything else.
DEMO

Regex negative lookahead not working as expected

I have the following regex:
[a-zA-Z0-9. ]*(?!cs)
and the string
Hotfix H5.12.1.00.cs02_ADV_LCR
I want to match only untill
Hotfix H5.12.1.00
But the regex matches untill "cs02"
Shouldn't the negative lookahead have done the job?
You may consider using a tempered greedy token:
(?:(?!\.cs)[a-zA-Z0-9. ])*
See the regex demo.
This will work regardless of whether .cs is present in the string or not because the tempered greedy token matches any 0+ characters from the [a-zA-Z0-9. ] character class that is not .cs.
You need to use positive lookahead instead of negative lookahead.
[a-zA-Z0-9. ]*(?=\.cs)
or
[a-zA-Z0-9. ]+(?=\.cs)
Note that your regex [a-zA-Z0-9. ]*(?!cs) is greedy and matches all the characters until it reaches a boundary which isn't followed by cs. See here.
At first pattern [a-zA-Z0-9. ]+ matches Hotfix H5.12.1.00.cs02 greedily because this pattern greedily matches alphabets , dots and spaces. Once it see the underscore char, it stops matching where the two conditions is satisfied,
_ won't get matched by [a-zA-Z0-9. ]+
_ is not cs
It works same for the further two matches also.