Regex to first occurrence only? [duplicate] - regex

This question already has answers here:
Regular expression to stop at first match
(9 answers)
Closed 1 year ago.
Let's say I have the following string:
this is a test for the sake of
testing. this is only a test. The end.
and I want to select this is a test and this is only a test. What in the world do I need to do?
The following Regex I tried yields a goofy result:
this(.*)test (I also wanted to capture what was between it)
returns this is a test for the sake of testing. this is only a test
It seems like this is probably something easy I'm forgetting.

The regex is greedy meaning it will capture as many characters as it can which fall into the .* match. To make it non-greedy try:
this(.*?)test
The ? modifier will make it capture as few characters as possible in the match.

Andy E and Ipsquiggle have the right idea, but I want to point out that you might want to add a word boundary assertion, meaning you don't want to deal with words that have "this" or "test" in them-- only the words by themselves. In Perl and similar that's done with the "\b" marker.
As it is, this(.*?)test would match "thistles are the greatest", which you probably don't want.
The pattern you want is something like this: \bthis\b(.*?)\btest\b

* is a greedy quantifier. That means it matches as much as possible, i.e. what you are seeing. Depending on the specific language support for regex, you will need to find a non-greedy quantifier. Usually this is a trailing question mark, like this: *?. That means it will stop consuming letters as soon as the rest of the regex can be satisfied.
There is a good explanation of greediness here.

For me, simply remove /g worked.
See https://regex101.com/r/EaIykZ/1

Related

How can I solve this regex using two asserts?

I have these 3 consecutive words : Nocivic Voie and Quartier
I have something like this :
#Nocivic;Voie;Quartier#
Question :
I need make a regex to extract the 3 words Nocivic Voie and Quartier using positive lookahead and the commas need to be included in my regex but not the #.
I realized that this could work : \bNocivic(?=;Voie);\bVoie;Quartier
But why is this not working ?
\bNocivic(?=;Voie);\bVoie(?<=Voie;)\bQuartier
I am not too experienced with regex so if someone could tell me why or give me the correct answer if I really wanted to use another lookbehind would be greatly appreciated thanks.
First one is equivelent to
\bNocivic;Voie;Quartier\b
(?=;Voie) just tests if ;Voie follows Nocivic, no useful here
Extrac from
https://www.regextutorial.org/positive-and-negative-lookahead-assertions.php
They only assert if in a given test string the match with certain conditions is possible or not Yes or No.
See the difference below
Nocivic;Voie Ok & returns Nocivic;Voie
Nocivic(=?;Voie) Ok & returns Nocivic
Second one :
?< is not a valid command
The second one is not working, as after match Voie you assert that from the current position there should be Voie; to the left using (?<=Voie;) but you have not matched the semi colon yet.
Note that the lookaround assertions are fruitless in the example, as you are asserting what you are also matching.
If you want to match exactly those 3 words, it does not make sense to use lookarounds.
You can use 3 capture groups:
#(Nocivic);(Voie);(Quartier)#
Regex demo

Can I exclude Positive Lookaheads and Lookbehinds within a snippet in vscode?

I am having issues excluding parts of a string in a VSCode Snippet. Essentially, what I want is a specific piece of a path but I am unable to get the regex to exclude what I need excluded.
I have recently asked a question about something similar which you can find here: Is there a way to trim a TM_FILENAME beyond using TM_FILENAME_BASE?
As you can see, I am getting mainly tripped up by how the snippets work within vscode and not so much the regular expressions themselves
${TM_FILEPATH/(?<=area)(.+)(?=state)/${1:/pascalcase}/}
Given a file path that looks like abc/123/area/my-folder/state/...
Expected:
/MyFolder/
Actual:
abc/123/areaMyFolderstate/...
You need to match the whole string to achieve that:
"${TM_FILEPATH/.*area(\\/.*?\\/)state.*/${1:/pascalcase}/}"
See the regex demo
Details
.* - any 0+ chars other than line break chars, as many as possible
area - a word
-(\\/.*?\\/) - Group 1: /, any 0+ chars other than line break chars, as few as possible, and a /
-state.* - state substring and the rest of the line.
NOTE: If there must be no other subparts between area and state, replace .*? with [^\\/]* or even [^\\/]+.
The expected output seems to be different with part of a string in the input. If that'd be desired the expression might be pretty complicated, such as:
(?:[\s\S].*?)(?<=area\/)([^-])([^-]*)(-)([^\/])([^\/]*).*
and a replacement of something similar to /\U$1\E$2$3\U$4\E$5/, if available.
Demo 1
If there would be other operations, now I'm guessing maybe the pascalcase would do something, this simple expression might simply work here:
.*area(\\/.*?\\/).*
and the desired data is in this capturing group $1:
(\\/.*?\\/)
Demo 2
Building on my answer you linked to in your question, remember that lookarounds are "zero-length assertions" and "do not consume characters in the string". See lookarounds are zero-length assertions:
Lookahead and lookbehind, collectively called "lookaround", are zero-length assertions just like the start and end of line, and start and end of word anchors explained earlier in this tutorial. The difference is that lookaround actually matches characters, but then gives up the match, returning only the result: match or no match. That is why they are called "assertions". They do not consume characters in the string, but only assert whether a match is possible or not.
So in your snippet transform: /(?<=area)(.+)(?=state)/ the lookaround portions are not actually consumed and so are simply passed through. Vscode treats them, as it should, as not actually being within the "part to be transformed" segment at all.
That is why lookarounds are not excluded from your transform.

Using just regex is there a way to skip words or chars when using a lookaround? [duplicate]

This question already has answers here:
RegExp exclusion, looking for a word not followed by another
(3 answers)
Closed 4 years ago.
I guess specifically this might be about a negative look ahead.
If i had a sentence in the form of:
this is the WORD i want and I want and this is the PHRASE I DONT WANT
is there a way to use just regex to match "WORD" but only not if "PHRASE" is present? My initial idea was a negative lookahead but that is only the immediate word following. I then tried using (?:\w+(?:\s*[\,\-\'\:\/]\s*|\s+)){0,3} and other similar tricks but this would match the words in-between and not the actual phrase. Not to mention the wonkiness of + in lookarounds. Then I thought about using a grouping like [^something] but i didnt know how to do that with full words without a lookaround. I then had the idea to nest lookarounds which i found out can happen, but that still gives me the root of the problem.
Can you skip words in the matching for a lookaround and if not how would i go about solving this issue?
Because if i nest using a lookbehind I still need to skip stuff to get to WORD in order to match it.
Assume the words are arbitrary in the sentence but the key word and the key phrase is something specific.
If I understand your question, you can try:
/(?!.*PHRASE I DONT WANT)WORD/
Demo

Simple regex for matching up to an optional character?

I'm sure this is a simple question for someone at ease with regular expressions:
I need to match everything up until the character #
I don't want the string following the # character, just the stuff before it, and the character itself should not be matched. This is the most important part, and what I'm mainly asking. As a second question, I would also like to know how to match the rest, after the # character. But not in the same expression, because I will need that in another context.
Here's an example string:
topics/install.xml#id_install
I want only topics/install.xml. And for the second question (separate expression) I want id_install
First expression:
^([^#]*)
Second expression:
#(.*)$
[a-zA-Z0-9]*[\#]
If your string contains any other special characters you need to add them into the first square bracket escaped.
I don't use C#, but i will assume that it uses pcre... if so,
"([^#]*)#.*"
with a call to 'match'. A call to 'search' does not need the trailing ".*"
The parens define the 'keep group'; the [^#] means any character that is not a '#'
You probably tried something like
"(.*)#.*"
and found that it fails when multiple '#' signs are present (keeping the leading '#'s)?
That is because ".*" is greedy, and will match as much as it can.
Your matcher should have a method that looks something like 'group(...)'. Most matchers
return the entire matched sequence as group(0), the first paren-matched group as group(1),
and so forth.
PCRE is so important i strongly encourage you to search for it on google, learn it, and always have it in your programming toolkit.
Use look ahead and look behind:
To get all characters up to, but not including the pound (#): .*?(?=\#)
To get all characters following, but not including the pound (#): (?<=\#).*
If you don't mind using groups, you can do it all in one shot:
(.*?)\#(.*) Your answers will be in group(1) and group(2). Notice the non-greedy construct, *?, which will attempt to match as little as possible instead of as much as possible.
If you want to allow for missing # section, use ([^\#]*)(?:\#(.*))?. It uses a non-collecting group to test the second half, and if it finds it, returns everything after the pound.
Honestly though, for you situation, it is probably easier to use the Split method provided in String.
More on lookahead and lookbehind
first:
/[^\#]*(?=\#)/ edit: is faster than /.*?(?=\#)/
second:
/(?<=\#).*/
For something like this in C# I would usually skip the regular expressions stuff altogether and do something like:
string[] split = exampleString.Split('#');
string firstString = split[0];
string secondString = split[1];

Regular expression greedy match not working as expected

I have a very basic regular expression that I just can't figure out why it's not working so the question is two parts. Why does my current version not work and what is the correct expression.
Rules are pretty simple:
Must have minimum 3 characters.
If a % character is the first character must be a minimum of 4 characters.
So the following cases should work out as follows:
AB - fail
ABC - pass
ABCDEFG - pass
% - fail
%AB - fail
%ABC - pass
%ABCDEFG - pass
%%AB - pass
The expression I am using is:
^%?\S{3}
Which to me means:
^ - Start of string
%? - Greedy check for 0 or 1 % character
\S{3} - 3 other characters that are not white space
The problem is, the %? for some reason is not doing a greedy check. It's not eating the % character if it exists so the '%AB' case is passing which I think should be failing. Why is the %? not eating the % character?
Someone please show me the light :)
Edit: The answer I used was Dav below: ^(%\S{3}|[^%\s]\S{2})
Although it was a 2 part answer and Alan's really made me understand why. I didn't use his version of ^(?>%?)\S{3} because it worked but not in the javascript implementation. Both great answers and a lot of help.
The word for the behavior you described isn't greedy, it's possessive. Normal, greedy quantifiers match as much as they can originally, but back off if necessary to allow the whole regex to match (I like to think of them as greedy but accommodating). That's what's happening to you: the %? originally matches the leading percent sign, but if there aren't enough characters left for an overall match, it gives up the percent sign and lets \S{3} match it instead.
Some regex flavors (including Java and PHP) support possessive quantifiers, which never back off, even if that causes the overall match to fail. .NET doesn't have those, but it has the next best thing: atomic groups. Whatever you put inside an atomic group acts like a separate regex--it either matches at the position where it's applied or it doesn't, but it never goes back and tries to match more or less than it originally did just because the rest of the regex is failing (that is, the regex engine never backtracks into the atomic group). Here's how you would use it for your problem:
^(?>%?)\S{3}
If the string starts with a percent sign, the (?>%?) matches it, and if there aren't enough characters left for \S{3} to match, the regex fails.
Note that atomic groups (or possessive quantifiers) are not necessary to solve this problem, as #Dav demonstrated. But they're very powerful tools which can easily make the difference between impossible and possible, or too damn slow and slick as can be.
Regex will always try to match the whole pattern if it can - "greedy" doesn't mean "will always grab the character if it exists", but instead means "will always grab the character if it exists and a match can be made with it grabbed".
Instead, what you probably want is something like this:
^(%\S{3}|[^%\s]\S{2})
Which will match either a % followed by 3 characters, or a non-%, non-whitespace followed by 2 more.
I always love to look at RE questions to see how much time people spend on them to "Save time"
str.len() >= str[0]=='&' ? 4 : 3
Although in real life I'd be more explicit, I just wrote it that way because for some reason some people consider code brevity an advantage (I'd call it an anti-advantage, but that's not a popular opinion right now)
Try the regex modified a little based on Dav's original one:
^(%\S{3,}|[^%\s]\S{2,})
with the regex option "^ and $ match at line breaks" on.