regex optional lookahead - regex

I want a regular expression to match all of these:
startabcend
startdef
blahstartghiend
blahstartjklendsomething
and to return abc, def, ghi and jkl respectively.
I have this the following which works for case 1 and 3 but am having trouble making the lookahead optional.
(?<=start).*(?=end.*)
Edit:
Hmm. Bad example. In reality, the bit in the middle is not numeric, but is preceeded by a certain set of characters and optionally succeeded by it. I have updated the inputs and outputs as requested and added a 4th example in response to someones question.

If you're able to use lookahead,
(?<=start).*?(?=(?:end|$))
as suggested by stema below is probably the simplest way to get the entire pattern to match what you want.
Alternatively, if you're able to use capturing groups, you should just do that instead:
start(.*?)(?:end)?$
and then just get the value from the first capture group.

Maybe like this:
(?<=start).*?(?=(?:end|$))
This will match till "start" and "end" or till the end of line, additionally the quantifier has to be non greedy (.*?)
See it here on Regexr
Extended the example on Regexr to not only work with digits.

An optional lookahead doesn't make sense:
If it's optional then it's ok if it matches, but it's also ok if it doesn't match. And since a lookahead does not extend the match it has absolutely no effect.
So the syntax for an optional lookahead is the empty string.

Lookahead alone won't do the job. Try this:
(?<=start)(?:(?!end).)*
The lookbehind positions you after the word "start", then the rest of it consumes everything until (but not including) the next occurrence of "end".
Here's a demo on Ideone.com

if "end" is always going to be present, then use:
(?<=start)(.*?)(?=end) as you put in the OP. Since you say "make the lookahead optional", then just run up until there's "end" or the carriage return. (?<=start)(.*?)(?=end|\n). If you don't care about capturing the "end" group, you can skip the lookahead and do (?:start)?(.*?)(?:end)? which will start after "start", if it's there and stop before "end", if it's there. You can also use more of those piped "or" patterns: (?:start|^) and (?:end|\n).

Why do you need lookahead?
start(\d+)\w*
See it on rubular

Related

Select Northings from a 1 Line String

I have the following string;
Start: 738392E, 6726376N
I extracted 738392 ok using (?<=.art\:\s)([0-9A-Z]*). This gave me a one group match allowing me to extract it as a column value
.
I want to extract 6726376 the same way. Have only one group appear because I am parsing that to a column value.
Not sure why is (?=(art\:\s\s*))(?=[,])*(.*[0-9]*) giving me the entire line after S.
Helping me get it right with an explanation will go along way.
Because you used positive lookaheads. Those just make some assertions, but don't "move the head along".
(?=(art\:\s\s*)) makes sure you're before "art: ...". The next thing is another positive lookahead that you quantify with a star to make it optional. Finally you match anything, so you get the rest of the line in your capture group.
I propose a simpler regex:
(?<=(art\:\s))(\d+)\D+(\d+)
Demo
First we make a positive lookback that makes sure we're after "art: ", then we match two numbers, seperated by non-numbers.
There is no need for you to make it this complicated. Just use something like
Start: (\d+)E, (\d+)N
or
\b\d+(?=[EN]\b)
if you need to match each bit separately.
Your expression (?=(art\:\s\s*))(?=[,])*(.*[0-9]*) has several problems besides the ones already mentioned: 1) your first and second lookahead match at different locations, 2) your second lookahead is quantified, which, in 25 years, I have never seen someone do, so kudos. ;), 3) your capturing group matches about anything, including any line or the empty string.
You match the whole part after it because you use .* which will match until the end of the line.
Note that this part [0-9]* at the end of the pattern does not match because it is optional and the preceding .* already matches until the end of the string.
You could get the match without any lookarounds:
(art:\s)(\d+)[^,]+,\s(\d+)
Regex demo
If you want the matches only, you could make use of the PyPi regex module
(?<=\bStart:(?:\s+\d+[A-Z],)* )\d+(?=[A-Z])
Regex demo (For example only, using a different engine) | Python demo

RegEx for enforcing and finding the last match

I am trying to extract a part of a name from a string. I almost have it, but something isn't right where I am using a positive lookahead.
Here is my regex: (?=s\s(.*?)$)
I have marked all the results I want with bold text.
Trittbergets Ronja
Minitiger's Samanta Junior
Björntorpets Cita
Sors Kelly's Majsskalle
The problem is that Kelly's Majsskalle gets returned, when it should only select Majsskalle.
Here is a link to regex101 for debugging:
https://regex101.com/r/PZWxr7/1
How do I get the lookahead to disregard the first match?
No need for a lookahead. Just try this:
.*s\s(.*?)$
You need to enforce regular expression engine to find the last match using a dot-star:
^.*s\s(.*)$
A .* consumes everything up to a linebreak immediately then engine backtracks to match the next pattern.
See live demo here
or use a tempered dot:
s(?= ((?:(?!s ).)+)$)
^^^^^^^^^^
Match a byte only if we are not pointing at a `s[ ]`
See live demo here
Note: the former is the better solution.
The lookahead should be used to determine the start of a capture or the end of a capture. To start the capture after the first capture, you need to use a lookbehind - this ensures the text BEFORE the capture is that search pattern.
Update your pattern on regex101 to this and you'll see the difference:
(?<=s\s).*?$
Edit - my bad, I didn't spot that last line.
You can also include a negative lookahead to ensure that there's not another word that ends in s in the next match:
(?<=s\s)(?!.+?s\s).*?$
This solves the issue with the last line.

Regex for selecting words ending in 'ing' unless

I want to select words ending in with a regular expression, but I want exclude words that end in thing. For example:
everything
running
catching
nothing
Of these words, running and catching should be selected, everything and nothing should be excluded.
I've tried the following:
.+ing$
But that selects everything. I'm thinking look aheads/look arounds could be the solution, but I haven't been able to get one that works.
Solutions that work in Python or R would be helpful.
In python you can use negative lookbehind assertion as this:
^.*(?<!th)ing$
RegEx Demo
(?<!th) is negative lookbehind expression that will fail the match if th comes before ing at the end of string.
Note that if you are matching words that are not on separate lines then instead of anchors use word boundaries as:
\w+(?<!th)ing\b
Something like \b\w+(?<!th)ing\b maybe.
You might also use a negative lookahead (?! to assert that what is on the right is not 0+ times a word character followed by thing and a word boundary:
\b(?!\w*thing\b)\w*ing\b
Regex demo | Python demo

Regular Expression Should not start with a character and contain a sequence

For example, should not start with h and should contain ap.
Should match apology, rap god, trap but not match happy.
I tried
^[^h](ap)*
but it doesn't match sequences which start with ap like apology.
You may use
^(?!h).*ap
See the following demo. To match the whole string to the end, append .* at the end:
^(?!h).*ap.*
If you plan to only match words following the rules you outlined, you may use
\b(?!h)\w*ap\w*
Or, without a lookahead:
\b([^\Wh]\w*)?ap\w*
See this regex demo and the demo without a lookahead.
#WiktorStribiżew's comment with negative lookahead is correct (you might want to add .* to it if you want to match the whole string).
For completeness, you can also use alternation:
^(?:[^h].*ap|ap).*
Demo: https://regex101.com/r/ecVTGm/1

Matching double symbol

Hey guys I've been working with this one for a little while. I can't seem to get it.
Here is what I have so far
(#[^{2,}+)([^(\s\W\d{2}]+)(\b)
http://rubular.com/r/zlx3j00Wjl
Although this is not excepting periods in the match.
I basically need to match this.
#function.name(param)
I just need to match function.name. This does that.
http://rubular.com/r/hWMB72LsWT
I don't want to match this
##function.name(param)
hello##test.com`
Didn't know if anyone has any ideas. Thanks for the help.
You can use a negative lookahead: #(?!#) matches a # not followed by another #.
Here is my go at it (here it is on Rubular):
(?<!#)#(\w+(?:\.\w+)*)\([^)]*\)
Explained:
(?<!#)# # an '#' not preceded by an '#'
(\w+(?:\.\w+)*) # any number of xxx.xxx.xxx, captured into a group
\([^)]*\) # brackets, containing anything that isn't a closing bracket
Since this is Ruby, you might not care about matching parentheses. In that case you can just remove the last section.
Try this:
(?:^|\s)#+([^(]+)
You will have function.match and function.name in the first group, will not match hello##test.com. Rubular:
http://rubular.com/r/b8gy1LcVGz
Try this
(?!.*##)^#([^()\s]+)\b
See it here on Rubular
I removed some brackets from your expression
I removed the Quantifier from the leading #
(?!.*##) is a negative lookahead assertion. It will fail if it finds anywhere in the string two # characters in a row.
I am not sure about your requirements, if there is all the time a set of brackets at the end, then you don't need your word boundary. If there can be similar strings without brackets that you don't want to match, then I would add another lookahead to ensure this assertion:
?!.*##)^#([^()\s]+)(?=\()
See it here on Rubular