I can't match the sequence between two specific words - regex

I want to find a string that is between two specific words. See in the picture the pattern I made. The problem is that "INICIO" is appearing twice, I just want it to match it last appearance until the word "FIM". In this case, I wanted the first pattern to be INICIOBBBBBBBBBBBFIM. I need match all occurrence in the string.
Link to Regex
My Regex

What you are looking for sounds like what Rexegg terms a tempered greedy token.
INICIO(?:(?!INICIO).)*?FIM
See your updated demo at regex101
The negative lookahead "tempers" the dot to not jump over another start. This practice is rather costly as before each position the lookahead looks ahead if there is no INICIO ahead before proceeding.
If interested, there is a nice answer on SO explaining it in more detail.

Maybe,
.*(INICIO.*?FIM)
or
INICIO(?!.*\BINICIO\B)\B.*?\BFIM
might be OK to start with.
If you wish to simplify/modify/explore the expression, it's been explained on the top right panel of regex101.com. If you'd like, you can also watch in this link, how it would match against some sample inputs.
Demo 2

Reverse the string, and search for
MIF.*?OICINI
Then reverse the match
"JJJJJJJJJJJJJJJINICIOAAAAAAAAAAAAINICIOBBBBBBBBBBBBFIMJJJJJJJJJJJJJJINICIOJJJJJJJJJJJJJJFIM"
.split("")
.reverse()
.join("")
.match(/MIF.*?OICINI/g)[1]
.split("")
.reverse()
.join("");
(Notice, I'm actually grabbing the second match here. This is because the matches are in reverse order as well)

A bit contrived but this could work:
(?<=INICIO).*?(INICIO.*?FIM)|(?<=FIM).*?(INICIO.*?FIM)
Your result will be in $2
https://regex101.com/r/qUf4kE/1

Related

How to write a regular expression inside awk to IGNORE a word as a whole? [duplicate]

I've been looking around and could not make this happen. I am not totally noob.
I need to get text delimited by (including) START and END that doesn't contain START. Basically I can't find a way to negate a whole word without using advanced stuff.
Example string:
abcSTARTabcSTARTabcENDabc
The expected result:
STARTabcEND
Not good:
STARTabcSTARTabcEND
I can't use backward search stuff. I am testing my regex here: www.regextester.com
Thanks for any advice.
Try this
START(?!.*START).*?END
See it here online on Regexr
(?!.*START) is a negative lookahead. It ensures that the word "START" is not following
.*? is a non greedy match of all characters till the next "END". Its needed, because the negative lookahead is just looking ahead and not capturing anything (zero length assertion)
Update:
I thought a bit more, the solution above is matching till the first "END". If this is not wanted (because you are excluding START from the content) then use the greedy version
START(?!.*START).*END
this will match till the last "END".
START(?:(?!START).)*END
will work with any number of START...END pairs. To demonstrate in Python:
>>> import re
>>> a = "abcSTARTdefENDghiSTARTjlkENDopqSTARTrstSTARTuvwENDxyz"
>>> re.findall(r"START(?:(?!START).)*END", a)
['STARTdefEND', 'STARTjlkEND', 'STARTuvwEND']
If you only care for the content between START and END, use this:
(?<=START)(?:(?!START).)*(?=END)
See it here:
>>> re.findall(r"(?<=START)(?:(?!START).)*(?=END)", a)
['def', 'jlk', 'uvw']
The really pedestrian solution would be START(([^S]|S*S[^ST]|ST[^A]|STA[^R]|STAR[^T])*(S(T(AR?)?)?)?)END. Modern regex flavors have negative assertions which do this more elegantly, but I interpret your comment about "backwards search" to perhaps mean you cannot or don't want to use this feature.
Update: Just for completeness, note that the above is greedy with respect to the end delimiter. To only capture the shortest possible string, extend the negation to also cover the end delimiter -- START(([^ES]|E*E[^ENS]|EN[^DS]|S*S[^STE]|ST[^AE]|STA[^RE]|STAR[^TE])*(S(T(AR?)?)?|EN?)?)END. This risks to exceed the torture threshold in most cultures, though.
Bug fix: A previous version of this answer had a bug, in that SSTART could be part of the match (the second S would match [^T], etc). I fixed this but by the addition of S in [^ST] and adding S* before the non-optional S to allow for arbitrary repetitions of S otherwise.
May I suggest a possible improvement on the solution of Tim Pietzcker?
It seems to me that START(?:(?!START).)*?END is better in order to only catch a START immediately followed by an END without any START or END in between. I am using .NET and Tim's solution would match also something like START END END. At least in my personal case this is not wanted.
[EDIT: I have left this post for the information on capture groups but the main solution I gave was not correct.
(?:START)((?:[^S]|S[^T]|ST[^A]|STA[^R]|STAR[^T])*)(?:END)
as pointed out in the comments would not work; I was forgetting that the ignored characters could not be dropped and thus you would need something such as ...|STA(?![^R])| to still allow that character to be part of END, thus failing on something such as STARTSTAEND; so it's clearly a better choice; the following should show the proper way to use the capture groups...]
The answer given using the 'zero-width negative lookahead' operator "?!", with capture groups, is: (?:START)((?!.*START).*)(?:END) which captures the inner text using $1 for the replace. If you want to have the START and END tags captured you could do (START)((?!.*START).*)(END) which gives $1=START $2=text and $3=END or various other permutations by adding/removing ()s or ?:s.
That way if you are using it to do search and replace, you can do, something like BEGIN$1FINISH. So, if you started with:
abcSTARTdefSTARTghiENDjkl
you would get ghi as capture group 1, and replacing with BEGIN$1FINISH would give you the following:
abcSTARTdefBEGINghiFINISHjkl
which would allow you to change your START/END tokens only when paired properly.
Each (x) is a group, but I have put (?:x) for each of the ones except the middle which marks it as a non-capturing group; the only one I left without a ?: was the middle; however, you could also conceivably capture the BEGIN/END tokens as well if you wanted to move them around or what-have-you.
See the Java regex documentation for full details on Java regexes.

Regular Expression look ahead in log

https://regex101.com/r/9kfa7D/4
I can never get the look ahead portion correct. I've tried a few different things, but I'm trying to get to the next date and parse it like that. Mainly because I don't know what the message will look like and it could be pretty random. Any help would be great.
I need to group the message portion of it.
Edit: Updated to make it a little more clear of what I'm trying to do. Never everything from each date.
You can just tweak your regex without tinkering lookahead like this:
^\d{2}-\w{3}-\d{4} (?:\d{2}:){2}\d{2}\.\d{3}
Updated Regex Demo
EDIT:
As per updated question OP can use this negative lookahead based regex to capture log text:
^[^\[]+\[[^\]]+\] +[^:]+ +(.*(?:\n(?!\d{2}-[a-zA-Z]{3}-).*)*)
This regex doesn't use DOTALL flag by unrolling the loop in last segment. This makes above regex pretty fast to complete the parsing.
New Demo
If you care for the message between log timestamps use this (it's in the 2-nd group):
/(\d{2}-\w{3}-\d{4} \S+ \S+ \[[^\]]++\] )(?=(.+)((?1)|\z))/gms
^(?:\d{2}-\w{3}-\d{4} (?:\d{2}:){2}\d{2}\.\d{3}) ((?:[^\n]+(?:\n+(?!\d{2}-\w{3}-\d{4})|))+)
The first part is the date pattern, which is non-grouping since you do not want to keep the date.
The second part is [^\n]+ which is followed by a \n provided it is not followed by \d{2}-\w{3}-\d{4} (hence the negative look ahead).
The second part is then repeated any number of times.
You can see the demo on regex101.
What you need
(^\d+.[A-Z].*?)[A-Z]
how it works
Lots of people like the complex thinking when they are confront a regex. But you should know exactly what you want.
you just need to match this: 29-Jun-2016 09:33:43.565 INFO and nothing else. So let's begin:
First: two digit,
next: A word with capital letter
next: everything from this word to the next capitalize word
finish.
the main rule
Non-greedy mantch: .*?
prove
NOTE
do you want to match from beginning to log
very easy just add .*?log at the end. that's it.
Do you ever pay attention to how many steps it take?
First of mine: 7952
Second of mine: 13751
Compare it with other
After putting the picture here. some guys update their regex. I do
not want to argue. no problem. I just wanted to show it.
Otherwise I can ( as you can ) makes it less by choice the specific
pattern For example: ^\d+-[A-Za-z]+-\d+\s\d+:\d+:\d+\.\d+ Now 7952 become 3878
Do you want to learn how lock-head assertion works?
Very easy. The main concept is that (?=) is never matches anything. It only matches the position just one point before you want.
like:
^\d+-[A-Z].+(?=[A-Z]+ ).
It still matches: 29-Jun-2016 09:33:43.565 INFO
Pay attention to . at the end. So here the look head assertion point to between F and O
If would like to match this 29-Jun-2016 09:33:43.565 then what can you do?
Think about this:
^\d+-[A-Za-z].+(?=[\d] ).
and figure out it by yourself.

Smallest possible match / nongreedy regex search

I first thought that this answer will totaly solve my issue, but it did not.
I have a string url like this one:
http://www.someurl.com/some-text-1-0-1-0-some-other-text.htm#id_76
I would like to extract some-other-text so basically, I come with the following regex:
/0-(.*)\.htm/
Unfortunately, this matches 1-0-some-other-text because regex are greedy. I can not succeed make it nongreedy using .*?, it just does not change anything as you can see here.
I also tried with the U modifier but it did not help.
Why the "nongreedy" tip does not work?
In case you need to get the closest match, you can make use of a tempered greedy token.
0-((?:(?!0-).)*)\.htm
See demo
The lazy version of your regex does not work because regex engine analyzes the string from left to right. It always gets leftmost position and checks if it can match. So, in your case, it found the first 0-and was happy with it. The laziness applies to the rightmost position. In your case, there is 1 possible rightmost position, so, lazy matching could not help achieve expected results.
You also can use
0-((?!.*?0-).*)\.htm
It will work if you have individual strings to extract the values from.
You want to exclude the 1-0? If so, you can use a non capturing group:
(?:1-0-)+(.*?)\.htm
Demo

Is it possible to say in Regex "if the next word does not match this expression"?

I'm trying to detect occurrences of words italicized with *asterisks* around it. However I want to ensure it's not within a link. So it should find "text" in here is some *text* but not within http://google.com/hereissome*text*intheurl.
My first instinct was to use look aheads, but it doesn't seem to work if I use a URL regex such as John Gruber's:
(?i)\b((?:[a-z][\w-]+:(?:/{1,3}|[a-z0-9%])|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:'".,<>?«»“”‘’]))
And put it in a look ahead at the beginning of the pattern, followed by the rest of the pattern.
(?=URLPATTERN)\*[a-zA-Z\s]\*
So how would I do this?
You can use this alternation technique to match everything first on LHS that you want to discard. Then on RHS use captured group to match desired text.
https?:\/\/\S*|(\*\S+\*)
You can then use captured group #1 for your emphasized text.
RegEx Demo
The following regexp:
^(?!http://google.com/hereissome.*text.*intheurl).*
Matches everything but http://google.com/hereissome*text*intheurl. This is called negative lookahead. Some regexp libraries may not support it, python's does.
Here is a link to Mastering Lookahead and Lookbehind.

Regex - Get string between two words that doesn't contain word

I've been looking around and could not make this happen. I am not totally noob.
I need to get text delimited by (including) START and END that doesn't contain START. Basically I can't find a way to negate a whole word without using advanced stuff.
Example string:
abcSTARTabcSTARTabcENDabc
The expected result:
STARTabcEND
Not good:
STARTabcSTARTabcEND
I can't use backward search stuff. I am testing my regex here: www.regextester.com
Thanks for any advice.
Try this
START(?!.*START).*?END
See it here online on Regexr
(?!.*START) is a negative lookahead. It ensures that the word "START" is not following
.*? is a non greedy match of all characters till the next "END". Its needed, because the negative lookahead is just looking ahead and not capturing anything (zero length assertion)
Update:
I thought a bit more, the solution above is matching till the first "END". If this is not wanted (because you are excluding START from the content) then use the greedy version
START(?!.*START).*END
this will match till the last "END".
START(?:(?!START).)*END
will work with any number of START...END pairs. To demonstrate in Python:
>>> import re
>>> a = "abcSTARTdefENDghiSTARTjlkENDopqSTARTrstSTARTuvwENDxyz"
>>> re.findall(r"START(?:(?!START).)*END", a)
['STARTdefEND', 'STARTjlkEND', 'STARTuvwEND']
If you only care for the content between START and END, use this:
(?<=START)(?:(?!START).)*(?=END)
See it here:
>>> re.findall(r"(?<=START)(?:(?!START).)*(?=END)", a)
['def', 'jlk', 'uvw']
The really pedestrian solution would be START(([^S]|S*S[^ST]|ST[^A]|STA[^R]|STAR[^T])*(S(T(AR?)?)?)?)END. Modern regex flavors have negative assertions which do this more elegantly, but I interpret your comment about "backwards search" to perhaps mean you cannot or don't want to use this feature.
Update: Just for completeness, note that the above is greedy with respect to the end delimiter. To only capture the shortest possible string, extend the negation to also cover the end delimiter -- START(([^ES]|E*E[^ENS]|EN[^DS]|S*S[^STE]|ST[^AE]|STA[^RE]|STAR[^TE])*(S(T(AR?)?)?|EN?)?)END. This risks to exceed the torture threshold in most cultures, though.
Bug fix: A previous version of this answer had a bug, in that SSTART could be part of the match (the second S would match [^T], etc). I fixed this but by the addition of S in [^ST] and adding S* before the non-optional S to allow for arbitrary repetitions of S otherwise.
May I suggest a possible improvement on the solution of Tim Pietzcker?
It seems to me that START(?:(?!START).)*?END is better in order to only catch a START immediately followed by an END without any START or END in between. I am using .NET and Tim's solution would match also something like START END END. At least in my personal case this is not wanted.
[EDIT: I have left this post for the information on capture groups but the main solution I gave was not correct.
(?:START)((?:[^S]|S[^T]|ST[^A]|STA[^R]|STAR[^T])*)(?:END)
as pointed out in the comments would not work; I was forgetting that the ignored characters could not be dropped and thus you would need something such as ...|STA(?![^R])| to still allow that character to be part of END, thus failing on something such as STARTSTAEND; so it's clearly a better choice; the following should show the proper way to use the capture groups...]
The answer given using the 'zero-width negative lookahead' operator "?!", with capture groups, is: (?:START)((?!.*START).*)(?:END) which captures the inner text using $1 for the replace. If you want to have the START and END tags captured you could do (START)((?!.*START).*)(END) which gives $1=START $2=text and $3=END or various other permutations by adding/removing ()s or ?:s.
That way if you are using it to do search and replace, you can do, something like BEGIN$1FINISH. So, if you started with:
abcSTARTdefSTARTghiENDjkl
you would get ghi as capture group 1, and replacing with BEGIN$1FINISH would give you the following:
abcSTARTdefBEGINghiFINISHjkl
which would allow you to change your START/END tokens only when paired properly.
Each (x) is a group, but I have put (?:x) for each of the ones except the middle which marks it as a non-capturing group; the only one I left without a ?: was the middle; however, you could also conceivably capture the BEGIN/END tokens as well if you wanted to move them around or what-have-you.
See the Java regex documentation for full details on Java regexes.