match two patterns followed one by another - regex

I'm creating a calendar where users can set events and time in single line, for example:
"6pm supper" - event with start time only
"8:00 - 16:00 work" - event with time period
Regex I'm currently using to get times:
[\d]{1,2}[.|:]?[\d]{0,2}[\s]?[am|pm|AM|PM]{0,2}
It works fine but I can't figure out how to filter out the unwanted occurrences of time if they happen, for example:
"6pm supper at '8pm' restaurant" In this example '8pm' is a restaurant name but it will be interpreted as event with time period while it's not. I suppose I have to write a regex that will only match time pattern in the beginning of line and the next pattern that follows after it without any words between but I have no success composing such a regex so far.
Any suggestions?

What if you used the following regex
([\d]{1,2}[.|:]?[\d]{0,2}[\s]?[apm|APM]{0,2})( - )?([\d]{1,2}[.|:]?[\d]{0,2}[\s]?[apm|APM]{0,2})?(.*)
This would allow you to access the different sections e.g. 6pm supper at '8pm' restaurant
would be:
(6pm)()()( supper at '8pm' restaurant)
$1 $2$3 $4

You could try using a lookbehind construct, to only select dates that are not preceded by letters other than "a","p", and "m". Something in the line of
(?<![letters other than apm].*)
According to http://www.regular-expressions.info/lookaround.html, not all Regex implementations support this in the needed extent, though. Most do not seem to allow .* in a lookbehind.

Would ^[\d]{1,2}[.|:]?[\d]{0,2}[\s]?[am|pm|AM|PM]{0,2} fix the problem of matching the '8pm' in your example?
The ^ is used to match the start of a line. $ can be used for matching the end of a line (in case you need that for later;) ).
UPDATE:
This one's a bit ugly but it seems to work:
[^'"][\d]{1,2}[.|:]?[\d]{0,2}[\s]?[am|pm|AM|PM]{0,2}[^'"]|^[\d]{1,2}[.|:]?[\d]{0,2}[\s]?[am|pm|AM|PM]{0,2}
The first option ensures that if a time appears in the middle of a string, it can't be surrounded by any kind of quote character. The second option allows for times that are at the start of a string. This is ugly looking and can probably be improved somewhat... but it seems to work for me.
UPDATE:
I think this version's a little easier to read:
([^'"]|^)[\d]{1,2}[.|:]?[\d]{0,2}[\s]?[am|pm|AM|PM]{0,2}[^'"]

Related

How to write a regular expression inside awk to IGNORE a word as a whole? [duplicate]

I've been looking around and could not make this happen. I am not totally noob.
I need to get text delimited by (including) START and END that doesn't contain START. Basically I can't find a way to negate a whole word without using advanced stuff.
Example string:
abcSTARTabcSTARTabcENDabc
The expected result:
STARTabcEND
Not good:
STARTabcSTARTabcEND
I can't use backward search stuff. I am testing my regex here: www.regextester.com
Thanks for any advice.
Try this
START(?!.*START).*?END
See it here online on Regexr
(?!.*START) is a negative lookahead. It ensures that the word "START" is not following
.*? is a non greedy match of all characters till the next "END". Its needed, because the negative lookahead is just looking ahead and not capturing anything (zero length assertion)
Update:
I thought a bit more, the solution above is matching till the first "END". If this is not wanted (because you are excluding START from the content) then use the greedy version
START(?!.*START).*END
this will match till the last "END".
START(?:(?!START).)*END
will work with any number of START...END pairs. To demonstrate in Python:
>>> import re
>>> a = "abcSTARTdefENDghiSTARTjlkENDopqSTARTrstSTARTuvwENDxyz"
>>> re.findall(r"START(?:(?!START).)*END", a)
['STARTdefEND', 'STARTjlkEND', 'STARTuvwEND']
If you only care for the content between START and END, use this:
(?<=START)(?:(?!START).)*(?=END)
See it here:
>>> re.findall(r"(?<=START)(?:(?!START).)*(?=END)", a)
['def', 'jlk', 'uvw']
The really pedestrian solution would be START(([^S]|S*S[^ST]|ST[^A]|STA[^R]|STAR[^T])*(S(T(AR?)?)?)?)END. Modern regex flavors have negative assertions which do this more elegantly, but I interpret your comment about "backwards search" to perhaps mean you cannot or don't want to use this feature.
Update: Just for completeness, note that the above is greedy with respect to the end delimiter. To only capture the shortest possible string, extend the negation to also cover the end delimiter -- START(([^ES]|E*E[^ENS]|EN[^DS]|S*S[^STE]|ST[^AE]|STA[^RE]|STAR[^TE])*(S(T(AR?)?)?|EN?)?)END. This risks to exceed the torture threshold in most cultures, though.
Bug fix: A previous version of this answer had a bug, in that SSTART could be part of the match (the second S would match [^T], etc). I fixed this but by the addition of S in [^ST] and adding S* before the non-optional S to allow for arbitrary repetitions of S otherwise.
May I suggest a possible improvement on the solution of Tim Pietzcker?
It seems to me that START(?:(?!START).)*?END is better in order to only catch a START immediately followed by an END without any START or END in between. I am using .NET and Tim's solution would match also something like START END END. At least in my personal case this is not wanted.
[EDIT: I have left this post for the information on capture groups but the main solution I gave was not correct.
(?:START)((?:[^S]|S[^T]|ST[^A]|STA[^R]|STAR[^T])*)(?:END)
as pointed out in the comments would not work; I was forgetting that the ignored characters could not be dropped and thus you would need something such as ...|STA(?![^R])| to still allow that character to be part of END, thus failing on something such as STARTSTAEND; so it's clearly a better choice; the following should show the proper way to use the capture groups...]
The answer given using the 'zero-width negative lookahead' operator "?!", with capture groups, is: (?:START)((?!.*START).*)(?:END) which captures the inner text using $1 for the replace. If you want to have the START and END tags captured you could do (START)((?!.*START).*)(END) which gives $1=START $2=text and $3=END or various other permutations by adding/removing ()s or ?:s.
That way if you are using it to do search and replace, you can do, something like BEGIN$1FINISH. So, if you started with:
abcSTARTdefSTARTghiENDjkl
you would get ghi as capture group 1, and replacing with BEGIN$1FINISH would give you the following:
abcSTARTdefBEGINghiFINISHjkl
which would allow you to change your START/END tokens only when paired properly.
Each (x) is a group, but I have put (?:x) for each of the ones except the middle which marks it as a non-capturing group; the only one I left without a ?: was the middle; however, you could also conceivably capture the BEGIN/END tokens as well if you wanted to move them around or what-have-you.
See the Java regex documentation for full details on Java regexes.

Regex return single match starting from beginning of line

I have text that looks something like this (but much uglier of course):
On 9/02/2019, 9/03/2019 and 9/05/2019 this thing happened
and I would like to regex capture anything related to the date
On 9/02/2019, 9/03/2019 and 9/05/2019
Using regex
([a-zA-Z ]{0,5}(?:\d+\/)+\d+\s*(?:,|and|\s)+)
this seems to work fine (allowing for the first 5 characters of string to be leading words).
But when I tell it I only want to start at the beginning of the string with
^([a-zA-Z ]{0,5}(?:\d+\/)+\d+\s*(?:,|and|\s)+)
It only captures On 9/02/2019,.
Strings that begin with the event followed by the date I will need to treat differently
He was walking when he encountered a stray dog on 9/2/19
https://regex101.com/r/p4H10Z/1
Thanks for the rapid responses. I was hoping to figure out how to make it a single match. Now I see. Grouping all of the date matches in another group that can be repeated does the job.
^([a-zA-Z ]{0,5}((?:\d+\/)+\d+\s*(?:,|and|\s)+)+)

Regular Expression look ahead in log

https://regex101.com/r/9kfa7D/4
I can never get the look ahead portion correct. I've tried a few different things, but I'm trying to get to the next date and parse it like that. Mainly because I don't know what the message will look like and it could be pretty random. Any help would be great.
I need to group the message portion of it.
Edit: Updated to make it a little more clear of what I'm trying to do. Never everything from each date.
You can just tweak your regex without tinkering lookahead like this:
^\d{2}-\w{3}-\d{4} (?:\d{2}:){2}\d{2}\.\d{3}
Updated Regex Demo
EDIT:
As per updated question OP can use this negative lookahead based regex to capture log text:
^[^\[]+\[[^\]]+\] +[^:]+ +(.*(?:\n(?!\d{2}-[a-zA-Z]{3}-).*)*)
This regex doesn't use DOTALL flag by unrolling the loop in last segment. This makes above regex pretty fast to complete the parsing.
New Demo
If you care for the message between log timestamps use this (it's in the 2-nd group):
/(\d{2}-\w{3}-\d{4} \S+ \S+ \[[^\]]++\] )(?=(.+)((?1)|\z))/gms
^(?:\d{2}-\w{3}-\d{4} (?:\d{2}:){2}\d{2}\.\d{3}) ((?:[^\n]+(?:\n+(?!\d{2}-\w{3}-\d{4})|))+)
The first part is the date pattern, which is non-grouping since you do not want to keep the date.
The second part is [^\n]+ which is followed by a \n provided it is not followed by \d{2}-\w{3}-\d{4} (hence the negative look ahead).
The second part is then repeated any number of times.
You can see the demo on regex101.
What you need
(^\d+.[A-Z].*?)[A-Z]
how it works
Lots of people like the complex thinking when they are confront a regex. But you should know exactly what you want.
you just need to match this: 29-Jun-2016 09:33:43.565 INFO and nothing else. So let's begin:
First: two digit,
next: A word with capital letter
next: everything from this word to the next capitalize word
finish.
the main rule
Non-greedy mantch: .*?
prove
NOTE
do you want to match from beginning to log
very easy just add .*?log at the end. that's it.
Do you ever pay attention to how many steps it take?
First of mine: 7952
Second of mine: 13751
Compare it with other
After putting the picture here. some guys update their regex. I do
not want to argue. no problem. I just wanted to show it.
Otherwise I can ( as you can ) makes it less by choice the specific
pattern For example: ^\d+-[A-Za-z]+-\d+\s\d+:\d+:\d+\.\d+ Now 7952 become 3878
Do you want to learn how lock-head assertion works?
Very easy. The main concept is that (?=) is never matches anything. It only matches the position just one point before you want.
like:
^\d+-[A-Z].+(?=[A-Z]+ ).
It still matches: 29-Jun-2016 09:33:43.565 INFO
Pay attention to . at the end. So here the look head assertion point to between F and O
If would like to match this 29-Jun-2016 09:33:43.565 then what can you do?
Think about this:
^\d+-[A-Za-z].+(?=[\d] ).
and figure out it by yourself.

Regex PCRE: Validate string to match first set of string instead of last

I tried quite a few things but Im stuck with my regex whenever meets the criteria 2 consecutive times. In this case it just considers it as one expressions instead of 2.
\[ame\=[^\.]+(.+)youtube\.(.+)v\=([^\]\&\"]+)[\]\'\"\&](.+)\[\/ame\]
E.g.
[ame="http://www.youtube.com/watch?v=brfr5CD2qqY"][B][COLOR=yellow]http://www.youtube.com/watch?v=brfrx5D2qqY[/COLOR][/B][/ame][/U]
[B][COLOR=yellow]or[/COLOR][/B] [B][COLOR=yellow]B[/COLOR][/B]
[ame="http://www.youtube.com/watch?v=M9ak3rKIBAU"][B][COLOR=yellow]http://www.youtube.com/watch?v=M9a3arKIBAU[/COLOR][/B][/ame]
[B][COLOR=yellow]or[/COLOR][/B] [B][COLOR=yellow]C[/COLOR][/B]
[ame="http://www.youtube.com/watch?v=7vh--3pyq5U"][COLOR=yellow]http://www.youtube.com/watch?v=7vh--3pyq5U[/COLOR][/ame]
In that case, this regex would instead of matching all 3 options, it takes it as one.
Any ideas how to make an expression that would say match the first "[/ame]"?
The problem is the use of .+ - they are "greedy", meaning they will consume as much input as possible and still match.
Change them to reluctant quantifiers: .+?, which won't skip forward over the end of the first match to match the end if the last match.
I'm not sure what your objective is (you haven't made that clear yet)
But this will match and capture out the youtube URL for you, ensuring you only match each single instance between [ame= and [/ame]
/\[ame=["'](.*?)["'](.*?)\/ame\]/i
Here's a working example, and a great sandbox to play around in: http://regex101.com/r/jR4lK2

Regex - Get string between two words that doesn't contain word

I've been looking around and could not make this happen. I am not totally noob.
I need to get text delimited by (including) START and END that doesn't contain START. Basically I can't find a way to negate a whole word without using advanced stuff.
Example string:
abcSTARTabcSTARTabcENDabc
The expected result:
STARTabcEND
Not good:
STARTabcSTARTabcEND
I can't use backward search stuff. I am testing my regex here: www.regextester.com
Thanks for any advice.
Try this
START(?!.*START).*?END
See it here online on Regexr
(?!.*START) is a negative lookahead. It ensures that the word "START" is not following
.*? is a non greedy match of all characters till the next "END". Its needed, because the negative lookahead is just looking ahead and not capturing anything (zero length assertion)
Update:
I thought a bit more, the solution above is matching till the first "END". If this is not wanted (because you are excluding START from the content) then use the greedy version
START(?!.*START).*END
this will match till the last "END".
START(?:(?!START).)*END
will work with any number of START...END pairs. To demonstrate in Python:
>>> import re
>>> a = "abcSTARTdefENDghiSTARTjlkENDopqSTARTrstSTARTuvwENDxyz"
>>> re.findall(r"START(?:(?!START).)*END", a)
['STARTdefEND', 'STARTjlkEND', 'STARTuvwEND']
If you only care for the content between START and END, use this:
(?<=START)(?:(?!START).)*(?=END)
See it here:
>>> re.findall(r"(?<=START)(?:(?!START).)*(?=END)", a)
['def', 'jlk', 'uvw']
The really pedestrian solution would be START(([^S]|S*S[^ST]|ST[^A]|STA[^R]|STAR[^T])*(S(T(AR?)?)?)?)END. Modern regex flavors have negative assertions which do this more elegantly, but I interpret your comment about "backwards search" to perhaps mean you cannot or don't want to use this feature.
Update: Just for completeness, note that the above is greedy with respect to the end delimiter. To only capture the shortest possible string, extend the negation to also cover the end delimiter -- START(([^ES]|E*E[^ENS]|EN[^DS]|S*S[^STE]|ST[^AE]|STA[^RE]|STAR[^TE])*(S(T(AR?)?)?|EN?)?)END. This risks to exceed the torture threshold in most cultures, though.
Bug fix: A previous version of this answer had a bug, in that SSTART could be part of the match (the second S would match [^T], etc). I fixed this but by the addition of S in [^ST] and adding S* before the non-optional S to allow for arbitrary repetitions of S otherwise.
May I suggest a possible improvement on the solution of Tim Pietzcker?
It seems to me that START(?:(?!START).)*?END is better in order to only catch a START immediately followed by an END without any START or END in between. I am using .NET and Tim's solution would match also something like START END END. At least in my personal case this is not wanted.
[EDIT: I have left this post for the information on capture groups but the main solution I gave was not correct.
(?:START)((?:[^S]|S[^T]|ST[^A]|STA[^R]|STAR[^T])*)(?:END)
as pointed out in the comments would not work; I was forgetting that the ignored characters could not be dropped and thus you would need something such as ...|STA(?![^R])| to still allow that character to be part of END, thus failing on something such as STARTSTAEND; so it's clearly a better choice; the following should show the proper way to use the capture groups...]
The answer given using the 'zero-width negative lookahead' operator "?!", with capture groups, is: (?:START)((?!.*START).*)(?:END) which captures the inner text using $1 for the replace. If you want to have the START and END tags captured you could do (START)((?!.*START).*)(END) which gives $1=START $2=text and $3=END or various other permutations by adding/removing ()s or ?:s.
That way if you are using it to do search and replace, you can do, something like BEGIN$1FINISH. So, if you started with:
abcSTARTdefSTARTghiENDjkl
you would get ghi as capture group 1, and replacing with BEGIN$1FINISH would give you the following:
abcSTARTdefBEGINghiFINISHjkl
which would allow you to change your START/END tokens only when paired properly.
Each (x) is a group, but I have put (?:x) for each of the ones except the middle which marks it as a non-capturing group; the only one I left without a ?: was the middle; however, you could also conceivably capture the BEGIN/END tokens as well if you wanted to move them around or what-have-you.
See the Java regex documentation for full details on Java regexes.