Lazy regex that starts match at last match not ends at first - regex

I am trying to extract data from semi structured text, it is an email composed of tab delimited tables. Users have entered time stamp on top of each table and within the table they list security identifiers that I am looking for.
The goal is to extract correctsecurity and the time stamp on top of the table that correctsecurity is located at.
For example...
10:00 AM
not it
not it
9:00 AM
not it
correctsecurity
..is supposed to return 9:00 AM correctsecurity. However my current regex is returning 10:00 AM correctsecurity, meaning right item, but not the right time.
Here is my regex so far:
((1[0-2]|[0-9]):[0-5][0-9](\s?(AM|PM))?)(?:(.*\n)+)(correctsecurity)
Note that the last part correctsecurity is being created dynamically based on other criteria so even if I were to provide the actual item in this question it would do little help(because it is one of many), for simplicity sake please assume that correctsecurity is exactly the item I am looking for.
Lastly I am doing this in VBA so maybe solving this whole problem is easier without using a long regex, so feel free proposed non regex solutions.

To solve the main problem simply change the central section of your regex to not accept empty lines:
.*\n -> .+\n
Then add a newline anchor \n before the central section to avoid the skip of the AM|PM section:
So your regex will be:
((1[0-2]|[0-9]):[0-5][0-9](\s?(AM|PM))?)\n(?:(.+\n)+)(correctsecurity)
^ ^
Changes --------------------------------|------|
Optional Optimization
You may remove many unneeded groups and add a generic multi-os regex for the newline (?:\r\n?|\n):
((?:1[0-2]|[0-9]):[0-5][0-9](?: [AP]M)?)(?:\r\n?|\n)(?:[^\r\n]+(?:\r\n?|\n))+(correctsecurity)

You can solve this with negative lookahead :
a((?!a).)*correctsecurity
Where a is the pattern that you want to start the match at and don't want to encouter in the middle of the match.
Applied to your specif needs :
\d*:\d* [AP]M((?!\d*:\d* [AP]M).)*correctsecurity
Don't forget to let the dot match line breaks.
I assume VBA uses thee VBScript regex dialect which requires the following modification:
\d*:\d* [AP]M((?!\d*:\d* [AP]M)[\s\S])*correctsecurity

Related

What is the correct regex pattern to use to clean up Google links in Vim?

As you know, Google links can be pretty unwieldy:
https://www.google.com/search?q=some+search+here&source=hp&newwindow=1&ei=A_23ssOllsUx&oq=some+se....
I have MANY Google links saved that I would like to clean up to make them look like so:
https://www.google.com/search?q=some+search+here
The only issue is that I cannot figure out the correct regex pattern for Vim to do this.
I figure it must be something like this:
:%s/&source=[^&].*//
:%s/&source=[^&].*[^&]//
:%s/&source=.*[^&]//
But none of these are working; they start at &source, and replace until the end of the line.
Also, the search?q=some+search+here can appear anywhere after the .com/, so I cannot rely on it being in the same place every time.
So, what is the correct Vim regex pattern to use in order to clean up these links?
Your example can easily be dealt with by using a very simple pattern:
:%s/&.*
because you want to keep everything that comes before the second parameter, which is marked by the first & in the string.
But, if the q parameter can be anywhere in the query string, as in:
https://www.google.com/search?source=hp&newwindow=1&q=some+search+here&ei=A_23ssOllsUx&oq=some+se....
then no amount of capturing or whatnot will be enough to cover every possible case with a single pattern, let alone a readable one. At this point, scripting is really the only reasonable approach, preferably with a language that understands URLs.
--- EDIT ---
Hmm, scratch that. The following seems to work across the board:
:%s#^\(https://www.google.com/search?\)\(.*\)\(q=.\{-}\)&.*#\1\3
We use # as separator because of the many / in a typical URL.
We capture a first group, up to and including the ? that marks the beginning of the query string.
We match whatever comes between the ? and the first occurrence of q= without capturing it.
We capture a second group, the q parameter, up to and excluding the next &.
We replace the whole thing with the first capture group followed by the second capture group.

Regex matching for optional strings

I am trying to come up with a smallest regex possible for extracting parts of a string with the last section being an optional one. The string will look something like:
jack:Bill(23):Space Force (23, Apple;Orange)
or
jack:Bill(23):Space Force
I need to extract as follows:
Jack
Bill(23)
Space Force
23
Apple;Orange
The last 2 items may or may not appear based on the source string. I am trying with a regex like:
(.*?):(.*?):(.*?)(\\(([0-9]+),([^\\)]*)?\\))?
But this does not seem to work.
I got it working with (.*?):(.*?):([^\\(]*)(\\(([0-9]+), ([^\\)]*)?\\))?

Extract only the text field needed

I am at the beginning of learning Regex, and I use every opportunity to understand how it's working. Currently I am trying to extract dates from a text file (which is in fact a vnt-file type from my mobile phone). It looks like following:
BEGIN:VNOTE
VERSION:1.1
BODY;ENCODING=QUOTED-PRINTABLE;CHARSET=UTF-8:18.07.=0A14.08.=0A15.09.=0A15.10.=
=0A13.11.=0A13.12.=0A12.01.=0A03.02. Grippe=0A06.03.=0A04.04.2015=0A0=
5.05.2015=0A03.06.2015=0A03.07.2015=0A02.08.2015=0A30.08.2015=0A28.09=
17.11.2017=0A
DCREATED:20171118T095601
X-IRMC-LUID:150
END:VNOTE
I want to extract all dates, so that the final list is like that:
18.07.
14.08.
15.09.
15.10.
and so on. If the date has also a year, it should also be displayed.
I almost found out how to detect the dates by the following regex:
.+(\d\d\.\d\d\.(2015|2016|2017)?).+
But it only detect very few of the dates. The result is this:
BEGIN:VNOTE
VERSION:1.1
15.10.
04.04.2015
30.08.2015
24.01.2016
DCREATED:20171118T075601
X-IRMC-LUID:150
END:VNOTE
Then I tried to add a question mark which makes the .+ not greedy, as far as I read in tutorials. Then the regex looks like:
.+?(\d\d\.\d\d\.(2015|2016|2017)?).+?
But the result is still not what I am looking for:
BEGIN:VNOTE
VERSION:1.1
21.03.20.04.18.05.18.06.18.07.14.08.15.09.15.10.
13.11.13.12.12.01.03.02.06.03.04.04.20150A0=
03.06.201503.07.201502.08.201530.08.20150A28.09=
28.10.201525.11.201528.12.201524.01.20160A
DCREATED:20171118T075601
X-IRMC-LUID:150
END:VNOTE
For someone who is familiar with regex I am pretty sure this is very easy to solve, but I don't get it. It's very confusing when you are new to regex. I tried to find a hint in some tutorials or stackoverflow posts, but all I found is this: Notepad++ how to extract only the text field which is needed?
But it doesn't work for me. I assume it might have something to do with the fact that my text file is not one single line.
I have my example on regex101 too.
I would be very thankful if maybe someone can give me a hint what else I can try.
Edit: I would like to detect the dates with the regex and as a result have a list with only the dates (maybe it is called substitute?)
Edit 2: Sorry for not mentioning it earlier: I just want to use the regex in e.g. Notepad++ or an online regex test website. Just to get the result of the dates and save the result in a new txt-file. I don't want to use the regex in an programming language. My apologies for not being precisely before.
Edit 3: The result should be a list with the dates, and each date in a new line:
I want to extract all dates, so that the final list is like that:
18.07.
14.08.
15.09.
15.10.
I suggest this pattern:
(?:.*?|\G)(\d\d\.\d\d\.(?:\d{4})?)
This makes use of the \G flag that, in this case, allows for multiple matches from the very start of the match without letting any single unmatched character in the text, thus allowing the removal of all but what's wanted.
If you want to remove the extra matches as well, add |.* at the end:
(?:.*?|\G)(\d\d\.\d\d\.(?:\d{4})?)|.*
regex101 demo
In N++, make sure the options underlined are selected, and that the cursor is at the beginning. In the picture below, I replaced then undid the replacement, only to show that matches were identified (16 replacements).
You can try using the following pattern:
\d{2}\.\d{2}\.(?:\d{4})?
This will match day.month dates of the form 18.07., but it also allows such a date to be followed by a four digit year, e.g. 18.07.2017. While it would be nice to make the pattern more restrictive, to avoid false fire matches, I do not see anything obvious which can be added to the above pattern. Follow the demo link below to see the pattern in action.
Demo

Regex to 'clean' main words from suffixes recognized by some repetitive patterns

I have this exaple list
Veep - Season 1 BDMux.torrent
Vegas S01e01-21.torrent
Velvet S01e13.torrent
Velvet.e10.torrent
Velvet_e01.torrent
Veronica Mars s01.torrent
Vicious S01e01-06.torrent
Victor Ros S01e01-06.torrent
Video.Game.High.School.S01e01-09.XviD.torrent
Vikings - Season 1 EXT.torrent
Vikings_S04e04.avi.torrent
I want eliminate similar lines like velvet. or velvet_ and consolidate to one and finally print like this
Veep
Vegas
Velvet
Veronica Mars
Victor Ros
Video Game High School
Vikings
How regex?
To do all that in one regex, I'd say is impossible. However, this regex
^(.*?)[ ._-]*(?:s\w*\s*\d+)?(?:e\d\d(?:-\d\d)?)?[\s.]*\w*?\.torrent(?:[\s\S]*\1.*$)*$
handles what you throwed at us ;). There's one but though - it can't remove the dots in titles like Video.Game.High.School.
And - it requires the shows to be grouped, like in your example (e.g. All Velvet grouped together). This ought to be easily solved by Notepad++'s Edit>Line Operations>Sort Lines in Ascending though.
Check it out here at regex101.
What it does is to capture everything up to season and/or episode, allowing for an optional format and finally matching .torrent. It then optionally matches everything up to a possible repeat of the first captured and whatever follows up to the end of the line. The last step is repeated until no match found. The capture group now holds the name of the show, but the regex matches all lines of the show. Thus, replacing the whole match with the capture, will leave only one clean entry for each show.
This means that it won't handle when a shows name starts with the complete name of another show, e.g. American Crime and American Crime Story, since the first would match the second, and therefor keep matching 'til the end of the second. This can be fixed by including the test for season/episode in the second part of the regex, but I opted out on this to keep it simpler and faster.
So, you say in a comment "regex does not need to be perfect". Well, here's one that gets most of the job done for you - but isn't perfect.
Regards
Edit
Made some updates and simplified regex considerably. Here's the old one if you want the more specific one:
^(.*?)[ ._]?(?:-? season \d+|(?:s\d\d)?(?:e\d\d(?:-\d\d)?)?)[\s.]*(?:bdmux|xvid|ext|avi)?\.torrent(?:[\s\S]*\1.*$)*$

RegEx: Match Mr. Ms. etc in a "Title" Database field

I need to build a RegEx expression which gets its text strings from the Title field of my Database. I.e. the complete strings being searched are: Mr. or Ms. or Dr. or Sr. etc.
Unfortunately this field was a free field and anything could be written into it. e.g.: M. ; A ; CFO etc.
The expression needs to match on everything except: Mr. ; Ms. ; Dr. ; Sr. (NOTE: The list is a bit longer but for simplicity I keep it short.)
WHAT I HAVE TRIED SO FAR:
This is what I am using successfully on on another field:
^(?!(VIP)$).* (This will match every string except "VIP")
I rewrote that expression to look like this:
^(?!(Mr.|Ms.|Dr.|Sr.)$).*
Unfortunately this did not work. I assume this is because because of the "." (dot) is a reserved symbol in RegEx and needs special handling.
I also tried:
^(?!(Mr\.|Ms\.|Dr\.|Sr\.)$).*
But no luck as well.
I looked around in the forum and tested some other solutions but could not find any which works for me.
I would like to know how I can build my formula to search the complete (short) string and matches everything except "Mr." etc. Any help is appreciated!
Note: My Question might seem unusual and seems to have many open ends and possible errors. However the rest of my application is handling those open ends. Please trust me with this.
If you want your string simply to not start with one of those prefixes, then do this:
^(?!([MDS]r|Ms)\.).*$
The above simply ensures that the beginning of the string (^) is not followed by one of your listed prefixes. (You shouldn't even need the .*$ but this is in case you're using some engine that requires a complete match.)
If you want your string to not have those prefixes anywhere, then do:
^(.(?!([MDS]r|Ms)\.))*$
The above ensures that every character (.) is not followed by one of your listed prefixes, to the end (so the $ is necessary in this one).
I just read that your list of prefixes may be longer, so let me expand for you to add:
^(.(?!(Mr|Ms|Dr|Sr)\.))*$
You say entirely of the prefixes? Then just do this:
^(?!Mr|Ms|Dr|Sr)\.$
And if you want to make the dot conditional:
^(?!Mr|Ms|Dr|Sr)\.?$
^
Through this | we can define any number prefix pattern which we gonna match with string.
var pattern = /^(Mrs.|Mr.|Ms.|Dr.|Er.).?[A-z]$/;
var str = "Mrs.Panchal";
console.log(str.match(pattern));
this may do it
/(?!.*?(?:^|\W)(?:(?:Dr|Mr|Mrs|Ms|Sr|Jr)\.?|Miss|Phd|\+|&)(?:\W|$))^.*$/i
from that page I mentioned
Rather than trying to construct a regex that matches anything except Mr., Ms., etc., it would be easier (if your application allows it) to write a regex that matches only those strings:
/^(Mr|Ms|Dr|Sr)\.$/
and just swap the logic for handling matching vs non-matching strings.
re.sub(r'^([MmDdSs][RSrs]{1,2}|[Mm]iss)\.{0,1} ','',name)