Lookahead in Regex - regex

I am trying to extract venue from a file which contains several articles using regex. I know that the venue starts with either For/From and is followed by date which starts with a day of the week or author's name if the date is missing, I wrote the following regex to match the venue, however it always matches everything till the author's name which means the date also comes in the venue if that article has a date.
"""((?<=\n)(?:(?:\bFrom\b)|(?:\bFor\b)).*?(?=(?:(?:Monday|Tuesday|Wednesday|Thursday|Friday|Saturday|Sunday)|(?:[A-Z]+))))""".r
Why is my code not matching the days if it is encountered but rather goes ahead to match [A-Z] which is the author's name.
Input: "The Consequences of Hostilities Between the States
From the New York Packet.
Tuesday, November 20, 1787.
HAMILTON
To the People of the State of New York:"
The line "Tuesday, November 20, 1787." is optional and may not occur in all articles. I want the output to be "From the New York Packet."
I am getting the correct output for articles that do not have a date, however I am getting the output "From the New York Packet.
Tuesday, November 20, 1787." for articles that contain the date.

Based on your edit, all you really need is
^(From|For).*
with the multiline flag.
I know that the venue starts with either For/From
and is followed by date which starts with a day of the week or author's name if the date is missing
it always matches everything till the author's name which means the date also comes in the venue if that article has a date.
Sounds like you want to find an entire line within a text file that begins with "From" or "For"
^(From|For)
(Set the multiline flag on so that ^ matches the beginning of a line rather than the beginning of input).
is followed by an optional date
\s+(Monday|Tuesday|Wednesday|Thursday|Friday|Saturday|Sunday)?
followed by the author's name
\s+\w+\s+\w+
followed by everything until the end of the line
.*
Unless, of course you mean that you want to skip the date and match only the For/From and the author's name (not the date). That cannot be done in Regex alone - you can use grouping to extract the desired values, though.

You only need to capture the entire line that starts with For or From, so you can simply use this:
^(For|From).*$
The ^ and $ anchor the match to the start and end of the line, and the .* matches everything inbetween.
Here, try it out with whatever examples you like.
If this needs to be more complicated, I'll update my answer.

Related

How to write a REGEX that captures a string between a string (ie the words between specific words)

The Text string below is coming through in the Integromat text parser. I'm trying to capture the values from a form a user filled out using the built in Integromat Regex text parser.
For example, the test string comes in as (unfortunately the info is not coming through on individual lines):
Information First Name:Frank Last Name:McTester Email:mctesterxmas#rya.com
Guest Name:Debby McTester Party RegistrationNumber of Dinner Guests: 2 [http://
I need the regex to pull the info FRANK, which is between the string First Name: and Last Name:, so on and so forth.
My current regex works great for emails where these strings are on their own lines. For example if the email comes in with each string on its own line, then this regex works well.
First Name:\s*(.*)|Last Name:\s*(.*)|Email:\s*(.*)|Guest Name:\s*(.*)|Number of Dinner Guests:\s*(.*)
But when everything is mashed up, I cannot figure out how to use regex to parse the string.
Instead of using alternatives, match the entire line with each field in order.
First Name:\s*(.*?)\s+Last Name:\s*(.*?)\s+Email:\s*(.*)|Guest Name:\s*(.*?)Number of Dinner Guests:\s*(.*)
This was the final solution if anyone else needs it.
Transaction Date: *(?<date>\S+)|^Information[^:]*:\s*(?<name>.*)Last Name:\s*(?<lastname>.*)Email:(?<email>.*)Guest\n?Name:(?<guestname>.*)|Dinner Guests:\s*(?<guestcount>\d+)
or
Transaction Date: *(?<date>[^\s\n]+)|^Information[^:]*:\s*(?<name>.*)Last Name:\s*(?<lastname>.*)Email:(?<email>.*)Guest\n?Name:(?<guestname>.*)|Dinner Guests:\s*(?<guestcount>\d+)

what is regex doing in the background?

I played around with regex today and I stepped on something I don't really understand why it behave like this.
This is my working regex (I formatted it for better readability):
(?<name>[a-z\ ]+[a-zA-Z]+|[a-zA-Z]+)\
(?<firstname>[a-z-A-Z\ ]+)\n
(?<title>[a-zA-Z\.\ ]+)\n?
(?<company>[a-zA-Zäöü\.\ ]+)?\n
(?<street>[a-zA-Zäöü]+)\ (?<housenumber>[0-9]+)\n?
(?<postfach>Postfach [0-9]+)?\n
(?<zip>[0-9]+)\ (?<place>[a-zA-Zäöü]+)
And this is the string I want to parse through:
Smith John
Dr.
Foobar AG
Smithstrasse 1
Postfach 1
6500 Bellinzona
With this regex it'll work perfectly. But previously the \n before group street was nullable and not the \n before group company. The thing is that there's a case where the string has no company in it. The result with the previous version: The whole street exept for the last char was in the group company and the last char of the street in group street (I used regex101 for testing). Although group company is nullable it looks like it "forced" to be part of the string which is definetly not the thing I want.
And that's where my quesion comes. How does regex work exactly in the background? I think regex is trying to take the best solution out of all the possible groupings it can have in the string. But I have no clue why it takes this solution as the best one.
Here's a link to regex101 where you can see how it behaved previously: https://regex101.com/r/OmuPBn/1

Using Calibre, figuring out RegEx expressions. Configuring metadata from file name

I am trying to use Calibre on my mac to organize my ebook library.
As a summer personal project, I created various epubs of my nephews' and nieces' school reports as keepsakes on my computer and phone. I had labeled the files as: Title_Last Name, First Name.epub
For example: Report on ATP Cycle_Doe, John.epub
With Calibre I found you can configure metadata from the file name: Link
For example:
(?P<title>.+) - (?P<author>[^_]+)
Would only work if the file name was: Title - First Name Last Name.epub
I tried:
(?P<title>.+)[^\w](?P<author>[^_]+)
And it would return the title as: Report on ATP Cycle Doe,
And the author as: John
Can anyone can help me figure out a RegEx expression to extract the title and author from the file name convention that I used?
Such that the title is: Report on ATP Cycle
And the author is: John Doe
It is much appreciated.
Use this:
^(?P<title>[^_]+)_(?P<author>.*)\.epub$
In the Regex Demo, look at the named groups in the right pane.
Explanation
The ^ anchor asserts that we are at the beginning of the string
(?P<title>[^_]+) captures chars that are not an underscore to the title capture group
(?P<author>.*) captures any chars to the author capture group
\.epub matches .epub
The $ anchor asserts that we are at the end of the string
Variation
If for some reason the regex is not supposed to match the .epub extension, use this instead:
^(?P<title>[^_]+)_(?P<author>[.^]+)

Regular Expression match everything but two names and <email address> after particular word

I have a bunch of Names and email addresses inside of these aggregated emails and I'd like to get rid of everything but the First Last <email#domain.com> throughout the document. Basically I have...
From: Name Wood <email#gmail.com>
Subject: Yelp entries for iPod contest
Date: April 20, 2012 12:51:07 PM EDT
To: email#domain.cc
Have had a great experience with .... My Son ... is currently almost a year into treatment. Dr. ... is great! Very informative and always updates us on progress and we have our regular visits. The ... buck program is a great incentive which they've implemented to help kids take care of their teeth/braces. They also offer payment programs which help for those of us that need a structured payment option. Wouldn't take my kids anywhere else. Thanks Dr. ... and staff
Text for 1, 2, and 3 entries to Yelp
Hope ... wins!!
Begin forwarded message:
From: Name Wood <email#gmail.com>
Subject: reviews 2 and 3
Date: April 20, 2012 12:44:26 PM EDT
To: email#domain.cc
Have had a great experience with ... Orthodontics. My Son ... is currently almost a year into treatment. Dr. ... is great! Very informative and always updates us on progress and we have our regular visits. The ... buck program is a great incentive which they've implemented to help kids take care of their teeth/braces. They also offer payment programs which help for those of us that need a structured payment option. Wouldn't take my kids anywhere else. Thanks Dr. ... and staff
Have had a great experience with...
I want to only match the...
Name Wood <email#gmail.com>
Name Wood <email#gmail.com>
from this text. So basically I want to match next two words after the word "From: " plus "<"+email address+">" excluding the word "From: ". I've gleaned from researching that this is a negative lookahead (I think) searching for two whole words (somehow using {0,2}) and then an email address from one < character to another >.
You could just do this:
/(?:From: )(.*)/g
This regular expression will find what you're looking for:
(?<=From:)\s*[^<]+<[^>]+>
But what you're going to do with it is a little unclear from your question. The matched text should probably be put into one or more groups so you can extract the text you want. (Name in one group? Email in a separate group? Or both together?) You haven't said what you want to do with it, so you'll have to provide more information. The above is the simplest case scenario.
Explanation:
(?<=From:) # positive lookbehind to find "From:"
\s* # optional whitespace
[^<]+< # everything up to the first '<' (the name)
[^>]+> # everything up to the '>' (the email)
If you want to strip all but the name and email.
Modifier 's' (dot includes newline),
Global find and replacement for both regex's is $1\n
This is faster but will leave an extra newline on sucesses.
Find .*?From:[^\S\n]*([^<\n]+<[^>\n]*\#[^>\n]*>)|.*$
This is slower (uses lookahead) but won't leave the extra newline.
Find .*?From:[^\S\n]*([^<\n]+<[^>\n]*\#[^>\n]*>)(?:(?!From:[^\S\n]*[^<\n]+<[^>\n]*\#[^>\n]*>).)*

How to remove a portion of a string from the end of a field only using find-and-replace in Microsoft Access?

I have a field with names and some of them have a trailing space and letter (middle initial...) at the end that I am trying to remove using find and replace in Microsoft Access 2010.
Example:
Doe John A -> Doe John
Doe Jane B -> Doe Jane
Is this possible using "find and replace" in Microsoft Access?
I was able to look through the following Access tutorials but can't figure out how to get it to only remove them from the END of the field/string:
Examples of wildcards in use
Access wildcard character reference
Replace using wildcards
My current find-and-replace will remove the entire string (because of the asterisk but without the asterisk - nothing is found) not just the trailing space and letter!
I think I am missing a "$" somewhere to tell it to only look at the end of the string but cannot get it to work without deleting the entire string from the field.
I don't think that find & replace dialog is sophisticated enough for what you want to do. You could use a regular expression in VBA code, which should be a close match to what you want. However this could be easy with SQL.
To display all of name_field except for the final space plus letter:
SELECT Left(name_field, Len(name_field)-2)
FROM MyTable
WHERE name_field Like "* [a-z]";
To actually discard the space plus letter from name_field:
UPDATE MyTable
SET name_field = Left(name_field, Len(name_field)-2)
WHERE name_field Like "* [a-z]";