what is regex doing in the background? - regex

I played around with regex today and I stepped on something I don't really understand why it behave like this.
This is my working regex (I formatted it for better readability):
(?<name>[a-z\ ]+[a-zA-Z]+|[a-zA-Z]+)\
(?<firstname>[a-z-A-Z\ ]+)\n
(?<title>[a-zA-Z\.\ ]+)\n?
(?<company>[a-zA-Zäöü\.\ ]+)?\n
(?<street>[a-zA-Zäöü]+)\ (?<housenumber>[0-9]+)\n?
(?<postfach>Postfach [0-9]+)?\n
(?<zip>[0-9]+)\ (?<place>[a-zA-Zäöü]+)
And this is the string I want to parse through:
Smith John
Dr.
Foobar AG
Smithstrasse 1
Postfach 1
6500 Bellinzona
With this regex it'll work perfectly. But previously the \n before group street was nullable and not the \n before group company. The thing is that there's a case where the string has no company in it. The result with the previous version: The whole street exept for the last char was in the group company and the last char of the street in group street (I used regex101 for testing). Although group company is nullable it looks like it "forced" to be part of the string which is definetly not the thing I want.
And that's where my quesion comes. How does regex work exactly in the background? I think regex is trying to take the best solution out of all the possible groupings it can have in the string. But I have no clue why it takes this solution as the best one.
Here's a link to regex101 where you can see how it behaved previously: https://regex101.com/r/OmuPBn/1

Related

How to write a REGEX that captures a string between a string (ie the words between specific words)

The Text string below is coming through in the Integromat text parser. I'm trying to capture the values from a form a user filled out using the built in Integromat Regex text parser.
For example, the test string comes in as (unfortunately the info is not coming through on individual lines):
Information First Name:Frank Last Name:McTester Email:mctesterxmas#rya.com
Guest Name:Debby McTester Party RegistrationNumber of Dinner Guests: 2 [http://
I need the regex to pull the info FRANK, which is between the string First Name: and Last Name:, so on and so forth.
My current regex works great for emails where these strings are on their own lines. For example if the email comes in with each string on its own line, then this regex works well.
First Name:\s*(.*)|Last Name:\s*(.*)|Email:\s*(.*)|Guest Name:\s*(.*)|Number of Dinner Guests:\s*(.*)
But when everything is mashed up, I cannot figure out how to use regex to parse the string.
Instead of using alternatives, match the entire line with each field in order.
First Name:\s*(.*?)\s+Last Name:\s*(.*?)\s+Email:\s*(.*)|Guest Name:\s*(.*?)Number of Dinner Guests:\s*(.*)
This was the final solution if anyone else needs it.
Transaction Date: *(?<date>\S+)|^Information[^:]*:\s*(?<name>.*)Last Name:\s*(?<lastname>.*)Email:(?<email>.*)Guest\n?Name:(?<guestname>.*)|Dinner Guests:\s*(?<guestcount>\d+)
or
Transaction Date: *(?<date>[^\s\n]+)|^Information[^:]*:\s*(?<name>.*)Last Name:\s*(?<lastname>.*)Email:(?<email>.*)Guest\n?Name:(?<guestname>.*)|Dinner Guests:\s*(?<guestcount>\d+)

London postcode Regex Validation

I'm using the following Regex Syntax to validate a UK postcode in RSForm!Pro:
^(([gG][iI][rR] {0,}0[aA]{2})|((([a-pr-uwyzA-PR-UWYZ][a-hk-yA-HK-Y]?[0-9][0-9]?)|(([a-pr-uwyzA-PR-UWYZ][0-9][a-hjkstuwA-HJKSTUW])|([a-pr-uwyzA-PR-UWYZ][a-hk-yA-HK-Y][0-9][abehmnprv-yABEHMNPRV-Y]))) {0,}[0-9][abd-hjlnp-uw-zABD-HJLNP-UW-Z]{2}))$
the validation works fine but i need to allow only Postcodes inside London.
Here are the postcodes allowed:
WC, EC, E1-E20, N1-N22, NW1-NW11, SE1-SE28, SW1-SW20, W1-14, HA0-9, EN1-8
Is there any Regex that validate only London postcodes and if not how can i run a separate validation after this and check the postcode is one of above.
This should match only your Matchcodes, assuming that they are a standalone string - otherwise you might want to use word boundries \b instead of the anchors ^$. I don't think there are much optimizations possible, just pretty straigthforward matching all possible codes:
^(?:[WE]C|(?:E|SW)(?:[1-9]|1[0-9]|[12]0)|N(?:[1-9]|1[0-9]|2[0-2])|NW(?:[1-9]|1[01])|SE(?:[1-9]|1[0-9]|2[0-8])|W(?:[1-9]|1[0-4])|HA[0-9]|EN[1-8])$
Here is a demo with the positive matches and some negative examples:
https://regex101.com/r/cT6fQ5/2
However - after reading a bit more into the topic, this might be a lot deeper than your initial post provides. I found this website with london postcodes http://www.doogal.co.uk/london_postcodes.php and worked on a solution, that restricts the first 3 or 4 digits specific to london and afterwards uses the last part from your initial postcode-regex:
(?i)^((?:EC(?:[1234][AMNPRV]|[124]Y)|WC(?:[12][ABEHNR]|1[VX])|(?:E|SW)(?:[0-9]|1[0-9]|[12]0)|N(?:[1-9]|1[0-9]|2[0-2])|NW(?:[1-9]|1[01])|SE(?:[1-9]|1[0-9]|2[0-8])|W(?:[1-9]|1[0-4])|HA[0-9]|EN[1-8]|E1W) {0,}[0-9][ABD-HJLNP-UW-Z]{2})$
And again a demo:
https://regex101.com/r/eD1gW3/3

Using Calibre, figuring out RegEx expressions. Configuring metadata from file name

I am trying to use Calibre on my mac to organize my ebook library.
As a summer personal project, I created various epubs of my nephews' and nieces' school reports as keepsakes on my computer and phone. I had labeled the files as: Title_Last Name, First Name.epub
For example: Report on ATP Cycle_Doe, John.epub
With Calibre I found you can configure metadata from the file name: Link
For example:
(?P<title>.+) - (?P<author>[^_]+)
Would only work if the file name was: Title - First Name Last Name.epub
I tried:
(?P<title>.+)[^\w](?P<author>[^_]+)
And it would return the title as: Report on ATP Cycle Doe,
And the author as: John
Can anyone can help me figure out a RegEx expression to extract the title and author from the file name convention that I used?
Such that the title is: Report on ATP Cycle
And the author is: John Doe
It is much appreciated.
Use this:
^(?P<title>[^_]+)_(?P<author>.*)\.epub$
In the Regex Demo, look at the named groups in the right pane.
Explanation
The ^ anchor asserts that we are at the beginning of the string
(?P<title>[^_]+) captures chars that are not an underscore to the title capture group
(?P<author>.*) captures any chars to the author capture group
\.epub matches .epub
The $ anchor asserts that we are at the end of the string
Variation
If for some reason the regex is not supposed to match the .epub extension, use this instead:
^(?P<title>[^_]+)_(?P<author>[.^]+)

EditPad: Find and Replace with RegEx Backreferences

I'm trying my hand at regex again. In particular, using a backreference to found text in the replace string in the EditPad text editor.
Subject:
Product1 Desc,12 PIN,GradeA Qty Price
Product2 Desc,28 PIN,GradeA Qty Price
Goal:
Since the text is currently space-separated, I need to replace 12 PIN with 12||PIN, and 28 PIN with 28||PIN.
What I'm trying:
[(0-9)]+[(\s)]PIN seems to be finding what I want just fine.
When I try to replace with backrefereces, though, the only one I can get to work is \0.
For example, using \0||PIN as my replace gives me 12 PIN||PIN.
When I try to replace with \1||PIN, however, it gives ||PIN.
What am I missing?
I could have sworn that I saw a previous poster answer this...
Using this as your find string:
([0-9]+)[\s]*PIN
and this as your replace string:
\1||PIN
should do it.

Regular Expression match everything but two names and <email address> after particular word

I have a bunch of Names and email addresses inside of these aggregated emails and I'd like to get rid of everything but the First Last <email#domain.com> throughout the document. Basically I have...
From: Name Wood <email#gmail.com>
Subject: Yelp entries for iPod contest
Date: April 20, 2012 12:51:07 PM EDT
To: email#domain.cc
Have had a great experience with .... My Son ... is currently almost a year into treatment. Dr. ... is great! Very informative and always updates us on progress and we have our regular visits. The ... buck program is a great incentive which they've implemented to help kids take care of their teeth/braces. They also offer payment programs which help for those of us that need a structured payment option. Wouldn't take my kids anywhere else. Thanks Dr. ... and staff
Text for 1, 2, and 3 entries to Yelp
Hope ... wins!!
Begin forwarded message:
From: Name Wood <email#gmail.com>
Subject: reviews 2 and 3
Date: April 20, 2012 12:44:26 PM EDT
To: email#domain.cc
Have had a great experience with ... Orthodontics. My Son ... is currently almost a year into treatment. Dr. ... is great! Very informative and always updates us on progress and we have our regular visits. The ... buck program is a great incentive which they've implemented to help kids take care of their teeth/braces. They also offer payment programs which help for those of us that need a structured payment option. Wouldn't take my kids anywhere else. Thanks Dr. ... and staff
Have had a great experience with...
I want to only match the...
Name Wood <email#gmail.com>
Name Wood <email#gmail.com>
from this text. So basically I want to match next two words after the word "From: " plus "<"+email address+">" excluding the word "From: ". I've gleaned from researching that this is a negative lookahead (I think) searching for two whole words (somehow using {0,2}) and then an email address from one < character to another >.
You could just do this:
/(?:From: )(.*)/g
This regular expression will find what you're looking for:
(?<=From:)\s*[^<]+<[^>]+>
But what you're going to do with it is a little unclear from your question. The matched text should probably be put into one or more groups so you can extract the text you want. (Name in one group? Email in a separate group? Or both together?) You haven't said what you want to do with it, so you'll have to provide more information. The above is the simplest case scenario.
Explanation:
(?<=From:) # positive lookbehind to find "From:"
\s* # optional whitespace
[^<]+< # everything up to the first '<' (the name)
[^>]+> # everything up to the '>' (the email)
If you want to strip all but the name and email.
Modifier 's' (dot includes newline),
Global find and replacement for both regex's is $1\n
This is faster but will leave an extra newline on sucesses.
Find .*?From:[^\S\n]*([^<\n]+<[^>\n]*\#[^>\n]*>)|.*$
This is slower (uses lookahead) but won't leave the extra newline.
Find .*?From:[^\S\n]*([^<\n]+<[^>\n]*\#[^>\n]*>)(?:(?!From:[^\S\n]*[^<\n]+<[^>\n]*\#[^>\n]*>).)*