Using Calibre, figuring out RegEx expressions. Configuring metadata from file name - regex

I am trying to use Calibre on my mac to organize my ebook library.
As a summer personal project, I created various epubs of my nephews' and nieces' school reports as keepsakes on my computer and phone. I had labeled the files as: Title_Last Name, First Name.epub
For example: Report on ATP Cycle_Doe, John.epub
With Calibre I found you can configure metadata from the file name: Link
For example:
(?P<title>.+) - (?P<author>[^_]+)
Would only work if the file name was: Title - First Name Last Name.epub
I tried:
(?P<title>.+)[^\w](?P<author>[^_]+)
And it would return the title as: Report on ATP Cycle Doe,
And the author as: John
Can anyone can help me figure out a RegEx expression to extract the title and author from the file name convention that I used?
Such that the title is: Report on ATP Cycle
And the author is: John Doe
It is much appreciated.

Use this:
^(?P<title>[^_]+)_(?P<author>.*)\.epub$
In the Regex Demo, look at the named groups in the right pane.
Explanation
The ^ anchor asserts that we are at the beginning of the string
(?P<title>[^_]+) captures chars that are not an underscore to the title capture group
(?P<author>.*) captures any chars to the author capture group
\.epub matches .epub
The $ anchor asserts that we are at the end of the string
Variation
If for some reason the regex is not supposed to match the .epub extension, use this instead:
^(?P<title>[^_]+)_(?P<author>[.^]+)

Related

Using RegEx to Extract Anchor Text of Links With Beginning of Specific Target URL

I need assistance with capturing "Mr. John Doe" from the following HTML code:
Mr. John Doe
I have been trying various string matching and thought that I was close when I tried using the following RegEx:
(.*)
...But, no matches were found in the capturing group.
This is something I'm trying to set as a parameter in a crawl simulation software (using PCRE). I'm simply looking to extract the author name which would appear within a hyperlink that links to a target URL beginning with /author/...
Any pointers? Thank you in advance!
Your problem is that you require a space when there's none:
(.*)
# ^^^
Remove it and it works:
(.*)
However, this expression could be vastly optimized (no dot-star-soup everywhere, that is) and if you're still at the beginning, better use a parser and xpath queries instead.

what is regex doing in the background?

I played around with regex today and I stepped on something I don't really understand why it behave like this.
This is my working regex (I formatted it for better readability):
(?<name>[a-z\ ]+[a-zA-Z]+|[a-zA-Z]+)\
(?<firstname>[a-z-A-Z\ ]+)\n
(?<title>[a-zA-Z\.\ ]+)\n?
(?<company>[a-zA-Zäöü\.\ ]+)?\n
(?<street>[a-zA-Zäöü]+)\ (?<housenumber>[0-9]+)\n?
(?<postfach>Postfach [0-9]+)?\n
(?<zip>[0-9]+)\ (?<place>[a-zA-Zäöü]+)
And this is the string I want to parse through:
Smith John
Dr.
Foobar AG
Smithstrasse 1
Postfach 1
6500 Bellinzona
With this regex it'll work perfectly. But previously the \n before group street was nullable and not the \n before group company. The thing is that there's a case where the string has no company in it. The result with the previous version: The whole street exept for the last char was in the group company and the last char of the street in group street (I used regex101 for testing). Although group company is nullable it looks like it "forced" to be part of the string which is definetly not the thing I want.
And that's where my quesion comes. How does regex work exactly in the background? I think regex is trying to take the best solution out of all the possible groupings it can have in the string. But I have no clue why it takes this solution as the best one.
Here's a link to regex101 where you can see how it behaved previously: https://regex101.com/r/OmuPBn/1

Specific regex name pattern that allows for various name from other countries

Hoping for some help I want to have a pattern that matches names with the following format.
Surname, Firstname othername othername
My pattern = "^[\\p{L} .'-]+[,\s][\\p{L} .'-]+$"
My issue is that it works but I want to restrict whitespaces in the surname part.
IE so Johnson Fredrick, Whatever - wouldn't be found
I tried "^([\\p{L} .'-]+[^\s][,\s][\\p{L} .'-]+$" but no such luck...
Thanks!
remove the space from the first block of code:
My pattern = "^[\\p{L}.'-]+[,\s][\\p{L} .'-]+$"

Lookahead in Regex

I am trying to extract venue from a file which contains several articles using regex. I know that the venue starts with either For/From and is followed by date which starts with a day of the week or author's name if the date is missing, I wrote the following regex to match the venue, however it always matches everything till the author's name which means the date also comes in the venue if that article has a date.
"""((?<=\n)(?:(?:\bFrom\b)|(?:\bFor\b)).*?(?=(?:(?:Monday|Tuesday|Wednesday|Thursday|Friday|Saturday|Sunday)|(?:[A-Z]+))))""".r
Why is my code not matching the days if it is encountered but rather goes ahead to match [A-Z] which is the author's name.
Input: "The Consequences of Hostilities Between the States
From the New York Packet.
Tuesday, November 20, 1787.
HAMILTON
To the People of the State of New York:"
The line "Tuesday, November 20, 1787." is optional and may not occur in all articles. I want the output to be "From the New York Packet."
I am getting the correct output for articles that do not have a date, however I am getting the output "From the New York Packet.
Tuesday, November 20, 1787." for articles that contain the date.
Based on your edit, all you really need is
^(From|For).*
with the multiline flag.
I know that the venue starts with either For/From
and is followed by date which starts with a day of the week or author's name if the date is missing
it always matches everything till the author's name which means the date also comes in the venue if that article has a date.
Sounds like you want to find an entire line within a text file that begins with "From" or "For"
^(From|For)
(Set the multiline flag on so that ^ matches the beginning of a line rather than the beginning of input).
is followed by an optional date
\s+(Monday|Tuesday|Wednesday|Thursday|Friday|Saturday|Sunday)?
followed by the author's name
\s+\w+\s+\w+
followed by everything until the end of the line
.*
Unless, of course you mean that you want to skip the date and match only the For/From and the author's name (not the date). That cannot be done in Regex alone - you can use grouping to extract the desired values, though.
You only need to capture the entire line that starts with For or From, so you can simply use this:
^(For|From).*$
The ^ and $ anchor the match to the start and end of the line, and the .* matches everything inbetween.
Here, try it out with whatever examples you like.
If this needs to be more complicated, I'll update my answer.

Regular Expression match everything but two names and <email address> after particular word

I have a bunch of Names and email addresses inside of these aggregated emails and I'd like to get rid of everything but the First Last <email#domain.com> throughout the document. Basically I have...
From: Name Wood <email#gmail.com>
Subject: Yelp entries for iPod contest
Date: April 20, 2012 12:51:07 PM EDT
To: email#domain.cc
Have had a great experience with .... My Son ... is currently almost a year into treatment. Dr. ... is great! Very informative and always updates us on progress and we have our regular visits. The ... buck program is a great incentive which they've implemented to help kids take care of their teeth/braces. They also offer payment programs which help for those of us that need a structured payment option. Wouldn't take my kids anywhere else. Thanks Dr. ... and staff
Text for 1, 2, and 3 entries to Yelp
Hope ... wins!!
Begin forwarded message:
From: Name Wood <email#gmail.com>
Subject: reviews 2 and 3
Date: April 20, 2012 12:44:26 PM EDT
To: email#domain.cc
Have had a great experience with ... Orthodontics. My Son ... is currently almost a year into treatment. Dr. ... is great! Very informative and always updates us on progress and we have our regular visits. The ... buck program is a great incentive which they've implemented to help kids take care of their teeth/braces. They also offer payment programs which help for those of us that need a structured payment option. Wouldn't take my kids anywhere else. Thanks Dr. ... and staff
Have had a great experience with...
I want to only match the...
Name Wood <email#gmail.com>
Name Wood <email#gmail.com>
from this text. So basically I want to match next two words after the word "From: " plus "<"+email address+">" excluding the word "From: ". I've gleaned from researching that this is a negative lookahead (I think) searching for two whole words (somehow using {0,2}) and then an email address from one < character to another >.
You could just do this:
/(?:From: )(.*)/g
This regular expression will find what you're looking for:
(?<=From:)\s*[^<]+<[^>]+>
But what you're going to do with it is a little unclear from your question. The matched text should probably be put into one or more groups so you can extract the text you want. (Name in one group? Email in a separate group? Or both together?) You haven't said what you want to do with it, so you'll have to provide more information. The above is the simplest case scenario.
Explanation:
(?<=From:) # positive lookbehind to find "From:"
\s* # optional whitespace
[^<]+< # everything up to the first '<' (the name)
[^>]+> # everything up to the '>' (the email)
If you want to strip all but the name and email.
Modifier 's' (dot includes newline),
Global find and replacement for both regex's is $1\n
This is faster but will leave an extra newline on sucesses.
Find .*?From:[^\S\n]*([^<\n]+<[^>\n]*\#[^>\n]*>)|.*$
This is slower (uses lookahead) but won't leave the extra newline.
Find .*?From:[^\S\n]*([^<\n]+<[^>\n]*\#[^>\n]*>)(?:(?!From:[^\S\n]*[^<\n]+<[^>\n]*\#[^>\n]*>).)*