Regex: Separated by nested parentheses and semicolon - regex

My strings look like the following (each row is one exemplrary string):
Smith, Anna (Univ Cambridge); Doe, Jane (Univ Vienna (Austria)); Doe, John (Univ Tokyo; MIT)
Mueller, Hans (FU Berlin (Germany)); Schmid, Julia (); Doe, John (CalTech); Boe, Jane (TU Wien)
Kim, Lee (Nazarbayev Univ (Kazakhstan); Univ Oxford)
In other words, the pattern comprises Surname, Name (Affiliation); (or without the ; if no other person follows), whereby the parentheses may be optionally nested ( () ) or contain a ; or be empty ().
I want to extract each name and affiliation, as in:
Smith, Anna (Univ Cambridge)
Doe, Jane (Univ Vienna (Austria))
Doe, John (Univ Tokyo; MIT)
Mueller, Hans (FU Berlin (Germany))
Schmid, Julia ()
Doe, John (CalTech)
Boe, Jane (TU Wien)
Kim, Lee (Nazarbayev Univ (Kazakhstan); Univ Oxford)
What would be the correct RegEx to do this?
My attempt with (?<=\()(?:[^()]+|\([^)]+\))+ did not work well...

Since your expected matches can only have one nested parentheses level, you can use
\w+,\s*\w+\s*\([^()]*(?:\([^()]*\)[^()]*)*\);?
See the regex demo.
Depending on whether or not your regex library supports recursion, or balanced constructs, this can be further enhanced to match parenthetical phrases of any depth.
Details:
\w+ - one or more word chars
, - a comma
\s* - zero or more whitespaces
\w+\s* - one or more word and then zero or more whitespace chars
\( - a ( char
[^()]* - zero or more chars other than ( and )
(?:\([^()]*\)[^()]*)* - zero or more sequences of (...) substrings with no ( and ) in between and then zero or more chars other than ( and )
\);? - a ) and then an optional ;.

Related

Capturing string parts in RegEx

I would like to map different parts of a string, some of them are optionally presented, some of them are always there. I'm using the Calibre's built in function (based on Python regex), but it is a general question: how can I do it in regex?
Sample strings:
!!Mixed Fortunes - An Economic History of China Russia and the West 0198703635 by Vladimir Popov (Jun 17, 2014 4_1).pdf
!Mixed Fortunes - An Economic History of China Russia and the West 0198703635 by Vladimir Popov (Jun 17, 2014 4_1).pdf
Mixed Fortunes - An Economic History of China Russia and the West 0198703635 by Vladimir Popov (Jun 17, 2014 4_1).pdf
!!Mixed Fortunes - An Economic History of China Russia and the West by Vladimir Popov (Jun 17, 2014 4_1).pdf
!!Mixed Fortunes - An Economic History of China Russia and the West by 1 Vladimir Popov (Jun 17, 2014 4_1).pdf
The strings' structure is the following:
[importance markings if any, it can be '!' or '!!'][title][ISBN-10 if available]by[author]([publication date and other metadata]).[file type]
Finally I created this regular expression, but it is not perfect, because if ISBN presented the title will contain the ISBN part too...
(?P<title>[A-Za-z0-9].+(?P<isbn>[0-9]{10})|([A-Za-z0-9].*))\sby\s.*?(?P<author>[A-Z0-9].*)(?=\s\()
Here is my sandbox: https://regex101.com/r/K2FzpH/1
I really appreciate any help!
Instead of using an alteration, you could use:
^!*(?P<title>[A-Za-z0-9].+?)(?:\s+(?P<isbn>[0-9]{10}))?\s+by\s+(?P<author>[A-Z0-9][^(]+)(?=\s\()
^ Start of the string
!* Match optional exclamation marks
(?P<title>[A-Za-z0-9].+?) Named group title, match of the ranges in the character class followed by matching as least as possible chars
(?:\s+(?P<isbn>[0-9]{10}))? Optionally match 1+ whitespace chars and named group isbn which matches 10 digits
\s+by\s+ Match by between 1 or more whitspace chars
(?P<author>[A-Z0-9][^(]+) Named group author Match either A-Z or 0-9 followed by 1+ times any char except (
(?=\s\() Positive lookahead to assert ( directly to the right
Regex demo

Using REGEX to remove duplicates when entire line is not a duplicate

^(.*)(\r?\n\1)+$
replace with \1
The above is a great way to remove duplicate lines using REGEX
but it requires the entire line to be a duplicate
However – what would I use if I want to detect and remove dups – when the entire line s a whole is not a dup – but just the first X characters
Example:
Original File
12345 Dennis Yancey University of Miami
12345 Dennis Yancey University of Milan
12345 Dennis Yancey University of Rome
12344 Ryan Gardner University of Spain
12347 Smith John University of Canada
Dups Removed
12345 Dennis Yancey University of Miami
12344 Ryan Gardner University of Spain
12347 Smith John University of Canada
How about using a second group for checking eg the first 10 characters:
^((.{10}).*)(?:\r?\n\2.*)+
Where {n} specifies the amount of the characters from linestart that should be dupe checked.
the whole line is captured to $1 which is also used as replacement
the second group is used to check for duplicate line starts with
See this demo at regex101
Another idea would be the use of a lookahead and replace with empty string:
^(.{10}).*\r?\n(?=\1)
This one will just drop the current line, if captured $1 is ahead in the next line.
Here is the demo at regex101
For also removing duplicate lines, that contain up to 10 characters, a PCRE idea using conditionals: ^(?:(.{10})|(.{0,9}$)).*+\r?\n(?(1)(?=\1)|(?=\2$)) and replace with empty string.
If your regex flavor supports possessive quantifiers, use of .*+ will improve performance.
Be aware, that all these patterns (and your current regex) just target consecutive duplicate lines.

Advanced regex: What would be the regex for this pattern?

Want to identify names of all authors in the following text:
#misc{diaz2006automatic,
title={AUTOMATIC ROCKING DEVICE},
author={Diaz, Navarro David and Gines, Rodriguez Noe},
year={2006},
month=jul # "~12",
note={EP Patent 1,678,025}
}
#article{standefer1984sitting,
title={The sitting position in neurosurgery: a retrospective analysis of 488 cases},
author={Standefer, Michael and Bay, Janet W and Trusso, Russell},
journal={Neurosurgery},
volume={14},
number={6},
pages={649--658},
year={1984},
publisher={LWW}
}
#article{gentsch1992identification,
title={Identification of group A rotavirus gene 4 types by polymerase chain reaction.},
author={GenTSCH, JoN R and Glass, RI and Woods, P and Gouvea, V and Gorziglia, M and Flores, J and Das, BK and Bhan, MK},
journal={Journal of Clinical Microbiology},
volume={30},
number={6},
pages={1365--1373},
year={1992},
publisher={Am Soc Microbiol}
}
For the above text, regex should match:
match1 - Diaz, Navarro David
match2 - Gines, Rodriguez Noe
match3 - Standefer, Michael
match4 - Janet W
match5 - Trusso, Russell
...and so on
Although what you want should be easily achievable by capturing the contents between { and } for all lines starting with author= and then just splitting it using \s*(?:,|\band\b)\s* regex which will give you all the author names.
But just in case, your regex engine is PCRE based, you can use this regex, whose group1 content will give you the author names like you want.
^\s*author={|(?!^)\G((?:(?! and|, )[^}\n])+)(?: *and *)?(?:[^\w\n]*)
This regex exploits \G operator to match lines starting with author= and then starts matching the names which shouldn't contain and or , within it using (?!^)\G((?:(?! and|, )[^}\n])+)(?: *and *)?(?:[^\w\n]*) regex part
Regex Demo

Regex for more than 1 First Name before the Middle Initial

I'm not that good with regular expression and here is my problem:
I want to create a regex that match with a name that has two or more first name (e.g. Francis Gabriel).
I came up with the regex ^[A-Z][a-z]{3,30}/s[A-Z][a-z]{3,30} but
it only matches with two first name and not all first names.
The regex should match with John John J. Johnny.
^[A-Z][a-z]{3,30}(\\s[A-Z](\\.|[a-z]{2,30})?)*$
\s must be used in java when using a Pattern Compiler.
If it is X., we have to validate it, or XYZ
John Johny J.hny -> is wrong
so either . or [a-z] and at least one first name should be there. So, put a * at last of second part to match 0 or more.
Since java is not supported in this snippet, a JavaScript implementation of same regex is done for you to understand.
Check it here
var reg=/^[A-Z][a-z]{3,30}(\s[A-Z](\.|[a-z]{2,30})?)*$/;
console.log(reg.test("John john")); // false because second part start with small case
console.log(reg.test("John John"));
console.log(reg.test("John John J."));
console.log(reg.test("John John J. Johny"));
Use the following regex:
^\w+\s(\w+\s)+\w\.\s\w+$
^\w+\s match a name a space
(\w+\s)+ followed by at least one more name and space
\w+\.\s followed by a single letter initial with dot then space
\w+$ followed by a last name
Regex101
Test code:
String testInput = "John John P. Johnny";
if (testInput.matches("^\\w+\\s(\\w+\\s)+\\w+\\.\\s\\w+$")) {
System.out.println("We have a match");
}
Try this:
^(\S*\s+)(\S*)?\s+\S*?
Francis Gabriel - matches:
0: [0,10] Francis
1: [0,9] Francis
2: [9,9]
John John2 J. Johnny - matches:
0: [0,11] John John2
1: [0,5] John
2: [5,10] John2

Regex to match a few possible strings with possible leading and/or trailing spaces

Let's say I have a string:
John Smith (auth.), Mary Smith, Richard Smith (eds.), Richie Jack (ed.), Jack Johnny (eds.)
I would like to match:
John Smith(auth.),Mary Smith,Richard Smith(eds.),Richie Jack(ed.),Jack Johnny(eds.)
I have came up with a regex but I have a problem with the | (or character) because my string contains characters that have to be escaped like ().. This is what I'm not able deal with. My regex is:
\s+\((auth\.\)|\(eds\.\))?,\s+
EDIT: I think now that the most universal solution would be to assume that in () could be anything.
Try this:
\s*\((auth|eds?)?\.\)?,?\s*
\s+ means one or more
\s* means zero or more
Based on your comment, I modified the regex:
\s*((\([^)]*\))|,)\s*