Advanced regex: What would be the regex for this pattern? - regex

Want to identify names of all authors in the following text:
#misc{diaz2006automatic,
title={AUTOMATIC ROCKING DEVICE},
author={Diaz, Navarro David and Gines, Rodriguez Noe},
year={2006},
month=jul # "~12",
note={EP Patent 1,678,025}
}
#article{standefer1984sitting,
title={The sitting position in neurosurgery: a retrospective analysis of 488 cases},
author={Standefer, Michael and Bay, Janet W and Trusso, Russell},
journal={Neurosurgery},
volume={14},
number={6},
pages={649--658},
year={1984},
publisher={LWW}
}
#article{gentsch1992identification,
title={Identification of group A rotavirus gene 4 types by polymerase chain reaction.},
author={GenTSCH, JoN R and Glass, RI and Woods, P and Gouvea, V and Gorziglia, M and Flores, J and Das, BK and Bhan, MK},
journal={Journal of Clinical Microbiology},
volume={30},
number={6},
pages={1365--1373},
year={1992},
publisher={Am Soc Microbiol}
}
For the above text, regex should match:
match1 - Diaz, Navarro David
match2 - Gines, Rodriguez Noe
match3 - Standefer, Michael
match4 - Janet W
match5 - Trusso, Russell
...and so on

Although what you want should be easily achievable by capturing the contents between { and } for all lines starting with author= and then just splitting it using \s*(?:,|\band\b)\s* regex which will give you all the author names.
But just in case, your regex engine is PCRE based, you can use this regex, whose group1 content will give you the author names like you want.
^\s*author={|(?!^)\G((?:(?! and|, )[^}\n])+)(?: *and *)?(?:[^\w\n]*)
This regex exploits \G operator to match lines starting with author= and then starts matching the names which shouldn't contain and or , within it using (?!^)\G((?:(?! and|, )[^}\n])+)(?: *and *)?(?:[^\w\n]*) regex part
Regex Demo

Related

How to use REGEXTRACT to extract certain characters between two strings

I am trying to extract a person's name between different characters. For example, the cells contains this information
PATIENT: 2029985 - COLLINS, JUNIOR .
PATIENT: 1235231-02 - JERRY JR, PATRICK .
PATIENT: 986435--EXP-- - JULIUS, DANIEL .
PATIENT: 2021118-02 - DRED-HARRY, KEVIN .
My goal is to use one REGEXTRACT formula to get the following:
COLLINS, JUNIOR
JERRY JR, PATRICK
JULIUS, DANIEL
LOVE ALSTON, BRENDA
So far, I have come up with the formula:
=ARRAYFORMULA(REGEXEXTRACT(B3:B, "-(.*)\."))
Where B3 contains the first information
Using that formula, I get:
COLLINS, JUNIOR
02 - JERRY JR, PATRICK
02 - LOVE-ALSTON, BRENDA
-EXP-- - JULIUS, DANIEL
02 - DRED-HARRY, KEVIN
I managed to get the first name down but how do I go about extracting the rest.
You can use
=ARRAYFORMULA(REGEXEXTRACT(B3:B, "\s-\s+([^.]*?)\s*\."))
See the regex demo. Details:
\s-\s+ - a whitespace, -, one or more whitespaces
([^.]*?) - Group 1: zero or more chars other than a . as few as possible
\s* - zero or more whitespaces
\. - a . char.
1st solution: With your shown samples, please try following regex.
Online demo for above regex
^PATIENT:.*-\s+([^.]*?)\s*\.
OR try following Google-sheet forumla:
=ARRAYFORMULA(REGEXEXTRACT(B3:B, "^PATIENT:.*-\s+([^.]*?)\s*\."))
Explanation: Checking if line/value starts from PATIENT followed by : till -(using greedy mechanism here), which is followed by spaces(1 or more occurrences). Then creating one and only capturing group which contains everything just before .(dot) in it making it non-greedy, closing capturing group which is followed by spaces(0 or more occurrences) followed by a literal dot.
2nd solution: Using lazy match approach in regex, please try following regex.
.*?\s-\s([^.]*?)\s*\.
Google-sheet formula will be as follows:
=ARRAYFORMULA(REGEXEXTRACT(B3:B, ".*?\s-\s([^.]*?)\s*\."))
Online demo for above regex

Regex to capture alpha numeric before pipe separated

I've been trying to create a regex with space & alpha numeric values.
Below Im sharing the sample String.
Manchester United 8547|12345678910
|12345678910
Manchester |12345678910
124587933 |12345678910
8457 Manchester United|12345678910
Manchester United|12345678910
I want to capture everything before pipe(|) separated. At times there is a possibility of complete space and no alpha numeric values before pipe(|) which I've shown in 2nd example. Regex should not capture pipe(|) and next numerical values(12345678910).
I've tried below regex but none are working for me.
^.*$
^[\s\w\d]+$
[a-zA-Z0-9\s]+
[a-zA-Z0-9\s\W]+
^[\sa-z|A-Z|0-9]+$
^[\sa-z|A-Z|0-9]+$
[^\s]*$
([^\"]*)
^[a-zA-Z0-9]$
^([^?]*)$
.+?(?=\w)
\s[a-zA-Z0-9]+
^[\sa-zA-Z0-9]+
I need a full match & not group match
for example if I try for
Manchester 8457 then regex would be Manchester \d+. This gives me full match & not group match.
You can try this.
input.substring(0,input.indexOf("|"))
If you want to match alphanumeric before the pipe and not get a group match, but a match only, you can use a character class with a positive lookahead (?=\|) (if that is supported) to assert the pipe at the right.
^[A-Za-z0-9 ]+(?=\|)
Regex demo
Assuming that every line would have a pipe, you could split the input string on CRLF, and then extract the portion to the left of the pipe:
String input = "Manchester United 8547|12345678910\n |12345678910\nManchester |12345678910\n124587933 |12345678910\n8457 Manchester United|12345678910\n Manchester United|12345678910\n";
String[] parts = input.split("\r?\n");
List<String> contents = Arrays.stream(parts)
.map(x -> x.split("\\|")[0].trim())
.collect(Collectors.toList());
System.out.println(contents);
This prints:
[Manchester United 8547, , Manchester, 124587933, 8457 Manchester United,
Manchester United]
for getting alphanumeric part use the following
^\s*\w(.+?)\|
This should answer your question i guess.
^(.+?)\|
Please use this and try it checks only for the beginning string.
its is for the pipe
Try it here

Using REGEX to remove duplicates when entire line is not a duplicate

^(.*)(\r?\n\1)+$
replace with \1
The above is a great way to remove duplicate lines using REGEX
but it requires the entire line to be a duplicate
However – what would I use if I want to detect and remove dups – when the entire line s a whole is not a dup – but just the first X characters
Example:
Original File
12345 Dennis Yancey University of Miami
12345 Dennis Yancey University of Milan
12345 Dennis Yancey University of Rome
12344 Ryan Gardner University of Spain
12347 Smith John University of Canada
Dups Removed
12345 Dennis Yancey University of Miami
12344 Ryan Gardner University of Spain
12347 Smith John University of Canada
How about using a second group for checking eg the first 10 characters:
^((.{10}).*)(?:\r?\n\2.*)+
Where {n} specifies the amount of the characters from linestart that should be dupe checked.
the whole line is captured to $1 which is also used as replacement
the second group is used to check for duplicate line starts with
See this demo at regex101
Another idea would be the use of a lookahead and replace with empty string:
^(.{10}).*\r?\n(?=\1)
This one will just drop the current line, if captured $1 is ahead in the next line.
Here is the demo at regex101
For also removing duplicate lines, that contain up to 10 characters, a PCRE idea using conditionals: ^(?:(.{10})|(.{0,9}$)).*+\r?\n(?(1)(?=\1)|(?=\2$)) and replace with empty string.
If your regex flavor supports possessive quantifiers, use of .*+ will improve performance.
Be aware, that all these patterns (and your current regex) just target consecutive duplicate lines.

How do I get all words that begin with a capital letter following a specific string?

I have some text that could look something like this:
Name is William Bob Francis Ford Coppola-Mr-Cool King-Of-The-Mountain is a fake name.
I would like to run a regular expression against that string and pull out
William Bob Francis Ford Coppola-Mr-Cool King-Of-The-Mountain
as a match.
My current regex looks like this:
/\b((NAME\s\s*)(((\s*\,*\s*)? *)(([A-Z\'\-])([A-Za-z\'\-]+)*\s*){2,})?)\b/ig
and it does most of what I want but it's not perfect. Instead of just getting the name, it is also getting the "is a" following the name like this:
"William Bob Francis Ford Coppola-Mr-Cool King-Of-The-Mountain is a"
What is a regex formula to get only the words starting with a capital letter following the "Name" label and end when the next word starts with a lowercase after a space?
How do you like /Name ((?:[A-Z]\w+[ -]?)+)/?
Regex101: https://regex101.com/r/BFJBpZ/1
You can use:
Name\b[\sa-z]*\K(?:[A-Z][a-z]+[\s-]*)+(?=\s[a-z])
where
\K resets the starting point of the matching after having matched Name followed by some words in lower case
(?:[A-Z][a-z]+[\s-]*)+ will match all the words starting with a capital letter
(?=\s[a-z]) add the constraint that the following word starts with a lower case letter
demo: https://regex101.com/r/WBrdFU/1/
Notes:
you shouldn't use the i option in your regex, if you do so all of
your char classes [A-Z] will at the same time match upper case
letters but also lower case letters... This would prevent you from
selecting the words that start with a capital letter!!!
Adding the names with apostrophe:
Name\b[\sa-z]*\K(?:[A-Z][a-z'\s-]*?)+(?=\s[a-z])
demo: https://regex101.com/r/WBrdFU/3/
My guess is that, this simple expression might work, if we always have is after our desired output:
Name is (.+?) is.+
Test
use strict;
my $str = 'Name is William Bob Francis Ford Coppola-Mr-Cool King-Of-The-Mountain is a fake name.
';
my $regex = qr/Name is (.+?) is.+/mp;
if ( $str =~ /$regex/g ) {
print "Whole match is ${^MATCH} and its start/end positions can be obtained via \$-[0] and \$+[0]\n";
# print "Capture Group 1 is $1 and its start/end positions can be obtained via \$-[1] and \$+[1]\n";
# print "Capture Group 2 is $2 ... and so on\n";
}
# ${^POSTMATCH} and ${^PREMATCH} are also available with the use of '/p'
# Named capture groups can be called via $+{name}
Demo
RegEx Circuit
jex.im visualizes regular expressions:
Advice
zdim advises that:
Perhaps, as it may not be "is", just any low-case word (so after a
word boundary), something like /\b([A-Z].+?)\b[a-z.!?]/ ...
(probably needs tweaking, specially for the possible end of sentence
after the name) ?
This worked when I tested with regex101.com. Please check and let me know if this works for you
/Name is (([\s]*[A-Z][-a-z]*)*)/
Group 1 has this William Bob Francis Ford Coppola-Mr-Cool King-Of-The-Mountain
and test it on this link below
https://regex101.com/r/M2V2in/2

Regex for more than 1 First Name before the Middle Initial

I'm not that good with regular expression and here is my problem:
I want to create a regex that match with a name that has two or more first name (e.g. Francis Gabriel).
I came up with the regex ^[A-Z][a-z]{3,30}/s[A-Z][a-z]{3,30} but
it only matches with two first name and not all first names.
The regex should match with John John J. Johnny.
^[A-Z][a-z]{3,30}(\\s[A-Z](\\.|[a-z]{2,30})?)*$
\s must be used in java when using a Pattern Compiler.
If it is X., we have to validate it, or XYZ
John Johny J.hny -> is wrong
so either . or [a-z] and at least one first name should be there. So, put a * at last of second part to match 0 or more.
Since java is not supported in this snippet, a JavaScript implementation of same regex is done for you to understand.
Check it here
var reg=/^[A-Z][a-z]{3,30}(\s[A-Z](\.|[a-z]{2,30})?)*$/;
console.log(reg.test("John john")); // false because second part start with small case
console.log(reg.test("John John"));
console.log(reg.test("John John J."));
console.log(reg.test("John John J. Johny"));
Use the following regex:
^\w+\s(\w+\s)+\w\.\s\w+$
^\w+\s match a name a space
(\w+\s)+ followed by at least one more name and space
\w+\.\s followed by a single letter initial with dot then space
\w+$ followed by a last name
Regex101
Test code:
String testInput = "John John P. Johnny";
if (testInput.matches("^\\w+\\s(\\w+\\s)+\\w+\\.\\s\\w+$")) {
System.out.println("We have a match");
}
Try this:
^(\S*\s+)(\S*)?\s+\S*?
Francis Gabriel - matches:
0: [0,10] Francis
1: [0,9] Francis
2: [9,9]
John John2 J. Johnny - matches:
0: [0,11] John John2
1: [0,5] John
2: [5,10] John2