I am creating regexes that get the whole sentence if a piece of specific information exists. Right now I am working on my name regex, so if there is any composed name (example: "Jorge Martel", "Jorge Martel del Arnold Albuquerque") the regex should get the whole sentence that has the name.
If I have these two sentences:
(1) - "A hardworking guy is working at the supermarket. They call him Jorge Horizon, but that's not his real name."
(2) - "He has an identity document that contains the name, Jorge Martel Arnold."
The regex should return these two results from the sentences above:
(1) - "They call him Jorge Horizon, but that's not his real name."
(2) - "He has an identity document that contains the name, Jorge Martel Arnold."
This is my regex:
(?:(?(?<=[\.!?]\s([A-Z]))(.+?[^.])|))?((?:(?:[A-Z][A-zÀ-ÿ']+\s(?:(?:(?:[A-zÀ-ÿ']{1,3}\s)?(?:[A-ZÀ-Ÿ][A-zÀ-ÿ']*\s?))+))\b)(.+?[\.!?](?:\s|\n|\Z)))
Basically, it verifies if there is a dot, exclamation, or interrogation symbol with a blank space and an upper case character and tells the regex that everything must be select, else it should get all the sentence.
My else case (|) right now is empty, because using (.+?) avoids my first condition...
Regex without the else case:
Validates until the dot, but doesn't get the second sentence.
Regex with the else case:
Validates the second sentence, but overrides the first condition that appears in the first sentence.
I expect my regex to return correctly the sentences:
"They call him Jorge Horizon, but that's not his real name."
"He has an identity document that contains the name, Jorge Martel Arnold."
I have also created a text to validate the regex operations as I will be using it a lot in texts. I added a lot of conditions in this text, which will probably appear in my daily work.
Check my regex, sentence, and text here:
Does anyone know what should I change in my regex? I have tried many variations and still cannot find the solution.
P.S.: I intend to use it in my python code, but I need to fix it with the regex and not with the python code.
you can try this.
[\w\ \,\']+\.\ ?([\w\ \,\']+\.)|^([\w\ \,\']+\.)$
prints $1$2. I.e if group one is empty it prints blank since there is no match, then will print group 2. Visa versa, it prints group 1 when group 2 is not there.
[\w\ ,']+.\ ?([\w\ ,']+.) - as matching anything with XXX. XXX.
then
^([\w\ ,']+.)$ - must start end with only 1 sentence.
Though honestly this can easily be done with a Tokenizer of (.) that check length of 1 or 2. It' really like using a sledgehammer to hammer a nail.
Matching names can be a very hard job using a regex, but if you want to match at least 2 consecutive uppercase words using the specified ranges.
Assuming the names start with an uppercase char A-Z (else you can extend that character class as well with the allowed chars or if supported use \p{Lu} to match an uppercase char that has a lowercase variant):
(?<!\S)[A-Z][A-Za-zÀ-ÿ]*(?:\s+[a-zÀ-ÿ,]+)*\s+[A-Z][a-zÀ-ÿ]*\s+[A-Z][a-zÀ-ÿ,]*.*?[.!?](?!\S)
(?<!\S) Assert a whitespace boundary to the left
[A-Z][A-Za-zÀ-ÿ]* Match an uppercase char A-Z optionally followed by matching the defined ranges
(?:\s+[a-zÀ-ÿ,]*)* Optionally repeat matching 1+ whitespace chars and 1 or more of the ranges
\s+[A-Z][a-zÀ-ÿ]*\s+[A-Z][a-zÀ-ÿ,]* Match 2 times whitespace chars followed by an uppercase A-Z and optional chars defined in the character class
.*?[.!?] Match as least as possible chars followed by one of . ! or ?
(?!\S) Assert a whitspace boundary to the right
Regex demo
Try this:
((?:^|(?:[^\.!?]*))[^\.!?\n]*(?:(?:[A-ZÀ-Ÿ][A-zÀ-ÿ']+\s?){2,}[^\.!?]*[\.!?]))
It will capture sentences where name has at least two words, e.g. His name is John Smith.
It won't capture sentences like: John went to a concert.
Related
I'm trying to extract all words with Uppercase initial letter from a text, with the REGEXEXTRACT formula in google sheets.
Ideally the first word of sentences should be ignored and only all subsequent words with first Uppercase letter should be extracted.
Other Close Questions and Formulas:
I've found those other two questions and answers:
How to extract multiple names with capital letters in Google Sheets?
=ARRAYFORMULA(TRIM(IFERROR(REGEXREPLACE(IFERROR(REGEXEXTRACT(IFERROR(SPLIT(A2:A, CHAR(10))), "(.*) .*#")), "Mr. |Mrs. ", ""))))
Extract only ALLCAPS words with regex
=REGEXEXTRACT(A2, REPT(".* ([A-Z]{2,})", COUNTA(SPLIT(REGEXREPLACE(A2,"([A-Z]{2,})","$"),"$"))-1))
They are close but I can't apply them successfully to my project.
The Regex Pattern I Use:
I also found this regex [A-ZÖ][a-zö]+ pattern that works well to get all the Uppercase first letter words.
The problem is that it's not ignoring the first words of sentences.
Other Python Solution Vs Google Sheets Formula:
I've also found this python tutorial and script to do it:
Proper Noun Extraction in Python using NLP in Python
# Importing the required libraries
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize, sent_tokenize
# Function to extract the proper nouns
def ProperNounExtractor(text):
print('PROPER NOUNS EXTRACTED :')
sentences = nltk.sent_tokenize(text)
for sentence in sentences:
words = nltk.word_tokenize(sentence)
words = [word for word in words if word not in set(stopwords.words('english'))]
tagged = nltk.pos_tag(words)
for (word, tag) in tagged:
if tag == 'NNP': # If the word is a proper noun
print(word)
text = """Down the Rabbit-Hole
Alice was beginning to get very tired of sitting by her sister on the bank, and of having nothing to do: once or twice she had peeped into the book her sister was reading, but it had no pictures or conversations in it, `and what is the use of a book,' thought Alice `without pictures or conversation?'
So she was considering in her own mind (as well as she could, for the hot day made her feel very sleepy and stupid), whether the pleasure of making a daisy-chain would be worth the trouble of getting up and picking the daisies, when suddenly a White Rabbit with pink eyes ran close by her.
There was nothing so very remarkable in that; nor did Alice think it so very much out of the way to hear the Rabbit say to itself, `Oh dear! Oh dear! I shall be late!' (when she thought it over afterwards, it occurred to her that she ought to have wondered at this, but at the time it all seemed quite natural); but when the Rabbit actually took a watch out of its waistcoat-pocket, and looked at it, and then hurried on, Alice started to her feet, for it flashed across her mind that she had never before seen a rabbit with either a waistcoat-pocket, or a watch to take out of it, and burning with curiosity, she ran across the field after it, and fortunately was just in time to see it pop down a large rabbit-hole under the hedge."""
# Calling the ProperNounExtractor function to extract all the proper nouns from the given text.
ProperNounExtractor(text)
It works well, but I the idea in doing it in Google Sheets is to have the Uppercase first letter Words adjacent to the text in a table format for more convenient reference.
Question Summary:
How would you adjust my formula in the sample sheet below
=ARRAYFORMULA(IF(A1:A="","",REGEXEXTRACT(A1:A,"[A-ZÖ][a-zö]+")))
to add those functions:
Extract all the first Uppercase letter Words from each cell with text
Ignore the first words of sentences
return all the first Uppercase letter Words save the first words from sentences into the adjacent cells, one word per cell (similar to this example (from the 2nd Close question above): )
Sample Sheet:
Here's my testing Sample Sheet
Many thanks for your help!
You can use
=ARRAYFORMULA(SPLIT(REGEXREPLACE(REGEXREPLACE(A111:A, "(?:[?!]|\.(?:\.\.+)?)\s+", CHAR(10)), "(?m)^\s*[[:upper:]][[:alpha:]]*|.*?([[:upper:]][[:alpha:]]*|$)", "$1" & char(10)), CHAR(10)))
Or, to make sure the ?, ! or . / ... that are matched as sentence boundaries are followed with an uppercase letter:
=ARRAYFORMULA(SPLIT(REGEXREPLACE(REGEXREPLACE(A111:A, "(?:[?!]|\.(?:\.\.+)?)\s+([[:upper:]])", CHAR(10) & "$1"), "(?m)^\s*[[:upper:]][[:alpha:]]*|.*?([[:upper:]][[:alpha:]]*|$)", "$1" & char(10)), CHAR(10)))
See the demo screenshot:
See the regex demo.
First, we split the text into sentences in a cell with REGEXREPLACE(A111:A, "(?:[?!]|\.(?:\.\.+)?)\s+", CHAR(10)). Actually, this just replaces final sentence punctuation with a newline.
The second REGEXREPLACE is used with another regex that matches
(?m)^\s*[[:upper:]][[:alpha:]]* - a capitalized word ([[:upper:]][[:alpha:]]*) at the start of string/line (^) together with optional whitespace (\s*)
| - or
.*? - any zero or more chars other than line break chars, as few as possible
([[:upper:]][[:alpha:]]*|$) - Group 1 ($1): an uppercase letter ([[:upper:]]) and then any zero or more letters ([[:alpha:]]*), or end of string ($)
and replaces the match with Group 1 value and a newline, LF char. Then, the result is SPLIT with a newline char.
My two cents:
Formula in B1:
=INDEX(IF(A1:A<>"",SPLIT(REGEXREPLACE(A1:A,"(?:(?:^|[.?!]+)\s*\S+|\b([A-ZÖ][a-zö]+(?:-[A-ZÖ][a-zö]+)*)\b|.+?)","$1|"),"|",1),""))
The pattern: (?:(?:^|[.?!]+)\s*\S+|\b([A-ZÖ][a-zö]+(?:-[A-ZÖ][a-zö]+)*)\b|.+?) means:
(?: - Open non-capture group to allow for alternations:
(?:^|[.?!]+)\s*\S+ - A nested non-capture group to allow for the start-line anchor or 1+ literal dots or question/exclamation marks, followed by 0+ whitespace chars and 1+ non-whitespace chars;
| - Or;
\b([A-ZÖ][a-zö]+(?:-[A-ZÖ][a-zö]+)*)\b - A 1st capture-group to catch camel-case strings (with optional hyphen) between word-boundaries;
| - Or;
.+? - Any 1+ characters (Lazy);
) - Close non-capture group.
The idea is here to use REGEXREPLACE() to substitute any match with the backreference to the 1st capture group and a pipe-symbol (or any symbol for that matter that won't be in your input) and use SPLIT() to get all words seperated. Note that it is important to use the 3rd parameter of the function to ignore empty strings.
INDEX() will trigger the array-functionality and spill the results. I used an nested IF() statement to check for empty cells to skip.
Since you have already found a Python solution for your use case, you may try utilizing that directly as a custom function in your Google Sheet by hosting the Python code as an API and using Google App script to call it and extract natively in the Google Sheet.
For reference, you can check this repository and this YouTube video.
I'm looking for help in making a regex to match and not match a series of name patterns if anyone can help with that.
Here's a list of cases I want to match/ not match :
// Should Match :
_class
c-class
_class-like
_class--variation
_class__children
_class__children--variation
c-custon-button-test
_class__lol--test
c-my-button-super-style
_class--variation-like
// Should not Match :
class
c--class
_class---variation
_class----variation
_class__test__test
_class--variation__children
_like
c-like
noMargin
no-Margin
_no-Margin
no-margin
_class-like__children
_class-like--variation
For now I came up with this regex :
^(c-|_)([a-z]+)(__|--|-)?([a-z]+)(-{0,2}[a-z]+)+(-?(([a-z]-?)+|(like))$)
Which almost work but I still got a match on some case which shouldn't match and I'm afraid I'm struggling to find how to sort the last cases.
(Here's a link to regex101 with unit test and match case: https://regex101.com/r/HNAUpd/1/)
edit : I forgot to mention, about the word "like" it's a keyword in my pattern and can only be found at the end of the string and cannot be the sole word in the string.
edit 2 : As for the rules of matching they're as follow :
A string can start only with "_", "c-" or "js-".
the following word can be anything but not the word "like" and should not be anything else that letter in the range [a-z] and only in lowercase.
The word "like" can only be the last one of the string and must not be the only one in the string.
Words can be separated by "--" or "__".
If the string starts with "c-" the word can then be separated with "-" in addition to the previous separator.
The purpose of all this is for a CSS class/id matcher for a linter.
If anyone can help me with this it would be awesome :)
I think you're looking for something like this:
^(?!.*[\-_]like[\-_])(?:c-|js-|_)(?!like$)(?:[a-z]+(?:__|--?))?[a-z]+(?:--?[a-z]+)*$
Demo
Breakdown:
^ - Beginning of the string.
(?!.*[\-_]like[\-_]) - Doesn't contain the word "like" between two separators (only at the end of the string).
(?:c-|js-|_) - Either "c-", "js-", or "_" at the beginning of the string.
(?!like$) - Not immediately followed by the word "like".
(?:[a-z]+(?:__|--?))? - (optional) one or more a-z letters followed two underscores or one or two hyphens.
[a-z]+ - One or more a-z letters.
(?:--?[a-z]+)* - Match one or two hyphens followed by one or more a-z letters, and repeat zero or more times.
$ - End of string.
I have a regex that seemingly is straightforward but does not act as required. The input to be parsed is described as follows (nb: {} are not part of the regex, only what's inside):
A sequence of 0 or more spaces {\s*}
A dash {-}
A sequence of 0 or more spaces {\s*}
A full person's name (first name, middle names, surname; all captured into f1). The name must not start with a number
must appear at the end of the line {[A-Za-z][\w\s]*)}
The whole construct SPACE-SPACEf1 is optional
Just to explain what is captured into f1:
For the first char, I'm using the set of chars represented by [A-Za-z]. Followed by \w or space 0 or more times. This is captured into f1.
(?:\s*-\s*(?P<f1>[A-Za-z][\w\s]*))?$
I expect the following sequences to match and capture a value into f1:
" - Bruce" (f1=Bruce)
" - Bruce Dickinson" (f1=Bruce Dickinson)
I expect the following to not match:
"Bruce" (there is no leading dash)
" - Bruce!" (there is a non word (\w) character after the name and before end of line
I expect the following match but not capture a value into f1 (I would prefer it to not match though):
" - 1Bruce" (leading character is numeric)
These are the actual results:
" - Bruce" (f1=Bruce) Tick; this works
" - Bruce Dickinson" (f1=Bruce Dickinson) Tick; this works
"Bruce" (f1= not captured, but expression is a match. This is wrong, because Bruce doesn't match the optional part, and $ comes next which doesn't match Bruce)
" - Bruce!" (f1= not cpatured, but expression is a match; this is wrong, because of the !, which means that match does not appear at the end of line.
I expect that:
(?:\s*-\s*(?P<f1>[A-Za-z][\w\s]*))?
would consume { - Bruce}, which should leave !, which should fail because of the next regex token being $; however, the computer says no, so I'm wrong but I don't know why :(
" - 1Bruce" (f1= not captured, but expression is match. This is understandable because the whole {space dash space f1} sequence is optional and because it doesn't match, that construct is skipped and then there is nothing else to process on the input; we hit the end of line)
If I can get this to work, I can get the rest of my expression to work the way I want it to. I need somebody else to jolt me into thinking about this differently. I've spent 2 days on this with no positive output, so very frustrating.
PS: I am using regex101.com to test regexes. The regexes will be used as part of a Rust application whose regex engine is based on google's RE2.
Eventually, I need to be able to recognise a sequence of names delimited by &, and the whole expression is optional by the use of ? and must appear at the end of line $.
So
{ - Bruce & Nicko & Dave Murray } would be valid
and
{ - Bruce & Nicko & Dave Murray & } should not be valid and NOT match
But 1 step at a time!
The point here is that you cannot match and not match something at the same time. If you make the whole pattern optional, and the end of string obligatory, even if there is nothing of interest the end of string will be matched - always.
The way out is to think of a subpattern you are interested in. You are interested in the names, so, make the first letter obligatory. The hyphen seems to be obligatory in all test cases you supplied, too. Everything else can be optional:
\s*-\s*(?P<f1>([^\W\d_])\w*(?:\s+\w+)*)(?:\s*&\s*(?P<f2>([^\W\d_])\w*(?:\s+\w+)*))*$
See the regex demo (the \s is replaced with \h and \n added to the negated character classes just for demo purposes as it is a multiline demo).
Note that I replaced [a-zA-Z] with [^\W\d_] to make the pattern more flexible ([^\W\d_] just matches any letter).
I want to use regex to validate names. The names must contain, first name, middle name, last name (not necessarily all). But I also want to impose a condition that the name must be of at least four characters. I have found regex to validate full name here Java Regex to Validate Full Name ... and found regex to check for checking of at least three chars (alphabets) in a string here Regex to check for at least 3 characters. But I am not sure how to combine these two to obtain the desired result. Please help me to achieve the desired Regex, so that I can complete my project.
You can use
^[a-zA-Z]{4,}(?: [a-zA-Z]+){0,2}$
See the regex demo
This will work with names starting with both lower- and upper-cased letters.
^ - start of string
[a-zA-Z]{4,} - 4 or more ASCII letters
(?: [a-zA-Z]+){0,2} - 0 to 2 occurrences of a space followed with one or more ASCII letters
$ - end of string.
If you need to restrict the words to start with Uppercase letters, you can use
^[A-Z][a-zA-Z]{3,}(?: [A-Z][a-zA-Z]*){0,2}$
Here is my solution:
^[a-zA-Z]{3,}( {1,2}[a-zA-Z]{3,}){0,}$
^ --> start of string.
[a-zA-Z]{3,} --> 3 or more character.
( {1,2}[a-zA-Z]{3,}){0,} --> 0 or more words with 3 or more character.
$ --> end of string.
It might be a bit overkill but:
([A-Z][a-z]{3,} )([A-Z][a-z]{3,} )?([A-Z][a-z]{3,})
should do the trick.
It matches words that start with a capitalized letter followed by 3 or more lowercase letter -> words have a length of four. The middle-name is optional and the last name doesn't contain a trailing whitespace.
Edit:
If you want to support "fancy" characters (äöü etc.) you can read this question for details.
Using the pattern from Java 7 with the UNICODE_CHARACTER_CLASS flag the regex should look like this:
(\\p{Upper}\\p{Lower}{3,} )(\\p{Upper}\\p{Lower}{3,} )?(\\p{Upper}\\p{Lower}{3,})
I am trying to match only the street name from a series of addresses. The addresses might look like:
23 Barrel Rd.
14 Old Mill Dr.
65-345 Howard's Bluff
I want to use a regex to match "Barrel", "Old Mill", and "Howard's". I need to figure out how to exclude the last word. So far I have a lookbehind to exclude the digits, and I can include the words and spaces and "'" by using this:
(?<=\d\s)(\w|\s|\')+
How can I exclude the final word (which may or may not end in a period)? I figure I should be using a lookahead, but I can't figure out how to formulate it.
You don't need a look-behind for this:
/^[-\d]+ ([\w ']+) \w+\.?$/
Match one or more digits and hyphens
space
match letters, digits, spaces, apostrophes into capture group 1
space
match a final word and an optional period
An example Ruby implementation:
regex = /^[-\d]+ ([\w ']+) \w+\.?$/
tests = [ "23 Barrel Rd.", "14 Old Mill Dr.", "65-345 Howard's Bluff" ]
tests.each do |test|
p test.match(regex)[1]
end
Output:
"Barrel"
"Old Mill"
"Howard's"
I believe the lookahead you want is (?=\s\w+\.?$).
\s: you don't want to include the last space
\w: at least one word-character (A-Z, a-z, 0-9, or '_')
\.?: optional period (for abbreviations such as "St.")
$: make sure this is the last word
If there's a possibility that there might be additional whitespace before the newline, just change this to (?=\s\w+\.?\s*$).
Why not just match what you want? If I have understood well you need to get all the words after the numbers excluding the last word. Words are separated by space so just get everything between numbers and the last space.
Example
\d+(?:-\d+)? ((?:.)+) Note: there's a space at the end.
Tha will end up with what you want in \1 N times.
If you just want to match the exact text you may use \K (not supported by every regex engine) but: Example
With the regex \d+(?:-\d+)? \K.+(?= )
Another option is to use the split() function provided in most scripting languages. Here's the Python version of what you want:
stname = address.split()[1:-1]
(Here address is the original address line, and stname is the name of the street, i.e., what you're trying to extract.)