Regexmatch to find all string cells that match multiple words - regex

I'm using ArrayFormula and FILTER combination to list all cells in a column that contain all of the search term words. I'm using REGEXMATCH rather than QUERY/CONTAINS/LIKE because my FILTER has other criteria that return TRUE/FALSE.
My problem seems to be precedence. So the following regex works in a limited way.
=ArrayFormula(filter(A1:A5,regexmatch(A1:A5,"(?i)^"&"(.*?\bbob\b)(.*?\bcat\b)"&".*$")))
It will find Bob and cat but only if Bob precedes cat.
Google sheets fails if I try to use lookahead ?= ie
=ArrayFormula(filter(A1:A5,regexmatch(A1:A5,"(?i)^"&"(?=.*?\bbob\b)(?=.*?\bcat\b)"&".*$")))
I don't want to use the '|' alternation in the string (repeat and reverse) as the input words may be many more than two so alternation becomes exponentially more complex.
Here's the test search array (each row is a single cell containing a string)...
Bob ate the dead cat
The cat ate live bob
No cat ate live dog
Bob is dead
Bob and the cat are alive
... and the desired results I'm after.
Bob ate the dead cat
The cat ate live bob
Bob and the cat are alive
Once I have the regex sorted out, the final solution will be a user input text box where they simply enter the words that must be found in a string ie 'Bob cat'. This input string I think I can unpack into its separate words and concatenate to the above expression, however, if there's a 'best practice' way of doing this I'd like to hear.

Find 2 strings
Try:
=FILTER(A:A,REGEXMATCH(A:A,"(?i)bob.*cat|cat.*bob"))
You don't need to use ArrayFormula because filter is array formula itself.
(?i) - to make search case insensitive
bob.*cat|cat.*bob - match "bob→cat" or "cat→bob"
Find multiple strings
There's more complex formula for more words to match then 2.
Suppose we have a list in column A:
Bob ate the dead cat
The cat ate live bob
No cat ate live dog
Bob is dead
Bob and the cat are alive
Cat is Bob
ate Cat bob
And need to find all matches of 3 words, put them in column C:
cat
ate
bob
The formula is:
=FILTER(A:A,MMULT(--REGEXMATCH(A:A,
"(?i)"&TRANSPOSE(C1:C3)),ROW(INDIRECT("a1:a"&COUNTA(C1:C3)))^0)=COUNTA(C1:C3))
It uses RegexMatch of transposed list of words C1:C3, and then mmult function sums matches and =COUNTA(C1:C3) compares the number of matches with the number of words in list.
The result is:
Bob ate the dead cat
The cat ate live bob
ate Cat bob

See if this does what you want. In B1 enter:
=arrayformula(filter(A1:A5,regexmatch(A1:A5,lower(index(split(C2," "),0,1)))*regexmatch(lower(A1:A5),lower(index(split(C2," "),0,2)))))
In C2 enter your search words with a space between them (cat Bob).
All words are changed to lower case. The index split separates the words in C2 and the separate words go in the regexmatch. Below is my shared test spreadsheet:
https://docs.google.com/spreadsheets/d/1sDNnSeqHbi0vLosxhyr8t8KXa3MzWC_WJ26eSVNnG80/edit?usp=sharing
Expanding on Max's very good answer, this will change the formula for the list of words in column C. I added an example to the shared spreadsheet (Sheet2).
=FILTER(A:A,MMULT(--REGEXMATCH(A:A,"(?i)"&TRANSPOSE(INDIRECT( "C1:C" & counta(C1:C ) ))),ROW(INDIRECT("a1:a"&COUNTA(INDIRECT( "C1:C" & counta(C1:C ) ))))^0)=COUNTA(INDIRECT( "C1:C" & counta(C1:C ) )))

Maybe a bit easier to understand (I hate MMULT)
=query({A1:A},"select Col1 where "&join(" and ",arrayformula("Col1 matches '."&filter(B:B,B:B<>"")&".'")))
Where A contains your list of phrases and B contains your criteria words.
This part of the formula, =join(" and ",arrayformula("Col1 matches '."&filter(D3:D,D3:D<>"")&".'")) builds a query string from terms in B. for example:
Col1 matches '.cats.' and Col1 matches '.dogs.'
And then this list gets concatenated into the whole "select" expression:
select Col1 where Col1 matches '.cats.' and Col1 matches '.dogs.'

Related

Convert MS Outlook formatted email addresses to names of attendees using RegEx

I'm trying to use Notepadd ++ to find and replace regex to extract names from MS Outlook formatted meeting attendee details.
I copy and pasted the attendee details and got names like.
Fred Jones <Fred.Jones#example.org.au>; Bob Smith <Bob.Smith#example.org.au>; Jill Hartmann <Jill.Hartmann#example.org.au>;
I'm trying to wind up with
Fred Jones; Bob Smith; Jill Hartmann;
I've tried a number of permutations of
\B<.*>; \B
on Regex 101.
Regex is greedy, <.*> matches from the first < to the last > in one fell swoop. You want to say "any character which is neither of these" instead of just "any character".
*<[^<>]*>
The single space and asterisk before the main expression consumes any spaces before the match. Replace these matches with nothing and you will be left with just the names, like in your example.
This is a very common FAQ.

How to refer to the match order in the replace string?

I'm looking for a way to type the order of match of the being-replaced-string among all other found-strings. For Example, for the 1st matched string, the value should be 1, for the 2nd it's 2 … for the nth it should be n. The value I'm looking for is the order of the matched string among all other matched strings.
Example for what I'm trying to get
Let's say that I have this original content ...
<"BOY"(GUN)><"GIRL"(BAG)><"SISTERS"(CANDY)><"JOHN"(HAT)>
... and I want it to be manipulated to be like this ...
1
BOY
GUN
2
GIRL
BAG
3
SISTERS
CANDY
4
JOHN
HAT
I already know that I need <"(.*?)"\((.*?)\)> to match each element. For the replace code I think I need something like #MATCH ORDER REFERENCE#\n\$1\n$2\n.
Note
I'm using Perl on Windows.
Use the /e modifier to evaluate the replacement. See Regexp Quote-Like Operators.
Then you can increase a counter on each replacement.
Code
my $text = '<"BOY"(GUN)><"GIRL"(BAG)><"SISTERS"(CANDY)><"JOHN"(HAT)>';
my $counter = 1;
$text =~ s/<"([^"]+)"\(([^()]+)\)>/$counter++."\n$1\n$2\n\n"/ge;
print $text;
Output
1
BOY
GUN
2
GIRL
BAG
3
SISTERS
CANDY
4
JOHN
HAT

Regex for words that don't differ by only one letter

I want to create series of puzzle games where you change one letter in a word to create a new word, with the aim of reaching a given target word. For example, to change "this" to "that":
this
thin
than
that
What I want to do is create a regex which will scan a list of words and choose all those that do not match the current word by all but one letter. For example, if my starting word is "pale" and my list of words is...
pale
male
sale
tale
pile
pole
pace
page
pane
pave
palm
peal
leap
play
help
pack
... I want all the words from "peal" to "pack" to be selected. This means that I can delete them from my list, leaving only the words that could be the next match. (It's OK for "pale" itself to be unselected.)
I can do this in parts:
^.(?!ale).{3}\n selects words not like "*ale"
^.(?<!p).{3}\n|^.{2}(?!le).{2}\n selects words not like "p*le"
^.{2}(?<!pa).{2}\n|^.{3}(?!e).\n selects words not like "pa*e"
^.{3}(?<!pal).\n selects words not like "pal*".
However, when I put them together...
^.(?!ale).{3}\n|^.(?<!p).{3}\n|^.{2}(?!le).{2}\n|^.{2}(?<!pa).{2}\n|^.{3}(?!e).\n|^.{3}(?<!pal).\n
... everything but "pale" is matched.
I need some way to create an AND relationship between the different regexes, or (more likely) a completely different approach.
You can use the Python regex module that allows fuzzy matching:
>>> import regex
>>> regex.findall(r'(?:pale){s<=1}', "male sale tale pile pole pace page pane pave palm peal leap play help pack")
['male', 'sale', 'tale', 'pile', 'pole', 'pace', 'page', 'pane', 'pave', 'palm']
In this case, you want a substitution of 0 or 1 is a match.
Or consider the TRE library and the command line agrep which supports a similar syntax.
Given:
$ echo $s
male sale tale pile pole pace page pane pave palm peal leap play help pack
You can filter to a list of a single substitution:
$ echo $s | tr ' ' '\n' | agrep '(?:pale){ 1s <2 }'
male
sale
tale
pile
pole
pace
page
pane
pave
palm
Here's a solution that uses cool python tricks and no regex:
def almost_matches(word1, word2):
return sum(map(str.__eq__, word1, word2)) == 3
for word in "male sale tale pile pole pace page pane pave palm peal leap play help pack".split():
print almost_matches("pale", word)
A completely different approach: Levenshtein distance
...the Levenshtein distance between two words is the minimum number of single-character edits (i.e. insertions, deletions or substitutions) required to change one word into the other.
PHP example:
$words = array(
"pale",
"male",
"sale",
"tale",
"pile",
"pole",
"pace",
"page",
"pane",
"pave",
"palm",
"peal",
"leap",
"play",
"help",
"pack"
);
foreach($words AS $word)
if(levenshtein("pale", $word) > 1)
echo $word."\n";
This assumes the word on the first line is the keyword. Just a brute force parallel letter-match and count gets the job done:
awk 'BEGIN{FS=""}
NR==1{n=NF;for(i=1;i<=n;++i)c[i]=$i}
NR>1{j=0;for(i=1;i<=n;++i)j+=c[i]==$i;if(j<n-1)print}'
A regexp general solution would need to be a 2-stepper I think -- generate the regexp in first step (from the keyword), run the regexp against the file in the second step.
By the way, the way to do an "and" of regexp's is to string lookaheads (and the lookaheads don't need to be as complicated as you had above I think):
^(?!.ale)(?!p.le)(?!pa.e)(?!pal.)

How to find words with more than 3 characters between 2 other words

I have 2 sentences:
Today one dog will eat 2 kg of meats more than a cat
Human always prefer dog and cat
With the help of regex:
I would like to find sentences that have dog and cat together without human
I need also to have words between dog and cat which have more than 3 characters in the sentence where we can't find human
Assuming that the string you're matching contains one sentence:
"^(?!.*human)(?=.*dog)(?=.*cat)"
will match if the string contains dog and cat but not human.
For your second question (finding all words of more than two (!) characters between dog and cat, you need two steps (at least in Java):
First, find the part of the string between dog and cat using the regex
"(?<=dog).*(?=cat)"
Then, on the match result, use the regex "\\w{3,}" to find all alphanumeric words of length 3 or more.

Extract a portion of text using RegEx

I would like to extract portion of a text using a regular expression. So for example, I have an address and want to return just the number and streets and exclude the rest:
2222 Main at King Edward Vancouver BC CA
But the addresses varies in format most of the time. I tried using Lookbehind Regex and came out with this expression:
.*?(?=\w* \w* \w{2}$)
The above expressions handles the above example nicely but then it gets way too messy as soon as commas come into the text, postal codes which can be a 6 character string or two 3 character strings with a space in the middle, etc...
Is there any more elegant way of extracting a portion of text other than a lookbehind regex?
Any suggestion or a point in another direction is greatly appreciated.
Thanks!
Regular expressions are for data that is REGULAR, that follows a pattern. So if your data is completely random, no, there's no elegant way to do this with regex.
On the other hand, if you know what values you want, you can probably write a few simple regexes, and then just test them all on each string.
Ex.
regex1= address # grabber, regex2 = street type grabber, regex3 = name grabber.
Attempt a match on string1 with regex1, regex2, and finally regex3. Move on to the next string.
well i thot i'd throw my hat into the ring:
.*(?=,? ([a-zA-Z]+,?\s){3}([\d-]*\s)?)
and you might want ^ or \d+ at the front for good measure
and i didn't bother specifying lengths for the postal codes... just any amount of characters hyphens in this one.
it works for these inputs so far and variations on comas within the City/state/country area:
2222 Main at King Edward Vancouver, BC, CA, 333-333
555 road and street place CA US 95000
2222 Main at King Edward Vancouver BC CA 333
555 road and street place CA US
it is counting at there being three words at the end for the city, state and country but other than that it's like ryansstack said, if it's random it won't work. if the city is two words like New York it won't work. yeah... regex isn't the tool for this one.
btw: tested on regexhero.net
i can think of 2 ways you can do this
1) if you know that "the rest" of your data after the address is exactly 2 fields, ie BC and CA, you can do split on your string using space as delimiter, remove the last 2 items.
2) do a split on delimiter /[A-Z][A-Z]/ and store the result in array. then print out the array ( this is provided that the address doesn't contain 2 or more capital letters)