R grep and exact matches - regex

It seems grep is "greedy" in the way it returns matches. Assuming I've the following data:
Sources <- c(
"Coal burning plant",
"General plant",
"coalescent plantation",
"Charcoal burning plant"
)
Registry <- seq(from = 1100, to = 1103, by = 1)
df <- data.frame(Registry, Sources)
If I perform grep("(?=.*[Pp]lant)(?=.*[Cc]oal)", df$Sources, perl = TRUE, value = TRUE), it returns
"Coal burning plant"
"coalescent plantation"
"Charcoal burning plant"
However, I only want to return exact match, i.e. only where "coal" and "plant" occur. I don't want "coalescent", "plantation" and so on. So for this, I only want to see "Coal burning plant"

You want to use word boundaries \b around your word patterns. A word boundary does not consume any characters. It asserts that on one side there is a word character, and on the other side there is not. You may also want to consider using the inline (?i) modifier for case-insensitive matching.
grep('(?i)(?=.*\\bplant\\b)(?=.*\\bcoal\\b)', df$Sources, perl=T, value=T)
Working Demo

If you always want the order "coal" then "plant", then this should work
grep("\\b[Cc]oal\\b.*\\b[Pp]lant\\b", Sources, perl = TRUE, value=T)
Here we add \b match which stands for a word boundary. You can add the word boundaries to your original attempt we well
grep("(?=.*\\b[Pp]lant\\b)(?=.*\\b[Cc]oal\\b)", Sources,
perl = TRUE, value = TRUE)

Related

Get a match when there are duplicate letters in a string

I have a list of inputs in google sheets,
Input
Desired Output
"To demonstrate only not an input" The repeated letters
Outdoors
Match
o
dog
No Match
step
No Match
bee
Match
e
Chessboard
Match
s
Cookbooks
Match
o, k
How do I verify if all letters are unique in a string without splitting it?
In other words if the string has one letter or more occurred twice or more, return TRUE
My process so far
I tried this solution in addition to splitting the string and dividing the length of the string on the COUNTA of unique letters of the string, if = 1 "Match", else "No match"
Or using regex
I found a method to match a letter is occure in a string 2 times this demonstration with REGEXEXTRACT But wait what needed is get TRUE when the letters are not unique in the string
=REGEXEXTRACT(A1,"o{2}?")
Returns:
oo
Something like this would do
=REGEXMATCH(Input,"(anyletter){2}?")
OR like this
=REGEXMATCH(lower(A6),"[a-zA-Z]{2}?")
Notes
The third column, "Column C," is only for demonstration and not for input.
The match is case insensitive
The string doesn't need to be splitted to aviod heavy calculation "I have long lists"
Avoid using lambda and its helper functions see why?
Its ok to return TRUE or FALSE instead of Match or No Match to keep it simple.
More examples
Input
Desired Output
Professionally
Match
Attractiveness
Match
Uncontrollably
Match
disreputably
No Match
Recommendation
Match
Interrogations
Match
Aggressiveness
Match
doublethinks
No Match
You are explicitly asking for an answer using a single regular expression. Unfortunately there is no such thing as a backreference to a former capture group using RE2. So if you'd spell out the answer to your problem it would look like:
=INDEX(IF(A2:A="","",REGEXMATCH(A2:A,"(?i)(?:a.*a|b.*b|c.*c|d.*d|e.*e|f.*f|g.*g|h.*h|i.*i|j.*j|k.*k|l.*l|m.*m|n.*n|o.*o|p.*p|q.*q|r.*r|s.*s|t.*t|u.*u|v.*v|w.*w|x.*x|y.*y|z.*z)")))
Since you are looking for case-insensitive matching (?i) modifier will help to cut down the options to just the 26 letters of the alphabet. I suppose the above can be written a bit neater like:
=INDEX(IF(A2:A="","",REGEXMATCH(A2:A,"(?i)(?:"&TEXTJOIN("|",1,REPLACE(REPT(CHAR(SEQUENCE(26,1,65)),2),2,0,".*"))&")")))
EDIT 1:
The only other reasonable way to do this (untill I learned about the PREG supported syntax of the matches clause in QUERY() by #DoubleUnary) with a single regex other than the above is to create your own UDF in GAS (AFAIK). It's going to be JavaScript based thus supporting a backreferences. GAS is not my forte, but a simple example could be:
function REGEXMATCH_JS(s) {
if (s.map) {
return s.map(REGEXMATCH_JS);
} else {
return /([a-z]).*?\1/gi.test(s);
}
}
The pattern ([a-z]).*?\1 means:
([a-z]) - Capture a single character in range a-z;
.*?\1 - Look for 0+ (lazy) characters up to a copy of this 1st captured character with a backreference.
The match is global and case-insensitive. You can now call:
=INDEX(IF(A2:A="","",REGEXMATCH_JS(A2:A)))
EDIT 2:
For those that are benchmarking speed, I am not testing this myself but maybe this would speed things up:
=INDEX(REGEXMATCH(A2:INDEX(A:A,COUNTA(A:A)),"(?i)(?:a.*a|b.*b|c.*c|d.*d|e.*e|f.*f|g.*g|h.*h|i.*i|j.*j|k.*k|l.*l|m.*m|n.*n|o.*o|p.*p|q.*q|r.*r|s.*s|t.*t|u.*u|v.*v|w.*w|x.*x|y.*y|z.*z)"))
Or:
=INDEX(REGEXMATCH(A2:INDEX(A:A,COUNTA(A:A)),"(?i)(?:"&TEXTJOIN("|",1,REPLACE(REPT(CHAR(SEQUENCE(26,1,65)),2),2,0,".*"))&")"))
Or:
=REGEXMATCH_JS(A2:INDEX(A:A,COUNTA(A:A)))
Respectively. Knowing there is a header in 1st row.
Benchmark:
Created a benchmark here.
Methodology:
Use NOW() to create a timestamp, when checkbox is clicked.
Use NOW() to create another timestamp, when the last row is filled and the checkbox is on.
The difference between those two timestamps gives time taken for the formula to complete.
The sample is a random data created from Math.random between [A-Za-z] with 10 characters per word.
Results:
Formula
Round1
Round2
Avg
% Slower than best
Sample size
10006
[re2](a.*a|b.*b)JvDv
0:00:19
0:00:19
0:00:19
-15.15%
[re2+recursion]MASTERMATCH_RE2
0:00:27
0:00:24
0:00:26
-54.55%
[Find+recursion]MASTERMATCH
0:00:17
0:00:16
0:00:17
0.00%
[PREG]Doubleunary
0:00:57
0:00:53
0:00:55
-233.33%
Conclusion:
This varies greatly based on browser/device/mobile app and on non-randomized sample data. But I found PREG to be consistently slower than re2
Use recursion.
This seems extremely faster than the regex based approach. Create a named function:
Name:
MASTERMATCH
Arguments(in this order):
word
The word to check
start
Starting at
Function:
=IF(
MID(word,start,1)="",
FALSE,
IF(
ISERROR(FIND(MID(word,start,1),word,start+1)),
MASTERMATCH(word,start+1),
TRUE
)
)
Usage:
=ARRAYFORMULA(MASTERMATCH(A2:INDEX(A2:A,COUNTA(A2:A)),1))
Or without case sensitivity
=ARRAYFORMULA(MASTERMATCH(lower(A2:A),1))
Explanation:
It recurses through each character using MID and checks whether the same character is available after this position using FIND. If so, returns true and doesn't check anymore. If not, keeps checking until the last character using recursion.
Or with regex,
Create a named function:
Name:
MASTERMATCH_RE2
Arguments(in this order):
word
The word to check
start
Starting at
Function:
IF(
MID(word,start,1)="",
FALSE,
IF(
REGEXMATCH(word,MID(word, start, 1)&"(?i).*"&MID(word,start,1)),
TRUE,
MASTERMATCH_RE2(word,start+1)
)
)
Usage:
=ARRAYFORMULA(MASTERMATCH_RE2(A2:A,1))
Or
=ARRAYFORMULA(MASTERMATCH_RE2(lower(A2:A),1))
Explanation:
It recurses through each character and creates a regex for that character. Instead of a.*a, b.*b,..., it takes the first character(using MID), eg: o in outdoor and creates a regex o.*o. If regex is positive for that regex (using REGEXMATCH), returns true and doesn't check for other letters or create other regexes.
Uses lambda, but it's efficient. Loop through each row and every character with MAP and REDUCE. REPLACE each character in the word and find the difference in length. If more than 1, don't check length anymore and return Match
=MAP(
A2:INDEX(A2:A,COUNTA(A2:A)),
LAMBDA(_,
REDUCE(
"No Match",
SEQUENCE(LEN(_)),
LAMBDA(a,c,
IF(a="Match",a,
IF(
LEN(_)-LEN(
REGEXREPLACE(_,"(?i)"&MID(_,c,1),)
)>1,
"Match",a
)
)
)
)
)
)
If you do run into lambda limitations, remove the MAP and drag fill the REDUCE formula.
=REDUCE("No Match",SEQUENCE(LEN(A2)),LAMBDA(a,c,IF(a="Match",a,IF(LEN(A2)-LEN(REGEXREPLACE(A2, "(?i)"&MID(A2,c,1),))>1,"Match",a))))
The latter is preferred for conditional formatting as well.
As Daniel Cruz said, Google Sheets functions such as regexmatch(), regexextract() and regexreplace() use RE2 regexes that do not support backreferences. However, the query() function uses Perl Compatible Regular Expressions that do support named capture groups and backreferences:
=arrayformula(
iferror( not( iserror(
match(
to_text(A3:A),
query(lower(unique(A3:A)), "where Col1 matches '.*?(?<char>.).*?\k<char>.*' ", 0),
0
)
) / (A3:A <> "") ) )
)
In my limited testing with a sample size of 1000 heterograms, pangrams, words with diacritic letters, and 10-character pseudo-random unique values from TheMaster's corpus, this PREG formula ran at about half the speed of the JvdV2 RE2 regex.
With Osm's sample of 50,000 highly repetitive sample values, the formula ran at 8x the speed of JvdV2.
A PREG regex is slower than a RE2 regex, but has the benefit that you can more easily check all characters for repeats. This lets you work with corpuses that include diacritic letters, numbers and other non-English alphabet characters:
Input
Output
Professionally
TRUE
disreputably
FALSE
Abacus
TRUE
Élysée
TRUE
naïve Ï
TRUE
määräävä
TRUE
121
TRUE
123
FALSE
You can also easily state which specific characters to check by replacing <char>. with something like <char>[\wéäåö] or <char>[^-;,.\s\d].
try:
=INDEX(IF(IFERROR(LEN(REGEXREPLACE(A1:A6, "[^"&C1:C6&"]", )), -1)>=
(LEN(SUBSTITUTE(C1:C6, "|", ))*2), "Match", "No Match"))
update
create a query heat map, filter it and vlookup back row position
=INDEX(LAMBDA(a, IF(""<>IFNA(VLOOKUP(ROW(a),
SPLIT(QUERY(QUERY(FLATTEN(ROW(a)&"​"&REGEXEXTRACT(a, REPT("(.)", LEN(a)))),
"select Col1,count(Col1) where Col1 matches '.*\w+$' group by Col1"),
"select Col1 where Col2 > 1", ), "​"), 2, )), "Match", "No Match"))
(A2:INDEX(A:A, MAX((A:A<>"")*ROW(A:A)))))
case insensitive would be:
=INDEX(LAMBDA(a, IF(""<>IFNA(VLOOKUP(ROW(a),
SPLIT(QUERY(QUERY(FLATTEN(ROW(a)&"​"&LOWER(REGEXEXTRACT(a, REPT("(.)", LEN(a))))),
"select Col1,count(Col1) where Col1 matches '.*\w+$' group by Col1"),
"select Col1 where Col2 > 1", ), "​"), 2, )), "Match", "No Match"))
(A2:INDEX(A:A, MAX((A:A<>"")*ROW(A:A)))))
Just to illustrate another method - not likely to be scaleable - try to substitute the second occurrence of the letter:
=ArrayFormula(if(isnumber(xmatch(len(A2)-1,len(substitute(upper(A2),char(sequence(1,26,65)),"",2)))),"Match","No match"))
If splitting were permitted, I would favour use of Frequency for speed, e.g.
=ArrayFormula(max(frequency(code(mid(upper(A2),sequence(len(A2)),1)),sequence(1,26,65)))>1)
You can give a try by using this RegEx : /(\w).*?\1/g in the REGEXMATCH function in google sheets.
Explanation :
(\w) - matches word characters (a-z, A-Z, 0-9, _), If you are sure that input will contain only alphabets then you can also use ([a-zA-Z]); then
.*? - zero or more characters (the ? denotes as optional that means it can match for consecutive as well as non-consecutive); until
\1 - it finds a repeat of the first matched character.
Live Demo : regex101
Coming after the battle ^^ Why not simply compare the number of unique letters in the string and its original length ?
=COUNTUNIQUE(split(regexreplace(A2;"(.)"; "$1_"); "_")) < LEN(A2)
All my tests seem fine.
(split() provided by this answer)

Regex with stringr:: how to find first instance of pattern

Behind this question is an effort to extract all references created by knitr and latex. Not finding another way, my thought was to read into R the .Rnw script and use a regular expression to find references -- where the latex syntax is \ref{caption referenced to}. My script has 250+ references, and some are very close to each other.
The text.1 example below works, but not the text example. I think it has to do with R chugging along to the final closing brace. How do I stop at the first closing brace and extract what preceded it to the opening brace?
library(stringr)
text.1 <- c(" \\ref{test}", "abc", "\\ref{test2}", " \\section{test3}", "{test3")
# In the regular expression below, look back and if find "ref{", grab everything until look behind for } at end
# braces are special characters and require escaping with double backslacs for R to recognize them as braces
# unlist converts the list returned by str_extract to a vector
unlist(str_extract_all(string = text.1, pattern = "(?<=ref\\{).*(?=\\}$)"))
[1] "test" "test2"
# a more complicated string, with more than one set of braces in an element
text <- c("text \ref{?bar labels precision} and more text \ref{?table column alignment}", "text \ref{?table space} }")
unlist(str_extract_all(string = text, pattern = "(?<=ref\\{).*(?=\\}$)"))
character(0)
The problem with text is the backslash in front of "ref" is being interpreted as a carriage return \r by the engine and R's parser; so you're trying to match "ref" but it's really (CR + "ef") ...
Also * is greedy by default, meaning it will match as much as it can and still allow the remainder of the regular expression to match. Use *? or a negated character class to prevent greediness.
unlist(str_extract_all(text, '(?<=\ref\\{)[^}]*'))
# [1] "?bar labels precision" "?table column alignment" "?table space"
As you can see, you can use a character class to match either (\r or r + "ef") ...
x <- c(' \\ref{test}', 'abc', '\\ref{test2}', ' \\section{test3}', '{test3',
'text \ref{?bar labels precision} and more text \ref{?table column alignment}',
'text \ref{?table space} }')
unlist(str_extract_all(x, '(?<=[\rr]ef\\{)[^}]*'))
# [1] "test" "test2" "?bar labels precision"
# [4] "?table column alignment" "?table space"
EDITED
The reason why it didn't capture what is before the closing brace } is because you added an end of line anchor $. Remove $ and it would work.
Therefore, you new code should be like this
unlist(str_extract_all(string = text, pattern = "(?<=ref\\{)[^}]*(?=\\})"))
See DEMO

Extracting clock time from string

I have a dataframe that consists of web-scraped data. One of the fields scraped was a time in clock time, but the scraping process wasn't perfect. Most of the 'good' data look something like '4:33, or '103:20 (so a leading single quote, and two fields, minutes and seconds). Also, there is some bad data, the most common one being '],, but also some containing text. I'd like a new string that is something like 4:33, and for bad data, just blank.
So my plan of attack is to match my good data form, and then replace everything else with a blank space. Sometime like time <- gsub('[0-9]+:[0-9]+', '', time). I know this would replace my pattern with a blank, and I want the opposite, but I'm unsure as to how to negate this whole pattern. A simple carat doesn't seem to work, nor applying it to a group. I tried something like gsub("(.)+([0-9]+)(:)([0-9]+)", "\\2\\3\\4", time) but that isn't working either.
Sample:
dput(sample)
c("'], ", "' Ling (2-0)vsThe Dragon(2-0)", "'8:18", "'13:33",
"'43:33")
Expected output:
c("", "", "8:18", "13:33", "43:33")
We can use grep to replace the elements that do not follow the pattern to '' and then replace the quotes (') with ''. Here, the pattern is the strings that start (^) with ' followed by numbers, :, numbers in that order to the end ($) of the string. So, all other string elements (by negating i.e. !) are assigned to '' using the logical index from grepl and we use sub to replace the '.
sample[!grepl("^'\\d+:\\d+$", sample)] <- ''
sub("'", '', sample)
#[1] "" "" "8:18" "13:33" "43:33"
Or we can also do this in one step using gsub by replacing all those characters (.) that do not follow the pattern \\d+:\\d+ with ''.
gsub("(\\d+:\\d+)(*SKIP)(*F)|.", '', sample, perl=TRUE)
#[1] "" "" "8:18" "13:33" "43:33"
Or another option is str_extract from library(stringr). It is not clear whether there are other patterns such as "some text '08:20 value" in the OP's original dataset or not. The str_extract will also extract those time values, if present.
library(stringr)
str_extract(sample, '\\d+:\\d+')
#[1] NA NA "8:18" "13:33" "43:33"
It will give NA instead of '' for those that doesn't follow the pattern.
You can use sub:
sub('.+?(?=[0-9]+:[0-9]+)|.+', '', sample, perl = TRUE)
[1] "" "" "8:18" "13:33" "43:33"
The regex consists of two parts that are combined with a logical or (|).
.+?(?=[0-9]+:[0-9]+)
This regex matches a positive number of characters followed by the target pattern.
.+ This regex matches a positive number of characters.
The logic: Replace everything preceding thte target pattern with an empty string (''). If there is no target pattern, replace everything with the empty string.

R-regex: match strings not beginning with a pattern

I'd like to use regex to see if a string does not begin with a certain pattern. While I can use: [^ to blacklist certain characters, I can't figure out how to blacklist a pattern.
> grepl("^[^abc].+$", "foo")
[1] TRUE
> grepl("^[^abc].+$", "afoo")
[1] FALSE
I'd like to do something like grepl("^[^(abc)].+$", "afoo") and get TRUE, i.e. to match if the string does not start with abc sequence.
Note that I'm aware of this post, and I also tried using perl = TRUE, but with no success:
> grepl("^((?!hede).)*$", "hede", perl = TRUE)
[1] FALSE
> grepl("^((?!hede).)*$", "foohede", perl = TRUE)
[1] FALSE
Any ideas?
Yeah. Put the zero width lookahead /outside/ the other parens. That should give you this:
> grepl("^(?!hede).*$", "hede", perl = TRUE)
[1] FALSE
> grepl("^(?!hede).*$", "foohede", perl = TRUE)
[1] TRUE
which I think is what you want.
Alternately if you want to capture the entire string, ^(?!hede)(.*)$ and ^((?!hede).*)$ are both equivalent and acceptable.
There is now (years later) another possibility with the stringr package.
library(stringr)
str_detect("dsadsf", "^abc", negate = TRUE)
#> [1] TRUE
str_detect("abcff", "^abc", negate = TRUE)
#> [1] FALSE
Created on 2020-01-13 by the reprex package (v0.3.0)
I got stuck on the following special case, so I thought I would share...
What if there are multiple instances of the regular expression, but you still only want the first segment?
Apparently you can turn off the implicit greediness of the search
with specific perl wildcard modifiers
Suppose the string I wanted to process was
myExampleString = paste0(c(letters[1:13], "_", letters[14:26], "__",
LETTERS[1:13], "_", LETTERS[14:26], "__",
"laksjdl", "_", "lakdjlfalsjdf"),
collapse = "")
myExampleString
"abcdefghijklm_nopqrstuvwxyz__ABCDEFGHIJKLM_NOPQRSTUVWXYZ__laksjdl_lakdjlfalsjd"
and that I wanted only the first segment before the first "__".
I cannot simply search on "_", because single-underscore is
an allowable non-delimiter in this example string.
The following doesn't work. It instead gives me the first and second segments because of the default greediness (but not third, because of the forward-look).
gsub("^(.+(?=__)).*$", "\\1", myExampleString, perl = TRUE)
"abcdefghijklm_nopqrstuvwxyz__ABCDEFGHIJKLM_NOPQRSTUVWXYZ"
But this does work
gsub("^(.+?(?=__)).*$", "\\1", myExampleString, perl = TRUE)
"abcdefghijklm_nopqrstuvwxyz"
The difference is the greedy-modifier "?" after the wildcard ".+"
in the (perl) regular expression.

Matching non commented pattern in eclipse

I am having troubles with a regex syntax.
I want to match all occurrences of a certain word followed by a number, but exclude lines which are commented.
Comments are (multiple) # or ## or ### ...
Examples:
#This is a comment <- no match
#This is a comment myword 8 <- no match
my $var = 'myword 12'; <- match
my $var2 = 'myword'; <- no match
Until now I have
orignal pattern: ^[^(\#+)](.*?)(myword \d+)(.*?)$
new pattern: ^([^\#]*?)(myword\s+\d+)(.*?)$
Which should match lines which do no begin with one or more #, followed by something, then the word number combination I am searching for and finally something.
It would perhaps be good to match also parts of lines if the comment does not begin at the beginning of the line.
my $var3 = 'test';#myword 8 <- no match
What am I doing wrong?
I want to use it in Eclipse's file search (with Perl epic module).
Edit: The new pattern I got does no return false matches, but it return multiple the line which includes myword and several lines before that line. And I'm not sure it returns all matches.
Note that [] are character classes. You cannot use quantifiers in there. They are like the . – matches any character given in there. The dot itself, or a character class, can then be quantified.
In your example, [^(#+)] would match everything except (,), +, and depending on the flavour (I guess) # and \.
So what you want here is to match a line that starts with any character except for a #. (I think.)
A problem is that the # might occur in a string where it is not a comment. (Regarding comments not starting at the beginning of the line.)
Re: comments not at the beginning of the string.
To do this right (e.g. not to miss any valid matches) you pretty much have to parse a file's specific programming language's grammar properly, so you can't do this (easily, or even at all) with a RegEx.
If you don't, you risk missing valid search hits that follow a "#" used in a context other than comment start - as an example common to pretty much any language, after a string "this is my #hash".
It's even worse in Perl where "#" can also appear as a regex delimiter, as a $#myArr (index of the last element of an array), or - joy of joys - as a valid character in an identifyer name!
Of course, if You are aware of these problems and still want to use regexp to extract the content. Something like this may be useful:
^[^\#].[^\n\#]+myword\s\d+.[$;]+
This is a little bit complex but I hope it will works for You.
For me this matches as below:
my $var = 'myword 12'; <- match
my $var = 'myword 17'; <- match
my $var2 = 'myword'; <- no match
my $var = 'myword 9'; #'myword 17'; <- partly match
my $var = 'myword 8'; ##'myword 127'; <- partly match
my $var = ;#'myword 17'; <- no match
#my $var = 'myword 13'; <- no match
##my $var2 = 'myword 14'; <- no match