agrepl does not work with regular expressions - regex

I need to get partial matching in a string using regexs. I can get exact ones:
pattern <- "(^| )shower only($| )"
stringInQuestion<-"Delta Vero 1-Handle Shower Only Faucet Trim Kit in Chrome"
grepl(pattern,stringInQuestion, ignore.case=TRUE,perl=TRUE)
[1] TRUE
agrepl(pattern,stringInQuestion, ignore.case=TRUE,fixed = FALSE, max.distance=0.2)
[1] FALSE
Works only for plain character strings:
agrepl("shower only",stringInQuestion, ignore.case=TRUE,fixed = FALSE, max.distance=0.2)
Can somebody please help me to figure out what is going on?

Since you intend to just check for a whole word presence, I suggest reducing the pattern to
pattern <- "\\bshower only\\b"
See the official description of the max.distance argument:
max.distance
Maximum distance allowed for a match. Expressed either as integer, or as a fraction of the pattern length times the maximal transformation cost (will be replaced by the smallest integer not less than the corresponding fraction), or a list with possible components
0.2 will allow matching the phrase with errors, say Showerrrrr Only, but won't match Showerrrrrr Only. See this working demo:
pattern <- "\\bshower only\\b"
stringInQuestion<-"Delta Vero 1-Handle Shower Only Faucet Trim Kit in Chrome"
grepl(pattern,stringInQuestion, ignore.case=TRUE,perl=TRUE)
agrepl(pattern,stringInQuestion, ignore.case=TRUE,fixed = FALSE, max.distance=0.2)
## [1] TRUE
## [1] TRUE
Note that the max.distance should be tested against the real input you have and asjust accordingly.

Related

Incrementing a number in a string using sub

There's a string with a (single) number somewhere in it. I want to increment the number by one. Simple, right? I wrote the following without giving it a second thought:
sub("([[:digit:]]+)", as.character(as.numeric("\\1")+1), string)
... and got an NA.
> sub("([[:digit:]]+)", as.character(as.numeric("\\1")+1), "x is 5")
[1] NA
Warning message:
In sub("([[:digit:]]+)", as.character(as.numeric("\\1") + 1), "x is 5") :
NAs introduced by coercion
Why doesn't it work? I know other ways of doing this, so I don't need a "solution". I want to understand why this method fails.
The point is that the backreference is only evaluated during a match operation, and you cannot pass it to any function before that.
When you write as.numeric("\\1") the as.numeric function accepts a \1 string (a backslash and a 1 char). Thus, the result is expected, NA.
This happens because there is no built-in backreference interpolation in R.
You may use a gsubfn package:
> library(gsubfn)
> s <- "x is 5"
> gsubfn("\\d+", function(x) as.numeric(x) + 1, s)
[1] "x is 6"
It does not work because the arguments of sub are evaluated before they are passed to the regex engine (which gets called by .Internal).
In particular, as.numeric("\\1") evaluates to NA ... after that you're doomed.
It might be easier to think of it differently. You are getting the same error that you would get if you used:
print(as.numeric("\\1")+1)
Remember, the strings are passed to the function, where they are interpreted by the regex engine. The string \\1 is never transformed to be 5, since this calculation is done within the function.
Note that \\1 is not something that works as a number. NA seems to be similar to null in other languages:
NA... is a product of operation when you try to access something that is not there
From mpiktas' answer here.

regular expression in R-- new lines

I'm trying to using regular expression in R by using regexpr function. I have multiple conditions to match, therefore my regular expression is very long actually, for example "A\s+(\d+)|(\d+)\s+A". So I want to put each separate expression on different lines, like
"A\\s+(\\d+)|
(\\d+)\\s+A|"
But it's not working. The bracket tells R that I want to extract the digit number out. Can anyone give suggestions?
1) paste Try using paste:
paste("A\\s+(\\d+)",
"(\\d+)\\s+A",
sep = "|")
2) rex Another possibility is to use the rex package
library(rex)
rex(group("A", spaces, capture(digits)) %or%
group(capture(digits), spaces, "A"))
which gives:
(?:(?:A[[:space:]]+([[:digit:]]+))|(?:([[:digit:]]+)[[:space:]]+A))
3) rebus The rebus package is similar in intent:
library(rebus)
literal("A") %R% one_or_more(space()) %R% capture(one_or_more(ascii_digit())) %|%
capture(one_or_more(digit())) %R% one_or_more(space()) %R% literal("A")
which emits:
<regex> \QA\E[[:space:]]+([0-9]+)|([[:digit:]]+)[[:space:]]+\QA\E
If you want to break string literal up on to several lines in your script, one solution is to use paste0:
my_expr <- paste0('partone',
'parttwo',
'partthree')
Then you get the desired result:
> my_expr
[1] "partoneparttwopartthree"
You can't just break it up onto several lines in between quotes, b/c then the new line character is part of the expression.
If you are also trying to trouble shoot your regular expression, you'll need to post a sample of the data you are trying to work with and the desired result
Just use the x modifier with perl = TRUE in whatever function you're using. Place the x modifier ((?x)) at the beginning of the expression and white space is ignored. Additionally, comment charcters are ignored as well.
pat <- "(?x)\\\\ ## Grab a backslash followed by...
[a-zA-Z0-9]*cite[a-zA-Z0-9]* ## A word that contains ‘cite‘
(\\[([^]]+)\\]){0,2}\\** ## Look for 0-2 square brackets w/ content
\\{([a-zA-Z0-9 ,]+)\\}" ## Look for curly braces with viable bibkey
tex <- c(
"Many \\parencite*{Ted2005, Moe1999} say graphs \\textcite{Few2010}.",
"But \\authorcite{Ware2013} said perception good too.",
"Random words \\pcite[see][p. 22]{Get9999c}.",
"Still more \\citep[p. 22]{Foo1882c}?"
)
gsub(pat, "", tex, perl=TRUE)
## [1] "Many say graphs ." "But said perception good too."
## [3] "Random words ." "Still more ?"
A second approach...I maintain a package called regexr that attempts to enable maintainers of regular expressions libraries:
to write regular expressions in a way that is similar to the ways R code is written.
This may be overkill if you're aren't panning long term maintence of the expression but you could do the same thing with regexr by (no need for perl = TRUE). Note the minimal comments as the meaning is shared with sub expression names. The %:)% is a comment operator (commented code is happy code) but you need not use the leading names or comments, just construct.:
library(regexr)
pat2 <- construct(
backslash = "\\\\" %:)% "\\",
cite_command = "[a-zA-Z0-9]*cite[a-zA-Z0-9]*" %:)% "parencite",
square_brack = "(\\[([^]]+)\\]){0,2}\\**" %:)% "[e.g.][p. 12]",
bibkeys = "\\{([a-zA-Z0-9 ,]+)\\}" %:)% "{Rinker2014}"
)
gsub(pat2, "", tex)
## [1] "Many say graphs ." "But said perception good too."
## [3] "Random words ." "Still more ?"
The regexr frame work requires a bit of upfront time but the "code" is much easier to maintain, more modular, and easier to understand by others without learning a new "language". This is one approach of many and I tend to use a combination of standard regex, regexr and rebus (which works within the regexr framework). So for example we can grab any of the sub expressions from pat2 with the subs function as follows:
subs(pat2)
## $backslash
## [1] "\\\\"
##
## $cite_command
## [1] "[a-zA-Z0-9]*cite[a-zA-Z0-9]*"
##
## $square_brack
## [1] "(\\[([^]]+)\\]){0,2}\\**"
##
## $bibkeys
## [1] "\\{([a-zA-Z0-9 ,]+)\\}"
I also included simple way to test the main and sub expressions for perl validity as follows:
test(pat2)
## $regex
## [1] TRUE
##
## $subexpressions
## backslash cite_command square_brack bibkeys
## TRUE TRUE TRUE TRUE

Regex in R to limit one term AND another not OR another

I am trying to extract records from a data.frame using grepl. Here are some example cases.
a <- c('This is a healthcare facility', 'this is a hospital', 'this is a hospital district', 'this is a district health service')
I wish to extract all records that have hospital but not district. I have come unstuck when district and hospital occur in the same string. I tried using the dollowing:
str_match(string=a,pattern='hospital|^district' )
How do I limit district but still include hospital in this example?
Thanks.
You need to use the symbol & for AND, ! for NOT, with two grepl calls:
grepl("hospital", a) & !grepl("district", a)
# [1] FALSE TRUE FALSE FALSE
a[.Last.value]
# [1] "this is a hospital"
You could use two calls to grepl:
a[grepl("hospital", a) & !grepl("district", a)]
# [1] "this is a hospital"
R supports Perl-compatible regular expressions, which allow negative lookahead assertions, so in principle, you can write:
str_match(string=a, pattern='^(?!.*district).*hospital', perl=TRUE)
(which matches "start-of-string, followed by a point in the string that is not followed by .*district, followed by .*hospital"). That said, I'm really not sure if putting this condition into a single regex is the best way to do it; there may be a more R-ish way.

Regular expressions in R to erase all characters after the first space?

I have data in R that can look like this:
USDZAR Curncy
R157 Govt
SPX Index
In other words, one word, in this case a Bloomberg security identifier, followed by another word, which is the security class, separated by a space. I want to strip out the class and the space to get to:
USDZAR
R157
SPX
What's the most efficient way of doing this in R? Is it regular expressions or must I do something as I would in MS Excel using the mid and find commands? eg in Excel I would say:
=MID(#REF, 1, FIND(" ", #REF, 1)-1)
which means return a substring starting at character 1, and ending at the character number of the first space (less 1 to erase the actual space).
Do I need to do something similar in R (in which case, what is the equivalent), or can regular expressions help here? Thanks.
1) Try this where the regular expression matches a space followed by any sequence of characters and sub replaces that with a string having zero characters:
x <- c("USDZAR Curncy", "R157 Govt", "SPX Index")
sub(" .*", "", x)
## [1] "USDZAR" "R157" "SPX"
2) An alternative if you wanted the two words in separate columns in a data frame is as follows. Here as.is = TRUE makes the columns be character rather than factor.
read.table(text = x, as.is = TRUE)
## V1 V2
## 1 USDZAR Curncy
## 2 R157 Govt
## 3 SPX Index
It's pretty easy with stringr:
x <- c("USDZAR Curncy", "R157 Govt", "SPX Index")
library(stringr)
str_split_fixed(x, " ", n = 2)[, 1]
If you're like me, in that regexp's will always remain an inscrutable, frustrating mystery, this clunkier solution also exists:
x <- c("USDZAR Curncy", "R157 Govt", "SPX Index")
unlist(lapply(strsplit(x," ",fixed=TRUE),"[",1))
The fixed=TRUE isn't strictly necessary, just pointing out that you can do this (simple case) w/out really knowing the first thing about regexp's.
Edited to reflect #Wojciech's comment.
The regex would be to search for:
\x20.*
and replace with an empty string.
If you want to know whether it's faster, just time it.

R regex to validate user input is correct

I'm trying to practice writing better code, so I wanted to validate my input sequence with regex to make sure that the first thing I get is a single letter A to H only, and the second is a number 1 to 12 only. I'm new to regex and not sure what the expression should look like. I'm also not sure what type of error R would throw if this is invalidated?
In Perl it would be something like this I think: =~ m/([A-M]?))/)
Here is what I have so far for R:
input_string = "A1"
first_well_row = unlist(strsplit(input_string, ""))[1] # get the letter out
first_well_col = unlist(strsplit(input_string, ""))[2] # get the number out
In R code, using David's regex: [edited to reflect Marek's suggestion]
validate.input <- function(x){
match <- grepl("^[A-Ha-h]([0-9]|(1[0-2]))$",x,perl=TRUE)
## as Marek points out, instead of testing the length of the vector
## returned by grep() (which will return the index of the match and integer(0)
## if there are no matches), we can use grepl()
if(!match) stop("invalid input")
list(well_row=substr(x,1,1), well_col=as.integer(substr(x,2,nchar(x))))
}
This simply produces an error. If you want finer control over error handling, look up the documentation for tryCatch, here's a primitive usage example (instead of getting an error as before we'll return NA):
validate.and.catch.error <- function(x){
tryCatch(validate.input(x), error=function(e) NA)
}
Finally, note that you can use substr to extract your letters and numbers instead of doing strsplit.
You asked specifically for "A through H, then 0-9 or 10-12". Call the exception "InvalidInputException" or any similarly named object- "Not Valid" "Input" "Exception"
/^[A-H]([0-9]|(1[0-2]))$/
In Pseudocode:
validateData(String data)
if not data.match("/^[A-H]([0-9]|(1[0-2]))$/")
throw InvalidInputException