Understanding `regexp` in R [duplicate] - regex

This question already has answers here:
Extract all numbers from a single string in R
(4 answers)
Closed 8 years ago.
Understanding regular expressions sometimes can be a trouble. Especially if your not really familiar writing them, like myself.
In R there are a couple of built-in functions (base package) which i would like to understand and be able to use. Like:
grep and gsub, that take as arguments (p, x) where p is a pattern and x is a character vector to look-up. split function also takes regexp as argument like many others.
Anyway i have an example such as:
string <- "39 22' 19'' N"
and i need to be able to extract numbers from it. So using these stringr, iterators, foreach libraries i am trying to figure out an expression using either iter or foreach.
str_locate(string, "[0-9]+") locates and z <- str_extract(obj, "[0-9]+") extracts only the first match on my string.
I have tried making something like
x <- iter(z)
nextElem(x)
but it doesn't work. And another one which normally doesn't work.
a <- foreach(iter(z))
a
How should i fix this using the above libraries?
Thanks.

Check http://cran.r-project.org/web/packages/stringr/stringr.pdf
str_extract_all(your_string, "[0-9]+")

you have exactly the same result with the basic functions:
strsplit(gsub("(\\D+)"," ", string), " ")

This is another way to do it in base R:
string <- "39 22' 19'' N"
regmatches(string,gregexpr("[0-9]+",string))
# [[1]]
# [1] "39" "22" "19"
Note that regmatches(...) returns a list where each element is a char vector with the matches. So to get just the char vector you would use:
regmatches(string,gregexpr("[0-9]+",string))[[1]]
# [1] "39" "22" "19"

Related

Incrementing a number in a string using sub

There's a string with a (single) number somewhere in it. I want to increment the number by one. Simple, right? I wrote the following without giving it a second thought:
sub("([[:digit:]]+)", as.character(as.numeric("\\1")+1), string)
... and got an NA.
> sub("([[:digit:]]+)", as.character(as.numeric("\\1")+1), "x is 5")
[1] NA
Warning message:
In sub("([[:digit:]]+)", as.character(as.numeric("\\1") + 1), "x is 5") :
NAs introduced by coercion
Why doesn't it work? I know other ways of doing this, so I don't need a "solution". I want to understand why this method fails.
The point is that the backreference is only evaluated during a match operation, and you cannot pass it to any function before that.
When you write as.numeric("\\1") the as.numeric function accepts a \1 string (a backslash and a 1 char). Thus, the result is expected, NA.
This happens because there is no built-in backreference interpolation in R.
You may use a gsubfn package:
> library(gsubfn)
> s <- "x is 5"
> gsubfn("\\d+", function(x) as.numeric(x) + 1, s)
[1] "x is 6"
It does not work because the arguments of sub are evaluated before they are passed to the regex engine (which gets called by .Internal).
In particular, as.numeric("\\1") evaluates to NA ... after that you're doomed.
It might be easier to think of it differently. You are getting the same error that you would get if you used:
print(as.numeric("\\1")+1)
Remember, the strings are passed to the function, where they are interpreted by the regex engine. The string \\1 is never transformed to be 5, since this calculation is done within the function.
Note that \\1 is not something that works as a number. NA seems to be similar to null in other languages:
NA... is a product of operation when you try to access something that is not there
From mpiktas' answer here.

Extract a string of words between two specific words in R [duplicate]

This question already has answers here:
Extracting a string between other two strings in R
(4 answers)
Closed last year.
I have the following string : "PRODUCT colgate good but not goodOKAY"
I want to extract all the words between PRODUCT and OKAY
This can be done with sub:
s <- "PRODUCT colgate good but not goodOKAY"
sub(".*PRODUCT *(.*?) *OKAY.*", "\\1", s)
giving:
[1] "colgate good but not good"
No packages are needed.
Here is a visualization of the regular expression:
.*PRODUCT *(.*?) *OKAY.*
Debuggex Demo
x = "PRODUCT colgate good but not goodOKAY"
library(stringr)
str_extract(string = x, pattern = "(?<=PRODUCT).*(?=OKAY)")
(?<=PRODUCT) -- look behind the match for PRODUCT
.* match everything except new lines.
(?=OKAY) -- look ahead to match OKAY.
I should add you don't need the stringr package for this, the base functions sub and gsub work fine. I use stringr for it's consistency of syntax: whether I'm extracting, replacing, detecting etc. the function names are predictable and understandable, and the arguments are in a consistent order. I use stringr because it saves me from needing the documentation every time.
(Note that for stringr versions less than 1.1.0, you need to specify perl-flavored regex to get lookahead and lookbehind functionality - so the pattern above would need to be wrapped in perl().)
You can use gsub:
vec <- "PRODUCT colgate good but not goodOKAY"
gsub(".*PRODUCT\\s*|OKAY.*", "", vec)
# [1] "colgate good but not good"
You could use the rm_between function from the qdapRegex package. It takes a string and a left and right boundary as follows:
x <- "PRODUCT colgate good but not goodOKAY"
library(qdapRegex)
rm_between(x, "PRODUCT", "OKAY", extract=TRUE)
## [[1]]
## [1] "colgate good but not good"
You could use the package unglue :
library(unglue)
x <- "PRODUCT colgate good but not goodOKAY"
unglue_vec(x, "PRODUCT {out}OKAY")
#> [1] "colgate good but not good"

strsplit on first instance [duplicate]

This question already has answers here:
Splitting a string on the first space
(7 answers)
Closed 4 years ago.
I would like to write a strsplit command that grabs the first ")" and splits the string.
For example:
f("12)34)56")
"12" "34)56"
I have read over several other related regex SO questions but I am afraid I am not able to make heads or tails of this. Thank you any assistance.
You could get the same list-type result as you would with strsplit if you used regexpr to get the first match, and then the inverted result of regmatches.
x <- "12)34)56"
regmatches(x, regexpr(")", x), invert = TRUE)
# [[1]]
# [1] "12" "34)56"
Need speed? Then go for stringi functions. See timings e.g. here.
library(stringi)
x <- "12)34)56"
stri_split_fixed(str = x, pattern = ")", n = 2)
It might be safer to identify where the character is and then substring either side of it:
x <- "12)34)56"
spl <- regexpr(")",x)
substring(x,c(1,spl+1),c(spl-1,nchar(x)))
#[1] "12" "34)56"
Another option is to use str_split in the package stringr:
library(stringr)
f <- function(string)
{
unlist(str_split(string,"\\)",n=2))
}
> f("12)34)56")
[1] "12" "34)56"
Replace the first ( with the non-printing character "\01" and then strsplit on that. You can use any character you like in place of "\01" as long as it does not appear.
strsplit(sub(")", "\01", "12)34)56"), "\01")

Regex in R: match everything but not "some string" [duplicate]

This question already has answers here:
How can I remove all objects but one from the workspace in R?
(14 answers)
Remove all punctuation except apostrophes in R
(4 answers)
Closed 9 years ago.
The answers to another question explain how to match a string not containing a word.
The problem (for me) is that the solutions given don't work in R.
Often I create a data.frame() from existing vectors and want to clean up my workspace. So for example, if my workspace contains:
> ls()
[1] "A" "B" "dat" "V"
>
and I want to retain only dat, I'd have to clean it up with:
> rm(list=ls(pattern="A"))
> rm(list=ls(pattern="B"))
> rm(list=ls(pattern="V"))
> ls()
[1] "dat"
>
(where A, B, and V are just examples of a large number of complicated names like my.first.vector that are not easy to match with rm(list=ls(pattern="[ABV]"))).
It would be most convenient (for me) to tell rm() to remove everything except dat, but the problem is that the solution given in the linked Q&A does not work:
> rm(list=ls(pattern="^((?!dat).)*$"))
Error in grep(pattern, all.names, value = TRUE) :
invalid regular expression '^((?!dat).)*$', reason 'Invalid regexp'
>
So how can I match everything except dat in R?
This will remove all objects except dat . (Use the ls argument all.names = TRUE if you want to remove objects whose names begin with a dot as well.)
rm( list = setdiff( ls(), "dat" ) )
Replace "dat" with a vector of names, e.g. c("dat", "some.other.object"), if you want to retain several objects; or, if the several objects can all be readily matched by a regular expression try something like this which removes all objects whose names do not start with "dat":
rm( list = setdiff( ls(), ls( pattern = "^dat" ) ) )
Another approach is to save the data, save("dat", file = "dat.RData"), exit R, start a new R session and load the data, 1oad("dat.RData"). Also note this method of restarting R.
Negative look-around requires perl=TRUE argument in R. So, you won't be able to directly use ls(pattern = ...) with that regular expression. Alternatively you can do:
rm(list = grep("^((?!dat).)*$", ls(), perl=TRUE, value=TRUE))
This is if you're looking for inexact matches. If you're looking for exact match, you should just do what Ferdinand has commented:
rm(list=ls()[ls() != "dat"])

Regular expressions in R to erase all characters after the first space?

I have data in R that can look like this:
USDZAR Curncy
R157 Govt
SPX Index
In other words, one word, in this case a Bloomberg security identifier, followed by another word, which is the security class, separated by a space. I want to strip out the class and the space to get to:
USDZAR
R157
SPX
What's the most efficient way of doing this in R? Is it regular expressions or must I do something as I would in MS Excel using the mid and find commands? eg in Excel I would say:
=MID(#REF, 1, FIND(" ", #REF, 1)-1)
which means return a substring starting at character 1, and ending at the character number of the first space (less 1 to erase the actual space).
Do I need to do something similar in R (in which case, what is the equivalent), or can regular expressions help here? Thanks.
1) Try this where the regular expression matches a space followed by any sequence of characters and sub replaces that with a string having zero characters:
x <- c("USDZAR Curncy", "R157 Govt", "SPX Index")
sub(" .*", "", x)
## [1] "USDZAR" "R157" "SPX"
2) An alternative if you wanted the two words in separate columns in a data frame is as follows. Here as.is = TRUE makes the columns be character rather than factor.
read.table(text = x, as.is = TRUE)
## V1 V2
## 1 USDZAR Curncy
## 2 R157 Govt
## 3 SPX Index
It's pretty easy with stringr:
x <- c("USDZAR Curncy", "R157 Govt", "SPX Index")
library(stringr)
str_split_fixed(x, " ", n = 2)[, 1]
If you're like me, in that regexp's will always remain an inscrutable, frustrating mystery, this clunkier solution also exists:
x <- c("USDZAR Curncy", "R157 Govt", "SPX Index")
unlist(lapply(strsplit(x," ",fixed=TRUE),"[",1))
The fixed=TRUE isn't strictly necessary, just pointing out that you can do this (simple case) w/out really knowing the first thing about regexp's.
Edited to reflect #Wojciech's comment.
The regex would be to search for:
\x20.*
and replace with an empty string.
If you want to know whether it's faster, just time it.