Remove everything except period and numbers from string regex in R - regex

I know there are many questions on stack overflow regarding regex but I cannot accomplish this one easy task with the available help I've seen. Here's my data:
a<-c("Los Angeles, CA","New York, NY", "San Jose, CA")
b<-c("c(34.0522, 118.2437)","c(40.7128, 74.0059)","c(37.3382, 121.8863)")
df<-data.frame(a,b)
df
a b
1 Los Angeles, CA c(34.0522, 118.2437)
2 New York, NY c(40.7128, 74.0059)
3 San Jose, CA c(37.3382, 121.8863)
I would like to remove the everything but the numbers and the period (i.e. remove "c", ")" and "(". This is what I've tried thus far:
str_replace(df$b,"[^0-9.]","" )
[1] "(34.0522, 118.2437)" "(40.7128, 74.0059)" "(37.3382, 121.8863)"
str_replace(df$b,"[^\\d\\)]+","" )
[1] "34.0522, 118.2437)" "40.7128, 74.0059)" "37.3382, 121.8863)"
Not sure what's left to try. I would like to end up with the following:
[1] "34.0522, 118.2437" "40.7128, 74.0059" "37.3382, 121.8863"
Thanks.

If I understand you correctly, this is what you want:
df$b <- gsub("[^[:digit:]., ]", "", df$b)
or:
df$b <- strsplit(gsub("[^[:digit:]. ]", "", df$b), " +")
> df
a b
1 Los Angeles, CA 34.0522, 118.2437
2 New York, NY 40.7128, 74.0059
3 San Jose, CA 37.3382, 121.8863
or if you want all the "numbers" as a numeric vector:
as.numeric(unlist(strsplit(gsub("[^[:digit:]. ]", "", df$b), " +")))
[1] 34.0522 118.2437 40.7128 74.0059 37.3382 121.8863

Try this
gsub("[\\c|\\(|\\)]", "",df$b)
#[1] "34.0522, 118.2437" "40.7128, 74.0059" "37.3382, 121.8863"

Not a regular expression solution, but a simple one.
The elements of b are R expressions, so loop over each element, parsing it, then creating the string you want.
vapply(
b,
function(bi)
{
toString(eval(parse(text = bi)))
},
character(1)
)

Here is another option with str_extract_all from stringr. Extract the numeric part using str_extract_all into a list, convert to numeric, rbind the list elements and cbind it with the first column of 'df'
library(stringr)
cbind(df[1], do.call(rbind,
lapply(str_extract_all(df$b, "[0-9.]+"), as.numeric)))

Related

regular expression string between two [ ] in R

I am stuck on regular expressions yet again but this time in R.
The problem I am facing is that I a vector I would like to extract a string between two [] for each row in the vector. However, sometimes I have cases where there is more than one series of [ ] in the whole statement and so I am recovering all strings in each row that is in the [ ]. In all cases I just need to recover the first instance of the string in the [ ] not the second or more instances. The example dataframe I have is:
comp541_c0_seq1 gi|356502740|ref|XP_003520174.1| PREDICTED: uncharacterized protein LOC100809655 [Glycine max]
comp5041_c0_seq1 gi|460370622|ref|XP_004231150.1| [Solanum lycopersicum] PREDICTED: uncharacterized protein LOC101250457 [Solanum lycopersicum]
The code i have been using that recovers the string and the index and makes a vector in the new dataframe are:
pattern <- "\\[\\w*\\s\\w*]"
match<- gregexpr(pattern, data$Description)
data$Species <- regmatches(data$Description, match)
the structure of the dataframe that I am using is:
data.frame': 67911 obs. of 6 variables:
$ Column1 : Factor w/ 67911 levels "comp100012_c0_seq1 ",..: 3344 8565 17875 18974 19059 19220 21429 29791 40214 48529 ...
$ Description : Factor w/ 26038 levels "0.0","1.13142e-173",..: NA NA NA NA NA NA NA NA 7970 NA ...
So the problem with my pattern match is that it return a vector (Species) where some of the rows have:
[Glycine max] # this is good
c("[Solanum lycopersicum]", "[Solanum lycopersicum]") # I only need one set returned
What I would like is:
[Glycine max]
[Solanum lycopersicum]
I have been trying every way I can with the regular expression. Would anyone know how to improve what I have to just extract the first instance of the string within [ ]?
Thanks in advance.
I think this example should be illuminating to your problems:
txt <- c("[Bracket text]","[Bracket text1] and [Bracket text2]","No brackets in here")
pattern <- "\\[\\w*\\s\\w*]"
mat <- regexpr(pattern,txt)
#[1] 1 1 -1
#attr(,"match.length")
#[1] 14 15 -1
txt[mat != -1] <- regmatches(txt, mat)
txt
#[1] "[Bracket text]" "[Bracket text1]" "No brackets in here"
Or if you want to do it all in one go and return NA values for non-matches, try:
ifelse(mat != -1, regmatches(txt,mat), NA)
#[1] "[Bracket text]" "[Bracket text1]" NA
Using the base-R facilities for string manipulation is just making life hard for yourself. Use rebus to create the regular expression, and stringi (or stringr) to get the matches.
library(rebus)
library(stringi)
txt <- c("[Bracket text]","[Bracket text1] and [Bracket text2]","No brackets in here") # thanks, thelatemail
pattern <- OPEN_BRACKET %R%
alnum(1, Inf) %R%
space(1, Inf) %R%
alnum(1, Inf) %R%
"]"
stri_extract_first_regex(txt, pattern)
## [1] "[Bracket text]" "[Bracket text1]" NA
I suspect that you probably don't want to keep those square brackets. Try this variant:
pattern <- OPEN_BRACKET %R%
capture(
alnum(1, Inf) %R%
space(1, Inf) %R%
alnum(1, Inf)
) %R%
"]"
stri_match_first_regex(txt, pattern)[, 2]
## [1] "Bracket text" "Bracket text1" NA

Extract a substring if it has an exact match in another vector

Update: the first version of this question was implicitly asking how to extract a substring if it has ANY match in another vector, for which #Colonel Beauvel provided an elegant response:
This does the trick, base R:
newname = sapply(nametitle, function(u){
bool = sapply(name, function(x) grepl(x, u))
if(any(bool)) name[bool][1] else NA })
newname
John Smith, MD PhD Jane Doe, JD
"John" "Jane"
However, I did not realize that I was actually asking for a way to find exact matches until the function kindly contributed did not work for all elements in my vector. Therefore, the following is my revised question.
Say I have the following character vector of generic names and their academic degrees:
nametitle <- c("John Smith, MD PhD", "Jane Doe, JD", "John-Paul Jones, MS")
And I have a "look-up" vector of first names:
name <- c("John", "Jane", "Mark", "Steve")
What I want to do is search each element of nametitle, and if part of the element (i.e., a substring of each string) is an exact match of an element from name, then in a new vector newname, write that element of nametitle with the corresponding element of name, or if there is no exact match, write the original value from nametitle.
Therefore, what I'd expect the proper function to do is return newname with the three elements below:
[1] "John" [2] "Jane" [3] "John-Paul Jones, MS"
I've attempted the following using the function contributed above:
newname = sapply(nametitle, function(u){
bool = sapply(name, function(x) grepl(x, u))
if(any(bool)) name[bool][1] else NA })
Which performs just fine for elements "John Smith, MD Phd" and "Jane Doe, JD", but not for "John-Paul Jones, MS" -- this element is replaced with "John" in the new vector newname.
There may be a simple change that can be made to the original function contributed by #Colonel Beauvel to resolve this issue, but using nested sapply functions is throwing me through a loop (pun intended?). Thanks.
This does the trick, base R:
newname = sapply(nametitle, function(u){
bool = sapply(name, function(x) grepl(x, u))
if(any(bool)) name[bool][1] else NA
})
#>newname
#John Smith, MD PhD Jane Doe, JD
# "John" "Jane"
Here's an easy way. First, create a regex pattern based on your name vector:
pattern <- paste0(".*(?<=\\s|^)(", paste(name, collapse = "|"), ")(?=\\s|$).*")
# [1] ".*(?<=\\s|^)(John|Jane|Mark|Steve)(?=\\s|$).*"
If you use this pattern, a single sub command will do the trick:
sub(pattern, "\\1", nametitle, perl = TRUE)
# [1] "John" "Jane" "John-Paul Jones, MS"

Eliminating the characters that are not a date in R

I have some data frame, df with a column with dates that are in the following format:
pv$day
01/01/13 00:00:00
03/01/13 00:02:03
04/03/13 00:10:15
....
I would like to eliminate the timestamp, just leaving the date (e.g. 01/01/13 for the first row). I have tried both using sapply() to apply the strsplit() function, and tried to filter the content using a regex, but don't seem to have quite gotten it right in either case. This:
sapply(pv$day, function(x) strsplit(toString(x), ' '))
gives me the column with the correct split, but indexing with either [1] or [[1]] does not return the first element of the split.
What is the best way to go about this?
You can use sub:
vec <- c("01/01/13 00:00:00", "03/01/13 00:02:03", "04/03/13 00:10:15")
sub(" .+", "", vec)
# [1] "01/01/13" "03/01/13" "04/03/13"
A simple, flexible solution is to use strptime and strftime. Here is an example that uses your dates from the example above:
# Your dates
t <- c("01/01/13 00:00:00","03/01/13 00:02:03", "04/03/13 00:10:15")
# Convert character strings to dates
z <- strptime(t, "%d/%m/%y %H:%M:%OS")
# Convert dates to string, omitting the time
z.date <- strftime(z,"%d/%m/%y")
# Print the first date
z.date[1]
Here's a nice way to use sapply, it uses strsplit to split at the space
> d <- c("01/01/13 00:00:00", "03/01/13 00:02:03", "04/03/13 00:10:15")
> sapply(strsplit(d, " "), `[`, 1)
# [1] "01/01/13" "03/01/13" "04/03/13"
And also, you could use stringr::word if you just want a character vector.
> library(stringr)
> word(d)
# [1] "01/01/13" "03/01/13" "04/03/13"
Here is an approach using a look around assertion:
vec <- c("01/01/13 00:00:00", "03/01/13 00:02:03", "04/03/13 00:10:15")
gsub(pattern = "(?=00).*$", replacement = "", vec, perl = TRUE)
[1] "01/01/13 " "03/01/13 " "04/03/13 "
The pattern looks for anything at the end of a string that begins with double 00, and removes it.

Remove the last comma in a string

I have a data.frame. It looks like this:
name state
Lily NY
Tom NY,NJ,
John PA,NJ
David SC,PA,NY,
Jim FL,PA
......
There are more than 100 rows. I just want to remove the last comma in each string if there is. My goal is not to remove all the last character.
Use a regular expression? Assuming your data frame is DF:
DF$state <- gsub(",$", "", DF$state)
The regular expression ,$ means every comma that occurs at the end of a string. The command gsub replaces every instance of the first argument with the second argument (in this case, nothing) that occurs in the third argument (DF$state).
With R 3.6.0, we can also use trimws with whitespace parameter specifying the ,
DF$state <- trimws(DF$state, whitespace = ",")
DF$state
#[1] "NY" "NY,NJ" "PA,NJ" "SC,PA,NY" "FL,PA"
data
DF <- structure(list(name = c("Lily", "Tom", "John", "David", "Jim"
), state = c("NY", "NY,NJ,", "PA,NJ", "SC,PA,NY,", "FL,PA")),
class = "data.frame", row.names = c(NA, -5L))

R: removing the '$' symbols

I have downloaded some data from a web server, including prices that are formatted for humans, including $ and thousand separators.
> head(m)
[1] $129,900 $139,900 $254,000 $260,000 $290,000 $295,000
I was able to get rid of the commas, using
m <- sub(',','',m)
but
m <- sub('$','',m)
does not remove the dollar sign. If I try mn <- as.numeric(m) or as.integer I get an error message:
Warning message: NAs introduced by coercion
and the result is:
> head(m)
[1] NA NA NA NA NA NA
How can I remove the $ sign? Thanks
dat <- gsub('[$]','',dat)
dat <- as.numeric(gsub(',','',dat))
> dat
[1] 129900 139900 254000 260000 290000 295000
In one step
gsub('[$]([0-9]+)[,]([0-9]+)','\\1\\2',dat)
[1] "129900" "139900" "254000" "260000" "290000" "295000"
Try this. It means replace anything that is not a digit with the empty string:
as.numeric(gsub("\\D", "", dat))
or to remove anything that is neither a digit nor a decimal:
as.numeric(gsub("[^0-9.]", "", dat))
UPDATE: Added a second similar approach in case the data in the question is not representative.
you could also use:
x <- c("$129,900", "$139,900", "$254,000", "$260,000", "$290,000", "$295,000")
library(qdap)
as.numeric(mgsub(c("$", ","), "", x))
yielding:
> as.numeric(mgsub(c("$", ","), "", x))
[1] 129900 139900 254000 260000 290000 295000
If you wanted to stay in base use the fixed = TRUE argument to gsub:
x <- c("$129,900", "$139,900", "$254,000", "$260,000", "$290,000", "$295,000")
as.numeric(gsub("$", "", gsub(",", "", x), fixed = TRUE))