I have a sequence of addresses and I am trying to replace numbers with ordinals. Right now I have the following.
library(toOrdinal)
addlist<-c("east 1 street", "4 ave", "5 blvd", "plaza", "43 lane" )
numstringc<-gsub("\\D", "", addlist)
numstring <-as.integer(numstringc)
ordstring<-sapply(numstring[!is.na(numstring)], toOrdinal)
ordstring
[1] "1st" "4th" "5th" "43rd"
I want to eventually get a vector that says
[1] "east 1st street", "4th ave", "5th blvd", "plaza", "43rd lane"
but I can't figure out how to make that.
With \\1 you can access the part of the matched expression in paranthesis, but gsub doesn't allow functions in the replacement, so you have to use gsubfn from the package by the same name, which actually doesn't need the \\1 part:
library(gsubfn)
addlist<-c("east 1 street", "4 ave", "5 blvd", "plaza", "43 lane" )
ordstring <- gsubfn("[0-9]+", function (x) toOrdinal(as.integer(x)), addlist)
Alternatively you can use gregexpr and regmatches, to replace them:
m <- gregexpr("[0-9]+", addlist)
regmatches(addlist, m) <- sapply(as.integer(regmatches(addlist,m)), toOrdinary)
Related
I know there are many questions on stack overflow regarding regex but I cannot accomplish this one easy task with the available help I've seen. Here's my data:
a<-c("Los Angeles, CA","New York, NY", "San Jose, CA")
b<-c("c(34.0522, 118.2437)","c(40.7128, 74.0059)","c(37.3382, 121.8863)")
df<-data.frame(a,b)
df
a b
1 Los Angeles, CA c(34.0522, 118.2437)
2 New York, NY c(40.7128, 74.0059)
3 San Jose, CA c(37.3382, 121.8863)
I would like to remove the everything but the numbers and the period (i.e. remove "c", ")" and "(". This is what I've tried thus far:
str_replace(df$b,"[^0-9.]","" )
[1] "(34.0522, 118.2437)" "(40.7128, 74.0059)" "(37.3382, 121.8863)"
str_replace(df$b,"[^\\d\\)]+","" )
[1] "34.0522, 118.2437)" "40.7128, 74.0059)" "37.3382, 121.8863)"
Not sure what's left to try. I would like to end up with the following:
[1] "34.0522, 118.2437" "40.7128, 74.0059" "37.3382, 121.8863"
Thanks.
If I understand you correctly, this is what you want:
df$b <- gsub("[^[:digit:]., ]", "", df$b)
or:
df$b <- strsplit(gsub("[^[:digit:]. ]", "", df$b), " +")
> df
a b
1 Los Angeles, CA 34.0522, 118.2437
2 New York, NY 40.7128, 74.0059
3 San Jose, CA 37.3382, 121.8863
or if you want all the "numbers" as a numeric vector:
as.numeric(unlist(strsplit(gsub("[^[:digit:]. ]", "", df$b), " +")))
[1] 34.0522 118.2437 40.7128 74.0059 37.3382 121.8863
Try this
gsub("[\\c|\\(|\\)]", "",df$b)
#[1] "34.0522, 118.2437" "40.7128, 74.0059" "37.3382, 121.8863"
Not a regular expression solution, but a simple one.
The elements of b are R expressions, so loop over each element, parsing it, then creating the string you want.
vapply(
b,
function(bi)
{
toString(eval(parse(text = bi)))
},
character(1)
)
Here is another option with str_extract_all from stringr. Extract the numeric part using str_extract_all into a list, convert to numeric, rbind the list elements and cbind it with the first column of 'df'
library(stringr)
cbind(df[1], do.call(rbind,
lapply(str_extract_all(df$b, "[0-9.]+"), as.numeric)))
I have a character vector like:
"I t is tim e to g o"
I wanted it to be:
"It is time to go"
This regex works in your case: "\\s(?=\\S\\s\\S{2,}|\\S$)"
string <- "I t is tim e to g o"
gsub("\\s(?=\\S\\s\\S{2,}|\\S$)", "", string, perl=TRUE)
## [1] "It is time to go"
Try this.Replace by empty string.
See demo.
https://regex101.com/r/nL5yL3/32
Using rex may make this type of task a little simpler. Although in this case maybe not :)
string <- "I t is tim e to g o"
library(rex)
re_substitutes(string, rex(
space %if_next_is%
list(
list(non_space, space, at_least(non_space, 2)) %or%
list(non_space, end)
)
), "", global = TRUE)
#> [1] "It is time to go"
a <- "1 \"US\", 2 \"UK\", 3 \"GE\""
I hope to get the following results:
1 US
2 UK
3 GE
Seems like you want something like this,
> a <- c("1 \"US\"","2 \"UK\"","3 \"GE\"")
> gsub("\"", "", a)
[1] "1 US" "2 UK" "3 GE"
OR
> a <- "1 \"US\", 2 \"UK\", 3 \"GE\""
> gsub("\"", "", a)
[1] "1 US, 2 UK, 3 GE"
> gsub("\"|,", "", a)
[1] "1 US 2 UK 3 GE"
\" are usually used to mean a double quotes.
There are no slashes in your string (length 1 character vector).
> cat(a)
1 "US", 2 "UK", 3 "GE"
The slashes you see are to escape the double quotes which otherwise would close the string. Compare what it would look like if you were using single quotes to delimit a string (in which case a double quote would not close it):
> identical(a, '1 "US", 2 "UK", 3 "GE"')
[1] TRUE
If you want to remove the commas,
> gsub(",", "", a)
[1] "1 \"US\" 2 \"UK\" 3 \"GE\""
If you want to display it without having it printed as a delimited string and without escaping things in it, use cat. You can even do both.
> cat(gsub(",", "", a))
1 "US" 2 "UK" 3 "GE"
I'm having a trouble with this regular expression. Consider the following vector.
> vec <- c("new jersey", "south dakota", "virginia:chincoteague",
"washington:whidbey island", "new york:main")
Of those strings that contain a :, I would like to keep only the ones with main after :, resulting in
[1] "new jersey" "south dakota" "new york:main"
So far, I've only been able to get there with this ugly nested nightmare, which is quite obviously far from optimal.
> g1 <- grep(":", vec)
> vec[ -g1[grep("main", grep(":", vec, value = TRUE), invert = TRUE)] ]
# [1] "new jersey" "south dakota" "new york:main"
How can I write a single regular expression to keep :main but remove others containing : ?
Using | (Pick one that contains :main or that does not contains : at all):
> vec <- c("new jersey", "south dakota", "virginia:chincoteague",
+ "washington:whidbey island", "new york:main")
> grep(":main|^[^:]*$", vec)
[1] 1 2 5
> vec[grep(":main|^[^:]*$", vec)]
[1] "new jersey" "south dakota" "new york:main"
You can use this single simple regex:
^[^:]+(?::main.*)?$
See demo
Not sure about the exact R code, but something like
grepl("^[^:]+(?::main.*)?$", subject, perl=TRUE);
Explanation
The ^ anchor asserts that we are at the beginning of the string
The [^:]+ matches all chars that are not a colon
The optional non-capturing group (?::main.*)? matches a colon, main and any chars that follow
The $ anchor asserts that we are at the end of the string
I would like to capture the first match, and return NA if there is no match.
regexpr("a+", c("abc", "def", "cba a", "aa"), perl=TRUE)
# [1] 1 -1 3 1
# attr(,"match.length")
# [1] 1 -1 1 2
x <- c("abc", "def", "cba a", "aa")
m <- regexpr("a+", x, perl=TRUE)
regmatches(x, m)
# [1] "a" "a" "aa"
So I expected "a", NA, "a", "aa"
Staying with regexpr:
r <- regexpr("a+", x)
out <- rep(NA,length(x))
out[r!=-1] <- regmatches(x, r)
out
#[1] "a" NA "a" "aa"
use regexec instead, since it returns a list which will allow you to catch the character(0)'s before unlisting
R <- regmatches(x, regexec("a+", x))
unlist({R[sapply(R, length)==0] <- NA; R})
# [1] "a" NA "a" "aa"
In R 3.3.0, it is possible to pull out both the matches and the non-matched results using the invert=NA argument. From the help file, it says
if invert is NA, regmatches extracts both non-matched and matched substrings, always starting and ending with a non-match (empty if the match occurred at the beginning or the end, respectively).
The output is a list, typically, in most cases of interest, (matching a single pattern), regmatches with this argument will return a list with elements of either length 3 or 1. 1 is the case of where no matches are found and 3 is the case with a match.
myMatch <- regmatches(x, m, invert=NA)
myMatch
[[1]]
[1] "" "a" "bc"
[[2]]
[1] "def"
[[3]]
[1] "cb" "a" " a"
[[4]]
[1] "" "aa" ""
So to extract what you want (with "" in place of NA), you can use sapply as follows:
myVec <- sapply(myMatch, function(x) {if(length(x) == 1) "" else x[2]})
myVec
[1] "a" "" "a" "aa"
At this point, if you really want NA instead of "", you can use
is.na(myVec) <- nchar(myVec) == 0L
myVec
[1] "a" NA "a" "aa"
Some revisions:
Note that you can collapse the last two lines into a single line:
myVec <- sapply(myMatch, function(x) {if(length(x) == 1) NA_character_ else x[2]})
The default data type of NA is logical, so using it will result in additional data conversions. Using the character version NA_character_, avoids this.
An even slicker extraction method for the final line is to use [:
sapply(myMatch, `[`, 2)
[1] "a" NA "a" "aa"
So you can do the whole thing in a fairly readable single line:
sapply(regmatches(x, m, invert=NA), `[`, 2)
Using more or less the same construction as yours -
chars <- c("abc", "def", "cba a", "aa")
chars[
regexpr("a+", chars, perl=TRUE) > 0
][1] #abc
chars[
regexpr("q", chars, perl=TRUE) > 0
][1] #NA
#vector[
# find all indices where regexpr returned positive value i.e., match was found
#][return the first element of the above subset]
Edit - Seems like I misunderstood the question. But since two people have found this useful I shall let it stay.
You can use stringr::str_extract(string, pattern). It will return NA if there is no matches. It has simpler function interface than regmatches() as well.