I have a sequence of addresses and I am trying to replace numbers with ordinals. Right now I have the following.
library(toOrdinal)
addlist<-c("east 1 street", "4 ave", "5 blvd", "plaza", "43 lane" )
numstringc<-gsub("\\D", "", addlist)
numstring <-as.integer(numstringc)
ordstring<-sapply(numstring[!is.na(numstring)], toOrdinal)
ordstring
[1] "1st" "4th" "5th" "43rd"
I want to eventually get a vector that says
[1] "east 1st street", "4th ave", "5th blvd", "plaza", "43rd lane"
but I can't figure out how to make that.
With \\1 you can access the part of the matched expression in paranthesis, but gsub doesn't allow functions in the replacement, so you have to use gsubfn from the package by the same name, which actually doesn't need the \\1 part:
library(gsubfn)
addlist<-c("east 1 street", "4 ave", "5 blvd", "plaza", "43 lane" )
ordstring <- gsubfn("[0-9]+", function (x) toOrdinal(as.integer(x)), addlist)
Alternatively you can use gregexpr and regmatches, to replace them:
m <- gregexpr("[0-9]+", addlist)
regmatches(addlist, m) <- sapply(as.integer(regmatches(addlist,m)), toOrdinary)
Consider
text <- "who let the dogs out"
fooo <- strsplit(text, " ")
fooo
[[1]]
[1] "who" "let" "the" "dogs" "out"
the output of strsplit is a list. The list's first element then is a vector, that contains the words above.
Why does the function behave that way? Is there any case in which it would return a list with more than one element?
And I can access the words using
fooo[[1]][1]
[1] "who"
, but is there no simpler way?
To your first question, one reason that comes to mind is so that it can keep different length result vectors in the same object, since it is vectorized over x:
text <- "who let the dogs out"
vtext <- c(text, "who let the")
##
> strsplit(text, " ")
[[1]]
[1] "who" "let" "the" "dogs" "out"
> strsplit(vtext, " ")
[[1]]
[1] "who" "let" "the" "dogs" "out"
[[2]]
[1] "who" "let" "the"
If this were to be returned as a data.frame, matrix, etc... instead of a list, it would have to be padded with additional elements.
I have a character vector like:
"I t is tim e to g o"
I wanted it to be:
"It is time to go"
This regex works in your case: "\\s(?=\\S\\s\\S{2,}|\\S$)"
string <- "I t is tim e to g o"
gsub("\\s(?=\\S\\s\\S{2,}|\\S$)", "", string, perl=TRUE)
## [1] "It is time to go"
Try this.Replace by empty string.
See demo.
https://regex101.com/r/nL5yL3/32
Using rex may make this type of task a little simpler. Although in this case maybe not :)
string <- "I t is tim e to g o"
library(rex)
re_substitutes(string, rex(
space %if_next_is%
list(
list(non_space, space, at_least(non_space, 2)) %or%
list(non_space, end)
)
), "", global = TRUE)
#> [1] "It is time to go"
In R you can use the strsplit function to split a vector on a delimiter(split) as follows:
x <- "What is this? It's an onion. What! That's| Well Crazy."
unlist(strsplit(x, "[\\?\\.\\!\\|]", perl=TRUE))
## [1] "What is this" " It's an onion" " What" " That's"
## [5] " Well Crazy"
I'd like to keep the delimiter(split) using R. So the desired output would be:
## [1] "What is this?" " It's an onion." " What!" " That's|"
## [5] " Well Crazy."
You can use "(?<=DELIMITERS)":
unlist(strsplit(x, "(?<=[?.!|])", perl=TRUE))
## [1] "What is this?" " It's an onion." " What!" " That's|"
## [5] " Well Crazy.
I am trying to write a program with regular expressions to clean up some data. Let's say I have room names with a letter and a number. In the final output I need to output the room names using the pattern "the full string (excluding letter & number) + letter + number" as in the examples below. However, with the regular expressions I've written so far, I get very messed up results, which are at the bottom of my message. For some reason, it puts letters and characters on some of the rows, even though there may be none in the input data. Thank you.
EDITED: I made edits to the input data. I would like to generalize the code to take any number of character strings, not just the single word "ROOM".
# the pattern should be "the full string (excluding letter & number) + letter + number". For example:
ATLANTA ROOM
ATLANTA ROOM 3
NEW YORK ROOM A 2
ROOM A 4
THE BIG AWESOME ROOM B
ROOM B 4
GEORGETOWN ROOM B 2
NEW YORK ROOM C 2
NEW YORK ROOM C
LOS ANGELES ROOM E 2
# program to clean with regular expressions. there could be multiple spaces between words
dd <- c("ATLANTA ROOM ",
" ATLANTA ROOM 3",
"NEW YORK A ROOM 2",
"4 ROOM A",
"THE BIG AWESOME ROOM B",
" ROOM 4 B",
"GEORGETOWN B 2 ROOM ",
" C NEW YORK ROOM 2",
"NEW YORK ROOM C",
"LOS ANGELES ROOM 2 E")
m_char_num <- regexpr("(\\<A|B|C|D|E|1|2|3|4\\>)", dd)
m_char <- regexpr("(\\<A|B|C|D|E\\>)", dd)
m_num <- regexpr("(\\<1|2|3|4\\>)", dd)
(dd2 <- paste(gsub("( +)", " ",
gsub("(^ +)|( +$)", "",
gsub("(\\<A|B|C|D|E|1|2|3|4\\>)", "", dd))),
regmatches(dd, m_char), regmatches(dd, m_num), sep = " "))
# actual output from the program
"TLANTA ROOMA3",
"TLANTA ROOMA2",
"NW YORK ROOMA4",
"ROOMA4",
"TH IG WSOM ROOME2",
"ROOMB2",
"GORGTOWN ROOMB2",
"NW YORK ROOMC3",
"NW YORK ROOMC2",
"LOS NGLS ROOMA4"
Here's an attempt:
sub(' $', '', # clean up spaces at the end
gsub(' +', ' ', # clean up double spaces
# rearrange letter and numbers
sub('^([A-Z]?)([0-9]*)([A-Z]?)$', 'ROOM \\1\\3 \\2',
gsub(' |ROOM', '', dd) # remove spaces and ROOM
)
)
)
#[1] "ROOM" "ROOM 3" "ROOM A 2" "ROOM A 4" "ROOM B" "ROOM B 4" "ROOM B 2"
#[8] "ROOM C 2" "ROOM C" "ROOM E 2"
And here's the same logic for the edited OP and comment below (assuming room names are words that have at least 3 letters and at most a 2-letter room designation):
gsub('(^ | $)', '', # clean up spaces in front or end
gsub(' +', ' ', # clean up double spaces
# extract room name and put it in front of the letter and number
paste(gsub('\\b([A-Z][A-Z]?|[0-9]+)\\b', '', dd, perl = T),
sub('^([A-Z]+)?([0-9]*)([A-Z]+)?$', '\\1\\3 \\2',
gsub(' |\\w\\w\\w+', '', dd) # remove spaces and words
)
)
)
)
So, what's happening is e.g. your program only 8 letters, and so instead of inserting "" or NA, it's recycling them.
Here is a fix:
m_char_num <- regexpr("(\\<A|B|C|D|E|1|2|3|4\\>)", dd)
m_char <- regexpr("(\\<A|B|C|D|E\\>)", dd)
m_num <- regexpr("(\\<1|2|3|4\\>)", dd)
numbers <- rep("", length(dd))
numbers[m_num>0] <- regmatches(dd, m_num)
letters <- rep("", length(dd))
letters[m_char>0] <- regmatches(dd, m_char)
output <- trim(paste("ROOM", letters, numbers))
[1] "ROOM" "ROOM 3" "ROOM A 2" "ROOM A 4" "ROOM B" "ROOM B 4" "ROOM B 2" "ROOM C 2" "ROOM C"
[10] "ROOM E 2"
Try this:
library(gsubfn)
# extract numbers (num) and room letters (char)
num <- sapply(strapplyc(dd, "\\d|$"), paste, collapse = "")
char <- sapply(strapplyc(dd, "[A-D]|$"), paste, collapse = "")
# put back together and sort
out <- sort(paste("ROOM", char, num))
# trim spaces (optional)
out <- gsub(" +", " ", sub(" *$", "", out))
> out
[1] "ROOM" "ROOM 2" "ROOM 3" "ROOM A 2" "ROOM A 4" "ROOM B"
[7] "ROOM B 2" "ROOM B 4" "ROOM C" "ROOM C 2"
UPDATE: minor improvements