change position of word within a string in r

change position of word within a string in r - regex

I have a string vector that looks like:
> string_vec
[1] "XXX" "Snakes On A Plane" "Mask of the Ninja" "Ruslan"
[5] "Kill Switch" "Buddy Holly Story, The" "Believers, The" "Closet, The"
[9] "Eyes of Tammy Faye, The" "Gymnast, The" "Hunger, The"
There are some names which contain ", The" in the end. I want to delete the comma and the space and move the "The" before all other text.
For e.g.: "Buddy Holly Story, The" becomes "The Buddy Holly Story".
Isolating the records with the pattern was easy :
string_vec[grepl("[Aa-zZ]+, The", string_vec) == TRUE]
How can I adjust the position now?
data
string_vec <- c("XXX", "Snakes On A Plane", "Mask of the Ninja",
"Ruslan",
"Kill Switch", "Buddy Holly Story, The", "Believers, The",
"Closet, The",
"Eyes of Tammy Faye, The", "Gymnast, The", "Hunger, The")

You may try
sub('^(.*), The', 'The \\1', string_vec)
#[1] "XXX" "Snakes On A Plane" "Mask of the Ninja"
#[4] "Ruslan" "Kill Switch" "The Buddy Holly Story"
#[7] "The Believers" "The Closet" "The Eyes of Tammy Faye"
#[10] "The Gymnast" "The Hunger"

Related

using toOrdinal to replace numbers with ordinals

I have a sequence of addresses and I am trying to replace numbers with ordinals. Right now I have the following.
library(toOrdinal)
addlist<-c("east 1 street", "4 ave", "5 blvd", "plaza", "43 lane" )
numstringc<-gsub("\\D", "", addlist)
numstring <-as.integer(numstringc)
ordstring<-sapply(numstring[!is.na(numstring)], toOrdinal)
ordstring
[1] "1st" "4th" "5th" "43rd"
I want to eventually get a vector that says
[1] "east 1st street", "4th ave", "5th blvd", "plaza", "43rd lane"
but I can't figure out how to make that.

With \\1 you can access the part of the matched expression in paranthesis, but gsub doesn't allow functions in the replacement, so you have to use gsubfn from the package by the same name, which actually doesn't need the \\1 part:
library(gsubfn)
addlist<-c("east 1 street", "4 ave", "5 blvd", "plaza", "43 lane" )
ordstring <- gsubfn("[0-9]+", function (x) toOrdinal(as.integer(x)), addlist)
Alternatively you can use gregexpr and regmatches, to replace them:
m <- gregexpr("[0-9]+", addlist)
regmatches(addlist, m) <- sapply(as.integer(regmatches(addlist,m)), toOrdinary)

Why does strsplit return a list

Consider
text <- "who let the dogs out"
fooo <- strsplit(text, " ")
fooo
[[1]]
[1] "who" "let" "the" "dogs" "out"
the output of strsplit is a list. The list's first element then is a vector, that contains the words above.
Why does the function behave that way? Is there any case in which it would return a list with more than one element?
And I can access the words using
fooo[[1]][1]
[1] "who"
, but is there no simpler way?

To your first question, one reason that comes to mind is so that it can keep different length result vectors in the same object, since it is vectorized over x:
text <- "who let the dogs out"
vtext <- c(text, "who let the")
##
> strsplit(text, " ")
[[1]]
[1] "who" "let" "the" "dogs" "out"
> strsplit(vtext, " ")
[[1]]
[1] "who" "let" "the" "dogs" "out"
[[2]]
[1] "who" "let" "the"
If this were to be returned as a data.frame, matrix, etc... instead of a list, it would have to be padded with additional elements.

How to remove unwanted space between words inside a character vector using R?

I have a character vector like:
"I t is tim e to g o"
I wanted it to be:
"It is time to go"

This regex works in your case: "\\s(?=\\S\\s\\S{2,}|\\S$)"
string <- "I t is tim e to g o"
gsub("\\s(?=\\S\\s\\S{2,}|\\S$)", "", string, perl=TRUE)
## [1] "It is time to go"
Try this.Replace by empty string.
See demo.
https://regex101.com/r/nL5yL3/32

Using rex may make this type of task a little simpler. Although in this case maybe not :)
string <- "I t is tim e to g o"
library(rex)
re_substitutes(string, rex(
space %if_next_is%
list(
list(non_space, space, at_least(non_space, 2)) %or%
list(non_space, end)
)
), "", global = TRUE)
#> [1] "It is time to go"

R split on delimiter (split) keep the delimiter (split)

In R you can use the strsplit function to split a vector on a delimiter(split) as follows:
x <- "What is this? It's an onion. What! That's| Well Crazy."
unlist(strsplit(x, "[\\?\\.\\!\\|]", perl=TRUE))
## [1] "What is this" " It's an onion" " What" " That's"
## [5] " Well Crazy"
I'd like to keep the delimiter(split) using R. So the desired output would be:
## [1] "What is this?" " It's an onion." " What!" " That's|"
## [5] " Well Crazy."

You can use "(?<=DELIMITERS)":
unlist(strsplit(x, "(?<=[?.!|])", perl=TRUE))
## [1] "What is this?" " It's an onion." " What!" " That's|"
## [5] " Well Crazy.

Regular expressions to re-order strings in a field

I am trying to write a program with regular expressions to clean up some data. Let's say I have room names with a letter and a number. In the final output I need to output the room names using the pattern "the full string (excluding letter & number) + letter + number" as in the examples below. However, with the regular expressions I've written so far, I get very messed up results, which are at the bottom of my message. For some reason, it puts letters and characters on some of the rows, even though there may be none in the input data. Thank you.
EDITED: I made edits to the input data. I would like to generalize the code to take any number of character strings, not just the single word "ROOM".
# the pattern should be "the full string (excluding letter & number) + letter + number". For example:
ATLANTA ROOM
ATLANTA ROOM 3
NEW YORK ROOM A 2
ROOM A 4
THE BIG AWESOME ROOM B
ROOM B 4
GEORGETOWN ROOM B 2
NEW YORK ROOM C 2
NEW YORK ROOM C
LOS ANGELES ROOM E 2
# program to clean with regular expressions. there could be multiple spaces between words
dd <- c("ATLANTA ROOM ",
" ATLANTA ROOM 3",
"NEW YORK A ROOM 2",
"4 ROOM A",
"THE BIG AWESOME ROOM B",
" ROOM 4 B",
"GEORGETOWN B 2 ROOM ",
" C NEW YORK ROOM 2",
"NEW YORK ROOM C",
"LOS ANGELES ROOM 2 E")
m_char_num <- regexpr("(\\<A|B|C|D|E|1|2|3|4\\>)", dd)
m_char <- regexpr("(\\<A|B|C|D|E\\>)", dd)
m_num <- regexpr("(\\<1|2|3|4\\>)", dd)
(dd2 <- paste(gsub("( +)", " ",
gsub("(^ +)|( +$)", "",
gsub("(\\<A|B|C|D|E|1|2|3|4\\>)", "", dd))),
regmatches(dd, m_char), regmatches(dd, m_num), sep = " "))
# actual output from the program
"TLANTA ROOMA3",
"TLANTA ROOMA2",
"NW YORK ROOMA4",
"ROOMA4",
"TH IG WSOM ROOME2",
"ROOMB2",
"GORGTOWN ROOMB2",
"NW YORK ROOMC3",
"NW YORK ROOMC2",
"LOS NGLS ROOMA4"

Here's an attempt:
sub(' $', '', # clean up spaces at the end
gsub(' +', ' ', # clean up double spaces
# rearrange letter and numbers
sub('^([A-Z]?)([0-9]*)([A-Z]?)$', 'ROOM \\1\\3 \\2',
gsub(' |ROOM', '', dd) # remove spaces and ROOM
)
)
)
#[1] "ROOM" "ROOM 3" "ROOM A 2" "ROOM A 4" "ROOM B" "ROOM B 4" "ROOM B 2"
#[8] "ROOM C 2" "ROOM C" "ROOM E 2"
And here's the same logic for the edited OP and comment below (assuming room names are words that have at least 3 letters and at most a 2-letter room designation):
gsub('(^ | $)', '', # clean up spaces in front or end
gsub(' +', ' ', # clean up double spaces
# extract room name and put it in front of the letter and number
paste(gsub('\\b([A-Z][A-Z]?|[0-9]+)\\b', '', dd, perl = T),
sub('^([A-Z]+)?([0-9]*)([A-Z]+)?$', '\\1\\3 \\2',
gsub(' |\\w\\w\\w+', '', dd) # remove spaces and words
)
)
)
)

So, what's happening is e.g. your program only 8 letters, and so instead of inserting "" or NA, it's recycling them.
Here is a fix:
m_char_num <- regexpr("(\\<A|B|C|D|E|1|2|3|4\\>)", dd)
m_char <- regexpr("(\\<A|B|C|D|E\\>)", dd)
m_num <- regexpr("(\\<1|2|3|4\\>)", dd)
numbers <- rep("", length(dd))
numbers[m_num>0] <- regmatches(dd, m_num)
letters <- rep("", length(dd))
letters[m_char>0] <- regmatches(dd, m_char)
output <- trim(paste("ROOM", letters, numbers))
[1] "ROOM" "ROOM 3" "ROOM A 2" "ROOM A 4" "ROOM B" "ROOM B 4" "ROOM B 2" "ROOM C 2" "ROOM C"
[10] "ROOM E 2"

Try this:
library(gsubfn)
# extract numbers (num) and room letters (char)
num <- sapply(strapplyc(dd, "\\d|$"), paste, collapse = "")
char <- sapply(strapplyc(dd, "[A-D]|$"), paste, collapse = "")
# put back together and sort
out <- sort(paste("ROOM", char, num))
# trim spaces (optional)
out <- gsub(" +", " ", sub(" *$", "", out))
> out
[1] "ROOM" "ROOM 2" "ROOM 3" "ROOM A 2" "ROOM A 4" "ROOM B"
[7] "ROOM B 2" "ROOM B 4" "ROOM C" "ROOM C 2"
UPDATE: minor improvements

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

change position of word within a string in r - regex

You may try sub('^(.*), The', 'The \\1', string_vec) #[1] "XXX" "Snakes On A Plane" "Mask of the Ninja" #[4] "Ruslan" "Kill Switch" "The Buddy Holly Story" #[7] "The Believers" "The Closet" "The Eyes of Tammy Faye" #[10] "The Gymnast" "The Hunger"

Related

using toOrdinal to replace numbers with ordinals

Why does strsplit return a list

How to remove unwanted space between words inside a character vector using R?

R split on delimiter (split) keep the delimiter (split)

Regular expressions to re-order strings in a field

Categories

Resources