R: Abbreviate state names in strings - regex

I have strings with state names in them. How do I efficiently abbreviate them? I am aware of state.abb[grep("New York", state.name)] but this works only if "New York" is the whole string. I have, for example, "Walmart, New York". Thanks in advance!
Let's assume this input:
x = c("Walmart, New York", "Hobby Lobby (California)", "Sold in Sears in Illinois")
Edit: desired outputs will be a la "Walmart, NY", "Hobby Lobby (CA)", "Sold in Sears in IL". As you can see from here, state can appear in many ways in a string

Here's a base R way, using gregexpr(), regmatches(), and regmatches<-(), :
abbreviateStateNames <- function(x) {
pat <- paste(state.name, collapse="|")
m <- gregexpr(pat, x)
ff <- function(x) state.abb[match(x, state.name)]
regmatches(x, m) <- lapply(regmatches(x, m), ff)
x
}
x <- c("Hobby Lobby (California)",
"Hello New York City, here I come (from Greensboro North Carolina)!")
abbreviateStateNames(x)
# [1] "Hobby Lobby (CA)"
# [2] "Hello NY City, here I come (from Greensboro NC)!"
Alternatively -- and quite a bit more naturally -- you can accomplish the same thing using the gsubfn package:
library(gsubfn)
pat <- paste(state.name, collapse="|")
gsubfn(pat, function(x) state.abb[match(x, state.name)], x)
[1] "Hobby Lobby (CA)"
[2] "Hello NY City, here I come (from Greensboro NC)!"

Related

readHTMLTable is not giving me the information I want

I am trying to analyze some Formule 1 data. Wikipedia has a table with the data I want. I am importing the data into R with the code below:
library(XML)
library(RCurl)
url <- "https://en.wikipedia.org/wiki/List_of_Formula_One_drivers"
tabs <- getURL(url)
tabs <- readHTMLTable(tabs, stringsAsFactors=FALSE)
pilots <- tabs[[3]]
pilots <- pilots[-dim(pilots)[1], ]
head(pilots[, 1])
[1] "Abate, CarloCarlo Abate"
[2] "Abecassis, GeorgeGeorge Abecassis"
[3] "Acheson, KennyKenny Acheson"
[4] "Adamich, Andrea deAndrea de Adamich"
[5] "Adams, PhilippePhilippe Adams"
[6] "Ader, WaltWalt Ader"
However, the pilot names are strange. Notice how they are. I'd like them to be like this:
head(pilots[, 1])
[1] "Carlo Abate"
[2] "George Abecassis"
[3] "Kenny Acheson"
[4] "Andrea de Adamich"
[5] "Philippe Adams"
[6] "Walt Ader"
However, it seems I am not able to write a regex that can deal with this problem or find an argument for the function readHTMLTable that ignores the sortkey value in the table I am interested. How can I solve my problem?
Use readHTMLTable with a bespoke elFun argument.
library(XML)
library(RCurl)
url <- "https://en.wikipedia.org/wiki/List_of_Formula_One_drivers"
tabs <- getURL(url)
myFun <- function(x){
if(length(y <- getNodeSet(x, ".//a")) > 0){
# return data.frame
title <- xpathSApply(x, ".//a", fun = xmlGetAttr, name = "title")
href <- xpathSApply(x, ".//a", fun = xmlGetAttr, name = "href")
value <- xpathSApply(x, ".//a", fun = xmlValue)
return(paste(value, collapse = ","))
}
xmlValue(x, encoding = "UTF-8")
}
tabs <- readHTMLTable(tabs, elFun = myFun, stringsAsFactors=FALSE)
pilots <- tabs[[3]]
pilots <- pilots[-dim(pilots)[1], ]
> head(pilots[, 1])
[1] "Carlo Abate" "George Abecassis" "Kenny Acheson" "Andrea de Adamich"
[5] "Philippe Adams" "Walt Ader"
> pilots[1,]
Name Country Seasons Championships Entries Starts Poles Wins Podiums Fastest laps Points[note]
1 Carlo Abate Italy 1962,1963 0 2 0 0 0 0 0 0

String rearrangment in R

I have a long list of names of city and its province name. This is partial list of my data
data <- c('Ranchi_Capital_State_Jharkhand', 'Bokaro_State_Jharkhand', 'Tata Nagar_State_Jharkhand', 'Ramgarh_State_Jharkhand',
'Pune_State_Maharashtra', 'Mumbai_Capital_State_Maharashtra', 'Nagpur_State_Maharashtra')
I want to arrange it such that State should come first, like this State_Jharkhand_Bokaro. If city is a capital then State_Jharkhand_Capital_Ranchi. Also note that city name or state name may have single string or more than one string (eg Tata Nagar).
What is most efficient way to do it, (without using any loop)?
You could use the below gsub function.
> data <- c('Ranchi_Capital_State_Jharkhand', 'Bokaro_State_Jharkhand', 'Tata Nagar_State_Jharkhand', 'Ramgarh_State_Jharkhand',
+ 'Pune_State_Maharashtra', 'Mumbai_Capital_State_Maharashtra', 'Nagpur_State_Maharashtra')
> gsub("^(?:(.*?)(_Capital))?(.*?)_(State.*)", "\\4\\2_\\1\\3", data)
[1] "State_Jharkhand_Capital_Ranchi" "State_Jharkhand_Bokaro"
[3] "State_Jharkhand_Tata Nagar" "State_Jharkhand_Ramgarh"
[5] "State_Maharashtra_Pune" "State_Maharashtra_Capital_Mumbai"
[7] "State_Maharashtra_Nagpur"
DEMO
This doesn't really use much regex, but is mostly based on the expected position of the information. Split the strings by "_" and then reorder them as required:
data
# [1] "Ranchi_Capital_State_Jharkhand" "Bokaro_State_Jharkhand"
# [3] "Tata Nagar_State_Jharkhand" "Ramgarh_State_Jharkhand"
# [5] "Pune_State_Maharashtra" "Mumbai_Capital_State_Maharashtra"
# [7] "Nagpur_State_Maharashtra"
A <- strsplit(data, "_", TRUE)
sapply(A, function(x) {
if (length(x) == 3) {
paste(x[c(2, 3, 1)], collapse = "_")
} else if (length(x) == 4) {
paste(x[c(3, 4, 2, 1)], collapse = "_")
} else {
stop("unexpected length")
}
})
# [1] "State_Jharkhand_Capital_Ranchi" "State_Jharkhand_Bokaro"
# [3] "State_Jharkhand_Tata Nagar" "State_Jharkhand_Ramgarh"
# [5] "State_Maharashtra_Pune" "State_Maharashtra_Capital_Mumbai"
# [7] "State_Maharashtra_Nagpur"
I don't know if using sapply breaks your requirement of "without using any loop" though.

Use strsplit to get last character in r

I have a file of baby names that I am reading in and then trying to get the last character in the baby name. For example, the file looks like..
Name Sex
Anna F
Michael M
David M
Sarah F
I read this in using
sourcenames = read.csv("babynames.txt", header=F, sep=",")
I ultimately want to end up with my result looking like..
Name Last Initial Sex
Michael l M
Sarah h F
I've managed to split the name into separate characters..
sourceout = strsplit(as.character(sourcenames$Name),'')
But now where I'm stuck is how to get the last letter, so in the case of Michael, how to get 'l'. I thought tail() might work but its returning the last few records, not the last character in each Name element.
Any help or advice is greatly appreciated.
Thanks :)
For your strsplit method to work, you can use tail with sapply
df$LastInit <- sapply(strsplit(as.character(df$Name), ""), tail, 1)
df
# Name Sex LastInit
# 1 Anna F a
# 2 Michael M l
# 3 David M d
# 4 Sarah F h
Alternatively, you can use substring
with(df, substring(Name, nchar(Name)))
# [1] "a" "l" "d" "h"
Try this function from stringi package:
require(stringi)
x <- c("Ala", "Sarah","Meg")
stri_sub(x, from = -1, to = -1)
This function extracts substrings between from and to index. If indexes are negative, then it counts characters from the end of a string. So if from=-1 and to=-1 it means that we want substring from last to last character :)
Why use stringi? Just look at this benchmarks :)
require(microbenchmark)
x <- sample(x,1000,T)
microbenchmark(stri_sub(x,-1), str_extract(x, "[a-z]{1}$"), gsub(".*(.)$", "\\1", x),
sapply(strsplit(as.character(x), ""), tail, 1), substring(x, nchar(x)))
Unit: microseconds
expr min lq median uq max neval
stri_sub(x, -1) 56.378 63.4295 80.6325 85.4170 139.158 100
str_extract(x, "[a-z]{1}$") 718.579 764.4660 821.6320 863.5485 1128.715 100
gsub(".*(.)$", "\\\\1", x) 478.676 493.4250 509.9275 533.8135 673.233 100
sapply(strsplit(as.character(x), ""), tail, 1) 12165.470 13188.6430 14215.1970 14771.4800 21723.832 100
substring(x, nchar(x)) 133.857 135.9355 141.2770 147.1830 283.153 100
Here is another option using data.table (for relatively clean syntax) and stringr (easier grammar).
library(data.table); library(stringr)
df = read.table(text="Name Sex
Anna F
Michael M
David M
Sarah F", header=T)
setDT(df) # convert to data.table
df[, "Last Initial" := str_extract(Name, "[a-z]{1}$") ][]
Name Sex Last Initial
1: Anna F a
2: Michael M l
3: David M d
4: Sarah F h
One liner:
x <- c("abc","123","Male")
regmatches(x,regexpr(".$", x))
## [1] "c" "3" "e"
You can do it with a Regular Expression and gsub:
sourcenames$last.letter = gsub(".*(.)$", "\\1", sourcenames$Name)
sourcenames
Name Sex last.letter
1 Anna F a
2 Michael M l
3 David M d
4 Sarah F h
you can try this one... str_sub() function in stringr package would help you.
library(dplyr)
library(stringr)
library(babynames)
babynames %>%
select(name,sex) %>%
mutate(last_letter = str_sub(name,-1,-1)) %>%
head()
dplyr approach:
sourcenames %>% rowwise() %>% mutate("Last Initial" = strsplit(as.character(Name),'') %>% unlist() %>% .[length(.)])

How to separate the variables of a particular column in a CSV file and write to a CSV file in R?

I have a CSV file like
Market,CampaignName,Identity
Wells Fargo,Gary IN MetroChicago IL Metro,56
EMC,Los Angeles CA MetroBoston MA Metro,78
Apple,Cupertino CA Metro,68
Desired Output to a CSV file with the first row as the headers
Market,City,State,Identity
Wells Fargo,Gary,IN,56
Wells Fargo,Chicago,IL,56
EMC,Los Angeles,CA,78
EMC,Boston,MA,78
Apple,Cupertino,CA,68
res <-
gsub('(.*) ([A-Z]{2})*Metro (.*) ([A-Z]{2}) .*','\\1,\\2:\\3,\\4',
xx$Market)
How to modify the above regular expressions to get the result in R?
New to R, any help is appreciated.
library(stringr)
xx.to.split <- with(xx, setNames(gsub("Metro", "", as.character(CampaignName)), Market))
do.call(rbind, str_match_all(xx.to.split, "(.+?) ([A-Z]{2}) ?"))[, -1]
Produces:
[,1] [,2]
Wells Fargo "Gary" "IN"
Wells Fargo "Chicago" "IL"
EMC "Los Angeles" "CA"
EMC "Boston" "MA"
Apple "Cupertino" "CA"
This should work even if you have different number of Compaign Names in each market. Unfortunately I think base options are annoying to implement because frustratingly there isn't a gregexec, although I'd be curious if someone comes up with something comparably compact in base.
Here is a solution using base R. Split the CampaignName column on the string Metro adding sequential numbers as names. stack turns it into a data frame with columns ind and values which we massage into DF1. Merge that with xx by the sequence numbers of DF1 and the row numbers of xx. Move Market to the front of DF2 and remove ind and CampaignName. Finally write it out.
xx <- read.csv("Campaign.csv", as.is = TRUE)
s <- strsplit(xx$CampaignName, " Metro")
names(s) <- seq_along(s)
ss <- stack(s)
DF1 <- with(ss, data.frame(ind,
City = sub(" ..$", "", values),
State = sub(".* ", "", values)))
DF2 <- merge(DF1, xx, by.x = "ind", by.y = 0)
DF <- DF2[ c("Market", setdiff(names(DF2), c("ind", "Market", "CampaignName"))) ]
write.csv(DF, file = "myfile.csv", row.names = FALSE, quote = FALSE)
REVISED to handle extra columns after poster modified the question to include such. Minor improvements.

Check if term preceeding keyword is a number

Given a String: 3 Design Features, I'm trying to check if the term preceeding "Design Features" is a number or not using the below. (The number can exist as 2 or 2.)
score=0;
str = <P>3 Design Features</P>
regexp_number = "/^[0-9]+./";
if(str_detect(y,regexp_number) ==TRUE)
{
score=score++;
}
This returns 0. What am I doing wrong here? Hoping someone can point out?
Thanks in advance.
-Simak
Your regex is wrong. It says it must contain a . to match, rather than optionally contain 0 or 1 ..
Change it to
regexp_number = "/^[0-9]+.?/";
w <- "aghj 3 Design Features kjkl"
x <- "aghj 3. Design Features kjkl"
y <- "aghj c Design Features kjkl"
z <- "4 aghj c gn Features kjkl"
fun <- function(x) grepl("[[:digit:]]",
regmatches(x,
regexpr(".\\.?(?= Design Features)",x,perl = TRUE)))
fun(w)
[1] TRUE
fun(x)
[1] TRUE
fun(y)
[1] FALSE
fun(z)
[1] logical(0)