Use strsplit to get last character in r - regex

I have a file of baby names that I am reading in and then trying to get the last character in the baby name. For example, the file looks like..
Name Sex
Anna F
Michael M
David M
Sarah F
I read this in using
sourcenames = read.csv("babynames.txt", header=F, sep=",")
I ultimately want to end up with my result looking like..
Name Last Initial Sex
Michael l M
Sarah h F
I've managed to split the name into separate characters..
sourceout = strsplit(as.character(sourcenames$Name),'')
But now where I'm stuck is how to get the last letter, so in the case of Michael, how to get 'l'. I thought tail() might work but its returning the last few records, not the last character in each Name element.
Any help or advice is greatly appreciated.
Thanks :)

For your strsplit method to work, you can use tail with sapply
df$LastInit <- sapply(strsplit(as.character(df$Name), ""), tail, 1)
df
# Name Sex LastInit
# 1 Anna F a
# 2 Michael M l
# 3 David M d
# 4 Sarah F h
Alternatively, you can use substring
with(df, substring(Name, nchar(Name)))
# [1] "a" "l" "d" "h"

Try this function from stringi package:
require(stringi)
x <- c("Ala", "Sarah","Meg")
stri_sub(x, from = -1, to = -1)
This function extracts substrings between from and to index. If indexes are negative, then it counts characters from the end of a string. So if from=-1 and to=-1 it means that we want substring from last to last character :)
Why use stringi? Just look at this benchmarks :)
require(microbenchmark)
x <- sample(x,1000,T)
microbenchmark(stri_sub(x,-1), str_extract(x, "[a-z]{1}$"), gsub(".*(.)$", "\\1", x),
sapply(strsplit(as.character(x), ""), tail, 1), substring(x, nchar(x)))
Unit: microseconds
expr min lq median uq max neval
stri_sub(x, -1) 56.378 63.4295 80.6325 85.4170 139.158 100
str_extract(x, "[a-z]{1}$") 718.579 764.4660 821.6320 863.5485 1128.715 100
gsub(".*(.)$", "\\\\1", x) 478.676 493.4250 509.9275 533.8135 673.233 100
sapply(strsplit(as.character(x), ""), tail, 1) 12165.470 13188.6430 14215.1970 14771.4800 21723.832 100
substring(x, nchar(x)) 133.857 135.9355 141.2770 147.1830 283.153 100

Here is another option using data.table (for relatively clean syntax) and stringr (easier grammar).
library(data.table); library(stringr)
df = read.table(text="Name Sex
Anna F
Michael M
David M
Sarah F", header=T)
setDT(df) # convert to data.table
df[, "Last Initial" := str_extract(Name, "[a-z]{1}$") ][]
Name Sex Last Initial
1: Anna F a
2: Michael M l
3: David M d
4: Sarah F h

One liner:
x <- c("abc","123","Male")
regmatches(x,regexpr(".$", x))
## [1] "c" "3" "e"

You can do it with a Regular Expression and gsub:
sourcenames$last.letter = gsub(".*(.)$", "\\1", sourcenames$Name)
sourcenames
Name Sex last.letter
1 Anna F a
2 Michael M l
3 David M d
4 Sarah F h

you can try this one... str_sub() function in stringr package would help you.
library(dplyr)
library(stringr)
library(babynames)
babynames %>%
select(name,sex) %>%
mutate(last_letter = str_sub(name,-1,-1)) %>%
head()

dplyr approach:
sourcenames %>% rowwise() %>% mutate("Last Initial" = strsplit(as.character(Name),'') %>% unlist() %>% .[length(.)])

Related

R - extract all strings matching pattern and create relational table

I am looking for a shorter and more pretty solution (possibly in tidyverse) to the following problem. I have a data.frame "data":
id string
1 A 1.001 xxx 123.123
2 B 23,45 lorem ipsum
3 C donald trump
4 D ssss 134, 1,45
What I wanted to do is to extract all numbers (no matter if the delimiter is "." or "," -> in this case I assume that string "134, 1,45" can be extracted into two numbers: 134 and 1.45) and create a data.frame "output" looking similar to this:
id string
1 A 1.001
2 A 123.123
3 B 23.45
4 C <NA>
5 D 134
6 D 1.45
I managed to do this (code below) but the solution is pretty ugly for me also not so efficient (two for-loops). Could someone suggest a better way to do do this (preferably using dplyr)
# data
data <- data.frame(id = c("A", "B", "C", "D"),
string = c("1.001 xxx 123.123",
"23,45 lorem ipsum",
"donald trump",
"ssss 134, 1,45"),
stringsAsFactors = FALSE)
# creating empty data.frame
len <- length(unlist(sapply(data$string, function(x) gregexpr("[0-9]+[,|.]?[0-9]*", x))))
output <- data.frame(id = rep(NA, len), string = rep(NA, len))
# main solution
start = 0
for(i in 1:dim(data)[1]){
tmp_len <- length(unlist(gregexpr("[0-9]+[,|.]?[0-9]*", data$string[i])))
for(j in (start+1):(start+tmp_len)){
output[j,1] <- data$id[i]
output[j,2] <- regmatches(data$string[i], gregexpr("[0-9]+[,|.]?[0-9]*", data$string[i]))[[1]][j-start]
}
start = start + tmp_len
}
# further modifications
output$string <- gsub(",", ".", output$string)
output$string <- as.numeric(ifelse(substring(output$string, nchar(output$string), nchar(output$string)) == ".",
substring(output$string, 1, nchar(output$string) - 1),
output$string))
output
1) Base R This uses relatively simple regular expressions and no packages.
In the first 2 lines of code replace any comma followed by a space with a
space and then replace all remaining commas with a dot. After these two lines s will be: c("1.001 xxx 123.123", "23.45 lorem ipsum", "donald trump", "ssss 134 1.45")
In the next 4 lines of code trim whitespace from beginning and end of each string field and split the string field on whitespace producing a
list. grep out those elements consisting only of digits and dots. (The regular expression ^[0-9.]*$ matches the start of a word followed by zero or more digits or dots followed by the end of the word so only words containing only those characters are matched.) Replace any zero length components with NA. Finally add data$id as the names. After these 4 lines are run the list L will be list(A = c("1.001", "123.123"), B = "23.45", C = NA, D = c("134", "1.45")) .
In the last line of code convert the list L to a data frame with the appropriate names.
s <- gsub(", ", " ", data$string)
s <- gsub(",", ".", s)
L <- strsplit(trimws(s), "\\s+")
L <- lapply(L, grep, pattern = "^[0-9.]*$", value = TRUE)
L <- ifelse(lengths(L), L, NA)
names(L) <- data$id
with(stack(L), data.frame(id = ind, string = values))
giving:
id string
1 A 1.001
2 A 123.123
3 B 23.45
4 C <NA>
5 D 134
6 D 1.45
2) magrittr This variation of (1) writes it as a magrittr pipeline.
library(magrittr)
data %>%
transform(string = gsub(", ", " ", string)) %>%
transform(string = gsub(",", ".", string)) %>%
transform(string = trimws(string)) %>%
with(setNames(strsplit(string, "\\s+"), id)) %>%
lapply(grep, pattern = "^[0-9.]*$", value = TRUE) %>%
replace(lengths(.) == 0, NA) %>%
stack() %>%
with(data.frame(id = ind, string = values))
3) dplyr/tidyr This is an alternate pipeline solution using dplyr and tidyr. unnest converts to long form, id is made factor so that we can later use complete to recover id's that are removed by subsequent filtering, the filter removes junk rows and complete inserts NA rows for each id that would otherwise not appear.
library(dplyr)
library(tidyr)
data %>%
mutate(string = gsub(", ", " ", string)) %>%
mutate(string = gsub(",", ".", string)) %>%
mutate(string = trimws(string)) %>%
mutate(string = strsplit(string, "\\s+")) %>%
unnest() %>%
mutate(id = factor(id))
filter(grepl("^[0-9.]*$", string)) %>%
complete(id)
4) data.table
library(data.table)
DT <- as.data.table(data)
DT[, string := gsub(", ", " ", string)][,
string := gsub(",", ".", string)][,
string := trimws(string)][,
string := setNames(strsplit(string, "\\s+"), id)][,
list(string = list(grep("^[0-9.]*$", unlist(string), value = TRUE))), by = id][,
list(string = if (length(unlist(string))) unlist(string) else NA_character_), by = id]
DT
Update Removed assumption that junk words do not have digit or dot. Also added (2), (3) and (4) and some improvements.
We can replace the , in between the numbers with . (using gsub), extract the numbers with str_extract_all (from stringr into a list), replace the list elements that have length equal to 0 with NA, set the names of the list with 'id' column, stack to convert the list to data.frame and rename the columns.
library(stringr)
setNames(stack(setNames(lapply(str_extract_all(gsub("(?<=[0-9]),(?=[0-9])", ".",
data$string, perl = TRUE), "[0-9.]+"), function(x)
if(length(x)==0) NA else as.numeric(x)), data$id))[2:1], c("id", "string"))
# id string
#1 A 1.001
#2 A 123.123
#3 B 23.45
#4 C NA
#5 D 134
#6 D 1.45
Same idea as Gabor's. I had hoped to use R's built-in parsing of strings (type.convert, used in read.table) rather than writing custom regex substitutions:
sp = setNames(strsplit(data$string, " "), data$id)
spc = lapply(sp, function(x) {
x = x[grep("[^0-9.,]$", x, invert=TRUE)]
if (!length(x))
NA_real_
else
mapply(type.convert, x, dec=gsub("[^.,]", "", x), USE.NAMES=FALSE)
})
setNames(rev(stack(spc)), names(data))
id string
1 A 1.001
2 A 123.123
3 B 23.45
4 C <NA>
5 D 134
6 D 1.45
Unfortunately, type.convert is not robust enough to consider both decimal delimiters at once, so we need this mapply malarkey instead of type.convert(x, dec = "[.,]").

Subset all 3 digit numbers and collapse them with a separator in a data frame. R

I'm formating a data set so each entry has the adegenet format for codominant markers, such as:
Loci1
###/###
208/210
200/204
198/208
where the # represents any digit (the number is a allele size in basepairs). My data has some homozygous entries (all 3 digit integers with no separator) that have the the form of:
Loci1
###
208
198
I intend to paste the 3 digit string to itself with sep='/' to produce the first format. I've tried to use grep to subset these homozygous entries by finding all non ###/### and negating the match using the table matching such as:
a <- grep('\\b\\d{3}?[/]\\d{3}', score$Loci1, value =T ) # Subset all ###/###/
score[!(a %in% 1:nrow(score$Loci1)), ] # works but only on vectors...
After the subset I could paste. The problem arises when I apply this to a data frame. grep seems to treat the data frame as a list (which in part it is) and returns columns that have a match.
So in short how can I go from ### to ###/### in a data frame
self contained example of data:
score2 <- NULL
set.seed(9)
Loci1 <- NULL
Loci2 <- NULL
Loci3 <- NULL
for (i in 1:5) Loci1 <- append(Loci1, paste(sample(seq(from = 230, to=330, by=3), 2, replace = F), collapse = '/'))
for (i in 1:5) Loci2 <- append(Loci2, paste(sample(seq(from = 230, to=330, by=3), 2, replace = F), collapse = '/'))
for (i in 1:5) Loci3 <- append(Loci3, paste(sample(seq(from = 230, to=330, by=3), 2, replace = F), collapse = '/'))
score2 <- data.frame(Loci1, Loci2, Loci3, stringsAsFactors = F)
score2[2,3] <- strsplit(score2[2,3], split = '/')[1]
score2[5,2] <- strsplit(score2[3,3], split = '/')[1]
score2[1,1] <- strsplit(score2[1,1], split = '/')[1]
score2[c(1, 4),c(2,3)] <- NA
score2
You could just replace the 3 digit items with the separator and a copy:
sub("^(...)$", "\\1/\\1", Loci1)
Use lapply with an anonymized function:
data.frame( lapply(score2, function(x) sub("^(...)$", "\\1/\\1", x) ) )
Loci1 Loci2 Loci3
1 251/251 <NA> <NA>
2 251/329 320/257 260/260
3 275/242 278/329 281/320
4 269/266 <NA> <NA>
5 296/326 281/281 326/314
(Not sure what the "paste-part" was supposed to refer to, but I think this was the intent of your question)
If the numeric values could have a varying number of digits then use a pattern argument like "^([0-9]{1,9})$"
An option using grep/paste,
m1 <- as.matrix(score2)
indx <- grep('^...$', m1)
m1[indx] <- paste(m1[indx], m1[indx], sep="/")
as.data.frame(m1)
# Loci1 Loci2 Loci3
#1 251/251 <NA> <NA>
#2 251/329 320/257 260/260
#3 275/242 278/329 281/320
#4 269/266 <NA> <NA>
#5 296/326 281/281 326/314
Or without converting to matrix, this can be done using lapply
score2[] <- lapply(score2, function(x) ifelse(grepl('^...$', x),
paste(x, x, sep="/"),x))

How to separate the variables of a particular column in a CSV file and write to a CSV file in R?

I have a CSV file like
Market,CampaignName,Identity
Wells Fargo,Gary IN MetroChicago IL Metro,56
EMC,Los Angeles CA MetroBoston MA Metro,78
Apple,Cupertino CA Metro,68
Desired Output to a CSV file with the first row as the headers
Market,City,State,Identity
Wells Fargo,Gary,IN,56
Wells Fargo,Chicago,IL,56
EMC,Los Angeles,CA,78
EMC,Boston,MA,78
Apple,Cupertino,CA,68
res <-
gsub('(.*) ([A-Z]{2})*Metro (.*) ([A-Z]{2}) .*','\\1,\\2:\\3,\\4',
xx$Market)
How to modify the above regular expressions to get the result in R?
New to R, any help is appreciated.
library(stringr)
xx.to.split <- with(xx, setNames(gsub("Metro", "", as.character(CampaignName)), Market))
do.call(rbind, str_match_all(xx.to.split, "(.+?) ([A-Z]{2}) ?"))[, -1]
Produces:
[,1] [,2]
Wells Fargo "Gary" "IN"
Wells Fargo "Chicago" "IL"
EMC "Los Angeles" "CA"
EMC "Boston" "MA"
Apple "Cupertino" "CA"
This should work even if you have different number of Compaign Names in each market. Unfortunately I think base options are annoying to implement because frustratingly there isn't a gregexec, although I'd be curious if someone comes up with something comparably compact in base.
Here is a solution using base R. Split the CampaignName column on the string Metro adding sequential numbers as names. stack turns it into a data frame with columns ind and values which we massage into DF1. Merge that with xx by the sequence numbers of DF1 and the row numbers of xx. Move Market to the front of DF2 and remove ind and CampaignName. Finally write it out.
xx <- read.csv("Campaign.csv", as.is = TRUE)
s <- strsplit(xx$CampaignName, " Metro")
names(s) <- seq_along(s)
ss <- stack(s)
DF1 <- with(ss, data.frame(ind,
City = sub(" ..$", "", values),
State = sub(".* ", "", values)))
DF2 <- merge(DF1, xx, by.x = "ind", by.y = 0)
DF <- DF2[ c("Market", setdiff(names(DF2), c("ind", "Market", "CampaignName"))) ]
write.csv(DF, file = "myfile.csv", row.names = FALSE, quote = FALSE)
REVISED to handle extra columns after poster modified the question to include such. Minor improvements.

How to add column to data.table with values from list based on regex

I have the following data.table:
id fShort
1 432-12 1245
2 3242-12 453543
3 324-32 45543
4 322-34 45343
5 2324-34 13543
DT <- data.table(
id=c("432-12", "3242-12", "324-32", "322-34", "2324-34"),
fShort=c("1245", "453543", "45543", "45343", "13543"))
and the following list:
filenames <- list("3242-124342345.png", "432-124343.png", "135-13434.jpeg")
I would like to create a new column "fComplete" that includes the complete filename from the list. For this the values of column "id" need to be matched with the filename-list. If the filename starts with the "id" string, the complete filename should be returned. I use the following regex
t <- grep("432-12","432-124343.png",value=T)
that return the correct filename.
This is how the final table should look like:
id fShort fComplete
1 432-12 1245 432-124343.png
2 3242-12 453543 3242-124342345.png
3 324-32 45543 NA
4 322-34 45343 NA
5 2324-34 13543 NA
DT2 <- data.table(
id=c("432-12", "3242-12", "324-32", "322-34", "2324-34"),
fshort=c("1245", "453543", "45543", "45343", "13543"),
fComplete = c("432-124343.png", "3242-124342345.png", NA, NA, NA))
I tried using apply and data.table approaches but I always get warnings like
argument 'pattern' has length > 1 and only the first element will be used
What is a simple approach to accomplish this?
Here's a data.table solution:
DT[ , fComplete := lapply(id, function(x) {
m <- grep(x, filenames, value = TRUE)
if (!length(m)) NA else m})]
id fShort fComplete
1: 432-12 1245 432-124343.png
2: 3242-12 453543 3242-124342345.png
3: 324-32 45543 NA
4: 322-34 45343 NA
5: 2324-34 13543 NA
In my experience with similar functions, sometimes the regex functions return a list, so you have to consider that in the apply - I usually do an example manually
Also apply will not always in y experience on its own return something that always works into a data.frame,sometimes I had to use lap ply, and or unlist and data.frame to modify it
Here is an answer - I am not familiar with data.tables and I was having issues with the filenames being in a list, but with some transformations this works. I worked it out by seeing what apply was outputting and adding the [1] to get the piece I needed
DT <- data.frame(
id=c("432-12", "3242-12", "324-32", "322-34", "2324-34"),
fShort=c("1245", "453543", "45543", "45343", "13543"))
filenames <- list("3242-124342345.png", "432-124343.png", "135-13434.jpeg")
filenames1 <- unlist(filenames)
x<-apply(DT[1],1,function(x) grep(x,filenames1)[1])
DT$fielname <- filenames1[x]

How to properly manipulate a string column in a data frame in R?

I have a data.frame with a string column that contains periods e.g "a.b.c.X". I want to split out the string by periods and retain the third segment e.g. "c" in the example given. Here is what I'm doing.
> df = data.frame(v=c("a.b.a.X", "a.b.b.X", "a.b.c.X"), b=seq(1,3))
> df
v b
1 a.b.a.X 1
2 a.b.b.X 2
3 a.b.c.X 3
And what I want is
> df = data.frame(v=c("a.b.a.X", "a.b.b.X", "a.b.c.X"), b=seq(1,3))
> df
v b
1 a 1
2 b 2
3 c 3
I'm attempting to use within, but I'm getting strange results. The value in the first row in the first column is being repeated.
> get = function(x) { unlist(strsplit(x, "\\."))[3] }
> within(df, v <- get(as.character(v)))
v b
1 a 1
2 a 2
3 a 3
What is the best practice for doing this? What am I doing wrong?
Update:
Here is the solution I used from #agstudy's answer:
> df = data.frame(v=c("a.b.a.X", "a.b.b.X", "a.b.c.X"), b=seq(1,3))
> get = function(x) gsub(".*?[.].*?[.](.*?)[.].*", '\\1', x)
> within(df, v <- get(v))
v b
1 a 1
2 b 2
3 c 3
Using some regular expression you can do :
gsub(".*?[.].*?[.](.*?)[.].*", '\\1', df$v)
[1] "a" "b" "c"
Or more concise:
gsub("(.*?[.]){2}(.*?)[.].*", '\\2', v)
The problem is not with within but with your get function. It returns a single character ("a") which gets recycled when added to your data.frame. Your code should look like this:
get.third <- function(x) sapply(strsplit(x, "\\."), `[[`, 3)
within(df, v <- get.third(as.character(v)))
Here is one possible solution:
df[, "v"] <- do.call(rbind, strsplit(as.character(df[, "v"]), "\\."))[, 3]
## > df
## v b
## 1 a 1
## 2 b 2
## 3 c 3
The answer to "what am I doing wrong" is that the bit of code that you thought was extracting the third element of each split string was actually putting all the elements of all your strings in a single vector, and then returning the third element of that:
get = function(x) {
splits = strsplit(x, "\\.")
print("All the elements: ")
print(unlist(splits))
print("The third element:")
print(unlist(splits)[3])
# What you actually wanted:
third_chars = sapply(splits, function (x) x[3])
}
within(df, v2 <- get(as.character(v)))