stringr split column by alpha and numeric - regex

I can only use stringer/ regular expression, I am working in r
I have a csv I have downloaded called mpg2,and a subset of this containing only Mercedes Benz makes. What I am trying to do is split the model into alpha and numeric so I can plot them. for example, a mercedes C300 would need to be split into C and 300, or GLS500 into GLS and 550.
so now I have all of the model numbers, now I want to split between letters and numbers.
I have tried
mercedes<- subset(mpg2, make=="Mercedes-Benz")
str_split(mercedes$model, "[0:9]")
but this doesn't do what I want it to and I have played with n= and that doesn't work either.
then I have
MB$modelnumber<-as.numeric(gsub("([0-9]+).*$", "\\1", mercedes$model))
Which makes a column of only numbers, I can't get the letters to work.
If I need to upload my specific dataset let me know, I just have to figure out how to do that.
But I need to basically split "XYZ123" into its alpha and numeric parts and put them in 2 separate columns.

something like this :
x <- "XYZ123"
x <- gsub("([0-9]+)",",\\1",x)
strsplit(x,",")
i ve replaced the original group of numbers by ,group of numbers. so that i can do a split on ot easily.

You can use something like this:
SplitMe <- function(string, alphaFirst = TRUE) {
Pattern <- ifelse(isTRUE(alphaFirst), "(?<=[a-zA-Z])(?=[0-9])", "(?<=[0-9])(?=[a-zA-Z])")
strsplit(string, split = Pattern, perl = T)
}
String <- c("C300", "GLS500", "XYZ123")
SplitMe(String)
# [[1]]
# [1] "C" "300"
#
# [[2]]
# [1] "GLS" "500"
#
# [[3]]
# [1] "XYZ" "123"
To get the output as a two column matrix, just use do.call(rbind, ...):
do.call(rbind, SplitMe(String))
# [,1] [,2]
# [1,] "C" "300"
# [2,] "GLS" "500"
# [3,] "XYZ" "123"
The above is just a convenience function that I have saved for the following scenarios:
strsplit(String, split = "(?<=[a-zA-Z])(?=[0-9])", perl = T)
and
strsplit(String, split = "(?<=[0-9])(?=[a-zA-Z])", perl = T)
This function won't change a GLS500 into a GLS550 though.

Related

Stata - Extract numbers before characters, create a list

Good morning,
I have a dataframe where one of the columns has observations that look like that:
row1: 28316496(15)|28943784(8)|28579919(7)
row2: 29343898(1)
I would like to create a new column that would extract the numbers that are not in parenthesis, create a list, and then append all these numbers to create a list with all these numbers.
Said differently at the end, I would like to end up with the following list:
28316496;28943784;28579919;29343898
It could also be any other similar object, I am just interested in getting all these numbers and matching them with another dataset.
I have tried using str_extract_all to extract the numbers but I am having trouble understanding the pattern argument. For instance I have tried:
str_extract_all("28316496(15)|28943784(8)", "\d+(\d)")
and
gsub("\s*\(.*", "", "28316496(15)|28943784(8)")
but it is not returning exactly what I want.
Any idea for extracting the number outside the brackets and create a giant list out of that?
Thanks a lot!
In base R, we can use gsub to remove the (, followed by the digits and ), and use read.table to read it in a data.frame
read.table(text = gsub("\\(\\d+\\)", "", df1$col1),
header = FALSE, sep = "|", fill = TRUE)
V1 V2 V3
1 28316496 28943784 28579919
2 29343898 NA NA
Or using str_extract, use a regex lookaround
library(stringr)
str_extract_all(df1$col1, "\\d+(?=\\()")
[[1]]
[1] "28316496" "28943784" "28579919"
[[2]]
[1] "29343898"
data
df1 <- structure(list(col1 = c("28316496(15)|28943784(8)|28579919(7)",
"29343898(1)")), class = "data.frame", row.names = c(NA, -2L))
Here is a way.
x <- c("28316496(15)|28943784(8)|28579919(7)", "29343898(1)")
y <- strsplit(x, "\\|")
y <- lapply(y, \(.y) sub("\\([^\\(\\)]+\\)$", "", .y))
y
#> [[1]]
#> [1] "28316496" "28943784" "28579919"
#>
#> [[2]]
#> [1] "29343898"
Created on 2022-09-24 with reprex v2.0.2

Combining fragmented sentences in an R dataframe

I have a dataframe which contains parts of whole sentences spread across, in some cases, multiple rows of a dataframe.
For example, head(mydataframe) returns
# 1 Do you have any idea what
# 2 they were arguing about?
# 3 Do--Do you speak
# 4 English?
# 5 yeah.
# 6 No, I'm sorry.
Assuming a sentence can be terminated by either
"." or "?" or "!" or "..."
are there any R library functions capable of outputting the following:
# 1 Do you have any idea what they were arguing about?
# 2 Do--Do you speak English?
# 3 yeah.
# 4 No, I'm sorry.
This should work for all the sentences ending with: . ... ? or !
x <- paste0(foo$txt, collapse = " ")
trimws(unlist(strsplit(x, "(?<=[?.!|])(?=\\s)", perl=TRUE)))
Credits to #AvinashRaj for the pointers on the lookbehind
Which gives:
#[1] "Do you have any idea what they were arguing about?"
#[2] "Do--Do you speak English?"
#[3] "yeah..."
#[4] "No, I'm sorry."
Data
I modified the toy dataset to include a case where a string ends with ... (as per requested by OP)
foo <- data.frame(num = 1:6,
txt = c("Do you have any idea what", "they were arguing about?",
"Do--Do you speak", "English?", "yeah...", "No, I'm sorry."),
stringsAsFactors = FALSE)
Here is what I got. I am sure there are better ways to do this. Here I used base functions. I created a sample data frame called foo. First, I created a string with all texts in txt. toString() adds ,, so I removed them in the first gsub(). Then, I took care of white space (more than 2 spaces) in the second gsub(). Then, I split the string by the delimiters you specified. Crediting Tyler Rinker for this post, I managed to leave delimiters in strsplit(). The final job was to remove white space at sentence initial position. Then, unlist the list.
EDIT
Steven Beaupré revised my code. That is the way to go!
foo <- data.frame(num = 1:6,
txt = c("Do you have any idea what", "they were arguing about?",
"Do--Do you speak", "English?", "yeah.", "No, I'm sorry."),
stringsAsFactors = FALSE)
library(magrittr)
toString(foo$txt) %>%
gsub(pattern = ",", replacement = "", x = .) %>%
strsplit(x = ., split = "(?<=[?.!])", perl = TRUE) %>%
lapply(., function(x)
{gsub(pattern = "^ ", replacement = "", x = x)
}) %>%
unlist
#[1] "Do you have any idea what they were arguing about?"
#[2] "Do--Do you speak English?"
#[3] "yeah."
#[4] "No I'm sorry."

Replace character string elements by indices efficiently in R

I would like to efficiently replace elements in my character object with other particular elements in particular places (these places are indices which I know as they are results of the gregexpr function).
I would like some foo function that works like:
foo("qwerty", c(1,3,5), c("z", "x", "y"))
giving me:
[1] "zwxryy"
I searched the stringr package cran pdf but nothing hit my mind. Thank you in advance for any suggestions.
For example:
xx <- unlist(strsplit("qwerty",""))
xx[c(1,3,5)] <- c("z", "x", "y")
paste0(xx,collapse='')
[1] "zwxryy"
You could also try the one below, if you don't have that many characters to replace
st1 <- "qwerty"
gsub("^.(.).(.).","z\\1x\\2y", st1)
#[1] "zwxryy"
In stringi package there is stri_sub function that works like this:
a <- "12345"
stri_sub(a, from=c(1,3,5),len=1) <- letters[c(1,3,5)]
a
## [1] "a2345" "12c45" "1234e"
it's almost what you want. Just use this in loop:
a <- "12345"
for(i in c(1,3,5)){
stri_sub(a, from=i,len=1) <- letters[i]
}
a
## [1] "a2c4e"
Be aware that this kind of function is on our TODO list, check:
https://github.com/Rexamine/stringi/issues?state=open

Regular expression-based list matching in R

I have two lists (more exactly, character atomic vectors) that I want to compare using regular expressions to produce a sub-set of one of the lists. I can use a 'for' loop for this, but is there some simpler code? Following exemplifies my case:
# list of unique cities
city <- c('Berlin', 'Perth', 'Oslo')
# list of city-months, like 'New York-Dec'
temp <- c('Berlin-Jan', 'Delhi-Jan', 'Lima-Feb', 'Perth-Feb', 'Oslo-Jan')
# need sub-set of 'temp' for only 'Jan' month for only the items in 'city' list:
# 'Berlin-Jan', 'Oslo-Jan'
Added clarification: In the actual case that I am seeking code for, the values of the 'month' equivalent are more complex, and rather random alphanumeric values with only the first two characters having informational value of my interest (has to be '01').
Added actual case example:
# equivalent of 'city' in the first example
# values match pattern TCGA-[0-9A-Z]{2}-[0-9A-Z]{4}
patient <- c('TCGA-43-4897', 'TCGA-65-4897', 'TCGA-78-8904', 'TCGA-90-8984')
# equivalent of 'temp' in the first example
# values match pattern TCGA-[0-9A-Z]{2}-[0-9A-Z]{4}-[\d]{2}[0-9A-Z]+
sample <- c('TCGA-21-5732-01A333', 'TCGA-43-4897-01A159', 'TCGA-65-4897-01T76', 'TCGA-78-8904-11A70')
# sub-set wanted (must have '01' after the 'patient' ID part)
# 'TCGA-43-4897-01A159', 'TCGA-65-4897-01T76'
Something like this?
temp <- temp[grepl("Jan", temp)]
temp[sapply(strsplit(temp, "-"), "[[", 1) %in% city]
# [1] "Berlin-Jan" "Oslo-Jan"
Even better, borrowing the idea from #agstudy:
> temp[temp %in% paste0(city, "-Jan")]
# [1] "Berlin-Jan" "Oslo-Jan"
Edit: How about this?
> sample[gsub("(.*-01).*$", "\\1", sample) %in% paste0(patient, "-01")]
# [1] "TCGA-43-4897-01A159" "TCGA-65-4897-01T76"
Here's a solution after the others, with your new requirements:
sample[na.omit(pmatch(paste0(patient, '-01'), sample))]
You can use gsub
x <- gsub(paste(paste(city,collapse='-Jan|'),'-Jan',sep=''),1,temp)
> temp[x==1]
[1] "Berlin-Jan" "Oslo-Jan"
the pattern here is :
"Berlin-Jan|Perth-Jan|Oslo-Jan"
Here's a solution with two partial string matches...
temp[agrep("Jan",temp)[which(agrep("Jan",temp) %in% sapply(city, agrep, x=temp))]]
# [1] "Berlin-Jan" "Oslo-Jan"
As a function just for fun...
fun <- function(x,y,pattern) y[agrep(pattern,y)[which(agrep(pattern,y) %in% sapply(x, agrep, x=y))]]
# x is a vector containing your data for filter
# y is a vector containing the data to filter on
# pattern is the quoted pattern you're filtering on
fun(temp, city, "Jan")
# [1] "Berlin-Jan" "Oslo-Jan"

Split on first comma in string

How can I efficiently split the following string on the first comma using base?
x <- "I want to split here, though I don't want to split elsewhere, even here."
strsplit(x, ???)
Desired outcome (2 strings):
[[1]]
[1] "I want to split here" "though I don't want to split elsewhere, even here."
Thank you in advance.
EDIT: Didn't think to mention this. This needs to be able to generalize to a column, vector of strings like this, as in:
y <- c("Here's comma 1, and 2, see?", "Here's 2nd sting, like it, not a lot.")
The outcome can be two columns or one long vector (that I can take every other element of) or a list of stings with each index ([[n]]) having two strings.
Apologies for the lack of clarity.
Here's what I'd probably do. It may seem hacky, but since sub() and strsplit() are both vectorized, it will also work smoothly when handed multiple strings.
XX <- "SoMeThInGrIdIcUlOuS"
strsplit(sub(",\\s*", XX, x), XX)
# [[1]]
# [1] "I want to split here"
# [2] "though I don't want to split elsewhere, even here."
From the stringr package:
str_split_fixed(x, pattern = ', ', n = 2)
# [,1]
# [1,] "I want to split here"
# [,2]
# [1,] "though I don't want to split elsewhere, even here."
(That's a matrix with one row and two columns.)
Here is yet another solution, with a regular expression to capture what is before and after the first comma.
x <- "I want to split here, though I don't want to split elsewhere, even here."
library(stringr)
str_match(x, "^(.*?),\\s*(.*)")[,-1]
# [1] "I want to split here"
# [2] "though I don't want to split elsewhere, even here."
library(stringr)
str_sub(x,end = min(str_locate(string=x, ',')-1))
This will get the first bit you want. Change the start= and end= in str_sub to get what ever else you want.
Such as:
str_sub(x,start = min(str_locate(string=x, ',')+1 ))
and wrap in str_trim to get rid of the leading space:
str_trim(str_sub(x,start = min(str_locate(string=x, ',')+1 )))
This works but I like Josh Obrien's better:
y <- strsplit(x, ",")
sapply(y, function(x) data.frame(x= x[1],
z=paste(x[-1], collapse=",")), simplify=F))
Inspired by chase's response.
A number of people gave non base approaches so I figure I'd add the one I usually use (though in this case I needed a base response):
y <- c("Here's comma 1, and 2, see?", "Here's 2nd sting, like it, not a lot.")
library(reshape2)
colsplit(y, ",", c("x","z"))