R separating out number and units from a column in a dataframe - regex

I have a dataframe which contains a column that has numbers as well as variable units:
num <- c(1:5)
val <- c("5%","10K", "100.2mv","1.4g","1.007kbars")
df <- data.frame(num,val)
df
How can I create two new columns from df$val, one that contains just the number and one the units?
Thank you for your help.

Here's a solution using stringr:
library(stringr)
df$extr_nums <- str_extract(val, "\\d+\\.?\\d*")
df$extr_units <- str_replace(val, nums, "")
df
num val extr_nums extr_units
1 1 5% 5 %
2 2 10K 10 K
3 3 100.2mv 100.2 mv
4 4 1.4g 1.4 g
5 5 1.007kbars 1.007 kbars
The regexp is translated as: "at least 1 digit, followed by optional dot, followed by optional digits".

Related

R: How to group and aggregate list elements using regex?

I want to aggregate (sum up) the following product list by groups (see below):
prods <- list("101.2000"=data.frame(1,2,3),
"102.2000"=data.frame(4,5,6),
"103.2000"=data.frame(7,8,9),
"104.2000"=data.frame(1,2,3),
"105.2000"=data.frame(4,5,6),
"106.2000"=data.frame(7,8,9),
"101.2001"=data.frame(1,2,3),
"102.2001"=data.frame(4,5,6),
"103.2001"=data.frame(7,8,9),
"104.2001"=data.frame(1,2,3),
"105.2001"=data.frame(4,5,6),
"106.2001"=data.frame(7,8,9))
test= list("100.2000"=data.frame(2,3,5),
"100.2001"=data.frame(4,5,6))
names <- c("A", "B", "C")
prods <- lapply(prods, function (x) {colnames(x) <- names; return(x)})
Each element of the product list (prods) has a name combination of the product number and the year (e.g. 101.2000 --> 101 = prod nr. and 2000 = year). And the groups only contain product numbers for the aggregation.
group1 <- c(101, 106)
group2 <- c(102, 104)
group3 <- c(105, 103)
My expected result, shows the aggregated product groups by year:
$group1.2000
A B C
1 8 10 12
$group2.2000
A B C
1 5 7 9
$group3.2000
A B C
1 11 13 15
$group1.2001
A B C
1 8 10 12
$group2.2001
A B C
1 5 7 9
$group3.2001
A B C
1 11 13 15
So far, I tried this way: First I decomposed the names of prods into product numbers:
prodnames <- names(prods)
prodnames_sub <- gsub("\\..*.","", prodnames)
And then I tried to aggregate using lapply:
lapply(prods, function(x) aggregate( ... , FUN = sum)
However, I didn't find how to implement the previous product numbers in the aggregation function. Ideas? Thanks
Here are two approaches. No packages are used in either one.
1) Using lists Create a two column data.frame S from the groups whose columns are the products (value column) and associated groups (ind column). Create the list to split by, By. In code to produce By, sub("\\.*", "", names(prods)) extracts the products and match is then used to find the associated group. sub("\\..*", "", names(prods)) extracts the year. Next perform the split and lapply over it to run the summations. The two components of By (group and year) can be reversed to change the order of the output, if desired.
S <- stack(list(group1 = group1, group2 = group2, group3 = group3))
By <- list(group = S$ind[match(sub("\\..*", "", names(prods)), S$values)],
year = sub(".*\\.", "", names(prods)))
lapply(split(prods, By), function(x) colSums(do.call(rbind, x)))
2) Using data.frames Convert the groups and prods each to a data frame, merge them, perform an aggregate and split back into a list. The output is the same as requested except for order. (Reverse the two right hand variables in the aggregate formula to get the order shown in the question but that will also reverse the two parts of each component name in he output list.)
S <- stack(list(group1 = group1, group2 = group2, group3 = group3))
DF0 <- do.call(rbind, prods)
DF <- cbind(do.call(rbind, strsplit(rownames(DF0), ".", fixed = TRUE)), DF0)
M <- merge(DF, S, all.x = TRUE, by = 1)
Ag <- aggregate(cbind(A, B, C) ~ ind + `2`, M, sum)
lapply(split(Ag, paste(Ag[[1]], Ag[[2]], sep = ".")), "[", 3:5)
giving:
$group1.2000
A B C
1 8 10 12
$group1.2001
A B C
4 8 10 12
$group2.2000
A B C
2 5 7 9
$group2.2001
A B C
5 5 7 9
$group3.2000
A B C
3 11 13 15
$group3.2001
A B C
6 11 13 15

How to replicate column-names, split them at delimiter '/', into multiple column-names, in R?

I have this matrix (it's big in size) "mymat". I need to replicate the columns that have "/" in their column name matching at "/" and make a "resmatrix". How can I get this done in R?
mymat
a b IID:WE:G12D/V GH:SQ:p.R172W/G c
1 3 4 2 4
22 4 2 2 4
2 3 2 2 4
resmatrix
a b IID:WE:G12D IID:WE:G12V GH:SQ:p.R172W GH:SQ:p.R172G c
1 3 4 4 2 2 4
22 4 2 2 2 2 4
2 3 2 2 2 2 4
Find out which columns have the "/" and replicate them, then rename. To calculate the new names, just split on / and replace the last letter for the second name.
# which columns have '/' in them?
which.slash <- grep('/', names(mymat), value=T)
new.names <- unlist(lapply(strsplit(which.slash, '/'),
function (bits) {
# bits[1] is e.g. IID:WE:G12D and bits[2] is the V
# take bits[1] and replace the last letter for the second colname
c(bits[1], sub('.$', bits[2], bits[1]))
}))
# make resmat by copying the appropriate columns
resmat <- cbind(mymat, mymat[, which.slash])
# order the columns to make sure the names replace properly
resmat <- resmat[, order(names(resmat))]
# put the new names in
names(resmat)[grep('/', names(resmat))] <- sort(new.names)
resmat looks like this
# a b c GH:SQ:p.R172G GH:SQ:p.R172W IID:WE:G12D IID:WE:G12V
# 1 1 3 4 2 2 4 4
# 2 22 4 4 2 2 2 2
# 3 2 3 4 2 2 2 2
You could use grep to get the index of column names with / ('nm1'), replicate the column names in 'nm1' by using sub/scan to create 'nm2'. Then, cbind the columns that are not 'nm1', with the replicated columns ('nm1'), change the column names with 'nm2', and if needed order the columns.
#get the column index with grep
nm1 <- grepl('/', names(df1))
#used regex to rearrange the substrings in the nm1 column names
#removed the `/` and use `scan` to split at the space delimiter
nm2 <- scan(text=gsub('([^/]+)(.)/(.*)', '\\1\\2 \\1\\3',
names(df1)[nm1]), what='', quiet=TRUE)
#cbind the columns that are not in nm1, with the replicate nm1 columns
df2 <- cbind(df1[!nm1], setNames(df1[rep(which(nm1), each= 2)], nm2))
#create another index to find the starting position of nm1 columns
nm3 <- names(df1)[1:(which(nm1)[1L]-1)]
#we concatenate the nm3, nm2, and the rest of the columns to match
#the expected output order
df2N <- df2[c(nm3, nm2, setdiff(names(df1)[!nm1], nm3))]
df2N
# a b IID:WE:G12D IID:WE:G12V GH:SQ:p.R172W GH:SQ:p.R172G c
#1 1 3 4 4 2 2 4
#2 22 4 2 2 2 2 4
#3 2 3 2 2 2 2 4
data
df1 <- structure(list(a = c(1L, 22L, 2L), b = c(3L, 4L, 3L),
`IID:WE:G12D/V` = c(4L,
2L, 2L), `GH:SQ:p.R172W/G` = c(2L, 2L, 2L), c = c(4L, 4L, 4L)),
.Names = c("a", "b", "IID:WE:G12D/V", "GH:SQ:p.R172W/G", "c"),
class = "data.frame", row.names = c(NA, -3L))

R regex find ranges in strings

I have a bunch of email subject lines and I'm trying to extract whether a range of values are present. This is how I'm trying to do it but am not getting the results I'd like:
library(stringi)
df1 <- data.frame(id = 1:5, string1 = NA)
df1$string1 <- c('15% off','25% off','35% off','45% off','55% off')
df1$pctOff10_20 <- stri_match_all_regex(df1$string1, '[10-20]%')
id string1 pctOff10_20
1 1 15% off NA
2 2 25% off NA
3 3 35% off NA
4 4 45% off NA
5 5 55% off NA
I'd like something like this:
id string1 pctOff10_20
1 1 15% off 1
2 2 25% off 0
3 3 35% off 0
4 4 45% off 0
5 5 55% off 0
Here is the way to go,
df1$pctOff10_20 <- stri_count_regex(df1$string1, '^(1\\d|20)%')
Explanation:
^ the beginning of the string
( group and capture to \1:
1 '1'
\d digits (0-9)
| OR
20 '20'
) end of \1
% '%'
1) strapply in gsubfn can do that by combining a regex (pattern= argument) and a function (FUN= argument). Below we use the formula representation of the function. Alternately we could make use of betweeen from data.table (or a number of other packages). This extracts the matches to the pattern, applies the function to it and returns the result simplifying it into a vector (rather than a list):
library(gsubfn)
btwn <- function(x, a, b) as.numeric(a <= as.numeric(x) & as.numeric(x) <= b)
transform(df1, pctOff10_20 =
strapply(
X = string1,
pattern = "\\d+",
FUN = ~ btwn(x, 10, 20),
simplify = TRUE
)
)
2) A base solution using the same btwn function defined above is:
transform(df1, pctOff10_20 = btwn(gsub("\\D", "", string1), 10, 20))

How do I count the number of words in a text (string)?

I have this string vector (for example):
str <- c("this is a string current trey",
"feather rtttt",
"tusla",
"laq")
To count the number of words in this vector I used this (as given here Count the number of words in a string in R?, which is a possible duplicate but with another issue)
No_words <- sapply(gregexpr("\\W+", str), length) + 1
but it returns
6 2 2 2
String has only 1 element in last two places (i.e. "tusla" and "laq")
so it should return
6 2 1 1
How do I get around this problem?
You can try
sapply(gregexpr("\\S+", x), length)
## [1] 6 2 1 1
Or as suggested in comments you can try
sapply(strsplit(x, "\\s+"), length)
## [1] 6 2 1 1
Use the stringi package and stri_count:
require(stringi)
str <- c(
"this is a string current trey",
"nospaces",
"multiple spaces",
" leadingspaces",
"trailingspaces ",
" leading and trailing ",
"just one space each")
> stri_count(str,regex="\\S+")
[1] 6 1 2 1 1 3 4
Use the wc-function from the qdap package.
str <- c("this is a string current trey",
"feather rtttt",
"tusla",
"laq")
library("qdap")
wc(str)
That returns:
wc(str)
[1] 6 2 1 1

How to calculate the number of occurrence of a given character in each row of a column of strings?

I have a data.frame in which certain variables contain a text string. I wish to count the number of occurrences of a given character in each individual string.
Example:
q.data<-data.frame(number=1:3, string=c("greatgreat", "magic", "not"))
I wish to create a new column for q.data with the number of occurence of "a" in string (ie. c(2,1,0)).
The only convoluted approach I have managed is:
string.counter<-function(strings, pattern){
counts<-NULL
for(i in 1:length(strings)){
counts[i]<-length(attr(gregexpr(pattern,strings[i])[[1]], "match.length")[attr(gregexpr(pattern,strings[i])[[1]], "match.length")>0])
}
return(counts)
}
string.counter(strings=q.data$string, pattern="a")
number string number.of.a
1 1 greatgreat 2
2 2 magic 1
3 3 not 0
The stringr package provides the str_count function which seems to do what you're interested in
# Load your example data
q.data<-data.frame(number=1:3, string=c("greatgreat", "magic", "not"), stringsAsFactors = F)
library(stringr)
# Count the number of 'a's in each element of string
q.data$number.of.a <- str_count(q.data$string, "a")
q.data
# number string number.of.a
#1 1 greatgreat 2
#2 2 magic 1
#3 3 not 0
If you don't want to leave base R, here's a fairly succinct and expressive possibility:
x <- q.data$string
lengths(regmatches(x, gregexpr("a", x)))
# [1] 2 1 0
nchar(as.character(q.data$string)) -nchar( gsub("a", "", q.data$string))
[1] 2 1 0
Notice that I coerce the factor variable to character, before passing to nchar. The regex functions appear to do that internally.
Here's benchmark results (with a scaled up size of the test to 3000 rows)
q.data<-q.data[rep(1:NROW(q.data), 1000),]
str(q.data)
'data.frame': 3000 obs. of 3 variables:
$ number : int 1 2 3 1 2 3 1 2 3 1 ...
$ string : Factor w/ 3 levels "greatgreat","magic",..: 1 2 3 1 2 3 1 2 3 1 ...
$ number.of.a: int 2 1 0 2 1 0 2 1 0 2 ...
benchmark( Dason = { q.data$number.of.a <- str_count(as.character(q.data$string), "a") },
Tim = {resT <- sapply(as.character(q.data$string), function(x, letter = "a"){
sum(unlist(strsplit(x, split = "")) == letter) }) },
DWin = {resW <- nchar(as.character(q.data$string)) -nchar( gsub("a", "", q.data$string))},
Josh = {x <- sapply(regmatches(q.data$string, gregexpr("g",q.data$string )), length)}, replications=100)
#-----------------------
test replications elapsed relative user.self sys.self user.child sys.child
1 Dason 100 4.173 9.959427 2.985 1.204 0 0
3 DWin 100 0.419 1.000000 0.417 0.003 0 0
4 Josh 100 18.635 44.474940 17.883 0.827 0 0
2 Tim 100 3.705 8.842482 3.646 0.072 0 0
Another good option, using charToRaw:
sum(charToRaw("abc.d.aa") == charToRaw('.'))
The stringi package provides the functions stri_count and stri_count_fixed which are very fast.
stringi::stri_count(q.data$string, fixed = "a")
# [1] 2 1 0
benchmark
Compared to the fastest approach from #42-'s answer and to the equivalent function from the stringr package for a vector with 30.000 elements.
library(microbenchmark)
benchmark <- microbenchmark(
stringi = stringi::stri_count(test.data$string, fixed = "a"),
baseR = nchar(test.data$string) - nchar(gsub("a", "", test.data$string, fixed = TRUE)),
stringr = str_count(test.data$string, "a")
)
autoplot(benchmark)
data
q.data <- data.frame(number=1:3, string=c("greatgreat", "magic", "not"), stringsAsFactors = FALSE)
test.data <- q.data[rep(1:NROW(q.data), 10000),]
A variation of https://stackoverflow.com/a/12430764/589165 is
> nchar(gsub("[^a]", "", q.data$string))
[1] 2 1 0
I'm sure someone can do better, but this works:
sapply(as.character(q.data$string), function(x, letter = "a"){
sum(unlist(strsplit(x, split = "")) == letter)
})
greatgreat magic not
2 1 0
or in a function:
countLetter <- function(charvec, letter){
sapply(charvec, function(x, letter){
sum(unlist(strsplit(x, split = "")) == letter)
}, letter = letter)
}
countLetter(as.character(q.data$string),"a")
You could just use string division
require(roperators)
my_strings <- c('apple', banana', 'pear', 'melon')
my_strings %s/% 'a'
Which will give you 1, 3, 1, 0. You can also use string division with regular expressions and whole words.
The question below has been moved here, but it seems this page doesn't directly answer to Farah El's question.
How to find number 1s in 101 in R
So, I'll write an answer here, just in case.
library(magrittr)
n %>% # n is a number you'd like to inspect
as.character() %>%
str_count(pattern = "1")
https://stackoverflow.com/users/8931457/farah-el
Yet another base R option could be:
lengths(lapply(q.data$string, grepRaw, pattern = "a", all = TRUE, fixed = TRUE))
[1] 2 1 0
The next expression does the job and also works for symbols, not only letters.
The expression works as follows:
1: it uses lapply on the columns of the dataframe q.data to iterate over the rows of the column 2 ("lapply(q.data[,2],"),
2: it apply to each row of the column 2 a function "function(x){sum('a' == strsplit(as.character(x), '')[[1]])}".
The function takes each row value of column 2 (x), convert to character (in case it is a factor for example), and it does the split of the string on every character ("strsplit(as.character(x), '')"). As a result we have a a vector with each character of the string value for each row of the column 2.
3: Each vector value of the vector is compared with the desired character to be counted, in this case "a" (" 'a' == "). This operation will return a vector of True and False values "c(True,False,True,....)", being True when the value in the vector matches the desired character to be counted.
4: The total times the character 'a' appears in the row is calculated as the sum of all the 'True' values in the vector "sum(....)".
5: Then it is applied the "unlist" function to unpack the result of the "lapply" function and assign it to a new column in the dataframe ("q.data$number.of.a<-unlist(....")
q.data$number.of.a<-unlist(lapply(q.data[,2],function(x){sum('a' == strsplit(as.character(x), '')[[1]])}))
>q.data
# number string number.of.a
#1 greatgreat 2
#2 magic 1
#3 not 0
Another base R answer, not so good as those by #IRTFM and #Finn (or as those using stringi/stringr), but better than the others:
sapply(strsplit(q.data$string, split=""), function(x) sum(x %in% "a"))
q.data<-data.frame(number=1:3, string=c("greatgreat", "magic", "not"))
q.data<-q.data[rep(1:NROW(q.data), 3000),]
library(rbenchmark)
library(stringr)
library(stringi)
benchmark( Dason = {str_count(q.data$string, "a") },
Tim = {sapply(q.data$string, function(x, letter = "a"){sum(unlist(strsplit(x, split = "")) == letter) }) },
DWin = {nchar(q.data$string) -nchar( gsub("a", "", q.data$string, fixed=TRUE))},
Markus = {stringi::stri_count(q.data$string, fixed = "a")},
Finn={nchar(gsub("[^a]", "", q.data$string))},
tmmfmnk={lengths(lapply(q.data$string, grepRaw, pattern = "a", all = TRUE, fixed = TRUE))},
Josh1 = {sapply(regmatches(q.data$string, gregexpr("g",q.data$string )), length)},
Josh2 = {lengths(regmatches(q.data$string, gregexpr("g",q.data$string )))},
Iago = {sapply(strsplit(q.data$string, split=""), function(x) sum(x %in% "a"))},
replications =100, order = "elapsed")
test replications elapsed relative user.self sys.self user.child sys.child
4 Markus 100 0.076 1.000 0.076 0.000 0 0
3 DWin 100 0.277 3.645 0.277 0.000 0 0
1 Dason 100 0.290 3.816 0.291 0.000 0 0
5 Finn 100 1.057 13.908 1.057 0.000 0 0
9 Iago 100 3.214 42.289 3.215 0.000 0 0
2 Tim 100 6.000 78.947 6.002 0.000 0 0
6 tmmfmnk 100 6.345 83.487 5.760 0.003 0 0
8 Josh2 100 12.542 165.026 12.545 0.000 0 0
7 Josh1 100 13.288 174.842 13.268 0.028 0 0
The easiest and the cleanest way IMHO is :
q.data$number.of.a <- lengths(gregexpr('a', q.data$string))
# number string number.of.a`
#1 1 greatgreat 2`
#2 2 magic 1`
#3 3 not 0`
s <- "aababacababaaathhhhhslsls jsjsjjsaa ghhaalll"
p <- "a"
s2 <- gsub(p,"",s)
numOcc <- nchar(s) - nchar(s2)
May not be the efficient one but solve my purpose.