R: What's the easiest way to print out pairs of values from a data.frame? - list

I have a data.frame:
df<-data.frame(a=c("x","x","y","y"),b=c(1,2,3,4))
> df
a b
1 x 1
2 x 2
3 y 3
4 y 4
What's the easiest way to print out each pair of values as a list of strings like this:
"x1", "x2", "y1", "y2"

apply(df, 1, paste, collapse="")

with(df, paste(a, b, sep=""))
And this should be faster than apply.
About timing
For 10000 rows we get:
df <- data.frame(
a = sample(c("x","y"), 10000, replace=TRUE),
b = sample(1L:4L, 10000, replace=TRUE)
)
N = 100
mean(replicate(N, system.time( with(df, paste(a, b, sep="")) )["elapsed"]), trim=0.05)
# 0.005778
mean(replicate(N, system.time( apply(df, 1, paste, collapse="") )["elapsed"]), trim=0.05)
# 0.09611
So increase in speed is visible for few thousands.
It's because Shane's solution call paste for each row separately. So there is nrow(df) calls of paste, in my solution is one call.

Also, you can use sqldf library:
library("sqldf")
df<-data.frame(a=c("x","x","y","y"),b=c(1,2,3,4))
result <- sqldf("SELECT a || cast(cast(b as integer) as text) as concat FROM df")
You will get the following result:
concat
1 x1
2 x2
3 y3
4 y4

Related

R: convert text duration ("..d ..h ..m ..s") into seconds

Trying to convert the following durations into seconds
x <- "1005d 16h 09m 57s"
x1 <- "16h 09m 57s"
x2 <- "06h 09m 57s"
x3 <- "09m 57s"
x4 <- "57s"
I've modified the answer from Jthorpe in this post Convert factor of format Hh Mm Ss to time duration.
days <- as.numeric(gsub('^*([0-9]+)d.*$','\\1',x3))
hours <- as.numeric(gsub('^.*([0-9][0-9])h.*$','\\1',x3))
minutes <- as.numeric(gsub('^.*([0-9][0-9])m.*$','\\1',x4))
seconds <- as.numeric(gsub('^.*([0-9][0-9])s.*$','\\1',x4))
duration_seconds <- seconds + 60*minutes + 60*60*hours + 24*60*60*days
However, this is only working with x, but not x1-x4. Now, I know I can probably use if logic to get around the issue, but is there a better way?
Thanks in advance.
We can change the space character (\\s+) with + using gsub, then we can replace 'd', 'h', 'm', 's' with gsubfn and loop through the output and evaluate the string.
library(gsubfn)
v2 <- gsubfn("[a-z]", list(d="*24*60*60", h = "*60*60", m = "*60",
s="*1"), gsub("\\s+", "+", v1))
unname(sapply(v2, function(x) eval(parse(text=x))))
#[1] 86890197 58197 22197 597 57
data
v1 <- c(x, x1, x2, x3, x4)
Use:
ifelse(is.na(your_exp),0)
So that whenever na is the output of your expression it becomes 0.
Eg:
days <- ifelse(is.na(as.numeric(gsub('^*([0-9]+)d.*$','\\1',x1))),0)
hours <- ifelse(is.na(as.numeric(gsub('^.*([0-9][0-9])h.*$','\\1',x1))),0)
minutes <- ifelse(is.na(as.numeric(gsub('^.*([0-9][0-9])m.*$','\\1',x1))),0)
seconds <- ifelse(is.na(as.numeric(gsub('^.*([0-9][0-9])s.*$','\\1',x1))),0)
Output:(after duration_seconds <- seconds + 60*minutes + 60*60*hours + 24*60*60*days)
> duration_seconds
[1] 58197

Subset all 3 digit numbers and collapse them with a separator in a data frame. R

I'm formating a data set so each entry has the adegenet format for codominant markers, such as:
Loci1
###/###
208/210
200/204
198/208
where the # represents any digit (the number is a allele size in basepairs). My data has some homozygous entries (all 3 digit integers with no separator) that have the the form of:
Loci1
###
208
198
I intend to paste the 3 digit string to itself with sep='/' to produce the first format. I've tried to use grep to subset these homozygous entries by finding all non ###/### and negating the match using the table matching such as:
a <- grep('\\b\\d{3}?[/]\\d{3}', score$Loci1, value =T ) # Subset all ###/###/
score[!(a %in% 1:nrow(score$Loci1)), ] # works but only on vectors...
After the subset I could paste. The problem arises when I apply this to a data frame. grep seems to treat the data frame as a list (which in part it is) and returns columns that have a match.
So in short how can I go from ### to ###/### in a data frame
self contained example of data:
score2 <- NULL
set.seed(9)
Loci1 <- NULL
Loci2 <- NULL
Loci3 <- NULL
for (i in 1:5) Loci1 <- append(Loci1, paste(sample(seq(from = 230, to=330, by=3), 2, replace = F), collapse = '/'))
for (i in 1:5) Loci2 <- append(Loci2, paste(sample(seq(from = 230, to=330, by=3), 2, replace = F), collapse = '/'))
for (i in 1:5) Loci3 <- append(Loci3, paste(sample(seq(from = 230, to=330, by=3), 2, replace = F), collapse = '/'))
score2 <- data.frame(Loci1, Loci2, Loci3, stringsAsFactors = F)
score2[2,3] <- strsplit(score2[2,3], split = '/')[1]
score2[5,2] <- strsplit(score2[3,3], split = '/')[1]
score2[1,1] <- strsplit(score2[1,1], split = '/')[1]
score2[c(1, 4),c(2,3)] <- NA
score2
You could just replace the 3 digit items with the separator and a copy:
sub("^(...)$", "\\1/\\1", Loci1)
Use lapply with an anonymized function:
data.frame( lapply(score2, function(x) sub("^(...)$", "\\1/\\1", x) ) )
Loci1 Loci2 Loci3
1 251/251 <NA> <NA>
2 251/329 320/257 260/260
3 275/242 278/329 281/320
4 269/266 <NA> <NA>
5 296/326 281/281 326/314
(Not sure what the "paste-part" was supposed to refer to, but I think this was the intent of your question)
If the numeric values could have a varying number of digits then use a pattern argument like "^([0-9]{1,9})$"
An option using grep/paste,
m1 <- as.matrix(score2)
indx <- grep('^...$', m1)
m1[indx] <- paste(m1[indx], m1[indx], sep="/")
as.data.frame(m1)
# Loci1 Loci2 Loci3
#1 251/251 <NA> <NA>
#2 251/329 320/257 260/260
#3 275/242 278/329 281/320
#4 269/266 <NA> <NA>
#5 296/326 281/281 326/314
Or without converting to matrix, this can be done using lapply
score2[] <- lapply(score2, function(x) ifelse(grepl('^...$', x),
paste(x, x, sep="/"),x))

How to properly manipulate a string column in a data frame in R?

I have a data.frame with a string column that contains periods e.g "a.b.c.X". I want to split out the string by periods and retain the third segment e.g. "c" in the example given. Here is what I'm doing.
> df = data.frame(v=c("a.b.a.X", "a.b.b.X", "a.b.c.X"), b=seq(1,3))
> df
v b
1 a.b.a.X 1
2 a.b.b.X 2
3 a.b.c.X 3
And what I want is
> df = data.frame(v=c("a.b.a.X", "a.b.b.X", "a.b.c.X"), b=seq(1,3))
> df
v b
1 a 1
2 b 2
3 c 3
I'm attempting to use within, but I'm getting strange results. The value in the first row in the first column is being repeated.
> get = function(x) { unlist(strsplit(x, "\\."))[3] }
> within(df, v <- get(as.character(v)))
v b
1 a 1
2 a 2
3 a 3
What is the best practice for doing this? What am I doing wrong?
Update:
Here is the solution I used from #agstudy's answer:
> df = data.frame(v=c("a.b.a.X", "a.b.b.X", "a.b.c.X"), b=seq(1,3))
> get = function(x) gsub(".*?[.].*?[.](.*?)[.].*", '\\1', x)
> within(df, v <- get(v))
v b
1 a 1
2 b 2
3 c 3
Using some regular expression you can do :
gsub(".*?[.].*?[.](.*?)[.].*", '\\1', df$v)
[1] "a" "b" "c"
Or more concise:
gsub("(.*?[.]){2}(.*?)[.].*", '\\2', v)
The problem is not with within but with your get function. It returns a single character ("a") which gets recycled when added to your data.frame. Your code should look like this:
get.third <- function(x) sapply(strsplit(x, "\\."), `[[`, 3)
within(df, v <- get.third(as.character(v)))
Here is one possible solution:
df[, "v"] <- do.call(rbind, strsplit(as.character(df[, "v"]), "\\."))[, 3]
## > df
## v b
## 1 a 1
## 2 b 2
## 3 c 3
The answer to "what am I doing wrong" is that the bit of code that you thought was extracting the third element of each split string was actually putting all the elements of all your strings in a single vector, and then returning the third element of that:
get = function(x) {
splits = strsplit(x, "\\.")
print("All the elements: ")
print(unlist(splits))
print("The third element:")
print(unlist(splits)[3])
# What you actually wanted:
third_chars = sapply(splits, function (x) x[3])
}
within(df, v2 <- get(as.character(v)))

How to calculate the number of occurrence of a given character in each row of a column of strings?

I have a data.frame in which certain variables contain a text string. I wish to count the number of occurrences of a given character in each individual string.
Example:
q.data<-data.frame(number=1:3, string=c("greatgreat", "magic", "not"))
I wish to create a new column for q.data with the number of occurence of "a" in string (ie. c(2,1,0)).
The only convoluted approach I have managed is:
string.counter<-function(strings, pattern){
counts<-NULL
for(i in 1:length(strings)){
counts[i]<-length(attr(gregexpr(pattern,strings[i])[[1]], "match.length")[attr(gregexpr(pattern,strings[i])[[1]], "match.length")>0])
}
return(counts)
}
string.counter(strings=q.data$string, pattern="a")
number string number.of.a
1 1 greatgreat 2
2 2 magic 1
3 3 not 0
The stringr package provides the str_count function which seems to do what you're interested in
# Load your example data
q.data<-data.frame(number=1:3, string=c("greatgreat", "magic", "not"), stringsAsFactors = F)
library(stringr)
# Count the number of 'a's in each element of string
q.data$number.of.a <- str_count(q.data$string, "a")
q.data
# number string number.of.a
#1 1 greatgreat 2
#2 2 magic 1
#3 3 not 0
If you don't want to leave base R, here's a fairly succinct and expressive possibility:
x <- q.data$string
lengths(regmatches(x, gregexpr("a", x)))
# [1] 2 1 0
nchar(as.character(q.data$string)) -nchar( gsub("a", "", q.data$string))
[1] 2 1 0
Notice that I coerce the factor variable to character, before passing to nchar. The regex functions appear to do that internally.
Here's benchmark results (with a scaled up size of the test to 3000 rows)
q.data<-q.data[rep(1:NROW(q.data), 1000),]
str(q.data)
'data.frame': 3000 obs. of 3 variables:
$ number : int 1 2 3 1 2 3 1 2 3 1 ...
$ string : Factor w/ 3 levels "greatgreat","magic",..: 1 2 3 1 2 3 1 2 3 1 ...
$ number.of.a: int 2 1 0 2 1 0 2 1 0 2 ...
benchmark( Dason = { q.data$number.of.a <- str_count(as.character(q.data$string), "a") },
Tim = {resT <- sapply(as.character(q.data$string), function(x, letter = "a"){
sum(unlist(strsplit(x, split = "")) == letter) }) },
DWin = {resW <- nchar(as.character(q.data$string)) -nchar( gsub("a", "", q.data$string))},
Josh = {x <- sapply(regmatches(q.data$string, gregexpr("g",q.data$string )), length)}, replications=100)
#-----------------------
test replications elapsed relative user.self sys.self user.child sys.child
1 Dason 100 4.173 9.959427 2.985 1.204 0 0
3 DWin 100 0.419 1.000000 0.417 0.003 0 0
4 Josh 100 18.635 44.474940 17.883 0.827 0 0
2 Tim 100 3.705 8.842482 3.646 0.072 0 0
Another good option, using charToRaw:
sum(charToRaw("abc.d.aa") == charToRaw('.'))
The stringi package provides the functions stri_count and stri_count_fixed which are very fast.
stringi::stri_count(q.data$string, fixed = "a")
# [1] 2 1 0
benchmark
Compared to the fastest approach from #42-'s answer and to the equivalent function from the stringr package for a vector with 30.000 elements.
library(microbenchmark)
benchmark <- microbenchmark(
stringi = stringi::stri_count(test.data$string, fixed = "a"),
baseR = nchar(test.data$string) - nchar(gsub("a", "", test.data$string, fixed = TRUE)),
stringr = str_count(test.data$string, "a")
)
autoplot(benchmark)
data
q.data <- data.frame(number=1:3, string=c("greatgreat", "magic", "not"), stringsAsFactors = FALSE)
test.data <- q.data[rep(1:NROW(q.data), 10000),]
A variation of https://stackoverflow.com/a/12430764/589165 is
> nchar(gsub("[^a]", "", q.data$string))
[1] 2 1 0
I'm sure someone can do better, but this works:
sapply(as.character(q.data$string), function(x, letter = "a"){
sum(unlist(strsplit(x, split = "")) == letter)
})
greatgreat magic not
2 1 0
or in a function:
countLetter <- function(charvec, letter){
sapply(charvec, function(x, letter){
sum(unlist(strsplit(x, split = "")) == letter)
}, letter = letter)
}
countLetter(as.character(q.data$string),"a")
You could just use string division
require(roperators)
my_strings <- c('apple', banana', 'pear', 'melon')
my_strings %s/% 'a'
Which will give you 1, 3, 1, 0. You can also use string division with regular expressions and whole words.
The question below has been moved here, but it seems this page doesn't directly answer to Farah El's question.
How to find number 1s in 101 in R
So, I'll write an answer here, just in case.
library(magrittr)
n %>% # n is a number you'd like to inspect
as.character() %>%
str_count(pattern = "1")
https://stackoverflow.com/users/8931457/farah-el
Yet another base R option could be:
lengths(lapply(q.data$string, grepRaw, pattern = "a", all = TRUE, fixed = TRUE))
[1] 2 1 0
The next expression does the job and also works for symbols, not only letters.
The expression works as follows:
1: it uses lapply on the columns of the dataframe q.data to iterate over the rows of the column 2 ("lapply(q.data[,2],"),
2: it apply to each row of the column 2 a function "function(x){sum('a' == strsplit(as.character(x), '')[[1]])}".
The function takes each row value of column 2 (x), convert to character (in case it is a factor for example), and it does the split of the string on every character ("strsplit(as.character(x), '')"). As a result we have a a vector with each character of the string value for each row of the column 2.
3: Each vector value of the vector is compared with the desired character to be counted, in this case "a" (" 'a' == "). This operation will return a vector of True and False values "c(True,False,True,....)", being True when the value in the vector matches the desired character to be counted.
4: The total times the character 'a' appears in the row is calculated as the sum of all the 'True' values in the vector "sum(....)".
5: Then it is applied the "unlist" function to unpack the result of the "lapply" function and assign it to a new column in the dataframe ("q.data$number.of.a<-unlist(....")
q.data$number.of.a<-unlist(lapply(q.data[,2],function(x){sum('a' == strsplit(as.character(x), '')[[1]])}))
>q.data
# number string number.of.a
#1 greatgreat 2
#2 magic 1
#3 not 0
Another base R answer, not so good as those by #IRTFM and #Finn (or as those using stringi/stringr), but better than the others:
sapply(strsplit(q.data$string, split=""), function(x) sum(x %in% "a"))
q.data<-data.frame(number=1:3, string=c("greatgreat", "magic", "not"))
q.data<-q.data[rep(1:NROW(q.data), 3000),]
library(rbenchmark)
library(stringr)
library(stringi)
benchmark( Dason = {str_count(q.data$string, "a") },
Tim = {sapply(q.data$string, function(x, letter = "a"){sum(unlist(strsplit(x, split = "")) == letter) }) },
DWin = {nchar(q.data$string) -nchar( gsub("a", "", q.data$string, fixed=TRUE))},
Markus = {stringi::stri_count(q.data$string, fixed = "a")},
Finn={nchar(gsub("[^a]", "", q.data$string))},
tmmfmnk={lengths(lapply(q.data$string, grepRaw, pattern = "a", all = TRUE, fixed = TRUE))},
Josh1 = {sapply(regmatches(q.data$string, gregexpr("g",q.data$string )), length)},
Josh2 = {lengths(regmatches(q.data$string, gregexpr("g",q.data$string )))},
Iago = {sapply(strsplit(q.data$string, split=""), function(x) sum(x %in% "a"))},
replications =100, order = "elapsed")
test replications elapsed relative user.self sys.self user.child sys.child
4 Markus 100 0.076 1.000 0.076 0.000 0 0
3 DWin 100 0.277 3.645 0.277 0.000 0 0
1 Dason 100 0.290 3.816 0.291 0.000 0 0
5 Finn 100 1.057 13.908 1.057 0.000 0 0
9 Iago 100 3.214 42.289 3.215 0.000 0 0
2 Tim 100 6.000 78.947 6.002 0.000 0 0
6 tmmfmnk 100 6.345 83.487 5.760 0.003 0 0
8 Josh2 100 12.542 165.026 12.545 0.000 0 0
7 Josh1 100 13.288 174.842 13.268 0.028 0 0
The easiest and the cleanest way IMHO is :
q.data$number.of.a <- lengths(gregexpr('a', q.data$string))
# number string number.of.a`
#1 1 greatgreat 2`
#2 2 magic 1`
#3 3 not 0`
s <- "aababacababaaathhhhhslsls jsjsjjsaa ghhaalll"
p <- "a"
s2 <- gsub(p,"",s)
numOcc <- nchar(s) - nchar(s2)
May not be the efficient one but solve my purpose.

How to expand a list with NULLs up to some length?

Given a list whose length <= N, what is the best / most efficient way to fill it up with trailing NULLs up to length (so that it has length N).
This is something which is a one-liner in any decent language, but I don't have a clue how to do it (efficiently) in a few lines in R so that it works for every corner case (zero length list etc.).
Let's keep it really simple:
tst<-1:10 #whatever, to get a vector of length 10
tst<-tst[1:15]
Try this :
> l = list("a",1:3)
> N = 5
> l[N+1]=NULL
> l
[[1]]
[1] "a"
[[2]]
[1] 1 2 3
[[3]]
NULL
[[4]]
NULL
[[5]]
NULL
>
How about this ?
> l = list("a",1:3)
> length(l)=5
> l
[[1]]
[1] "a"
[[2]]
[1] 1 2 3
[[3]]
NULL
[[4]]
NULL
[[5]]
NULL
Directly editing the list's length appears to be the fastest as far as I can tell:
tmp <- vector("list",5000)
sol1 <- function(x){
x <- x[1:10000]
}
sol2 <- function(x){
x[10001] <- NULL
}
sol3 <- function(x){
length(x) <- 10000
}
library(rbenchmark)
benchmark(sol1(tmp),sol2(tmp),sol3(tmp),replications = 5000)
test replications elapsed relative user.self sys.self user.child sys.child
1 sol1(tmp) 5000 2.045 1.394952 1.327 0.727 0 0
2 sol2(tmp) 5000 2.849 1.943383 1.804 1.075 0 0
3 sol3(tmp) 5000 1.466 1.000000 0.937 0.548 0 0
But the differences aren't huge, unless you're doing this a lot on very long lists, I suppose.
I'm sure there are shorter ways, but I would be inclined to do:
l <- as.list(1:10)
N <- 15
l <- c(l, as.list(rep(NA, N - length(l) )))
Hi: I'm not sure if you were talking about an actual list but, if you were, below will work. It works because, once you access the element of a vector ( which is a list is ) that is not there, R expands the vector to that length.
length <- 10
temp <- list("a","b")
print(temp)
temp[length] <- NULL
print(temp)