Related
I'm working with a data structure that is stratified into by three levels. For example, suppose my structure D is
First Second Third Value Index (let's do 1-indexing)
1 1 1 a 1
1 1 2 a 2
1 1 3 a 3
1 2 1 a 4
1 2 2 a 5
2 1 1 a 6
2 1 2 a 7
I formed a nested list structure to index components corresponding to each grouping structure:
[[1]]
[[1]][[1]]
1 2 3
[[1]][[2]]
4 5
[[2]]
[[2]][[1]]
6 7
In reality, my structure D is enormous, and in my context I will perform operations on random subsets of D. The way I'm doing the random subsetting is stratified sampling of each of the different levels. For example, suppose the first level contains 1:1000, the second is 1:100, the third containing 1:50. I may subsample 50 of the first-level identifiers, 10 of the second-level identifiers, and 3 of the third-level identifiers, which may give me something like (call the below nested structure A)
[[1]]
[[1]][[1]]
2 27 49
[[1]][[2]]
61 80 95
....
[[1]][[10]]
2409 3509 5609
[[2]]
[[2]][[1]]
7092 8091 9039
...
After subsetting based on the above indices, say now with structure D', the indices for D' are no longer the above, but rather (call the below nested structure A')
[[1]]
[[1]][[1]]
1 2 3
[[1]][[2]]
4 5 6
....
[[1]][[10]]
28 29 30
[[2]]
[[2]][[1]]
31 32 33
...
What is the most efficient way to map A to A'? It would be nice for this to be efficient, since it appears I'll be doing this operation many times and could be a rate-limiting step in my entire program. I'm currently using Rcpp (C++) and it may be possible some subsets are smaller than the require sampling number (for example, some second-level indices could be less than 10, say 7, in which case we just take all 7 labels).
Before going straight to Rcpp I would encourage you to think about how we could do this in base R. List, subsetting, and indexing are the bread and butter of base R and can be quite efficient.
Below, we have a solution that attacks your problem. You will note, it could probably be more efficient, but for now, this is very straightforward approach and is easy to understand. We will test the efficiency later and address any concerns then.
A_prime <- function(A) {
## Generate index vector for all of A
ind <- seq_len(sum(unlist(lapply(A, lengths))))
## Generate the ending index of each vector
endInd <- cumsum(lengths(unlist(A, recursive = F)))
## Use the endInd to create the corresponding start index
startInd <- c(1L, endInd[-length(endInd)] + 1L)
## Create a simple list with the appropriate index vector
A_ind <- mapply(function(s, e) ind[s:e], startInd, endInd, SIMPLIFY = FALSE)
## Again, using the structure of A, we begin creating starting
## and ending indices to replicate the structure of A
A_end <- cumsum(lengths(A))
A_strt <- c(1L, A_end[-length(A_end)] + 1L)
## Create the desired result
lapply(seq_along(A), function(x) A_ind[A_strt[x]:A_end[x]])
}
Let's test it on a similar problem that the OP has presented:
set.seed(35)
a <- sort(sample(10000, 60))
L1 <- lapply(seq.int(1, 60, 3), function(x) {
a[x:(x + 2)]
})
A <- list(L1[1:10], L1[11:20])
str(A)
List of 2
$ :List of 10
..$ : int [1:3] 4 203 205
..$ : int [1:3] 710 1281 1515
..$ : int [1:3] 1605 1784 1846
..$ : int [1:3] 1904 1993 2425
..$ : int [1:3] 2468 2499 2630
..$ : int [1:3] 2910 2920 3210
..$ : int [1:3] 3360 3464 3469
..$ : int [1:3] 3689 3811 4002
..$ : int [1:3] 4053 4304 4358
..$ : int [1:3] 4433 5290 5862
$ :List of 10
..$ : int [1:3] 6017 6021 6155
..$ : int [1:3] 6250 6370 6414
..$ : int [1:3] 6447 6530 6656
..$ : int [1:3] 6706 6820 6977
..$ : int [1:3] 6986 7148 7338
..$ : int [1:3] 7515 7522 7666
..$ : int [1:3] 7755 7889 7891
..$ : int [1:3] 8071 8143 8487
..$ : int [1:3] 8625 8731 8945
..$ : int [1:3] 8957 9149 9770
And here is the output:
str(A_prime(A))
List of 2
$ :List of 10
..$ : int [1:3] 1 2 3
..$ : int [1:3] 4 5 6
..$ : int [1:3] 7 8 9
..$ : int [1:3] 10 11 12
..$ : int [1:3] 13 14 15
..$ : int [1:3] 16 17 18
..$ : int [1:3] 19 20 21
..$ : int [1:3] 22 23 24
..$ : int [1:3] 25 26 27
..$ : int [1:3] 28 29 30
$ :List of 10
..$ : int [1:3] 31 32 33
..$ : int [1:3] 34 35 36
..$ : int [1:3] 37 38 39
..$ : int [1:3] 40 41 42
..$ : int [1:3] 43 44 45
..$ : int [1:3] 46 47 48
..$ : int [1:3] 49 50 51
..$ : int [1:3] 52 53 54
..$ : int [1:3] 55 56 57
..$ : int [1:3] 58 59 60
That looks pretty good! A benefit of the function A_prime is that the nested list need not be uniform. Let's test this on a very large list with over 10Mb of data:
set.seed(123)
big_ind <- sort(sample(1e8, 1e6))
## generate random chunks
endIndBig <- sort(sample(1e6, 1e5))
startIndBig <- c(1L, endIndBig)
endIndBig <- c(endIndBig, 1e6)
A_big_init <- mapply(function(s, e) big_ind[s:e], startIndBig, endIndBig, SIMPLIFY = FALSE)
## generate random chunks for nested lists
A_big_ends <- sort(sample(length(A_big_init), 1e3))
A_big_strts <- c(1L, A_big_ends)
A_big_ends <- c(A_big_ends, length(A_big_init))
A_big <- lapply(seq_along(A_big_ends), function(x) A_big_init[A_big_strts[x]:A_big_ends[x]])
Here is some summary information on A_big. As you can see the length of each sub list is not uniform:
print(object.size(A_big), units = "Mb")
# 10.7 Mb
length(A_big)
# [1] 1001
head(lengths(A_big))
# [1] 159 175 59 69 175 38
tail(lengths(A_big))
# [1] 88 4 225 91 74 59
A_big[[1]][[1]]
# [1] 3 72 722 836 929 1014 1091 1127
A_big[[1]][[159]]
# [1] 170285 170370 170482 170763 170793 170913 170965 171066 171240 171397 171464 171572 171590 171722
# [15] 171898 171903 172196 172284 172298 172590 172696 172698
A_big[[1000]][[2]]
# [1] 99856337 99856415
A_big[[1000]][[74]]
# [1] 99938669 99938699 99938743 99939158 99939664 99939803
Now, for the moment of truth...
system.time(A_prime(A_big))
user system elapsed
0.201 0.003 0.203
A_big_prime <- A_prime(A_big)
A_big_prime[[1]][[1]]
# [1] 1 2 3 4 5 6 7 8
A_big_prime[[1]][[159]]
# [1] 1879 1880 1881 1882 1883 1884 1885 1886 1887 1888 1889 1890 1891 1892 1893 1894 1895 1896 1897 1898
# [21] 1899 1900
A_big_prime[[1000]][[2]]
# [1] 1109671 1109672
A_big_prime[[1000]][[74]]
# [1] 1110583 1110584 1110585 1110586 1110587 1110588
That's not bad!!! As we can see, the mapping was performed correctly and quickly all with base R.
I have data where some of the items are numbers separated by "|", like:
head(mintimes)
[1] "3121|3151" "1171" "1351|1381" "1050" "" "122"
head(minvalues)
[1] 14 10 11 31 Inf 22
What I would like to do is extract all the times and match them to the minvalues. To end up with something like:
times values
3121 14
3151 14
1171 10
1351 11
1381 11
1050 31
122 22
I've tried to strsplit(mintimes, "|") and I've tried str_extract(mintimes, "[0-9]+") but they don't seem to work. Any ideas?
| is a regular expression metacharacter. When used literally, these special characters need to be escaped either with [] or with \\ (or you could use fixed = TRUE in some functions). So your call to strsplit() should be
strsplit(mintimes, "[|]")
or
strsplit(mintimes, "\\|")
or
strsplit(mintimes, "|", fixed = TRUE)
Regarding your other try with stringr functions, str_extract_all() seems to do the trick.
library(stringr)
str_extract_all(mintimes, "[0-9]+")
To get your desired result,
> mintimes <- c("3121|3151", "1171", "1351|1381", "1050", "", "122")
> minvalues <- c(14, 10, 11, 31, Inf, 22)
> s <- strsplit(mintimes, "[|]")
> data.frame(times = as.numeric(unlist(s)),
values = rep(minvalues, sapply(s, length)))
# times values
# 1 3121 14
# 2 3151 14
# 3 1171 10
# 4 1351 11
# 5 1381 11
# 6 1050 31
# 7 122 22
By default strsplit splits using a regular expression and "|" is a special character in the regular expression syntax. You can either escape it
strsplit(mintimes,"\\|")
or just set fixed=T to not use regular expressions
strsplit(mintimes,"|", fixed=T)
I have written a function called cSplit that is useful for these types of things. You can get it from my Gist: https://gist.github.com/mrdwab/11380733
Usage would be:
cSplit(data.table(mintimes, minvalues), "mintimes", "|", "long")
# mintimes minvalues
# 1: 3121 14
# 2: 3151 14
# 3: 1171 10
# 4: 1351 11
# 5: 1381 11
# 6: 1050 31
# 7: 122 22
It also has a "wide" setting, in case that would be at all useful to you:
cSplit(data.table(mintimes, minvalues), "mintimes", "|", "wide")
# minvalues mintimes_1 mintimes_2
# 1: 14 3121 3151
# 2: 10 1171 NA
# 3: 11 1351 1381
# 4: 31 1050 NA
# 5: Inf NA NA
# 6: 22 122 NA
Note: The output is a data.table.
As others have mentioned, you need to escape the | to include it literally in a regular expression. As always, we can skin this cat many ways, and here's one way to do it with stringr:
x <- c("3121|3151", "1171", "1351|1381", "1050", "", "122")
library(stringr)
unlist(str_extract_all(x, "\\d+"))
# [1] "3121" "3151" "1171" "1351" "1381" "1050" "122"
This won't work as expected if you have any decimal points in a character string of numbers, so the following (which says to match anything but |) might be safer:
unlist(str_extract_all(x, '[^|]+'))
# [1] "3121" "3151" "1171" "1351" "1381" "1050" "122"
Either way, you might want to wrap the result in as.numeric.
And here's another solution using stri_split_fixed from the stringi package. As an added value, we also play with mapply and do.call.
Input data:
mintimes <- c("3121|3151", "1171", "1351|1381", "1050", "", "122")
minvalues <- c(14, 10, 11, 31, Inf, 22)
Split mintimes w.r.t. | and convert to numeric:
library("stringi")
mintimes <- lapply(stri_split_fixed(mintimes, "|"), as.numeric)
## [[1]]
## [1] 3121 3151
##
## [[2]]
## [1] 1171
##
## [[3]]
## [1] 1351 1381
##
## [[4]]
## [1] 1050
##
## [[5]]
## [1] NA
##
## [[6]]
## [1] 122
Column-bind each minvalues with corresponding mintimes:
tmp <- mapply(cbind, mintimes, minvalues)
## [[1]]
## [,1] [,2]
## [1,] 3121 14
## [2,] 3151 14
##
## [[2]]
## [,1] [,2]
## [1,] 1171 10
##
## [[3]]
## [,1] [,2]
## [1,] 1351 11
## [2,] 1381 11
##
## [[4]]
## [,1] [,2]
## [1,] 1050 31
##
## [[5]]
## [,1] [,2]
## [1,] NA Inf
##
## [[6]]
## [,1] [,2]
## [1,] 122 22
Row-bind all the 6 matrices & remove NA-rows:
res <- do.call(rbind, tmp)
res[!is.na(res[,1]),]
## [,1] [,2]
## [1,] 3121 14
## [2,] 3151 14
## [3,] 1171 10
## [4,] 1351 11
## [5,] 1381 11
## [6,] 1050 31
## [7,] 122 22
To get the output you want, try something like this:
library(dplyr)
Split.Times <- function(x) {
mintimes <- as.numeric(unlist(strsplit(as.character(x$mintimes), "\\|")))
return(data.frame(mintimes = mintimes, minvalues = x$minvalues, stringsAsFactors=FALSE))
}
df <- data.frame(mintimes, minvalues, stringsAsFactors=FALSE)
df %>%
filter(mintimes != "") %>%
group_by(mintimes) %>%
do(Split.Times(.))
This produces:
mintimes minvalues
1 1050 31
2 1171 10
3 122 22
4 1351 11
5 1381 11
6 3121 14
7 3151 14
(I borrowed from my answer here - which is pretty much the same question/problem)
Here's a qdap package approach:
mintimes <- c("3121|3151", "1171", "1351|1381", "1050", "", "122")
minvalues <- c(14, 10, 11, 31, Inf, 22)
library(qdap)
list2df(setNames(strsplit(mintimes, "\\|"), minvalues), "times", "values")
## times values
## 1 3121 14
## 2 3151 14
## 3 1171 10
## 4 1351 11
## 5 1381 11
## 6 1050 31
## 7 122 22
You can use [:punct:]
strsplit(mintimes, "[[:punct:]]")
I would like to split strings on the first and last comma. Each string has at least two
commas. Below is an example data set and the desired result.
A similar question here asked how to split on the first comma: Split on first comma in string
Here I asked how to split strings on the first two colons: Split string on first two colons
Thank you for any suggestions. I prefer a solution in base R. Sorry if this is a duplicate.
my.data <- read.table(text='
my.string some.data
123,34,56,78,90 10
87,65,43,21 20
a4,b6,c8888 30
11,bbbb,ccccc 40
uu,vv,ww,xx 50
j,k,l,m,n,o,p 60', header = TRUE, stringsAsFactors=FALSE)
desired.result <- read.table(text='
my.string1 my.string2 my.string3 some.data
123 34,56,78 90 10
87 65,43 21 20
a4 b6 c8888 30
11 bbbb ccccc 40
uu vv,ww xx 50
j k,l,m,n,o p 60', header = TRUE, stringsAsFactors=FALSE)
You can use the \K operator which keeps text already matched out of the result and a negative look ahead assertion to do this (well almost, there is an annoying comma at the start of the middle portion which I am yet to get rid of in the strsplit). But I enjoyed this as an exercise in constructing a regex...
x <- '123,34,56,78,90'
strsplit( x , "^[^,]+\\K|,(?=[^,]+$)" , perl = TRUE )
#[[1]]
#[1] "123" ",34,56,78" "90"
Explantion:
^[^,]+ : from the start of the string match one or more characters that are not a ,
\\K : but don't include those matched characters in the match
So the first match is the first comma...
| : or you can match...
,(?=[^,]+$) : a , so long as it is followed by [(?=...)] one or more characters that are not a , until the end of the string ($)...
Here is a relatively simple approach. In the first line we use sub to replace the first and last commas with semicolons producing s. Then we read s using sep=";" and finally cbind the rest of my.data to it:
s <- sub(",(.*),", ";\\1;", my.data[[1]])
DF <- read.table(text=s, sep =";", col.names=paste0("mystring",1:3), as.is=TRUE)
cbind(DF, my.data[-1])
giving:
mystring1 mystring2 mystring3 some.data
1 123 34,56,78 90 10
2 87 65,43 21 20
3 a4 b6 c8888 30
4 11 bbbb ccccc 40
5 uu vv,ww xx 50
6 j k,l,m,n,o p 60
Here is code to split on the first and last comma. This code draws heavily from an answer by #bdemarest here: Split string on first two colons The gsub pattern below, which is the meat of the answer, contains important differences. The code for creating the new data frame after strings are split is the same as that of #bdemarest
# Replace first and last commas with colons.
new.string <- gsub(pattern="(^[^,]+),(.+),([^,]+$)",
replacement="\\1:\\2:\\3", x=my.data$my.string)
new.string
# Split on colons
split.data <- strsplit(new.string, ":")
# Create data frame
new.data <- data.frame(do.call(rbind, split.data))
names(new.data) <- paste("my.string", seq(ncol(new.data)), sep="")
my.data$my.string <- NULL
my.data <- cbind(new.data, my.data)
my.data
# my.string1 my.string2 my.string3 some.data
# 1 123 34,56,78 90 10
# 2 87 65,43 21 20
# 3 a4 b6 c8888 30
# 4 11 bbbb ccccc 40
# 5 uu vv,ww xx 50
# 6 j k,l,m,n,o p 60
# Here is code for splitting strings on the first comma
my.data <- read.table(text='
my.string some.data
123,34,56,78,90 10
87,65,43,21 20
a4,b6,c8888 30
11,bbbb,ccccc 40
uu,vv,ww,xx 50
j,k,l,m,n,o,p 60', header = TRUE, stringsAsFactors=FALSE)
# Replace first comma with colon
new.string <- gsub(pattern="(^[^,]+),(.+$)",
replacement="\\1:\\2", x=my.data$my.string)
new.string
# Split on colon
split.data <- strsplit(new.string, ":")
# Create data frame
new.data <- data.frame(do.call(rbind, split.data))
names(new.data) <- paste("my.string", seq(ncol(new.data)), sep="")
my.data$my.string <- NULL
my.data <- cbind(new.data, my.data)
my.data
# my.string1 my.string2 some.data
# 1 123 34,56,78,90 10
# 2 87 65,43,21 20
# 3 a4 b6,c8888 30
# 4 11 bbbb,ccccc 40
# 5 uu vv,ww,xx 50
# 6 j k,l,m,n,o,p 60
# Here is code for splitting strings on the last comma
my.data <- read.table(text='
my.string some.data
123,34,56,78,90 10
87,65,43,21 20
a4,b6,c8888 30
11,bbbb,ccccc 40
uu,vv,ww,xx 50
j,k,l,m,n,o,p 60', header = TRUE, stringsAsFactors=FALSE)
# Replace last comma with colon
new.string <- gsub(pattern="^(.+),([^,]+$)",
replacement="\\1:\\2", x=my.data$my.string)
new.string
# Split on colon
split.data <- strsplit(new.string, ":")
# Create new data frame
new.data <- data.frame(do.call(rbind, split.data))
names(new.data) <- paste("my.string", seq(ncol(new.data)), sep="")
my.data$my.string <- NULL
my.data <- cbind(new.data, my.data)
my.data
# my.string1 my.string2 some.data
# 1 123,34,56,78 90 10
# 2 87,65,43 21 20
# 3 a4,b6 c8888 30
# 4 11,bbbb ccccc 40
# 5 uu,vv,ww xx 50
# 6 j,k,l,m,n,o p 60
You can do a simple strsplit here on that column
popshift<-sapply(strsplit(my.data$my.string,","), function(x)
c(x[1], paste(x[2:(length(x)-1)],collapse=","), x[length(x)]))
desired.result <- cbind(data.frame(my.string=t(popshift)), my.data[-1])
I just split up all the values and make a new vector with the first, last and middle strings. Then i cbind that with the rest of the data. The result is
my.string.1 my.string.2 my.string.3 some.data
1 123 34,56,78 90 10
2 87 65,43 21 20
3 a4 b6 c8888 30
4 11 bbbb ccccc 40
5 uu vv,ww xx 50
6 j k,l,m,n,o p 60
Using str_match() from package stringr, and a little help from one of your links,
> library(stringr)
> data.frame(str_match(my.data$my.string, "(.+?),(.*),(.+?)$")[,-1],
some.data = my.data$some.data)
# X1 X2 X3 some.data
# 1 123 34,56,78 90 10
# 2 87 65,43 21 20
# 3 a4 b6 c8888 30
# 4 11 bbbb ccccc 40
# 5 uu vv,ww xx 50
# 6 j k,l,m,n,o p 60
Define:
dats <- list( df1 = data.frame(A=sample(1:3), B = sample(11:13)),
df2 = data.frame(AA=sample(1:3), BB = sample(11:13)))
s.t.
> dats
$df1
A B
1 2 12
2 3 11
3 1 13
$df2
AA BB
1 1 13
2 2 12
3 3 11
I would like to change all variable names from all caps to lower. I can do this with a loop but somehow cannot get this lapply call to work:
dats <- lapply(dats, function(x)
names(x)<-tolower(names(x)))
which results in:
> dats
$df1
[1] "a" "b"
$df2
[1] "aa" "bb"
while the desired result is:
> dats
$df1
a b
1 2 12
2 3 11
3 1 13
$df2
aa bb
1 1 13
2 2 12
3 3 11
If you don't use return at the end of a function, the last evaluated expression returned. So you need to return x.
dats <- lapply(dats, function(x) {
names(x)<-tolower(names(x))
x})
I have several vectors of unequal length and I would like to cbind them. I've put the vectors into a list and I have tried to combine the using do.call(cbind, ...):
nm <- list(1:8, 3:8, 1:5)
do.call(cbind, nm)
# [,1] [,2] [,3]
# [1,] 1 3 1
# [2,] 2 4 2
# [3,] 3 5 3
# [4,] 4 6 4
# [5,] 5 7 5
# [6,] 6 8 1
# [7,] 7 3 2
# [8,] 8 4 3
# Warning message:
# In (function (..., deparse.level = 1) :
# number of rows of result is not a multiple of vector length (arg 2)
As expected, the number of rows in the resulting matrix is the length of the longest vector, and the values of the shorter vectors are recycled to make up for the length.
Instead I'd like to pad the shorter vectors with NA values to obtain the same length as the longest vector. I'd like the matrix to look like this:
# [,1] [,2] [,3]
# [1,] 1 3 1
# [2,] 2 4 2
# [3,] 3 5 3
# [4,] 4 6 4
# [5,] 5 7 5
# [6,] 6 8 NA
# [7,] 7 NA NA
# [8,] 8 NA NA
How can I go about doing this?
You can use indexing, if you index a number beyond the size of the object it returns NA. This works for any arbitrary number of rows defined with foo:
nm <- list(1:8,3:8,1:5)
foo <- 8
sapply(nm, '[', 1:foo)
EDIT:
Or in one line using the largest vector as number of rows:
sapply(nm, '[', seq(max(sapply(nm,length))))
From R 3.2.0 you may use lengths ("get the length of each element of a list") instead of sapply(nm, length):
sapply(nm, '[', seq(max(lengths(nm))))
You should fill vectors with NA before calling do.call.
nm <- list(1:8,3:8,1:5)
max_length <- max(unlist(lapply(nm,length)))
nm_filled <- lapply(nm,function(x) {ans <- rep(NA,length=max_length);
ans[1:length(x)]<- x;
return(ans)})
do.call(cbind,nm_filled)
This is a shorter version of Wojciech's solution.
nm <- list(1:8,3:8,1:5)
max_length <- max(sapply(nm,length))
sapply(nm, function(x){
c(x, rep(NA, max_length - length(x)))
})
Here is an option using stri_list2matrix from stringi
library(stringi)
out <- stri_list2matrix(nm)
class(out) <- 'numeric'
out
# [,1] [,2] [,3]
#[1,] 1 3 1
#[2,] 2 4 2
#[3,] 3 5 3
#[4,] 4 6 4
#[5,] 5 7 5
#[6,] 6 8 NA
#[7,] 7 NA NA
#[8,] 8 NA NA
Late to the party but you could use cbind.fill from rowr package with fill = NA
library(rowr)
do.call(cbind.fill, c(nm, fill = NA))
# object object object
#1 1 3 1
#2 2 4 2
#3 3 5 3
#4 4 6 4
#5 5 7 5
#6 6 8 NA
#7 7 NA NA
#8 8 NA NA
If you have a named list instead and want to maintain the headers you could use setNames
nm <- list(a = 1:8, b = 3:8, c = 1:5)
setNames(do.call(cbind.fill, c(nm, fill = NA)), names(nm))
# a b c
#1 1 3 1
#2 2 4 2
#3 3 5 3
#4 4 6 4
#5 5 7 5
#6 6 8 NA
#7 7 NA NA
#8 8 NA NA