I have several vectors of unequal length and I would like to cbind them. I've put the vectors into a list and I have tried to combine the using do.call(cbind, ...):
nm <- list(1:8, 3:8, 1:5)
do.call(cbind, nm)
# [,1] [,2] [,3]
# [1,] 1 3 1
# [2,] 2 4 2
# [3,] 3 5 3
# [4,] 4 6 4
# [5,] 5 7 5
# [6,] 6 8 1
# [7,] 7 3 2
# [8,] 8 4 3
# Warning message:
# In (function (..., deparse.level = 1) :
# number of rows of result is not a multiple of vector length (arg 2)
As expected, the number of rows in the resulting matrix is the length of the longest vector, and the values of the shorter vectors are recycled to make up for the length.
Instead I'd like to pad the shorter vectors with NA values to obtain the same length as the longest vector. I'd like the matrix to look like this:
# [,1] [,2] [,3]
# [1,] 1 3 1
# [2,] 2 4 2
# [3,] 3 5 3
# [4,] 4 6 4
# [5,] 5 7 5
# [6,] 6 8 NA
# [7,] 7 NA NA
# [8,] 8 NA NA
How can I go about doing this?
You can use indexing, if you index a number beyond the size of the object it returns NA. This works for any arbitrary number of rows defined with foo:
nm <- list(1:8,3:8,1:5)
foo <- 8
sapply(nm, '[', 1:foo)
EDIT:
Or in one line using the largest vector as number of rows:
sapply(nm, '[', seq(max(sapply(nm,length))))
From R 3.2.0 you may use lengths ("get the length of each element of a list") instead of sapply(nm, length):
sapply(nm, '[', seq(max(lengths(nm))))
You should fill vectors with NA before calling do.call.
nm <- list(1:8,3:8,1:5)
max_length <- max(unlist(lapply(nm,length)))
nm_filled <- lapply(nm,function(x) {ans <- rep(NA,length=max_length);
ans[1:length(x)]<- x;
return(ans)})
do.call(cbind,nm_filled)
This is a shorter version of Wojciech's solution.
nm <- list(1:8,3:8,1:5)
max_length <- max(sapply(nm,length))
sapply(nm, function(x){
c(x, rep(NA, max_length - length(x)))
})
Here is an option using stri_list2matrix from stringi
library(stringi)
out <- stri_list2matrix(nm)
class(out) <- 'numeric'
out
# [,1] [,2] [,3]
#[1,] 1 3 1
#[2,] 2 4 2
#[3,] 3 5 3
#[4,] 4 6 4
#[5,] 5 7 5
#[6,] 6 8 NA
#[7,] 7 NA NA
#[8,] 8 NA NA
Late to the party but you could use cbind.fill from rowr package with fill = NA
library(rowr)
do.call(cbind.fill, c(nm, fill = NA))
# object object object
#1 1 3 1
#2 2 4 2
#3 3 5 3
#4 4 6 4
#5 5 7 5
#6 6 8 NA
#7 7 NA NA
#8 8 NA NA
If you have a named list instead and want to maintain the headers you could use setNames
nm <- list(a = 1:8, b = 3:8, c = 1:5)
setNames(do.call(cbind.fill, c(nm, fill = NA)), names(nm))
# a b c
#1 1 3 1
#2 2 4 2
#3 3 5 3
#4 4 6 4
#5 5 7 5
#6 6 8 NA
#7 7 NA NA
#8 8 NA NA
Related
I have a dataframe:
df <- data.frame(name=c("john", "david", "callum", "joanna", "allison", "slocum", "lisa"), id=1:7)
df
name id
1 john 1
2 david 2
3 callum 3
4 joanna 4
5 allison 5
6 slocum 6
7 lisa 7
I have a vector containing regex that I wish to find in the df$name variable:
vec <- c("lis", "^jo", "um$")
The output I want to get is as follows:
name id group
1 john 1 2
2 david 2 NA
3 callum 3 3
4 joanna 4 2
5 allison 5 1
6 slocum 6 3
7 lisa 7 1
I could do this doing the following:
df$group <- ifelse(grepl("lis", df$name), 1,
ifelse(grepl("^jo", df$name), 2,
ifelse(grepl("um$", df$name), 3,
NA)
However, I want to do this directly from 'vec'. I am generating different values into vec reactively in a shiny app. Can I assign groups based on index in vec?
Further, if something like the below happens, the group should be the first appearing. e.g. 'Callum' is TRUE for 'all' and "um$" but should get a group 1 here.
vec <- c("all", "^jo", "um$")
Here are several options:
df$group <- apply(Vectorize(grepl, "pattern")(vec, df$name),
1,
function(ii) which(ii)[1])
# name id group
#1 john 1 2
#2 david 2 NA
#3 callum 3 3
#4 joanna 4 2
#5 allison 5 1
#6 slocum 6 3
#7 lisa 7 1
Use a named vector and merge on it:
names(vec) <- seq_along(vec)
df <- merge(df, stack(Vectorize(grep, "pattern", SIMPLIFY=FALSE)(vec, df$name)),
by.x="id", by.y="values", all.x = TRUE)
df[!duplicated(df$id),] # to keep only the first match
# id name ind
#1 1 john 2
#2 2 david <NA>
#3 3 callum 3
#4 4 joanna 2
#5 5 allison 1
#6 6 slocum 3
#7 7 lisa 1
A for loop:
df$group <- NA
for ( i in rev(seq_along(vec))) {
TFvec <- grepl(vec[i], df$name)
df$group[TFvec] <- i
}
df
# name id group
#1 john 1 2
#2 david 2 NA
#3 callum 3 3
#4 joanna 4 2
#5 allison 5 1
#6 slocum 6 3
#7 lisa 7 1
Or you can use outer with stri_match_first_regex from stringi
library(stringi)
match.mat <- outer(df$name, vec, stri_match_first_regex)
df$group <- apply(match.mat, 1, function(ii) which(!is.na(ii))[1])
# [1] for first match in `vec`
# name id group
#1 john 1 2
#2 david 2 NA
#3 callum 3 3
#4 joanna 4 2
#5 allison 5 1
#6 slocum 6 3
#7 lisa 7 1
A vectorised solution, using rebus and stringi.
library(rebus)
library(stringi)
Create a regular expression that captures any of the values in vec.
vec <- c("lis", "^jo", "um$")
(rx <- or1(vec, capture = TRUE))
## <regex> (lis|^jo|um$)
Match the regex, then convert to factor and integer.
matches <- stri_match_first_regex(df$name, rx)[, 2]
df$group <- as.integer(factor(matches, levels = c("lis", "jo", "um")))
df now looks like this:
name id group
1 john 1 2
2 david 2 NA
3 callum 3 3
4 joanna 4 2
5 allison 5 1
6 slocum 6 3
7 lisa 7 1
In the following string:
"I may opt for a yam for Amy, May, and Tommy."
How to remove non-alphabetic characters and convert all letter to lowercase and sort the letters within each word in R?
Meanwhile, I try to sort words in sentence and removes the duplicates.
You could use stringi
library(stringi)
unique(stri_sort(stri_trans_tolower(stri_extract_all_words(txt, simplify = TRUE))))
Which gives:
## [1] "a" "amy" "and" "for" "i" "may" "opt" "tommy" "yam"
Update
As per mentionned by #DavidArenburg, I overlooked the "sort the letters within words" part of your question. You didn't provide a desired output and no immediate application comes to mind but, assuming you want to identify which words have a matching counterpart (string distance of 0):
unique(stri_sort(stri_trans_tolower(stri_extract_all_words(txt, simplify = TRUE)))) %>%
stringdistmatrix(., ., useNames = "strings", method = "qgram") %>%
# a amy and for i may opt tommy yam
# a 0 2 2 4 2 2 4 6 2
# amy 2 0 4 6 4 0 6 4 0
# and 2 4 0 6 4 4 6 8 4
# for 4 6 6 0 4 6 4 6 6
# i 2 4 4 4 0 4 4 6 4
# may 2 0 4 6 4 0 6 4 0
# opt 4 6 6 4 4 6 0 4 6
# tommy 6 4 8 6 6 4 4 0 4
# yam 2 0 4 6 4 0 6 4 0
apply(., 1, function(x) sum(x == 0, na.rm=TRUE))
# a amy and for i may opt tommy yam
# 1 3 1 1 1 3 1 1 3
Words with more than one 0 per row ("amy", "may", "yam") have a scrambled counterpart.
str <- "I may opt for a yam for Amy, May, and Tommy."
## Clean the words (just keep letters and convert to lowercase)
words <- strsplit(tolower(gsub("[^A-Za-z ]", "", str)), " ")[[1]]
## split the words into characters and sort them
sortedWords <- sapply(words, function(word) sort(unlist(strsplit(word, ""))))
## Join the sorted letters back together
sapply(sortedWords, paste, collapse="")
# i may opt for a yam for amy may and
# "i" "amy" "opt" "for" "a" "amy" "for" "amy" "amy" "adn"
# tommy
# "mmoty"
## If you want to convert result back to string
do.call(paste, lapply(sortedWords, paste, collapse=""))
# [1] "i amy opt for a amy for amy amy adn mmoty"
stringr will let you work on all character sets in R and at C-speed, and magrittr will let you use a piping idiom that works well for your needs:
library(stringr)
library(magrittr)
txt <- "I may opt for a yam for Amy, May, and Tommy."
txt %>%
str_to_lower %>% # lowercase
str_replace_all("[[:punct:][:digit:][:cntrl:]]", "") %>% # only alpha
str_replace_all("[[:space:]]+", " ") %>% # single spaces
str_split(" ") %>% # tokenize
extract2(1) %>% # str_split returns a list
sort %>% # sort
unique # unique words
## [1] "a" "amy" "and" "for" "i" "may" "opt" "tommy" "yam"
The qdap package that I maintain has the bag_o_words function that works well for this:
txt <- "I may opt for a yam for Amy, May, and Tommy."
library(qdap)
unique(sort(bag_o_words(txt)))
## [1] "a" "amy" "and" "for" "i" "may" "opt" "tommy" "yam"
I have a data.frame and I want to split one of its columns to two based on a regular expression. More specifically the strings have a suffix in parentheses that needs to be extracted to a column of its own.
So e.g. I want to get from here:
dfInit <- data.frame(VAR = paste0(c(1:10),"(",c("A","B"),")"))
to here:
dfFinal <- data.frame(VAR1 = c(1:10), VAR2 = c("A","B"))
1) gsubfn::read.pattern read.pattern in the gsubfn package can do that. The matches to the parenthesized portions of the regular rexpression are regarded as the fields:
library(gsubfn)
read.pattern(text = as.character(dfInit$VAR), pattern = "(.*)[(](.*)[)]$")
giving:
V1 V2
1 1 A
2 2 B
3 3 A
4 4 B
5 5 A
6 6 B
7 7 A
8 8 B
9 9 A
10 10 B
2) sub Another way is to use sub:
data.frame(V1=sub("\\(.*", "", dfInit$VAR), V2=sub(".*\\((.)\\)$", "\\1", dfInit$VAR))
giving the same result.
3) read.table This solution does not use a regular expression:
read.table(text = as.character(dfInit$VAR), sep = "(", comment = ")")
giving the same result.
You could also use extract from tidyr
library(tidyr)
extract(dfInit, VAR, c("VAR1", "VAR2"), "(\\d+).([[:alpha:]]+).", convert=TRUE) # edited and added `convert=TRUE` as per #aosmith's comments.
# VAR1 VAR2
#1 1 A
#2 2 B
#3 3 A
#4 4 B
#5 5 A
#6 6 B
#7 7 A
#8 8 B
#9 9 A
#10 10 B
See Split column at delimiter in data frame
dfFinal <- within(dfInit, VAR<-data.frame(do.call('rbind', strsplit(as.character(VAR), '[[:punct:]]'))))
> dfFinal
VAR.X1 VAR.X2
1 1 A
2 2 B
3 3 A
4 4 B
5 5 A
6 6 B
7 7 A
8 8 B
9 9 A
10 10 B
An approach with regmatches and gregexpr:
as.data.frame(do.call(rbind, regmatches(dfInit$VAR, gregexpr("\\w+", dfInit$VAR))))
You can also use cSplit from splitstackshape.
library(splitstackshape)
cSplit(dfInit, "VAR", "[()]", fixed=FALSE)
# VAR_1 VAR_2
# 1: 1 A
# 2: 2 B
# 3: 3 A
# 4: 4 B
# 5: 5 A
# 6: 6 B
# 7: 7 A
# 8: 8 B
# 9: 9 A
#10: 10 B
I have imported a dataset in R of 10 columns and 100 row. But in few columns there are brackets([]) and commas along with the values. How can i get rid of them?
As an instance, consider one of 4 columns and 2 rows.
V1 V2 V3 V4
3( [4 ([5 8
(1 5 9 [10,
And what i want is
V1 V2 V3 V4
3 4 5 8
1 5 9 10
Just use gsub:
mydf[] <- lapply(mydf, function(x) gsub("[][(),]", "", x))
mydf
# V1 V2 V3 V4
# 1 3 4 5 8
# 2 1 5 9 10
Instead of lapply, you can also use as.matrix:
mydf[] <- gsub("[][(),]", "", as.matrix(mydf))
mydf
# V1 V2 V3 V4
# 1 3 4 5 8
# 2 1 5 9 10
I have data where some of the items are numbers separated by "|", like:
head(mintimes)
[1] "3121|3151" "1171" "1351|1381" "1050" "" "122"
head(minvalues)
[1] 14 10 11 31 Inf 22
What I would like to do is extract all the times and match them to the minvalues. To end up with something like:
times values
3121 14
3151 14
1171 10
1351 11
1381 11
1050 31
122 22
I've tried to strsplit(mintimes, "|") and I've tried str_extract(mintimes, "[0-9]+") but they don't seem to work. Any ideas?
| is a regular expression metacharacter. When used literally, these special characters need to be escaped either with [] or with \\ (or you could use fixed = TRUE in some functions). So your call to strsplit() should be
strsplit(mintimes, "[|]")
or
strsplit(mintimes, "\\|")
or
strsplit(mintimes, "|", fixed = TRUE)
Regarding your other try with stringr functions, str_extract_all() seems to do the trick.
library(stringr)
str_extract_all(mintimes, "[0-9]+")
To get your desired result,
> mintimes <- c("3121|3151", "1171", "1351|1381", "1050", "", "122")
> minvalues <- c(14, 10, 11, 31, Inf, 22)
> s <- strsplit(mintimes, "[|]")
> data.frame(times = as.numeric(unlist(s)),
values = rep(minvalues, sapply(s, length)))
# times values
# 1 3121 14
# 2 3151 14
# 3 1171 10
# 4 1351 11
# 5 1381 11
# 6 1050 31
# 7 122 22
By default strsplit splits using a regular expression and "|" is a special character in the regular expression syntax. You can either escape it
strsplit(mintimes,"\\|")
or just set fixed=T to not use regular expressions
strsplit(mintimes,"|", fixed=T)
I have written a function called cSplit that is useful for these types of things. You can get it from my Gist: https://gist.github.com/mrdwab/11380733
Usage would be:
cSplit(data.table(mintimes, minvalues), "mintimes", "|", "long")
# mintimes minvalues
# 1: 3121 14
# 2: 3151 14
# 3: 1171 10
# 4: 1351 11
# 5: 1381 11
# 6: 1050 31
# 7: 122 22
It also has a "wide" setting, in case that would be at all useful to you:
cSplit(data.table(mintimes, minvalues), "mintimes", "|", "wide")
# minvalues mintimes_1 mintimes_2
# 1: 14 3121 3151
# 2: 10 1171 NA
# 3: 11 1351 1381
# 4: 31 1050 NA
# 5: Inf NA NA
# 6: 22 122 NA
Note: The output is a data.table.
As others have mentioned, you need to escape the | to include it literally in a regular expression. As always, we can skin this cat many ways, and here's one way to do it with stringr:
x <- c("3121|3151", "1171", "1351|1381", "1050", "", "122")
library(stringr)
unlist(str_extract_all(x, "\\d+"))
# [1] "3121" "3151" "1171" "1351" "1381" "1050" "122"
This won't work as expected if you have any decimal points in a character string of numbers, so the following (which says to match anything but |) might be safer:
unlist(str_extract_all(x, '[^|]+'))
# [1] "3121" "3151" "1171" "1351" "1381" "1050" "122"
Either way, you might want to wrap the result in as.numeric.
And here's another solution using stri_split_fixed from the stringi package. As an added value, we also play with mapply and do.call.
Input data:
mintimes <- c("3121|3151", "1171", "1351|1381", "1050", "", "122")
minvalues <- c(14, 10, 11, 31, Inf, 22)
Split mintimes w.r.t. | and convert to numeric:
library("stringi")
mintimes <- lapply(stri_split_fixed(mintimes, "|"), as.numeric)
## [[1]]
## [1] 3121 3151
##
## [[2]]
## [1] 1171
##
## [[3]]
## [1] 1351 1381
##
## [[4]]
## [1] 1050
##
## [[5]]
## [1] NA
##
## [[6]]
## [1] 122
Column-bind each minvalues with corresponding mintimes:
tmp <- mapply(cbind, mintimes, minvalues)
## [[1]]
## [,1] [,2]
## [1,] 3121 14
## [2,] 3151 14
##
## [[2]]
## [,1] [,2]
## [1,] 1171 10
##
## [[3]]
## [,1] [,2]
## [1,] 1351 11
## [2,] 1381 11
##
## [[4]]
## [,1] [,2]
## [1,] 1050 31
##
## [[5]]
## [,1] [,2]
## [1,] NA Inf
##
## [[6]]
## [,1] [,2]
## [1,] 122 22
Row-bind all the 6 matrices & remove NA-rows:
res <- do.call(rbind, tmp)
res[!is.na(res[,1]),]
## [,1] [,2]
## [1,] 3121 14
## [2,] 3151 14
## [3,] 1171 10
## [4,] 1351 11
## [5,] 1381 11
## [6,] 1050 31
## [7,] 122 22
To get the output you want, try something like this:
library(dplyr)
Split.Times <- function(x) {
mintimes <- as.numeric(unlist(strsplit(as.character(x$mintimes), "\\|")))
return(data.frame(mintimes = mintimes, minvalues = x$minvalues, stringsAsFactors=FALSE))
}
df <- data.frame(mintimes, minvalues, stringsAsFactors=FALSE)
df %>%
filter(mintimes != "") %>%
group_by(mintimes) %>%
do(Split.Times(.))
This produces:
mintimes minvalues
1 1050 31
2 1171 10
3 122 22
4 1351 11
5 1381 11
6 3121 14
7 3151 14
(I borrowed from my answer here - which is pretty much the same question/problem)
Here's a qdap package approach:
mintimes <- c("3121|3151", "1171", "1351|1381", "1050", "", "122")
minvalues <- c(14, 10, 11, 31, Inf, 22)
library(qdap)
list2df(setNames(strsplit(mintimes, "\\|"), minvalues), "times", "values")
## times values
## 1 3121 14
## 2 3151 14
## 3 1171 10
## 4 1351 11
## 5 1381 11
## 6 1050 31
## 7 122 22
You can use [:punct:]
strsplit(mintimes, "[[:punct:]]")