I have a large text file with a variable number of fields in each row. The first entry in each row corresponds to a biological pathway, and each subsequent entry corresponds to a gene in that pathway. The first few lines might look like this
path1 gene1 gene2
path2 gene3 gene4 gene5 gene6
path3 gene7 gene8 gene9
I need to read this file into R as a list, with each element being a character vector, and the name of each element in the list being the first element on the line, for example:
> pathways <- list(
+ path1=c("gene1","gene2"),
+ path2=c("gene3","gene4","gene5","gene6"),
+ path3=c("gene7","gene8","gene9")
+ )
>
> str(pathways)
List of 3
$ path1: chr [1:2] "gene1" "gene2"
$ path2: chr [1:4] "gene3" "gene4" "gene5" "gene6"
$ path3: chr [1:3] "gene7" "gene8" "gene9"
>
> str(pathways$path1)
chr [1:2] "gene1" "gene2"
>
> print(pathways)
$path1
[1] "gene1" "gene2"
$path2
[1] "gene3" "gene4" "gene5" "gene6"
$path3
[1] "gene7" "gene8" "gene9"
...but I need to do this automatically for thousands of lines. I saw a similar question posted here previously, but I couldn't figure out how to do this from that thread.
Thanks in advance.
Here's one way to do it:
# Read in the data
x <- scan("data.txt", what="", sep="\n")
# Separate elements by one or more whitepace
y <- strsplit(x, "[[:space:]]+")
# Extract the first vector element and set it as the list element name
names(y) <- sapply(y, `[[`, 1)
#names(y) <- sapply(y, function(x) x[[1]]) # same as above
# Remove the first vector element from each list element
y <- lapply(y, `[`, -1)
#y <- lapply(y, function(x) x[-1]) # same as above
One solution is to read the data in via read.table(), but use the fill = TRUE argument to pad the rows with fewer "entries", convert the resulting data frame to a list and then clean up the "empty" elements.
First, read your snippet of data in:
con <- textConnection("path1 gene1 gene2
path2 gene3 gene4 gene5 gene6
path3 gene7 gene8 gene9
")
dat <- read.table(con, fill = TRUE, stringsAsFactors = FALSE)
close(con)
Next we drop the first column, first saving it for the names of the list later
nams <- dat[, 1]
dat <- dat[, -1]
Convert the data frame to a list. Here I just split the data frame on the indices 1,2,...,n where n is the number of rows:
ldat <- split(dat, seq_len(nrow(dat)))
Clean up the empty cells:
ldat <- lapply(ldat, function(x) x[x != ""])
Finally, apply the names
names(ldat) <- nams
Giving:
> ldat
$path1
[1] "gene1" "gene2"
$path2
[1] "gene3" "gene4" "gene5" "gene6"
$path3
[1] "gene7" "gene8" "gene9"
A quick solution based on the linked page...
inlist <- strsplit(readLines("file.txt"), "[[:space:]]+")
pathways <- lapply(inlist, tail, n = -1)
names(pathways) <- lapply(inlist, head, n = 1)
One more solution:
sl <- c("path1 gene1 gene2", "path2 gene1 gene2 gene3") # created by readLines
f <- function(l, s) {
v <- strsplit(s, " ")[[1]]
l[[v[1]]] <- v[2:length(v)]
return(l)
}
res <- Reduce(f, sl, list())
Related
I am using a list of files, and I am trying to create a data frame that contains: for each sample, the percentage of two particular "GT" types by the levels of another factor variable called "chr" (with 1 to 24 levels).
It would have to look like this:
The problem I keep getting is that the vector never gets updated for the ith sample, it only keeps the first vector created. And then I am not sure how to save that updated vector on my data frame (df).
vector_chr <- vector();
for (i in seq_along(list_files)) {
GT <- list_files[[i]][,9]
chr <- list_files[[i]][,3]
GT$chr <- chr$chr # creating one df with both GT and chr
for (j in unique(GT$chr)){
dat_list = split(GT, GT$chr) # split data frames by chr (1 to 24)
table <- table(dat_list[[j]][,1]) # take GT and make a table
sum <- sum(table[3:4]) # sum GTs 3 and 4
perc <- sum/nrow(GT)
vector_chr <- c(vector_chr,perc) # assign the 24 percentages to a vector
}
df <- data.frame(matrix(ncol = 25, nrow = length(files)))
x <- c("Sample", "chr1", "chr2", "chr3",
"chr4", "chr5", "chr6", "chr7", "chr8", "chr9", "chr10",
"chr11", "chr12","chr13", "chr14", "chr15", "chr16",
"chr17", "chr18", "chr19", "chr20", "chr21", "chr22",
"chrX", "chrXY")
colnames(df) <- x
df$Sample <- names(list_files)
df[i,2:25] <- vector_chr # assign the 24 percentages for EACH sample
}
I am looking for a shorter and more pretty solution (possibly in tidyverse) to the following problem. I have a data.frame "data":
id string
1 A 1.001 xxx 123.123
2 B 23,45 lorem ipsum
3 C donald trump
4 D ssss 134, 1,45
What I wanted to do is to extract all numbers (no matter if the delimiter is "." or "," -> in this case I assume that string "134, 1,45" can be extracted into two numbers: 134 and 1.45) and create a data.frame "output" looking similar to this:
id string
1 A 1.001
2 A 123.123
3 B 23.45
4 C <NA>
5 D 134
6 D 1.45
I managed to do this (code below) but the solution is pretty ugly for me also not so efficient (two for-loops). Could someone suggest a better way to do do this (preferably using dplyr)
# data
data <- data.frame(id = c("A", "B", "C", "D"),
string = c("1.001 xxx 123.123",
"23,45 lorem ipsum",
"donald trump",
"ssss 134, 1,45"),
stringsAsFactors = FALSE)
# creating empty data.frame
len <- length(unlist(sapply(data$string, function(x) gregexpr("[0-9]+[,|.]?[0-9]*", x))))
output <- data.frame(id = rep(NA, len), string = rep(NA, len))
# main solution
start = 0
for(i in 1:dim(data)[1]){
tmp_len <- length(unlist(gregexpr("[0-9]+[,|.]?[0-9]*", data$string[i])))
for(j in (start+1):(start+tmp_len)){
output[j,1] <- data$id[i]
output[j,2] <- regmatches(data$string[i], gregexpr("[0-9]+[,|.]?[0-9]*", data$string[i]))[[1]][j-start]
}
start = start + tmp_len
}
# further modifications
output$string <- gsub(",", ".", output$string)
output$string <- as.numeric(ifelse(substring(output$string, nchar(output$string), nchar(output$string)) == ".",
substring(output$string, 1, nchar(output$string) - 1),
output$string))
output
1) Base R This uses relatively simple regular expressions and no packages.
In the first 2 lines of code replace any comma followed by a space with a
space and then replace all remaining commas with a dot. After these two lines s will be: c("1.001 xxx 123.123", "23.45 lorem ipsum", "donald trump", "ssss 134 1.45")
In the next 4 lines of code trim whitespace from beginning and end of each string field and split the string field on whitespace producing a
list. grep out those elements consisting only of digits and dots. (The regular expression ^[0-9.]*$ matches the start of a word followed by zero or more digits or dots followed by the end of the word so only words containing only those characters are matched.) Replace any zero length components with NA. Finally add data$id as the names. After these 4 lines are run the list L will be list(A = c("1.001", "123.123"), B = "23.45", C = NA, D = c("134", "1.45")) .
In the last line of code convert the list L to a data frame with the appropriate names.
s <- gsub(", ", " ", data$string)
s <- gsub(",", ".", s)
L <- strsplit(trimws(s), "\\s+")
L <- lapply(L, grep, pattern = "^[0-9.]*$", value = TRUE)
L <- ifelse(lengths(L), L, NA)
names(L) <- data$id
with(stack(L), data.frame(id = ind, string = values))
giving:
id string
1 A 1.001
2 A 123.123
3 B 23.45
4 C <NA>
5 D 134
6 D 1.45
2) magrittr This variation of (1) writes it as a magrittr pipeline.
library(magrittr)
data %>%
transform(string = gsub(", ", " ", string)) %>%
transform(string = gsub(",", ".", string)) %>%
transform(string = trimws(string)) %>%
with(setNames(strsplit(string, "\\s+"), id)) %>%
lapply(grep, pattern = "^[0-9.]*$", value = TRUE) %>%
replace(lengths(.) == 0, NA) %>%
stack() %>%
with(data.frame(id = ind, string = values))
3) dplyr/tidyr This is an alternate pipeline solution using dplyr and tidyr. unnest converts to long form, id is made factor so that we can later use complete to recover id's that are removed by subsequent filtering, the filter removes junk rows and complete inserts NA rows for each id that would otherwise not appear.
library(dplyr)
library(tidyr)
data %>%
mutate(string = gsub(", ", " ", string)) %>%
mutate(string = gsub(",", ".", string)) %>%
mutate(string = trimws(string)) %>%
mutate(string = strsplit(string, "\\s+")) %>%
unnest() %>%
mutate(id = factor(id))
filter(grepl("^[0-9.]*$", string)) %>%
complete(id)
4) data.table
library(data.table)
DT <- as.data.table(data)
DT[, string := gsub(", ", " ", string)][,
string := gsub(",", ".", string)][,
string := trimws(string)][,
string := setNames(strsplit(string, "\\s+"), id)][,
list(string = list(grep("^[0-9.]*$", unlist(string), value = TRUE))), by = id][,
list(string = if (length(unlist(string))) unlist(string) else NA_character_), by = id]
DT
Update Removed assumption that junk words do not have digit or dot. Also added (2), (3) and (4) and some improvements.
We can replace the , in between the numbers with . (using gsub), extract the numbers with str_extract_all (from stringr into a list), replace the list elements that have length equal to 0 with NA, set the names of the list with 'id' column, stack to convert the list to data.frame and rename the columns.
library(stringr)
setNames(stack(setNames(lapply(str_extract_all(gsub("(?<=[0-9]),(?=[0-9])", ".",
data$string, perl = TRUE), "[0-9.]+"), function(x)
if(length(x)==0) NA else as.numeric(x)), data$id))[2:1], c("id", "string"))
# id string
#1 A 1.001
#2 A 123.123
#3 B 23.45
#4 C NA
#5 D 134
#6 D 1.45
Same idea as Gabor's. I had hoped to use R's built-in parsing of strings (type.convert, used in read.table) rather than writing custom regex substitutions:
sp = setNames(strsplit(data$string, " "), data$id)
spc = lapply(sp, function(x) {
x = x[grep("[^0-9.,]$", x, invert=TRUE)]
if (!length(x))
NA_real_
else
mapply(type.convert, x, dec=gsub("[^.,]", "", x), USE.NAMES=FALSE)
})
setNames(rev(stack(spc)), names(data))
id string
1 A 1.001
2 A 123.123
3 B 23.45
4 C <NA>
5 D 134
6 D 1.45
Unfortunately, type.convert is not robust enough to consider both decimal delimiters at once, so we need this mapply malarkey instead of type.convert(x, dec = "[.,]").
I am collecting tweets with the twitteR package and get many duplicates. This code works fine:
tweets <- searchTwitter(keyword, n=500, lang="en", since=NULL, until=NULL, retryOnRateLimit=100)
mydata <- sapply(tweets, function(x) x$getText())
mydata <- unique(mydata, incomparables = F, nmax = NA)
The problem is that it actually doesn't delete any duplicates because it doesn't recognises them as such. The duplicate tweets typically contain shortened URLS that differ by 1 or 2 digits. So I tried to clean the tweets of URLS with this code:
tweets <- searchTwitter(keyword, n=500, lang="en", since=NULL, until=NULL,
retryOnRateLimit=100)
mydata <- sapply(tweets, function(x) x$getText())
mydata <- data.frame(tweetsText, stringsAsFactors = FALSE)
names(mydata) <- c('words')
removeURL <- function(x) gsub("http[[:alnum:]]*", "", x)
mydata$words <- removeURL(mydata$words)
removeURL <- function(x) gsub("https[[:alnum:]]*", "", x)
mydata$words <- removeURL(mydata$words)
mydata$words <- unique(mydata$words, incomparables = F, nmax = NA)
Now I get the error message:
Error in $<-.data.frame(*tmp*, "words", value = c("Tripping around #DisneySprings.....) : replacement has 295 rows, data has 300
Advice? Thanks!
Your error is easily reproducible:
mydata <- data.frame(list(w = c(0, 1, 0, 1)))
mydata$words <- c(0, 1, 1)
# Error in `$<-.data.frame`(`*tmp*`, "words", value = c(0, 1, 1)) :
# replacement has 3 rows, data has 4
This just means that you need to assign a vector of the same length as the length of the data frame.
To filter out duplicate values you need to change the last line of your code to:
res <- mydata[!duplicated(mydata$words), ]
I am trying to find an elegant way to find matches between the two following character columns in a data frame. The complicated part is that either string can contain a comma-separated list, and if a member of one list is a match for any member of the other list, then that whole entry would be considered a match. I'm not sure how well I've explained this, so here's sample data and output:
Alt1:
AT
A
G
CGTCC,AT
CGC
Alt2:
AA
A
GG
AT,GGT
CG
Expected Match per row:
Row 1 = none
Row 2 = A
Row 3 = none
Row 4 = AT
Row 5 = none
Non-working solutions:
First attempt: merge entire data frames by desired columns, then match up the alt columns shown above:
match1 = data.frame(merge(vcf.df, ref.df, by=c("chr", "start", "end", "ref")))
matches = unique(match1[unlist(sapply(match1$Alt1 grep, match1$Alt2, fixed=TRUE)),])
Second method, using findoverlaps feature from VariantAnnoatation/Granges:
findoverlaps(ranges(vcf1), ranges(vcf2))
Any suggestions would be greatly appreciated! Thank you!
Solution
Thanks to #Marat Talipov's answer below, the following solution works to compare two comma-separated strings:
> ##read in edited kaviar vcf and human ref
> ref <- readVcfAsVRanges("ref.vcf.gz", humie_ref)
Warning message:
In .vcf_usertag(map, tag, ...) :
ScanVcfParam ‘geno’ fields not present: ‘AD’
> ##rename chromosomes to match with vcf files
> ref <- renameSeqlevels(ref, c("1"="chr1"))
> ##################################
> ## Gather VCF files to process ##
> ##################################
> ##data frame *.vcf.gz files in directory path
> vcf_path <- data.frame(path=list.files(vcf_dir, pattern="*.vcf.gz$", full=TRUE))
> ##read in everything but sample data for speediness
> vcf_param = ScanVcfParam(samples=NA)
> vcf <- readVcfAsVRanges("test.vcf.gz", humie_ref, param=vcf_param)
> #################
> ## Match SNP's ##
> #################
> ##create data frames of info to match on
> vcf.df = data.frame(chr =as.character(seqnames(vcf)), start = start(vcf), end = end(vcf), ref = as.character(ref(vcf)),
+ alt=alt(vcf), stringsAsFactors=FALSE)
> ref.df = data.frame(chr =as.character(seqnames(ref)), start = start(ref), end = end(ref),
+ ref = as.character(ref(ref)), alt=alt(ref), stringsAsFactors=FALSE)
>
> ##merge based on all positional fields except vcf
> col_match = data.frame(merge(vcf.df, ref.df, by=c("chr", "start", "end", "ref")))
> library(stringi)
> ##split each alt column by comma and bind together
> M1 <- stri_list2matrix(sapply(col_match$alt.x,strsplit,','))
> M2 <- stri_list2matrix(sapply(col_match$alt.y,strsplit,','))
> M <- rbind(M1,M2)
> ##compare results
> result <- apply(M,2,function(z) unique(na.omit(z[duplicated(z)])))
> ##add results column to col_match df for checking/subsetting
> col_match$match = result
> head(col_match)
chr start end ref alt.x alt.y match
1 chr1 39998059 39998059 A G G G
2 chr1 39998059 39998059 A G G G
3 chr1 39998084 39998084 C A A A
4 chr1 39998084 39998084 C A A A
5 chr1 39998085 39998085 G A A A
6 chr1 39998085 39998085 G A A A
In the case that input lists are of equal length and you'd like to compare list elements in the pairwise manner, you could use this solution:
library(stringi)
M1 <- stri_list2matrix(sapply(Alt1,strsplit,','))
M2 <- stri_list2matrix(sapply(Alt2,strsplit,','))
M <- rbind(M1,M2)
result <- apply(M,2,function(z) unique(na.omit(z[duplicated(z)])))
Sample input:
Alt1 <- list('AT','A','G','CGTCC,AT','CGC','GG,CC')
Alt2 <- list('AA','A','GG','AT,GGT','CG','GG,CC')
Output:
# [[1]]
# character(0)
#
# [[2]]
# [1] "A"
#
# [[3]]
# character(0)
#
# [[4]]
# [1] "AT"
#
# [[5]]
# character(0)
#
# [[6]]
# [1] "GG" "CC"
Sticking with the stringi package, you could do something like this, using the Alt1 and Alt2 data from Marat's answer.
library(stringi)
f <- function(x, y) {
ssf <- stri_split_fixed(c(x, y), ",", simplify = TRUE)
if(any(sd <- stri_duplicated(ssf))) ssf[sd] else NA_character_
}
Map(f, Alt1, Alt2)
# [[1]]
# [1] NA
#
# [[2]]
# [1] "A"
#
# [[3]]
# [1] NA
#
# [[4]]
# [1] "AT"
#
# [[5]]
# [1] NA
#
# [[6]]
# [1] "GG" "CC"
Or in base R, we can use scan() to separate the strings with commas.
g <- function(x, y, sep = ",") {
s <- scan(text = c(x, y), what = "", sep = sep, quiet = TRUE)
s[duplicated(s)]
}
Map(g, Alt1, Alt2)
you could do something like this:
Alt1 <- list('AT','A','G',c('CGTCC','AT'),'CGC')
Alt2 <- list('AA','A','GG',c('AT','GGT'),'CG')
# make sure you change the lists within in the lists into vectors
matchlist <- list()
for (i in 1:length(Alt1)){
matchlist[[i]] <- ifelse(Alt1[[i]] %in% Alt2[[i]],
paste("Row",i,"=",c(Alt1[[i]],Alt2[[i]])[duplicated(c(Alt1[[i]],Alt2[[i]]))],sep=" "),
paste("Row",i,"= none",sep=" "))
}
print(matchlist)
I want to append or add a data.frame to itself...
Much in the same way the one adds:
n <- n + t
I have a function that creates a data.frame.
I have been using:
g <- function(compareA,compareB) {
for (i in 1:1000) {
ttr <- t.test(compareA, compareA, var.equal = TRUE)
tt_pvalues[i] <- ttr$p.value
}
name_tag <- paste(nameA, nameB, sep = "_Vs_")
tt_titles <- data.frame(name_tag, tt_titles)
# character vector which I want to add to a list
ALL_pvalues <- data.frame(tt_pvalues, ALL_pvalues)
# adding a numeric vector of values to a larger data.frame
}
Would cbind be better here?
There are two methods that would "add or append" data to a data.frame by columns and one that would append by rows. Assuming tag is the data.frame, and tt_titles is a vector of the same length that 'tag' has rows, then either of these would work:
tag <- cbind(tag, tt_titles)
# tt_titles could also be a data.frame with same number of rows
Or:
tag[["tt_titles"]] <- tt_titles
Now let's assume that we have instead two data.frames with the same column.names:
bigger.df <- rbind(tag, tag2)