I am trying to find an elegant way to find matches between the two following character columns in a data frame. The complicated part is that either string can contain a comma-separated list, and if a member of one list is a match for any member of the other list, then that whole entry would be considered a match. I'm not sure how well I've explained this, so here's sample data and output:
Alt1:
AT
A
G
CGTCC,AT
CGC
Alt2:
AA
A
GG
AT,GGT
CG
Expected Match per row:
Row 1 = none
Row 2 = A
Row 3 = none
Row 4 = AT
Row 5 = none
Non-working solutions:
First attempt: merge entire data frames by desired columns, then match up the alt columns shown above:
match1 = data.frame(merge(vcf.df, ref.df, by=c("chr", "start", "end", "ref")))
matches = unique(match1[unlist(sapply(match1$Alt1 grep, match1$Alt2, fixed=TRUE)),])
Second method, using findoverlaps feature from VariantAnnoatation/Granges:
findoverlaps(ranges(vcf1), ranges(vcf2))
Any suggestions would be greatly appreciated! Thank you!
Solution
Thanks to #Marat Talipov's answer below, the following solution works to compare two comma-separated strings:
> ##read in edited kaviar vcf and human ref
> ref <- readVcfAsVRanges("ref.vcf.gz", humie_ref)
Warning message:
In .vcf_usertag(map, tag, ...) :
ScanVcfParam ‘geno’ fields not present: ‘AD’
> ##rename chromosomes to match with vcf files
> ref <- renameSeqlevels(ref, c("1"="chr1"))
> ##################################
> ## Gather VCF files to process ##
> ##################################
> ##data frame *.vcf.gz files in directory path
> vcf_path <- data.frame(path=list.files(vcf_dir, pattern="*.vcf.gz$", full=TRUE))
> ##read in everything but sample data for speediness
> vcf_param = ScanVcfParam(samples=NA)
> vcf <- readVcfAsVRanges("test.vcf.gz", humie_ref, param=vcf_param)
> #################
> ## Match SNP's ##
> #################
> ##create data frames of info to match on
> vcf.df = data.frame(chr =as.character(seqnames(vcf)), start = start(vcf), end = end(vcf), ref = as.character(ref(vcf)),
+ alt=alt(vcf), stringsAsFactors=FALSE)
> ref.df = data.frame(chr =as.character(seqnames(ref)), start = start(ref), end = end(ref),
+ ref = as.character(ref(ref)), alt=alt(ref), stringsAsFactors=FALSE)
>
> ##merge based on all positional fields except vcf
> col_match = data.frame(merge(vcf.df, ref.df, by=c("chr", "start", "end", "ref")))
> library(stringi)
> ##split each alt column by comma and bind together
> M1 <- stri_list2matrix(sapply(col_match$alt.x,strsplit,','))
> M2 <- stri_list2matrix(sapply(col_match$alt.y,strsplit,','))
> M <- rbind(M1,M2)
> ##compare results
> result <- apply(M,2,function(z) unique(na.omit(z[duplicated(z)])))
> ##add results column to col_match df for checking/subsetting
> col_match$match = result
> head(col_match)
chr start end ref alt.x alt.y match
1 chr1 39998059 39998059 A G G G
2 chr1 39998059 39998059 A G G G
3 chr1 39998084 39998084 C A A A
4 chr1 39998084 39998084 C A A A
5 chr1 39998085 39998085 G A A A
6 chr1 39998085 39998085 G A A A
In the case that input lists are of equal length and you'd like to compare list elements in the pairwise manner, you could use this solution:
library(stringi)
M1 <- stri_list2matrix(sapply(Alt1,strsplit,','))
M2 <- stri_list2matrix(sapply(Alt2,strsplit,','))
M <- rbind(M1,M2)
result <- apply(M,2,function(z) unique(na.omit(z[duplicated(z)])))
Sample input:
Alt1 <- list('AT','A','G','CGTCC,AT','CGC','GG,CC')
Alt2 <- list('AA','A','GG','AT,GGT','CG','GG,CC')
Output:
# [[1]]
# character(0)
#
# [[2]]
# [1] "A"
#
# [[3]]
# character(0)
#
# [[4]]
# [1] "AT"
#
# [[5]]
# character(0)
#
# [[6]]
# [1] "GG" "CC"
Sticking with the stringi package, you could do something like this, using the Alt1 and Alt2 data from Marat's answer.
library(stringi)
f <- function(x, y) {
ssf <- stri_split_fixed(c(x, y), ",", simplify = TRUE)
if(any(sd <- stri_duplicated(ssf))) ssf[sd] else NA_character_
}
Map(f, Alt1, Alt2)
# [[1]]
# [1] NA
#
# [[2]]
# [1] "A"
#
# [[3]]
# [1] NA
#
# [[4]]
# [1] "AT"
#
# [[5]]
# [1] NA
#
# [[6]]
# [1] "GG" "CC"
Or in base R, we can use scan() to separate the strings with commas.
g <- function(x, y, sep = ",") {
s <- scan(text = c(x, y), what = "", sep = sep, quiet = TRUE)
s[duplicated(s)]
}
Map(g, Alt1, Alt2)
you could do something like this:
Alt1 <- list('AT','A','G',c('CGTCC','AT'),'CGC')
Alt2 <- list('AA','A','GG',c('AT','GGT'),'CG')
# make sure you change the lists within in the lists into vectors
matchlist <- list()
for (i in 1:length(Alt1)){
matchlist[[i]] <- ifelse(Alt1[[i]] %in% Alt2[[i]],
paste("Row",i,"=",c(Alt1[[i]],Alt2[[i]])[duplicated(c(Alt1[[i]],Alt2[[i]]))],sep=" "),
paste("Row",i,"= none",sep=" "))
}
print(matchlist)
Related
I am trying to analyze some Formule 1 data. Wikipedia has a table with the data I want. I am importing the data into R with the code below:
library(XML)
library(RCurl)
url <- "https://en.wikipedia.org/wiki/List_of_Formula_One_drivers"
tabs <- getURL(url)
tabs <- readHTMLTable(tabs, stringsAsFactors=FALSE)
pilots <- tabs[[3]]
pilots <- pilots[-dim(pilots)[1], ]
head(pilots[, 1])
[1] "Abate, CarloCarlo Abate"
[2] "Abecassis, GeorgeGeorge Abecassis"
[3] "Acheson, KennyKenny Acheson"
[4] "Adamich, Andrea deAndrea de Adamich"
[5] "Adams, PhilippePhilippe Adams"
[6] "Ader, WaltWalt Ader"
However, the pilot names are strange. Notice how they are. I'd like them to be like this:
head(pilots[, 1])
[1] "Carlo Abate"
[2] "George Abecassis"
[3] "Kenny Acheson"
[4] "Andrea de Adamich"
[5] "Philippe Adams"
[6] "Walt Ader"
However, it seems I am not able to write a regex that can deal with this problem or find an argument for the function readHTMLTable that ignores the sortkey value in the table I am interested. How can I solve my problem?
Use readHTMLTable with a bespoke elFun argument.
library(XML)
library(RCurl)
url <- "https://en.wikipedia.org/wiki/List_of_Formula_One_drivers"
tabs <- getURL(url)
myFun <- function(x){
if(length(y <- getNodeSet(x, ".//a")) > 0){
# return data.frame
title <- xpathSApply(x, ".//a", fun = xmlGetAttr, name = "title")
href <- xpathSApply(x, ".//a", fun = xmlGetAttr, name = "href")
value <- xpathSApply(x, ".//a", fun = xmlValue)
return(paste(value, collapse = ","))
}
xmlValue(x, encoding = "UTF-8")
}
tabs <- readHTMLTable(tabs, elFun = myFun, stringsAsFactors=FALSE)
pilots <- tabs[[3]]
pilots <- pilots[-dim(pilots)[1], ]
> head(pilots[, 1])
[1] "Carlo Abate" "George Abecassis" "Kenny Acheson" "Andrea de Adamich"
[5] "Philippe Adams" "Walt Ader"
> pilots[1,]
Name Country Seasons Championships Entries Starts Poles Wins Podiums Fastest laps Points[note]
1 Carlo Abate Italy 1962,1963 0 2 0 0 0 0 0 0
I'm searching for the locations of 4 different substrings in x and trying to merge these four outputs into one cumulative string:
x <- ("AAABBADSJALKACCWIEUADD")
outputA <- gregexpr(pattern = "AAA", x)
outputB <- gregexpr(pattern = "ABB", x)
outputC <- gregexpr(pattern = "ACC", x)
outputD <- gregexpr(pattern = "ADD", x)
I would like to merge these four outputs and output this merged result as a text file with each element separated on new line.
merged_output
# 1
# 3
# 13
# 20
Thank you
Actually you can do it all at once using a lookahead (?=)
gregexpr("A(?=AA|BB|CC|DD)", x, perl=T)[[1]]
# [1] 1 3 13 20
# attr(,"match.length")
# [1] 1 1 1 1
# attr(,"useBytes")
# [1] TRUE
For example
library(stringi)
cat("merged_output",
paste("#",
stri_locate_first_fixed(pattern = c("AAA", "ABB", "ACC", "ADD"), ("AAABBADSJALKACCWIEUADD"))[, "start"]),
file = tf <- tempfile(fileext = ".txt"),
sep = "\n")
Now, the file named in tf contains
> merged_output
> # 1
> # 3
> # 13
> # 20
Not very automated, but
cat(paste(c(outputA[[1]][1], outputB[[1]][1], outputC[[1]][1], outputD[[1]][1]),
collapse = "\n"),
file = "outputfile.txt")
should do it.
I'm formating a data set so each entry has the adegenet format for codominant markers, such as:
Loci1
###/###
208/210
200/204
198/208
where the # represents any digit (the number is a allele size in basepairs). My data has some homozygous entries (all 3 digit integers with no separator) that have the the form of:
Loci1
###
208
198
I intend to paste the 3 digit string to itself with sep='/' to produce the first format. I've tried to use grep to subset these homozygous entries by finding all non ###/### and negating the match using the table matching such as:
a <- grep('\\b\\d{3}?[/]\\d{3}', score$Loci1, value =T ) # Subset all ###/###/
score[!(a %in% 1:nrow(score$Loci1)), ] # works but only on vectors...
After the subset I could paste. The problem arises when I apply this to a data frame. grep seems to treat the data frame as a list (which in part it is) and returns columns that have a match.
So in short how can I go from ### to ###/### in a data frame
self contained example of data:
score2 <- NULL
set.seed(9)
Loci1 <- NULL
Loci2 <- NULL
Loci3 <- NULL
for (i in 1:5) Loci1 <- append(Loci1, paste(sample(seq(from = 230, to=330, by=3), 2, replace = F), collapse = '/'))
for (i in 1:5) Loci2 <- append(Loci2, paste(sample(seq(from = 230, to=330, by=3), 2, replace = F), collapse = '/'))
for (i in 1:5) Loci3 <- append(Loci3, paste(sample(seq(from = 230, to=330, by=3), 2, replace = F), collapse = '/'))
score2 <- data.frame(Loci1, Loci2, Loci3, stringsAsFactors = F)
score2[2,3] <- strsplit(score2[2,3], split = '/')[1]
score2[5,2] <- strsplit(score2[3,3], split = '/')[1]
score2[1,1] <- strsplit(score2[1,1], split = '/')[1]
score2[c(1, 4),c(2,3)] <- NA
score2
You could just replace the 3 digit items with the separator and a copy:
sub("^(...)$", "\\1/\\1", Loci1)
Use lapply with an anonymized function:
data.frame( lapply(score2, function(x) sub("^(...)$", "\\1/\\1", x) ) )
Loci1 Loci2 Loci3
1 251/251 <NA> <NA>
2 251/329 320/257 260/260
3 275/242 278/329 281/320
4 269/266 <NA> <NA>
5 296/326 281/281 326/314
(Not sure what the "paste-part" was supposed to refer to, but I think this was the intent of your question)
If the numeric values could have a varying number of digits then use a pattern argument like "^([0-9]{1,9})$"
An option using grep/paste,
m1 <- as.matrix(score2)
indx <- grep('^...$', m1)
m1[indx] <- paste(m1[indx], m1[indx], sep="/")
as.data.frame(m1)
# Loci1 Loci2 Loci3
#1 251/251 <NA> <NA>
#2 251/329 320/257 260/260
#3 275/242 278/329 281/320
#4 269/266 <NA> <NA>
#5 296/326 281/281 326/314
Or without converting to matrix, this can be done using lapply
score2[] <- lapply(score2, function(x) ifelse(grepl('^...$', x),
paste(x, x, sep="/"),x))
I have a data.frame object which I obtain by converting an object of class rules into data.frame in this way:
trx.cpf.rules.df <- as(trx.cpf.rules, "data.frame")
(You can build the trx.cpf.rules.df object from the structure dputed here).
The head of this data frame looks like this:
> head(trx.cpf.rules.df)
rules support confidence lift
66 {Product_Group_1,Product_Group_49} => {Product_Group_48} 0.1060016 0.7371274 6.683635
12 {Product_Group_48} => {Product_Group_49} 0.1067810 0.9681979 6.386621
68 {Product_Group_1,Product_Group_23} => {Product_Group_49} 0.1079501 0.9052288 5.971252
16 {Product_Group_23} => {Product_Group_49} 0.1098987 0.8392857 5.536265
71 {Product_Group_1,Product_Group_23} => {Product_Group_34} 0.1024942 0.8594771 4.702384
19 {Product_Group_34} => {Product_Group_23} 0.1079501 0.5906183 4.510496
Is there a fast way (dedicated function or sth like that) to convert each of the trx.cpf.rules.df$rules into two vectors contatining relue;s element? For example, for the first row it would be:
> (lhs.el <- c("Product_Group_1", "Product_Group_49"))
[1] "Product_Group_1" "Product_Group_49"
> (rhs.el <- c("Product_Group_48"))
[1] "Product_Group_48"
This will give you a list structure with lhs/rhs vectors:
l <- lapply( strsplit(as.character(trx.cpf.rules.df$rules), " => ", fixed = TRUE), function(x) {
strsplit( gsub("[{}]", "", x), ",", fixed = TRUE)
})
To inspect the first rule:
l[[1]]
# [[1]]
# [1] "Product_Group_1" "Product_Group_49"
#
# [[2]]
# [1] "Product_Group_48"
To inspect the left-hand-sides of all rules (head):
head(sapply(l, "[", 1))
# [[1]]
# [1] "Product_Group_1" "Product_Group_49"
#
# [[2]]
# [1] "Product_Group_48"
#
# [[3]]
# [1] "Product_Group_1" "Product_Group_23"
#
# [[4]]
# [1] "Product_Group_23"
#
# [[5]]
# [1] "Product_Group_1" "Product_Group_23"
#
# [[6]]
# [1] "Product_Group_34"
I have a large text file with a variable number of fields in each row. The first entry in each row corresponds to a biological pathway, and each subsequent entry corresponds to a gene in that pathway. The first few lines might look like this
path1 gene1 gene2
path2 gene3 gene4 gene5 gene6
path3 gene7 gene8 gene9
I need to read this file into R as a list, with each element being a character vector, and the name of each element in the list being the first element on the line, for example:
> pathways <- list(
+ path1=c("gene1","gene2"),
+ path2=c("gene3","gene4","gene5","gene6"),
+ path3=c("gene7","gene8","gene9")
+ )
>
> str(pathways)
List of 3
$ path1: chr [1:2] "gene1" "gene2"
$ path2: chr [1:4] "gene3" "gene4" "gene5" "gene6"
$ path3: chr [1:3] "gene7" "gene8" "gene9"
>
> str(pathways$path1)
chr [1:2] "gene1" "gene2"
>
> print(pathways)
$path1
[1] "gene1" "gene2"
$path2
[1] "gene3" "gene4" "gene5" "gene6"
$path3
[1] "gene7" "gene8" "gene9"
...but I need to do this automatically for thousands of lines. I saw a similar question posted here previously, but I couldn't figure out how to do this from that thread.
Thanks in advance.
Here's one way to do it:
# Read in the data
x <- scan("data.txt", what="", sep="\n")
# Separate elements by one or more whitepace
y <- strsplit(x, "[[:space:]]+")
# Extract the first vector element and set it as the list element name
names(y) <- sapply(y, `[[`, 1)
#names(y) <- sapply(y, function(x) x[[1]]) # same as above
# Remove the first vector element from each list element
y <- lapply(y, `[`, -1)
#y <- lapply(y, function(x) x[-1]) # same as above
One solution is to read the data in via read.table(), but use the fill = TRUE argument to pad the rows with fewer "entries", convert the resulting data frame to a list and then clean up the "empty" elements.
First, read your snippet of data in:
con <- textConnection("path1 gene1 gene2
path2 gene3 gene4 gene5 gene6
path3 gene7 gene8 gene9
")
dat <- read.table(con, fill = TRUE, stringsAsFactors = FALSE)
close(con)
Next we drop the first column, first saving it for the names of the list later
nams <- dat[, 1]
dat <- dat[, -1]
Convert the data frame to a list. Here I just split the data frame on the indices 1,2,...,n where n is the number of rows:
ldat <- split(dat, seq_len(nrow(dat)))
Clean up the empty cells:
ldat <- lapply(ldat, function(x) x[x != ""])
Finally, apply the names
names(ldat) <- nams
Giving:
> ldat
$path1
[1] "gene1" "gene2"
$path2
[1] "gene3" "gene4" "gene5" "gene6"
$path3
[1] "gene7" "gene8" "gene9"
A quick solution based on the linked page...
inlist <- strsplit(readLines("file.txt"), "[[:space:]]+")
pathways <- lapply(inlist, tail, n = -1)
names(pathways) <- lapply(inlist, head, n = 1)
One more solution:
sl <- c("path1 gene1 gene2", "path2 gene1 gene2 gene3") # created by readLines
f <- function(l, s) {
v <- strsplit(s, " ")[[1]]
l[[v[1]]] <- v[2:length(v)]
return(l)
}
res <- Reduce(f, sl, list())