readHTMLTable is not giving me the information I want - regex

I am trying to analyze some Formule 1 data. Wikipedia has a table with the data I want. I am importing the data into R with the code below:
library(XML)
library(RCurl)
url <- "https://en.wikipedia.org/wiki/List_of_Formula_One_drivers"
tabs <- getURL(url)
tabs <- readHTMLTable(tabs, stringsAsFactors=FALSE)
pilots <- tabs[[3]]
pilots <- pilots[-dim(pilots)[1], ]
head(pilots[, 1])
[1] "Abate, CarloCarlo Abate"
[2] "Abecassis, GeorgeGeorge Abecassis"
[3] "Acheson, KennyKenny Acheson"
[4] "Adamich, Andrea deAndrea de Adamich"
[5] "Adams, PhilippePhilippe Adams"
[6] "Ader, WaltWalt Ader"
However, the pilot names are strange. Notice how they are. I'd like them to be like this:
head(pilots[, 1])
[1] "Carlo Abate"
[2] "George Abecassis"
[3] "Kenny Acheson"
[4] "Andrea de Adamich"
[5] "Philippe Adams"
[6] "Walt Ader"
However, it seems I am not able to write a regex that can deal with this problem or find an argument for the function readHTMLTable that ignores the sortkey value in the table I am interested. How can I solve my problem?

Use readHTMLTable with a bespoke elFun argument.
library(XML)
library(RCurl)
url <- "https://en.wikipedia.org/wiki/List_of_Formula_One_drivers"
tabs <- getURL(url)
myFun <- function(x){
if(length(y <- getNodeSet(x, ".//a")) > 0){
# return data.frame
title <- xpathSApply(x, ".//a", fun = xmlGetAttr, name = "title")
href <- xpathSApply(x, ".//a", fun = xmlGetAttr, name = "href")
value <- xpathSApply(x, ".//a", fun = xmlValue)
return(paste(value, collapse = ","))
}
xmlValue(x, encoding = "UTF-8")
}
tabs <- readHTMLTable(tabs, elFun = myFun, stringsAsFactors=FALSE)
pilots <- tabs[[3]]
pilots <- pilots[-dim(pilots)[1], ]
> head(pilots[, 1])
[1] "Carlo Abate" "George Abecassis" "Kenny Acheson" "Andrea de Adamich"
[5] "Philippe Adams" "Walt Ader"
> pilots[1,]
Name Country Seasons Championships Entries Starts Poles Wins Podiums Fastest laps Points[note]
1 Carlo Abate Italy 1962,1963 0 2 0 0 0 0 0 0

Related

Subset all 3 digit numbers and collapse them with a separator in a data frame. R

I'm formating a data set so each entry has the adegenet format for codominant markers, such as:
Loci1
###/###
208/210
200/204
198/208
where the # represents any digit (the number is a allele size in basepairs). My data has some homozygous entries (all 3 digit integers with no separator) that have the the form of:
Loci1
###
208
198
I intend to paste the 3 digit string to itself with sep='/' to produce the first format. I've tried to use grep to subset these homozygous entries by finding all non ###/### and negating the match using the table matching such as:
a <- grep('\\b\\d{3}?[/]\\d{3}', score$Loci1, value =T ) # Subset all ###/###/
score[!(a %in% 1:nrow(score$Loci1)), ] # works but only on vectors...
After the subset I could paste. The problem arises when I apply this to a data frame. grep seems to treat the data frame as a list (which in part it is) and returns columns that have a match.
So in short how can I go from ### to ###/### in a data frame
self contained example of data:
score2 <- NULL
set.seed(9)
Loci1 <- NULL
Loci2 <- NULL
Loci3 <- NULL
for (i in 1:5) Loci1 <- append(Loci1, paste(sample(seq(from = 230, to=330, by=3), 2, replace = F), collapse = '/'))
for (i in 1:5) Loci2 <- append(Loci2, paste(sample(seq(from = 230, to=330, by=3), 2, replace = F), collapse = '/'))
for (i in 1:5) Loci3 <- append(Loci3, paste(sample(seq(from = 230, to=330, by=3), 2, replace = F), collapse = '/'))
score2 <- data.frame(Loci1, Loci2, Loci3, stringsAsFactors = F)
score2[2,3] <- strsplit(score2[2,3], split = '/')[1]
score2[5,2] <- strsplit(score2[3,3], split = '/')[1]
score2[1,1] <- strsplit(score2[1,1], split = '/')[1]
score2[c(1, 4),c(2,3)] <- NA
score2
You could just replace the 3 digit items with the separator and a copy:
sub("^(...)$", "\\1/\\1", Loci1)
Use lapply with an anonymized function:
data.frame( lapply(score2, function(x) sub("^(...)$", "\\1/\\1", x) ) )
Loci1 Loci2 Loci3
1 251/251 <NA> <NA>
2 251/329 320/257 260/260
3 275/242 278/329 281/320
4 269/266 <NA> <NA>
5 296/326 281/281 326/314
(Not sure what the "paste-part" was supposed to refer to, but I think this was the intent of your question)
If the numeric values could have a varying number of digits then use a pattern argument like "^([0-9]{1,9})$"
An option using grep/paste,
m1 <- as.matrix(score2)
indx <- grep('^...$', m1)
m1[indx] <- paste(m1[indx], m1[indx], sep="/")
as.data.frame(m1)
# Loci1 Loci2 Loci3
#1 251/251 <NA> <NA>
#2 251/329 320/257 260/260
#3 275/242 278/329 281/320
#4 269/266 <NA> <NA>
#5 296/326 281/281 326/314
Or without converting to matrix, this can be done using lapply
score2[] <- lapply(score2, function(x) ifelse(grepl('^...$', x),
paste(x, x, sep="/"),x))

String rearrangment in R

I have a long list of names of city and its province name. This is partial list of my data
data <- c('Ranchi_Capital_State_Jharkhand', 'Bokaro_State_Jharkhand', 'Tata Nagar_State_Jharkhand', 'Ramgarh_State_Jharkhand',
'Pune_State_Maharashtra', 'Mumbai_Capital_State_Maharashtra', 'Nagpur_State_Maharashtra')
I want to arrange it such that State should come first, like this State_Jharkhand_Bokaro. If city is a capital then State_Jharkhand_Capital_Ranchi. Also note that city name or state name may have single string or more than one string (eg Tata Nagar).
What is most efficient way to do it, (without using any loop)?
You could use the below gsub function.
> data <- c('Ranchi_Capital_State_Jharkhand', 'Bokaro_State_Jharkhand', 'Tata Nagar_State_Jharkhand', 'Ramgarh_State_Jharkhand',
+ 'Pune_State_Maharashtra', 'Mumbai_Capital_State_Maharashtra', 'Nagpur_State_Maharashtra')
> gsub("^(?:(.*?)(_Capital))?(.*?)_(State.*)", "\\4\\2_\\1\\3", data)
[1] "State_Jharkhand_Capital_Ranchi" "State_Jharkhand_Bokaro"
[3] "State_Jharkhand_Tata Nagar" "State_Jharkhand_Ramgarh"
[5] "State_Maharashtra_Pune" "State_Maharashtra_Capital_Mumbai"
[7] "State_Maharashtra_Nagpur"
DEMO
This doesn't really use much regex, but is mostly based on the expected position of the information. Split the strings by "_" and then reorder them as required:
data
# [1] "Ranchi_Capital_State_Jharkhand" "Bokaro_State_Jharkhand"
# [3] "Tata Nagar_State_Jharkhand" "Ramgarh_State_Jharkhand"
# [5] "Pune_State_Maharashtra" "Mumbai_Capital_State_Maharashtra"
# [7] "Nagpur_State_Maharashtra"
A <- strsplit(data, "_", TRUE)
sapply(A, function(x) {
if (length(x) == 3) {
paste(x[c(2, 3, 1)], collapse = "_")
} else if (length(x) == 4) {
paste(x[c(3, 4, 2, 1)], collapse = "_")
} else {
stop("unexpected length")
}
})
# [1] "State_Jharkhand_Capital_Ranchi" "State_Jharkhand_Bokaro"
# [3] "State_Jharkhand_Tata Nagar" "State_Jharkhand_Ramgarh"
# [5] "State_Maharashtra_Pune" "State_Maharashtra_Capital_Mumbai"
# [7] "State_Maharashtra_Nagpur"
I don't know if using sapply breaks your requirement of "without using any loop" though.

R match between two comma-separated strings

I am trying to find an elegant way to find matches between the two following character columns in a data frame. The complicated part is that either string can contain a comma-separated list, and if a member of one list is a match for any member of the other list, then that whole entry would be considered a match. I'm not sure how well I've explained this, so here's sample data and output:
Alt1:
AT
A
G
CGTCC,AT
CGC
Alt2:
AA
A
GG
AT,GGT
CG
Expected Match per row:
Row 1 = none
Row 2 = A
Row 3 = none
Row 4 = AT
Row 5 = none
Non-working solutions:
First attempt: merge entire data frames by desired columns, then match up the alt columns shown above:
match1 = data.frame(merge(vcf.df, ref.df, by=c("chr", "start", "end", "ref")))
matches = unique(match1[unlist(sapply(match1$Alt1 grep, match1$Alt2, fixed=TRUE)),])
Second method, using findoverlaps feature from VariantAnnoatation/Granges:
findoverlaps(ranges(vcf1), ranges(vcf2))
Any suggestions would be greatly appreciated! Thank you!
Solution
Thanks to #Marat Talipov's answer below, the following solution works to compare two comma-separated strings:
> ##read in edited kaviar vcf and human ref
> ref <- readVcfAsVRanges("ref.vcf.gz", humie_ref)
Warning message:
In .vcf_usertag(map, tag, ...) :
ScanVcfParam ‘geno’ fields not present: ‘AD’
> ##rename chromosomes to match with vcf files
> ref <- renameSeqlevels(ref, c("1"="chr1"))
> ##################################
> ## Gather VCF files to process ##
> ##################################
> ##data frame *.vcf.gz files in directory path
> vcf_path <- data.frame(path=list.files(vcf_dir, pattern="*.vcf.gz$", full=TRUE))
> ##read in everything but sample data for speediness
> vcf_param = ScanVcfParam(samples=NA)
> vcf <- readVcfAsVRanges("test.vcf.gz", humie_ref, param=vcf_param)
> #################
> ## Match SNP's ##
> #################
> ##create data frames of info to match on
> vcf.df = data.frame(chr =as.character(seqnames(vcf)), start = start(vcf), end = end(vcf), ref = as.character(ref(vcf)),
+ alt=alt(vcf), stringsAsFactors=FALSE)
> ref.df = data.frame(chr =as.character(seqnames(ref)), start = start(ref), end = end(ref),
+ ref = as.character(ref(ref)), alt=alt(ref), stringsAsFactors=FALSE)
>
> ##merge based on all positional fields except vcf
> col_match = data.frame(merge(vcf.df, ref.df, by=c("chr", "start", "end", "ref")))
> library(stringi)
> ##split each alt column by comma and bind together
> M1 <- stri_list2matrix(sapply(col_match$alt.x,strsplit,','))
> M2 <- stri_list2matrix(sapply(col_match$alt.y,strsplit,','))
> M <- rbind(M1,M2)
> ##compare results
> result <- apply(M,2,function(z) unique(na.omit(z[duplicated(z)])))
> ##add results column to col_match df for checking/subsetting
> col_match$match = result
> head(col_match)
chr start end ref alt.x alt.y match
1 chr1 39998059 39998059 A G G G
2 chr1 39998059 39998059 A G G G
3 chr1 39998084 39998084 C A A A
4 chr1 39998084 39998084 C A A A
5 chr1 39998085 39998085 G A A A
6 chr1 39998085 39998085 G A A A
In the case that input lists are of equal length and you'd like to compare list elements in the pairwise manner, you could use this solution:
library(stringi)
M1 <- stri_list2matrix(sapply(Alt1,strsplit,','))
M2 <- stri_list2matrix(sapply(Alt2,strsplit,','))
M <- rbind(M1,M2)
result <- apply(M,2,function(z) unique(na.omit(z[duplicated(z)])))
Sample input:
Alt1 <- list('AT','A','G','CGTCC,AT','CGC','GG,CC')
Alt2 <- list('AA','A','GG','AT,GGT','CG','GG,CC')
Output:
# [[1]]
# character(0)
#
# [[2]]
# [1] "A"
#
# [[3]]
# character(0)
#
# [[4]]
# [1] "AT"
#
# [[5]]
# character(0)
#
# [[6]]
# [1] "GG" "CC"
Sticking with the stringi package, you could do something like this, using the Alt1 and Alt2 data from Marat's answer.
library(stringi)
f <- function(x, y) {
ssf <- stri_split_fixed(c(x, y), ",", simplify = TRUE)
if(any(sd <- stri_duplicated(ssf))) ssf[sd] else NA_character_
}
Map(f, Alt1, Alt2)
# [[1]]
# [1] NA
#
# [[2]]
# [1] "A"
#
# [[3]]
# [1] NA
#
# [[4]]
# [1] "AT"
#
# [[5]]
# [1] NA
#
# [[6]]
# [1] "GG" "CC"
Or in base R, we can use scan() to separate the strings with commas.
g <- function(x, y, sep = ",") {
s <- scan(text = c(x, y), what = "", sep = sep, quiet = TRUE)
s[duplicated(s)]
}
Map(g, Alt1, Alt2)
you could do something like this:
Alt1 <- list('AT','A','G',c('CGTCC','AT'),'CGC')
Alt2 <- list('AA','A','GG',c('AT','GGT'),'CG')
# make sure you change the lists within in the lists into vectors
matchlist <- list()
for (i in 1:length(Alt1)){
matchlist[[i]] <- ifelse(Alt1[[i]] %in% Alt2[[i]],
paste("Row",i,"=",c(Alt1[[i]],Alt2[[i]])[duplicated(c(Alt1[[i]],Alt2[[i]]))],sep=" "),
paste("Row",i,"= none",sep=" "))
}
print(matchlist)

How to expand a list with NULLs up to some length?

Given a list whose length <= N, what is the best / most efficient way to fill it up with trailing NULLs up to length (so that it has length N).
This is something which is a one-liner in any decent language, but I don't have a clue how to do it (efficiently) in a few lines in R so that it works for every corner case (zero length list etc.).
Let's keep it really simple:
tst<-1:10 #whatever, to get a vector of length 10
tst<-tst[1:15]
Try this :
> l = list("a",1:3)
> N = 5
> l[N+1]=NULL
> l
[[1]]
[1] "a"
[[2]]
[1] 1 2 3
[[3]]
NULL
[[4]]
NULL
[[5]]
NULL
>
How about this ?
> l = list("a",1:3)
> length(l)=5
> l
[[1]]
[1] "a"
[[2]]
[1] 1 2 3
[[3]]
NULL
[[4]]
NULL
[[5]]
NULL
Directly editing the list's length appears to be the fastest as far as I can tell:
tmp <- vector("list",5000)
sol1 <- function(x){
x <- x[1:10000]
}
sol2 <- function(x){
x[10001] <- NULL
}
sol3 <- function(x){
length(x) <- 10000
}
library(rbenchmark)
benchmark(sol1(tmp),sol2(tmp),sol3(tmp),replications = 5000)
test replications elapsed relative user.self sys.self user.child sys.child
1 sol1(tmp) 5000 2.045 1.394952 1.327 0.727 0 0
2 sol2(tmp) 5000 2.849 1.943383 1.804 1.075 0 0
3 sol3(tmp) 5000 1.466 1.000000 0.937 0.548 0 0
But the differences aren't huge, unless you're doing this a lot on very long lists, I suppose.
I'm sure there are shorter ways, but I would be inclined to do:
l <- as.list(1:10)
N <- 15
l <- c(l, as.list(rep(NA, N - length(l) )))
Hi: I'm not sure if you were talking about an actual list but, if you were, below will work. It works because, once you access the element of a vector ( which is a list is ) that is not there, R expands the vector to that length.
length <- 10
temp <- list("a","b")
print(temp)
temp[length] <- NULL
print(temp)

Text file to list in R

I have a large text file with a variable number of fields in each row. The first entry in each row corresponds to a biological pathway, and each subsequent entry corresponds to a gene in that pathway. The first few lines might look like this
path1 gene1 gene2
path2 gene3 gene4 gene5 gene6
path3 gene7 gene8 gene9
I need to read this file into R as a list, with each element being a character vector, and the name of each element in the list being the first element on the line, for example:
> pathways <- list(
+ path1=c("gene1","gene2"),
+ path2=c("gene3","gene4","gene5","gene6"),
+ path3=c("gene7","gene8","gene9")
+ )
>
> str(pathways)
List of 3
$ path1: chr [1:2] "gene1" "gene2"
$ path2: chr [1:4] "gene3" "gene4" "gene5" "gene6"
$ path3: chr [1:3] "gene7" "gene8" "gene9"
>
> str(pathways$path1)
chr [1:2] "gene1" "gene2"
>
> print(pathways)
$path1
[1] "gene1" "gene2"
$path2
[1] "gene3" "gene4" "gene5" "gene6"
$path3
[1] "gene7" "gene8" "gene9"
...but I need to do this automatically for thousands of lines. I saw a similar question posted here previously, but I couldn't figure out how to do this from that thread.
Thanks in advance.
Here's one way to do it:
# Read in the data
x <- scan("data.txt", what="", sep="\n")
# Separate elements by one or more whitepace
y <- strsplit(x, "[[:space:]]+")
# Extract the first vector element and set it as the list element name
names(y) <- sapply(y, `[[`, 1)
#names(y) <- sapply(y, function(x) x[[1]]) # same as above
# Remove the first vector element from each list element
y <- lapply(y, `[`, -1)
#y <- lapply(y, function(x) x[-1]) # same as above
One solution is to read the data in via read.table(), but use the fill = TRUE argument to pad the rows with fewer "entries", convert the resulting data frame to a list and then clean up the "empty" elements.
First, read your snippet of data in:
con <- textConnection("path1 gene1 gene2
path2 gene3 gene4 gene5 gene6
path3 gene7 gene8 gene9
")
dat <- read.table(con, fill = TRUE, stringsAsFactors = FALSE)
close(con)
Next we drop the first column, first saving it for the names of the list later
nams <- dat[, 1]
dat <- dat[, -1]
Convert the data frame to a list. Here I just split the data frame on the indices 1,2,...,n where n is the number of rows:
ldat <- split(dat, seq_len(nrow(dat)))
Clean up the empty cells:
ldat <- lapply(ldat, function(x) x[x != ""])
Finally, apply the names
names(ldat) <- nams
Giving:
> ldat
$path1
[1] "gene1" "gene2"
$path2
[1] "gene3" "gene4" "gene5" "gene6"
$path3
[1] "gene7" "gene8" "gene9"
A quick solution based on the linked page...
inlist <- strsplit(readLines("file.txt"), "[[:space:]]+")
pathways <- lapply(inlist, tail, n = -1)
names(pathways) <- lapply(inlist, head, n = 1)
One more solution:
sl <- c("path1 gene1 gene2", "path2 gene1 gene2 gene3") # created by readLines
f <- function(l, s) {
v <- strsplit(s, " ")[[1]]
l[[v[1]]] <- v[2:length(v)]
return(l)
}
res <- Reduce(f, sl, list())