R Wildcard data frame merging - regex

I'm trying to merge a data frame and vector not by exact string matches in a column, but by wildcard string matches. To clarify, say you have this dataframe:
v <-data.frame(X1=c("AGTACAGT","AGTGAAGT","TGTA","GTTA","GAT","GAT"),X2=c(1,1,1,1,1,1))
# X1 X2
# 1 AGTACAGT 1
# 2 AGTGAAGT 2
# 3 TGTA 3
# 4 GTTA 4
# 5 GAT 5
# 6 GAT 6
I want to create a dataframe by creating a different color for every AGT.{3}GT,.{T|G}TA,GAT pattern, and creating a new column X3 that would show that color. So something like this:
# X1 X2 X3
# 1 AGTACAGT 1 "#FE7F01"
# 2 AGTGAAGT 2 "#FE7F01"
# 3 TGTA 3 "#FE7F00"
# 4 GTTA 4 "#FE7F00"
# 5 GAT 5 "#FE8002"
# 6 GAT 6 "#FE8002"
So far I am using this to create colors for each level, but I don't know how to count how many "wildcard levels" as opposed to singular levels there are:
x <- nlevels(v$X1)
x.colors2 <- colorRampPalette(brewer.pal(8,"Paired"))(x)
G <- data.frame("X1"=levels(v$X1),"X3"=x.colors2)
v <- merge(v,G)

Here's a solution.
Find patterns:
pat <- c("^AGT.{3}GT$", "^.(T|G)TA$", "^GAT$")
n <- length(pat)
indList <- lapply(pat, grep, v$X1)
Generate colors:
library(RColorBrewer)
col <- colorRampPalette(brewer.pal(8, "Paired"))(n)
Add colors to data frame:
colFull <- rep(col, sapply(indList, length))
v$color <- colFull[order(unlist(indList))]
The result:
v
# X1 X2 color
# 1 AGTACAGT 1 #A6CEE3
# 2 AGTGAAGT 1 #A6CEE3
# 3 TGTA 1 #979C62
# 4 GTTA 1 #979C62
# 5 GAT 1 #FF7F00
# 6 GAT 1 #FF7F00

Related

R - Inserting variable number of spaces into postcode string

I have a set of UK postcodes which need to be reformatted. They are made up of an incode and an outcode, where incode is of the form 'number letter letter' e.g. 2DB and the outcode is a combination of between 2 and 4 letters and numbers e.g. NW1 or SW10 or EC1A
Currently there is one space between the incode and outcode, but I need to reformat these so that the full postcode is 7 characters long e.g: ('-' stands for space)
NW1-2DB -> NW1-2DB (1 space between outcode and incode)
SW10-9NH -> SW109NH (0 spaces)
E1-6QL -> E1--6QL (2 spaces)
Data:
df <- data.frame("postcode"=c("NW1 2DB","SW10 9NH","E1 6QL"))
df
# postcode
# 1 NW1 2DB
# 2 SW10 9NH
# 3 E1 6QL
I have written a regex string to separate the outcode and incode, but couldn't find a way to add a variable number of spaces between them (this example just creates two spaces between outcode and incode).
require(dplyr)
df <- df %>% mutate(postcode_2sp = gsub('?(\\S+)\\s*?(\\d\\w{2})$','\\1 \\2', postcode)
To get around that I've tried to use mutate(),nchar() and rep():
df<-df %>%
mutate(outcode=gsub('?(\\S+)\\s*\\d\\w{2}$','\\1',postcode),
incode=gsub('\\S+\\s*?(\\d\\w{2})$','\\1',postcode)) %>%
mutate(out_length=nchar(outcode))%>%
mutate(postcode7=paste0(outcode,
paste0(rep(" ",4-out_length),collapse=""),
incode))
but get this error:
Error: invalid 'times' argument
without the last step to create postcode7 the df looks as follows:
df
# postcode outcode incode out_length
# 1 NW1 2DB NW1 2DB 3
# 2 SW10 9NH SW10 9NH 4
# 3 E1 6QL E1 6QL 2
And if I set the rep 'times' argument to a constant the code runs as expected (but doesn't do what I need it to do!)
df<-df %>%
mutate(outcode=gsub('?(\\S+)\\s*\\d\\w{2}$','\\1',postcode),
incode=gsub('\\S+\\s*?(\\d\\w{2})$','\\1',postcode)) %>%
mutate(out_length=nchar(outcode))%>%
mutate(postcode7=paste0(outcode,
paste0(rep(" ",4),collapse=""),
incode))
df
# postcode outcode incode out_length postcode7
# 1 NW1 2DB NW1 2DB 3 NW1 2DB
# 2 SW10 9NH SW10 9NH 4 SW10 9NH
# 3 E1 6QL E1 6QL 2 E1 6QL
Is there a way to make rep() accept a column as the times argument in a mutate? Or should I be looking at a totally different approach?
EDIT: I've just realised that I can use an if statement for each case of 2 characters, 3 characters or 4 characters in the outcode but that doesn't feel very elegant.
Have a look at the str_pad method from stringr package, which is suited for your case:
library(stringr)
df<-df %>%
mutate(outcode=gsub('?(\\S+)\\s*\\d\\w{2}$','\\1',postcode),
incode=gsub('\\S+\\s*?(\\d\\w{2})$','\\1',postcode)) %>%
mutate(out_length=nchar(outcode)) %>%
mutate(postcode7 = paste(outcode, str_pad(incode, 7-out_length), sep = ""))
df
# postcode outcode incode out_length postcode7
# 1 NW1 2DB NW1 2DB 3 NW1 2DB
# 2 SW10 9NH SW10 9NH 4 SW109NH
# 3 E1 6QL E1 6QL 2 E1 6QL
Another solution, using sprintf to format the output, and tidyr::extract for matching. This has the advantage of drastically simplifying both the pattern and the code for padding:
df %>%
extract(postcode, into = c('out', 'in'), '(\\S{2,4})\\s*(\\d\\w\\w)') %>%
mutate(postcode = sprintf('% -4s%s', out, `in`))
I do like the separate version posted above, but it requires that the postcodes are all separated by whitespace. In my experience this generally isn’t the case.
Using str_pad and separate:
library(dplyr)
library(tidyr)
library(stringr)
df %>%
separate(postcode, into = c("incode", "outcode"), remove = FALSE) %>%
mutate(
postcode8 = paste0(incode,
str_pad(outcode,
8 - nchar(incode), side = "left", pad = " ")))
# postcode incode outcode postcode8
# 1 NW1 2DB NW1 2DB NW1 2DB
# 2 SW10 9NH SW10 9NH SW10 9NH
# 3 E1 6QL E1 6QL E1 6QL
df%>%mutate(Postcode7=paste0(format(gsub('\\s.*$','',postcode),justify='left'),
format(gsub('^\\S+\\s','',postcode),justify='right')))

How to replicate column-names, split them at delimiter '/', into multiple column-names, in R?

I have this matrix (it's big in size) "mymat". I need to replicate the columns that have "/" in their column name matching at "/" and make a "resmatrix". How can I get this done in R?
mymat
a b IID:WE:G12D/V GH:SQ:p.R172W/G c
1 3 4 2 4
22 4 2 2 4
2 3 2 2 4
resmatrix
a b IID:WE:G12D IID:WE:G12V GH:SQ:p.R172W GH:SQ:p.R172G c
1 3 4 4 2 2 4
22 4 2 2 2 2 4
2 3 2 2 2 2 4
Find out which columns have the "/" and replicate them, then rename. To calculate the new names, just split on / and replace the last letter for the second name.
# which columns have '/' in them?
which.slash <- grep('/', names(mymat), value=T)
new.names <- unlist(lapply(strsplit(which.slash, '/'),
function (bits) {
# bits[1] is e.g. IID:WE:G12D and bits[2] is the V
# take bits[1] and replace the last letter for the second colname
c(bits[1], sub('.$', bits[2], bits[1]))
}))
# make resmat by copying the appropriate columns
resmat <- cbind(mymat, mymat[, which.slash])
# order the columns to make sure the names replace properly
resmat <- resmat[, order(names(resmat))]
# put the new names in
names(resmat)[grep('/', names(resmat))] <- sort(new.names)
resmat looks like this
# a b c GH:SQ:p.R172G GH:SQ:p.R172W IID:WE:G12D IID:WE:G12V
# 1 1 3 4 2 2 4 4
# 2 22 4 4 2 2 2 2
# 3 2 3 4 2 2 2 2
You could use grep to get the index of column names with / ('nm1'), replicate the column names in 'nm1' by using sub/scan to create 'nm2'. Then, cbind the columns that are not 'nm1', with the replicated columns ('nm1'), change the column names with 'nm2', and if needed order the columns.
#get the column index with grep
nm1 <- grepl('/', names(df1))
#used regex to rearrange the substrings in the nm1 column names
#removed the `/` and use `scan` to split at the space delimiter
nm2 <- scan(text=gsub('([^/]+)(.)/(.*)', '\\1\\2 \\1\\3',
names(df1)[nm1]), what='', quiet=TRUE)
#cbind the columns that are not in nm1, with the replicate nm1 columns
df2 <- cbind(df1[!nm1], setNames(df1[rep(which(nm1), each= 2)], nm2))
#create another index to find the starting position of nm1 columns
nm3 <- names(df1)[1:(which(nm1)[1L]-1)]
#we concatenate the nm3, nm2, and the rest of the columns to match
#the expected output order
df2N <- df2[c(nm3, nm2, setdiff(names(df1)[!nm1], nm3))]
df2N
# a b IID:WE:G12D IID:WE:G12V GH:SQ:p.R172W GH:SQ:p.R172G c
#1 1 3 4 4 2 2 4
#2 22 4 2 2 2 2 4
#3 2 3 2 2 2 2 4
data
df1 <- structure(list(a = c(1L, 22L, 2L), b = c(3L, 4L, 3L),
`IID:WE:G12D/V` = c(4L,
2L, 2L), `GH:SQ:p.R172W/G` = c(2L, 2L, 2L), c = c(4L, 4L, 4L)),
.Names = c("a", "b", "IID:WE:G12D/V", "GH:SQ:p.R172W/G", "c"),
class = "data.frame", row.names = c(NA, -3L))

R regex to fetch strings between characters at specific positions

I have some string data as follows in R.
DT <- structure(list(ID = c(1, 2, 3, 4, 5, 6), GKT = c("G1:GRST, G45:KRPT",
"G48932:KD56", "G7764:MGI45, K7786:IRE4R, K45:TG45", "K4512:3345, G51:56:34, K22:45I67",
"K678:RT,IG, G123:TGIF, G33:IG56", "T4534:K456")), .Names = c("ID",
"GKT"), class = "data.frame", row.names = c(NA, 6L))
DT
ID GKT
1 1 G1:GRST, G45:KRPT
2 2 G48932:KD56
3 3 G7764:MGI45, K7786:IRE4R, K45:TG45
4 4 K4512:3345, G51:56:34, K22:45I67
5 5 K678:RT,IG, G123:TGIF, G33:IG56
6 6 T4534:K456
I want to get the output out from DT$GKT using gsub and regex in R.
out <- c("G1, G45", "G48932", "G7764, K7786, K45", "K4512, G51, K22",
"K678, G123, G33", "T4534")
DT$out <- out
DT
ID GKT out
1 1 G1:GRST, G45:KRPT G1, G45
2 2 G48932:KD56 G48932
3 3 G7764:MGI45, K7786:IRE4R, K45:TG45 G7764, K7786, K45
4 4 K4512:3345, G51:56:34, K22:45I67 K4512, G51, K22
5 5 K678:RT,IG, G123:TGIF, G33:IG56 K678, G123, G33
6 6 T4534:K456 T4534
I have tried gsub(x=DT$GKT, pattern = "(:)(.*)(, |\\b)", replacement=""), but it fetches only first instances.
gsub(x=DT$GKT, pattern = "(:)(.*)(, |\\b)", replacement="")
[1] "G1" "G48932" "G7764" "K4512" "K678" "T4534"
Another option using gsub is to use a look behind
DT$out <- gsub("(?=:)(.[A-Z0-9,]+)(?=\\b)", "", DT$GKT, perl = TRUE)
DT
# ID GKT out
# 1 1 G1:GRST, G45:KRPT G1, G45
# 2 2 G48932:KD56 G48932
# 3 3 G7764:MGI45, K7786:IRE4R, K45:TG45 G7764, K7786, K45
# 4 4 K4512:3345, G51:56:34, K22:45I67 K4512, G51, K22
# 5 5 K678:RT,IG, G123:TGIF, G33:IG56 K678, G123, G33
# 6 6 T4534:K456 T4534
EDIT
You can use the following regular expression for replacing ...
DT$out <- gsub(':\\S+\\b', '', DT$GKT)
DT
# ID GKT out
# 1 1 G1:GRST, G45:KRPT G1, G45
# 2 2 G48932:KD56 G48932
# 3 3 G7764:MGI45, K7786:IRE4R, K45:TG45 G7764, K7786, K45
# 4 4 K4512:3345, G51:56:34, K22:45I67 K4512, G51, K22
# 5 5 K678:RT,IG, G123:TGIF, G33:IG56 K678, G123, G33
# 6 6 T4534:K456 T4534
You could use a lookahead (?=) to check for : and capture just the first group
unlist(regmatches(DT$GKT, gregexpr("([A-Z0-9]+)(?=:)", DT$GKT, perl=T)))
# [1] "G1" "G45" "G48932" "G7764" "K7786" "K45" "K4512" "G51"
# [9] "56" "K22" "K678" "G123" "G33" "T4534"

How to count frequencies of certain character in a string?

If I have a run of characters such as "AABBABBBAAAABBAAAABBBAABBBBABABB".
Is there a way to get R to count the runs of A and state how many of each length ?
So I'd like to know how many instances of 3 A's in a row, how many instances of a single A, how many instances of 2 A's in a row, etc.
table(rle(strsplit("AABBABBBAAAABBAAAABBBAABBBBABABB","")[[1]]))
gives
values
lengths A B
1 3 1
2 2 3
3 0 2
4 2 1
which (reading down the A column) means there were 3 A runs of length 1, 2 A runs of length 2 and 2 A runs of length 4.
Try
v1 <- scan(text=gsub('[^A]+', ',', str1), sep=',', what='', quiet=TRUE)
table(v1[nzchar(v1)])
# A AA AAAA
# 3 2 2
Or
library(stringi)
table(stri_extract_all_regex(str1, '[A]+')[[1]])
# A AA AAAA
# 3 2 2
Benchmarks
set.seed(42)
x1 <- stri_rand_strings(1,1e7, pattern='[A-G]')
system.time(table(stri_split_regex(x1, "[^A]+", omit_empty = TRUE)))
# user system elapsed
# 0.829 0.002 0.831
system.time(table(stri_extract_all_regex(x1, '[A]+')[[1]]))
# user system elapsed
# 0.790 0.002 0.791
system.time(table(rle(strsplit(x1,"")[[1]])) )
# user system elapsed
# 30.230 1.243 31.523
system.time(table(strsplit(x1, "[^A]+")))
# user system elapsed
# 4.253 0.006 4.258
system.time(table(attr(gregexpr("A+",x1)[[1]], 'match.length')))
# user system elapsed
# 1.994 0.004 1.999
library(microbenchmark)
microbenchmark(david=table(stri_split_regex(x1, "[^A]+", omit_empty = TRUE)),
akrun= table(stri_extract_all_regex(x1, '[A]+')[[1]]),
david2 = table(strsplit(x1, "[^A]+")),
glen = table(rle(strsplit(x1,"")[[1]])),
plannapus = table(attr(gregexpr("A+",x1)[[1]], 'match.length')),
times=20L, unit='relative')
#Unit: relative
# expr min lq mean median uq max neval cld
# david 1.0000000 1.000000 1.000000 1.000000 1.0000000 1.000000 20 a
# akrun 0.7908313 1.023388 1.054670 1.336510 0.9903384 1.004711 20 a
# david2 4.9325256 5.461389 5.613516 6.207990 5.6647301 5.374668 20 c
# glen 14.9064240 15.975846 16.672339 20.570874 15.8710402 15.465140 20 d
#plannapus 2.5077719 3.123360 2.836338 3.557242 2.5689176 2.452964 20 b
data
str1 <- 'AABBABBBAAAABBAAAABBBAABBBBABABB'
Here's additional way using strsplit
x <- "AABBABBBAAAABBAAAABBBAABBBBABABB"
table(strsplit(x, "[^A]+"))
# A AA AAAA
# 3 2 2
Or similarly with the stringi package
library(stringi)
table(stri_split_regex(x, "[^A]+", omit_empty = TRUE))
For completeness, here is another way, using the regmatches and gregexpr combo, to extract regexes:
x <- "AABBABBBAAAABBAAAABBBAABBBBABABB"
table(regmatches(x,gregexpr("A+",x))[[1]])
# A AA AAAA
# 3 2 2
Or in fact, since gregexpr keeps the length of the captured substring as attribute, one could even do, directly:
table(attr(gregexpr("A+",x)[[1]],'match.length'))
# 1 2 4
# 3 2 2

R separating out number and units from a column in a dataframe

I have a dataframe which contains a column that has numbers as well as variable units:
num <- c(1:5)
val <- c("5%","10K", "100.2mv","1.4g","1.007kbars")
df <- data.frame(num,val)
df
How can I create two new columns from df$val, one that contains just the number and one the units?
Thank you for your help.
Here's a solution using stringr:
library(stringr)
df$extr_nums <- str_extract(val, "\\d+\\.?\\d*")
df$extr_units <- str_replace(val, nums, "")
df
num val extr_nums extr_units
1 1 5% 5 %
2 2 10K 10 K
3 3 100.2mv 100.2 mv
4 4 1.4g 1.4 g
5 5 1.007kbars 1.007 kbars
The regexp is translated as: "at least 1 digit, followed by optional dot, followed by optional digits".