R regex to fetch strings between characters at specific positions - regex

I have some string data as follows in R.
DT <- structure(list(ID = c(1, 2, 3, 4, 5, 6), GKT = c("G1:GRST, G45:KRPT",
"G48932:KD56", "G7764:MGI45, K7786:IRE4R, K45:TG45", "K4512:3345, G51:56:34, K22:45I67",
"K678:RT,IG, G123:TGIF, G33:IG56", "T4534:K456")), .Names = c("ID",
"GKT"), class = "data.frame", row.names = c(NA, 6L))
DT
ID GKT
1 1 G1:GRST, G45:KRPT
2 2 G48932:KD56
3 3 G7764:MGI45, K7786:IRE4R, K45:TG45
4 4 K4512:3345, G51:56:34, K22:45I67
5 5 K678:RT,IG, G123:TGIF, G33:IG56
6 6 T4534:K456
I want to get the output out from DT$GKT using gsub and regex in R.
out <- c("G1, G45", "G48932", "G7764, K7786, K45", "K4512, G51, K22",
"K678, G123, G33", "T4534")
DT$out <- out
DT
ID GKT out
1 1 G1:GRST, G45:KRPT G1, G45
2 2 G48932:KD56 G48932
3 3 G7764:MGI45, K7786:IRE4R, K45:TG45 G7764, K7786, K45
4 4 K4512:3345, G51:56:34, K22:45I67 K4512, G51, K22
5 5 K678:RT,IG, G123:TGIF, G33:IG56 K678, G123, G33
6 6 T4534:K456 T4534
I have tried gsub(x=DT$GKT, pattern = "(:)(.*)(, |\\b)", replacement=""), but it fetches only first instances.
gsub(x=DT$GKT, pattern = "(:)(.*)(, |\\b)", replacement="")
[1] "G1" "G48932" "G7764" "K4512" "K678" "T4534"

Another option using gsub is to use a look behind
DT$out <- gsub("(?=:)(.[A-Z0-9,]+)(?=\\b)", "", DT$GKT, perl = TRUE)
DT
# ID GKT out
# 1 1 G1:GRST, G45:KRPT G1, G45
# 2 2 G48932:KD56 G48932
# 3 3 G7764:MGI45, K7786:IRE4R, K45:TG45 G7764, K7786, K45
# 4 4 K4512:3345, G51:56:34, K22:45I67 K4512, G51, K22
# 5 5 K678:RT,IG, G123:TGIF, G33:IG56 K678, G123, G33
# 6 6 T4534:K456 T4534

EDIT
You can use the following regular expression for replacing ...
DT$out <- gsub(':\\S+\\b', '', DT$GKT)
DT
# ID GKT out
# 1 1 G1:GRST, G45:KRPT G1, G45
# 2 2 G48932:KD56 G48932
# 3 3 G7764:MGI45, K7786:IRE4R, K45:TG45 G7764, K7786, K45
# 4 4 K4512:3345, G51:56:34, K22:45I67 K4512, G51, K22
# 5 5 K678:RT,IG, G123:TGIF, G33:IG56 K678, G123, G33
# 6 6 T4534:K456 T4534

You could use a lookahead (?=) to check for : and capture just the first group
unlist(regmatches(DT$GKT, gregexpr("([A-Z0-9]+)(?=:)", DT$GKT, perl=T)))
# [1] "G1" "G45" "G48932" "G7764" "K7786" "K45" "K4512" "G51"
# [9] "56" "K22" "K678" "G123" "G33" "T4534"

Related

Concatenate pandas dataframe with result of apply(lambda) where lambda returns another dataframe

A dataframe stores some values in columns, passing those values to a function I get another dataframe. I'd like to concatenate the returned dataframe's columns to the original dataframe.
I tried to do something like
i = pd.concat([i, i[['cid', 'id']].apply(lambda x: xy(*x), axis=1)], axis=1)
but it did not work with error:
ValueError: cannot copy sequence with size 2 to array axis with dimension 1
So I did like this:
def xy(x, y):
return pd.DataFrame({'x': [x*2], 'y': [y*2]})
df1 = pd.DataFrame({'cid': [4, 4], 'id': [6, 10]})
print('df1:\n{}'.format(df1))
df2 = pd.DataFrame()
for _, row in df1.iterrows():
nr = xy(row['cid'], row['id'])
nr['cid'] = row['cid']
nr['id'] = row['id']
df2 = df2.append(nr, ignore_index=True)
print('df2:\n{}'.format(df2))
Output:
df1:
cid id
0 4 6
1 4 10
df2:
x y cid id
0 8 12 4 6
1 8 20 4 10
The code does not look nice and should work slowly.
Is there pandas/pythonic way to do it properly and fast working?
python 2.7
Option 0
Most directly with pd.DataFrame.assign. Not very generalizable.
df1.assign(x=df1.cid * 2, y=df1.id * 2)
cid id x y
0 4 6 8 12
1 4 10 8 20
Option 1
Use pd.DataFrame.join to add new columns
This shows how to adjoin new columns after having used apply with a lambda
df1.join(df1.apply(lambda x: pd.Series(x.values * 2, ['x', 'y']), 1))
cid id x y
0 4 6 8 12
1 4 10 8 20
Option 2
Use pd.DataFrame.assign to add new columns
This shows how to adjoin new columns after having used apply with a lambda
df1.assign(**df1.apply(lambda x: pd.Series(x.values * 2, ['x', 'y']), 1))
cid id x y
0 4 6 8 12
1 4 10 8 20
Option 3
However, if your function really is just multiplying by 2
df1.join(df1.mul(2).rename(columns=dict(cid='x', id='y')))
Or
df1.assign(**df1.mul(2).rename(columns=dict(cid='x', id='y')))

R - How do document the number of grepl matches based in another data frame?

This is a rather tricky question indeed. It would be awesome if someone might be able to help me out.
What I'm trying to do is the following. I have data frame in R containing every locality in a given state, scraped from Wikipedia. It looks something like this (top 10 rows). Let's call it NewHampshire.df:
Municipality County Population
1 Acworth Sullivan 891
2 Albany Carroll 735
3 Alexandria Grafton 1613
4 Allenstown Merrimack 4322
5 Alstead Cheshire 1937
6 Alton Belknap 5250
7 Amherst Hillsborough 11201
8 Andover Merrimack 2371
9 Antrim Hillsborough 2637
10 Ashland Grafton 2076
I've further compiled a new variable called grep_term, which combines the values from Municipality and County into a new, variable that functions as an or-statement, something like this:
Municipality County Population grep_term
1 Acworth Sullivan 891 "Acworth|Sullivan"
2 Albany Carroll 735 "Albany|Carroll"
and so on. Furthermore, I have another dataset, containing self-disclosed locations of 2000 Twitter users. I call it location.df, and it looks a bit like this:
[1] "London" "Orleans village VT USA" "The World"
[4] "D M V Towson " "Playa del Sol Solidaridad" "Beautiful Downtown Burbank"
[7] NA "US" "Gaithersburg Md"
[10] NA "California " "Indy"
[13] "Florida" "exsnaveen com" "Houston TX"
I want to do two things:
1: Grepl through every observation in the location.df dataset, and save a TRUE or FALSE into a new variable depending on whether the self-disclosed location is part of the list in the first dataset.
2: Save the number of matches for a particular line in the NewHampshire.df dataset to a new variable. I.e., if there are 4 matches for Acworth in the twitter location dataset, there should be a value "4" for observation 1 in the NewHampshire.df on the newly created "matches" variable
What I've done so far: I've solved task 1, as follows:
for(i in 1:234){
location.df$isRelevant <- sapply(location.df$location, function(s) grepl(NH_Places[i], s, ignore.case = TRUE))
}
How can I solve task 2, ideally in the same for loop?
Thanks in advance, any help would be greatly appreciated!
With regard to task one, you could also use:
# location vector to be matched against
loc.vec <- c("Acworth","Hillsborough","California","Amherst","Grafton","Ashland","London")
location.df <- data.frame(location=loc.vec)
# create a 'grep-vector'
places <- paste(paste(NewHampshire$Municipality, NewHampshire$County,
sep = "|"),
collapse = "|")
# match them against the available locations
location.df$isRelevant <- sapply(location.df$location,
function(s) grepl(places, s, ignore.case = TRUE))
which gives:
> location.df
location isRelevant
1 Acworth TRUE
2 Hillsborough TRUE
3 California FALSE
4 Amherst TRUE
5 Grafton TRUE
6 Ashland TRUE
7 London FALSE
To get the number of matches in the location.df with the grep_term column, you can use:
NewHampshire$n.matches <- sapply(NewHampshire$grep_term, function(x) sum(grepl(x, loc.vec)))
gives:
> NewHampshire
Municipality County Population grep_term n.matches
1 Acworth Sullivan 891 Acworth|Sullivan 1
2 Albany Carroll 735 Albany|Carroll 0
3 Alexandria Grafton 1613 Alexandria|Grafton 1
4 Allenstown Merrimack 4322 Allenstown|Merrimack 0
5 Alstead Cheshire 1937 Alstead|Cheshire 0
6 Alton Belknap 5250 Alton|Belknap 0
7 Amherst Hillsborough 11201 Amherst|Hillsborough 2
8 Andover Merrimack 2371 Andover|Merrimack 0
9 Antrim Hillsborough 2637 Antrim|Hillsborough 1
10 Ashland Grafton 2076 Ashland|Grafton 2

R Wildcard data frame merging

I'm trying to merge a data frame and vector not by exact string matches in a column, but by wildcard string matches. To clarify, say you have this dataframe:
v <-data.frame(X1=c("AGTACAGT","AGTGAAGT","TGTA","GTTA","GAT","GAT"),X2=c(1,1,1,1,1,1))
# X1 X2
# 1 AGTACAGT 1
# 2 AGTGAAGT 2
# 3 TGTA 3
# 4 GTTA 4
# 5 GAT 5
# 6 GAT 6
I want to create a dataframe by creating a different color for every AGT.{3}GT,.{T|G}TA,GAT pattern, and creating a new column X3 that would show that color. So something like this:
# X1 X2 X3
# 1 AGTACAGT 1 "#FE7F01"
# 2 AGTGAAGT 2 "#FE7F01"
# 3 TGTA 3 "#FE7F00"
# 4 GTTA 4 "#FE7F00"
# 5 GAT 5 "#FE8002"
# 6 GAT 6 "#FE8002"
So far I am using this to create colors for each level, but I don't know how to count how many "wildcard levels" as opposed to singular levels there are:
x <- nlevels(v$X1)
x.colors2 <- colorRampPalette(brewer.pal(8,"Paired"))(x)
G <- data.frame("X1"=levels(v$X1),"X3"=x.colors2)
v <- merge(v,G)
Here's a solution.
Find patterns:
pat <- c("^AGT.{3}GT$", "^.(T|G)TA$", "^GAT$")
n <- length(pat)
indList <- lapply(pat, grep, v$X1)
Generate colors:
library(RColorBrewer)
col <- colorRampPalette(brewer.pal(8, "Paired"))(n)
Add colors to data frame:
colFull <- rep(col, sapply(indList, length))
v$color <- colFull[order(unlist(indList))]
The result:
v
# X1 X2 color
# 1 AGTACAGT 1 #A6CEE3
# 2 AGTGAAGT 1 #A6CEE3
# 3 TGTA 1 #979C62
# 4 GTTA 1 #979C62
# 5 GAT 1 #FF7F00
# 6 GAT 1 #FF7F00

How to replicate column-names, split them at delimiter '/', into multiple column-names, in R?

I have this matrix (it's big in size) "mymat". I need to replicate the columns that have "/" in their column name matching at "/" and make a "resmatrix". How can I get this done in R?
mymat
a b IID:WE:G12D/V GH:SQ:p.R172W/G c
1 3 4 2 4
22 4 2 2 4
2 3 2 2 4
resmatrix
a b IID:WE:G12D IID:WE:G12V GH:SQ:p.R172W GH:SQ:p.R172G c
1 3 4 4 2 2 4
22 4 2 2 2 2 4
2 3 2 2 2 2 4
Find out which columns have the "/" and replicate them, then rename. To calculate the new names, just split on / and replace the last letter for the second name.
# which columns have '/' in them?
which.slash <- grep('/', names(mymat), value=T)
new.names <- unlist(lapply(strsplit(which.slash, '/'),
function (bits) {
# bits[1] is e.g. IID:WE:G12D and bits[2] is the V
# take bits[1] and replace the last letter for the second colname
c(bits[1], sub('.$', bits[2], bits[1]))
}))
# make resmat by copying the appropriate columns
resmat <- cbind(mymat, mymat[, which.slash])
# order the columns to make sure the names replace properly
resmat <- resmat[, order(names(resmat))]
# put the new names in
names(resmat)[grep('/', names(resmat))] <- sort(new.names)
resmat looks like this
# a b c GH:SQ:p.R172G GH:SQ:p.R172W IID:WE:G12D IID:WE:G12V
# 1 1 3 4 2 2 4 4
# 2 22 4 4 2 2 2 2
# 3 2 3 4 2 2 2 2
You could use grep to get the index of column names with / ('nm1'), replicate the column names in 'nm1' by using sub/scan to create 'nm2'. Then, cbind the columns that are not 'nm1', with the replicated columns ('nm1'), change the column names with 'nm2', and if needed order the columns.
#get the column index with grep
nm1 <- grepl('/', names(df1))
#used regex to rearrange the substrings in the nm1 column names
#removed the `/` and use `scan` to split at the space delimiter
nm2 <- scan(text=gsub('([^/]+)(.)/(.*)', '\\1\\2 \\1\\3',
names(df1)[nm1]), what='', quiet=TRUE)
#cbind the columns that are not in nm1, with the replicate nm1 columns
df2 <- cbind(df1[!nm1], setNames(df1[rep(which(nm1), each= 2)], nm2))
#create another index to find the starting position of nm1 columns
nm3 <- names(df1)[1:(which(nm1)[1L]-1)]
#we concatenate the nm3, nm2, and the rest of the columns to match
#the expected output order
df2N <- df2[c(nm3, nm2, setdiff(names(df1)[!nm1], nm3))]
df2N
# a b IID:WE:G12D IID:WE:G12V GH:SQ:p.R172W GH:SQ:p.R172G c
#1 1 3 4 4 2 2 4
#2 22 4 2 2 2 2 4
#3 2 3 2 2 2 2 4
data
df1 <- structure(list(a = c(1L, 22L, 2L), b = c(3L, 4L, 3L),
`IID:WE:G12D/V` = c(4L,
2L, 2L), `GH:SQ:p.R172W/G` = c(2L, 2L, 2L), c = c(4L, 4L, 4L)),
.Names = c("a", "b", "IID:WE:G12D/V", "GH:SQ:p.R172W/G", "c"),
class = "data.frame", row.names = c(NA, -3L))

Split a vector of strings over a character to return a matrix

I have
rownames(results.summary)
[1] "2 - 1" "3 - 1" "4 - 1"
What I want is to return a matrix of
2 1
3 1
4 1
The way Ive done it as:
for(i in 1:length(rownames(results.summary)){
current.split <- unlist(strsplit(rownames(results.summary)[i], "-"))
matrix.results$comparison.group[i] <- trim(current.split[1])
matrix.results$control.group[i] <- trim(current.split[2])
}
The trim function basically removes any whitespace on either end.
I've been learning regex and was wondering if there's perhaps a more elegant vectorized solution?
No need to use strsplit, just read it using read.table:
read.table(text=vec,sep='-',strip.white = TRUE) ## see #flodel comment
V1 V2
1 2 1
2 3 1
3 4 1
where vec is :
vec <- c("2 - 1", "3 - 1", "4 - 1")
This should work:
vv <- c("2 - 1", "3 - 1", "4 - 1")
matrix(as.numeric(unlist(strsplit(vv, " - "))), ncol = 2, byrow = TRUE)
# [,1] [,2]
# [1,] 2 1
# [2,] 3 1
# [3,] 4 1
You can also try scan
vec <- c("2 - 1", "3 - 1", "4 - 1")
s <- scan(text = vec, what = integer(), sep = "-", quiet = TRUE)
matrix(s, length(s)/2, byrow = TRUE)
# [,1] [,2]
# [1,] 2 1
# [2,] 3 1
# [3,] 4 1
Another option is cSplit.
library(splitstackshape)
cSplit(data.frame(vec), "vec", sep = " - ", fixed=TRUE)
# vec_1 vec_2
# 1: 2 1
# 2: 3 1
# 3: 4 1
You can use str_match from the package stringr for this:
library(stringr)
##
x <- c("2 - 1","3 - 1","4 - 1")
##
cmat <- str_match(x, "(\\d).+(\\d)")[,-1]
> apply(cmat,2,as.numeric)
[,1] [,2]
[1,] 2 1
[2,] 3 1
[3,] 4 1
Using reshape2 colsplit
library(reshape2)
colsplit(x, " - ", c("A", "B"))
# A B
# 1 2 1
# 2 3 1
# 3 4 1
Or using tidyrs separate
library(tidyr)
separate(data.frame(x), x, c("A", "B"), sep = " - ")
# A B
# 1 2 1
# 2 3 1
# 3 4 1