Related
I'm trying to make the following call to outer() substantially faster. Parallelizing via foreach is still prohibitively slow, so I'd like to attempt calling this in C++ using Rcpp but would love to hear any faster alternative.
Given a matrix mat and a list of matrix colnames col.list I am summarizing mat as such.
mycall <- function(mat, col.list) {
outer(
rownames(mat),
col.list,
Vectorize(function(x,y) {
mean(mat[x,y])
})
)
}
For instance:
set.seed(123)
mat <- matrix(rnorm(100),nrow=10)
rownames(mat) <- letters[1:10]
colnames(mat) <- LETTERS[1:10]
mat
A B C D E F G H I J
a -0.56047565 1.2240818 -1.0678237 0.42646422 -0.69470698 0.25331851 0.37963948 -0.4910312 0.005764186 0.9935039
b -0.23017749 0.3598138 -0.2179749 -0.29507148 -0.20791728 -0.02854676 -0.50232345 -2.3091689 0.385280401 0.5483970
c 1.55870831 0.4007715 -1.0260044 0.89512566 -1.26539635 -0.04287046 -0.33320738 1.0057385 -0.370660032 0.2387317
d 0.07050839 0.1106827 -0.7288912 0.87813349 2.16895597 1.36860228 -1.01857538 -0.7092008 0.644376549 -0.6279061
e 0.12928774 -0.5558411 -0.6250393 0.82158108 1.20796200 -0.22577099 -1.07179123 -0.6880086 -0.220486562 1.3606524
f 1.71506499 1.7869131 -1.6866933 0.68864025 -1.12310858 1.51647060 0.30352864 1.0255714 0.331781964 -0.6002596
g 0.46091621 0.4978505 0.8377870 0.55391765 -0.40288484 -1.54875280 0.44820978 -0.2847730 1.096839013 2.1873330
h -1.26506123 -1.9666172 0.1533731 -0.06191171 -0.46665535 0.58461375 0.05300423 -1.2207177 0.435181491 1.5326106
i -0.68685285 0.7013559 -1.1381369 -0.30596266 0.77996512 0.12385424 0.92226747 0.1813035 -0.325931586 -0.2357004
j -0.44566197 -0.4727914 1.2538149 -0.38047100 -0.08336907 0.21594157 2.05008469 -0.1388914 1.148807618 -1.0264209
col.list <- replicate(5, sample(colnames(mat),sample(10,1)), simplify = F)
col.list
[[1]]
[1] "I" "H" "F" "C"
[[2]]
[1] "H" "C" "E" "D"
[[3]]
[1] "F" "A" "B" "C"
[[4]]
[1] "I" "G" "H" "F"
[[5]]
[1] "B" "F" "A" "D" "J"
mycall(mat, col.list)
[,1] [,2] [,3] [,4] [,5]
[1,] -0.32494304 -0.45677441 -0.03772476 0.03692275 0.46737855
[2,] -0.54260254 -0.75753314 -0.02922133 -0.61368967 0.07088301
[3,] -0.10844910 -0.09763415 0.22265121 0.06475016 0.61009334
[4,] 0.14372171 0.40224937 0.20522554 0.07130067 0.36000416
[5,] -0.43982636 0.17912380 -0.31934091 -0.55151435 0.30598183
[6,] 0.29678266 -0.27389757 0.83293885 0.79433814 1.02136588
[7,] 0.02527506 0.17601171 0.06195023 -0.07211925 0.43025291
[8,] -0.01188734 -0.39897791 -0.62342288 -0.03697956 -0.23527315
[9,] -0.28972770 -0.12070775 -0.24994491 0.22537340 -0.08066115
[10,] 0.61991819 0.16277087 0.13782578 0.81898563 -0.42188074
You could try:
sapply(col.list, function(v) rowMeans(mat[, v]))
I suspect the reason your solution is slow is Vectorize: it's a nice way to transform a scalar function into a vectorized function, but it has a huge cost: since it's based on mapply, it will call the function on each element, one by one. That is, one call to mean for each entry. If the outer result is large, that's going to be very costly. Instead, with the solution above, the code is at least vectorized in one direction, thanks to rowMeans.
I applied GeoSeries.almost_equals(other[, decimal=6]) function to geo data frame with 10 mil entries, in order to find multiple geo points close to each other.
:
which gave me matrix, now i need to filter all True values in order to create DF/list with only POI that are geo related, so I used:
Now, I struggle to figure out how to proceed further with filters of this matrix.
Expected output is either vector, list or ideally DF with all TRUE (matched) values but matched to each other re 1 to 1, and repeated (if [1,9] then [9,1] to be removed from output
list example:
DF example:
Consider this example dataframe:
In [1]: df = pd.DataFrame([[True, False, False, True],
...: [False, True, True, False],
...: [False, True, True, False],
...: [True, False, False, True]])
In [2]: df
Out[2]:
0 1 2 3
0 True False False True
1 False True True False
2 False True True False
3 True False False True
A possible solution to get to the dataframe of matching indexes:
First I use np.triu to only consider the upper triangle (so you don't have duplicates):
In [15]: df2 = pd.DataFrame(np.triu(df))
In [16]: df2
Out[16]:
0 1 2 3
0 True False False True
1 False True True False
2 False False True False
3 False False False True
Then I stack the dataframe, give the index levels the desired names, and select only the rows where we have 'True' values:
In [17]: result = df2.stack()
In [18]: result
Out[18]:
0 0 True
1 False
2 False
3 True
1 0 False
1 True
2 True
3 False
2 0 False
1 False
2 True
3 False
3 0 False
1 False
2 False
3 True
dtype: bool
In [21]: result.index.names = ['POI_id', 'matched_POI_ids']
In [23]: result[result].reset_index()
Out[23]:
POI_id matched_POI_ids 0
0 0 0 True
1 0 3 True
2 1 1 True
3 1 2 True
4 2 2 True
5 3 3 True
You can then of course delete the column with trues: .drop(0, axis=1)
I have the following charecter array
head(rest, n=20)
[,1] [,2] [,3]
[1,] "" "" ""
[2,] "" "" ""
[3,] "B" "-1" "-tv"
[4,] "" "" ""
[5,] "" "" ""
[6,] "A" "" ""
[7,] "" "" ""
...
[2893,] "" "" ""
[2894,] "" "" ""
[2895,] "" "" ""
[2896,] "st" "" ""
[2897,] "2" "-th" ""
[2898,] "1" "" ""
I would like to extract the all capital letters, all numbers and all lower case letter while keeping the index values.
I can find all the capital positions letters with this
grep("[A-Z]", rest, perl=TRUE)
and the values with
grep("[A-Z]", rest, perl=TRUE, value=TRUE)
But I can't figure out how to return the value while keeping the index.
I think this might be what you're looking for (using your example data):
rest <- matrix(c('','','','','','','B','-1','-tv','','','','','','','A','','','','','','','','','','','','','','','st','','','2','-th','','1','',''),13,byrow=T);
pat <- c('[A-Z]','[0-9]','[a-z]');
name <- c('house','floor','side');
res <- setNames(as.data.frame(lapply(pat,function(x) { i <- grep(x,rest); x <- rep('',nrow(rest)); x[(i-1)%%nrow(rest)+1] <- rest[i]; x; }),stringsAsFactors=F),name);
res;
## house floor side
## 1
## 2
## 3 B -1 -tv
## 4
## 5
## 6 A
## 7
## 8
## 9
## 10
## 11 st
## 12 2 -th
## 13 1
Actually that's not a great demo because of a dearth of populated cells, here's some randomized data for another demo:
set.seed(9);
R <- 12;
C <- 3;
N <- 5;
rest <- matrix(sample(c(rstr(N,charset=letters,lmin=1,lmax=3),rstr(N,charset=LETTERS,lmin=1,lmax=3),rstr(N,charset=0:9,lmin=1,lmax=3),rep('',R*C-N*3))),R);
rest;
## [,1] [,2] [,3]
## [1,] "AN" "" ""
## [2,] "895" "" ""
## [3,] "698" "" ""
## [4,] "zd" "" "32"
## [5,] "" "" ""
## [6,] "CK" "" ""
## [7,] "" "" ""
## [8,] "JWZ" "" "r"
## [9,] "1" "j" "IX"
## [10,] "" "" "ZFM"
## [11,] "k" "d" ""
## [12,] "" "" "252"
pat <- c('[A-Z]','[0-9]','[a-z]');
name <- c('house','floor','side');
res <- setNames(as.data.frame(lapply(pat,function(x) { i <- grep(x,rest); x <- rep('',R); x[(i-1)%%R+1] <- rest[i]; x; }),stringsAsFactors=F),name);
res;
## house floor side
## 1 AN
## 2 895
## 3 698
## 4 32 zd
## 5
## 6 CK
## 7
## 8 JWZ r
## 9 IX 1 j
## 10 ZFM
## 11 d
## 12 252
Note that I used a little function I wrote called rstr() to produce the random string values. It's not relevant to this question so I haven't posted it, but if you want I can provide it as well in this answer.
By chance in row 11 there's a collision between two side values. You specified in the comments that this can't happen in your actual data, but you can see from the output that the code handles that case gracefully; it ends up keeping the rightmost value that was in the row.
The new requirement of moving single-letter lowercase strings from the third column to the first, concatenating with any existing value in the first column, can be satisfied thusly (continuing with my second demo):
res$house <- ifelse(nchar(res$side)==1,paste0(res$house,res$side),res$house);
res$side <- ifelse(nchar(res$side)==1,'',res$side);
res;
## house floor side
## 1 AN
## 2 895
## 3 698
## 4 32 zd
## 5
## 6 CK
## 7
## 8 JWZr
## 9 IXj 1
## 10 ZFM
## 11 d
## 12 252
I want to use regex to capture substrings - I already have a working solution, but I wonder if there is a faster solution. I am applying applyCaptureRegex on a vector with about 400.000 entries.
exampleData <- as.data.frame(c("[hg19:21:34809787-34809808:+]","[hg19:11:105851118-105851139:+]","[hg19:17:7482245-7482266:+]","[hg19:6:19839915-19839936:+]"))
captureRegex <- function(captRegEx,str){
sapply(regmatches(str,gregexpr(captRegEx,str))[[1]], function(m) regmatches(m,regexec(captRegEx,m)))
}
applyCaptureRegex <- function(mir,r){
mir <- unlist(apply(mir, 1, function(x) captureRegex(r,x[1])))
mir <- matrix(mir ,ncol=5, byrow = TRUE)
mir
}
Usage and results:
> captureRegex("\\[[a-z0-9]+:([0-9]+):([0-9]+)-([0-9]+):([-+])\\]","[hg19:12:125627828-125627847:-]")
$`[hg19:12:125627828-125627847:-]`
[1] "[hg19:12:125627828-125627847:-]" "12" "125627828" "125627847" "-"
> applyCaptureRegex(exampleData,"\\[[a-z0-9]+:([0-9]+):([0-9]+)-([0-9]+):([-+])\\]")
[,1] [,2] [,3] [,4] [,5]
[1,] "[hg19:21:34809787-34809808:+]" "21" "34809787" "34809808" "+"
[2,] "[hg19:11:105851118-105851139:+]" "11" "105851118" "105851139" "+"
[3,] "[hg19:17:7482245-7482266:+]" "17" "7482245" "7482266" "+"
[4,] "[hg19:6:19839915-19839936:+]" "6" "19839915" "19839936" "+"
Thank you!
Why reinvent the wheel? You have several library packages to choose from with functions that return a character matrix with one column for each capturing group in your pattern.
stri_match_all_regex — stringi
x <- c('[hg19:21:34809787-34809808:+]', '[hg19:11:105851118-105851139:+]', '[hg19:17:7482245-7482266:+]', '[hg19:6:19839915-19839936:+]')
do.call(rbind, stri_match_all_regex(x, '\\[[^:]+:(\\d+):(\\d+)-(\\d+):([-+])]'))
# [,1] [,2] [,3] [,4] [,5]
# [1,] "[hg19:21:34809787-34809808:+]" "21" "34809787" "34809808" "+"
# [2,] "[hg19:11:105851118-105851139:+]" "11" "105851118" "105851139" "+"
# [3,] "[hg19:17:7482245-7482266:+]" "17" "7482245" "7482266" "+"
# [4,] "[hg19:6:19839915-19839936:+]" "6" "19839915" "19839936" "+"
str_match — stringr
str_match(x, '\\[[^:]+:(\\d+):(\\d+)-(\\d+):([-+])]')
strapplyc — gsubfn
strapplyc(x, "(\\[[^:]+:(\\d+):(\\d+)-(\\d+):([-+])])", simplify = rbind)
Below is a benchmark comparison of all combined solutions.
x <- rep(c('[hg19:21:34809787-34809808:+]',
'[hg19:11:105851118-105851139:+]',
'[hg19:17:7482245-7482266:+]',
'[hg19:6:19839915-19839936:+]'), 1000)
applyCaptureRegex <- function(mir, r) {
do.call(rbind, lapply(mir, function(x) regmatches(x, regexec(r, x))[[1]]))
}
gsubfn <- function(x1) strapplyc(x1, '(\\[[^:]+:(\\d+):(\\d+)-(\\d+):([-+])])', simplify = rbind)
regmtch <- function(x1) applyCaptureRegex(x1, '\\[[^:]+:(\\d+):(\\d+)-(\\d+):([-+])]')
stringr <- function(x1) str_match(x1, '\\[[^:]+:(\\d+):(\\d+)-(\\d+):([-+])]')
stringi <- function(x1) do.call(rbind, stri_match_all_regex(x1, '\\[[^:]+:(\\d+):(\\d+)-(\\d+):([-+])]'))
require(microbenchmark)
microbenchmark(gsubfn(x), regmtch(x), stringr(x), stringi(x))
Result
Unit: milliseconds
expr min lq mean median uq max neval
gsubfn(x) 372.27072 382.82179 391.21837 388.32396 396.27361 449.03091 100
regmtch(x) 394.03164 409.87523 419.42936 417.76770 427.08208 456.92460 100
stringr(x) 65.81644 70.28327 76.02298 75.43162 78.92567 116.18026 100
stringi(x) 15.88171 16.53047 17.52434 16.96127 17.76007 23.94449 100
In R, is it possible to extract group capture from a regular expression match? As far as I can tell, none of grep, grepl, regexpr, gregexpr, sub, or gsub return the group captures.
I need to extract key-value pairs from strings that are encoded thus:
\((.*?) :: (0\.[0-9]+)\)
I can always just do multiple full-match greps, or do some outside (non-R) processing, but I was hoping I can do it all within R. Is there's a function or a package that provides such a function to do this?
str_match(), from the stringr package, will do this. It returns a character matrix with one column for each group in the match (and one for the whole match):
> s = c("(sometext :: 0.1231313213)", "(moretext :: 0.111222)")
> str_match(s, "\\((.*?) :: (0\\.[0-9]+)\\)")
[,1] [,2] [,3]
[1,] "(sometext :: 0.1231313213)" "sometext" "0.1231313213"
[2,] "(moretext :: 0.111222)" "moretext" "0.111222"
gsub does this, from your example:
gsub("\\((.*?) :: (0\\.[0-9]+)\\)","\\1 \\2", "(sometext :: 0.1231313213)")
[1] "sometext 0.1231313213"
you need to double escape the \s in the quotes then they work for the regex.
Hope this helps.
Try regmatches() and regexec():
regmatches("(sometext :: 0.1231313213)",regexec("\\((.*?) :: (0\\.[0-9]+)\\)","(sometext :: 0.1231313213)"))
[[1]]
[1] "(sometext :: 0.1231313213)" "sometext" "0.1231313213"
gsub() can do this and return only the capture group:
However, in order for this to work, you must explicitly select elements outside your capture group as mentioned in the gsub() help.
(...) elements of character vectors 'x' which are not substituted will be returned unchanged.
So if your text to be selected lies in the middle of some string, adding .* before and after the capture group should allow you to only return it.
gsub(".*\\((.*?) :: (0\\.[0-9]+)\\).*","\\1 \\2", "(sometext :: 0.1231313213)")
[1] "sometext 0.1231313213"
Solution with strcapture from the utils:
x <- c("key1 :: 0.01",
"key2 :: 0.02")
strcapture(pattern = "(.*) :: (0\\.[0-9]+)",
x = x,
proto = list(key = character(), value = double()))
#> key value
#> 1 key1 0.01
#> 2 key2 0.02
This is how I ended up working around this problem. I used two separate regexes to match the first and second capture groups and run two gregexpr calls, then pull out the matched substrings:
regex.string <- "(?<=\\().*?(?= :: )"
regex.number <- "(?<= :: )\\d\\.\\d+"
match.string <- gregexpr(regex.string, str, perl=T)[[1]]
match.number <- gregexpr(regex.number, str, perl=T)[[1]]
strings <- mapply(function (start, len) substr(str, start, start+len-1),
match.string,
attr(match.string, "match.length"))
numbers <- mapply(function (start, len) as.numeric(substr(str, start, start+len-1)),
match.number,
attr(match.number, "match.length"))
I like perl compatible regular expressions. Probably someone else does too...
Here is a function that does perl compatible regular expressions and matches the functionality of functions in other languages that I am used to:
regexpr_perl <- function(expr, str) {
match <- regexpr(expr, str, perl=T)
matches <- character(0)
if (attr(match, 'match.length') >= 0) {
capture_start <- attr(match, 'capture.start')
capture_length <- attr(match, 'capture.length')
total_matches <- 1 + length(capture_start)
matches <- character(total_matches)
matches[1] <- substr(str, match, match + attr(match, 'match.length') - 1)
if (length(capture_start) > 1) {
for (i in 1:length(capture_start)) {
matches[i + 1] <- substr(str, capture_start[[i]], capture_start[[i]] + capture_length[[i]] - 1)
}
}
}
matches
}
As suggested in the stringr package, this can be achieved using either str_match() or str_extract().
Adapted from the manual:
library(stringr)
strings <- c(" 219 733 8965", "329-293-8753 ", "banana",
"239 923 8115 and 842 566 4692",
"Work: 579-499-7527", "$1000",
"Home: 543.355.3679")
phone <- "([2-9][0-9]{2})[- .]([0-9]{3})[- .]([0-9]{4})"
Extracting and combining our groups:
str_extract_all(strings, phone, simplify=T)
# [,1] [,2]
# [1,] "219 733 8965" ""
# [2,] "329-293-8753" ""
# [3,] "" ""
# [4,] "239 923 8115" "842 566 4692"
# [5,] "579-499-7527" ""
# [6,] "" ""
# [7,] "543.355.3679" ""
Indicating groups with an output matrix (we're interested in columns 2+):
str_match_all(strings, phone)
# [[1]]
# [,1] [,2] [,3] [,4]
# [1,] "219 733 8965" "219" "733" "8965"
#
# [[2]]
# [,1] [,2] [,3] [,4]
# [1,] "329-293-8753" "329" "293" "8753"
#
# [[3]]
# [,1] [,2] [,3] [,4]
#
# [[4]]
# [,1] [,2] [,3] [,4]
# [1,] "239 923 8115" "239" "923" "8115"
# [2,] "842 566 4692" "842" "566" "4692"
#
# [[5]]
# [,1] [,2] [,3] [,4]
# [1,] "579-499-7527" "579" "499" "7527"
#
# [[6]]
# [,1] [,2] [,3] [,4]
#
# [[7]]
# [,1] [,2] [,3] [,4]
# [1,] "543.355.3679" "543" "355" "3679"
This can be done using the package unglue, taking the example from the selected answer:
# install.packages("unglue")
library(unglue)
s <- c("(sometext :: 0.1231313213)", "(moretext :: 0.111222)")
unglue_data(s, "({x} :: {y})")
#> x y
#> 1 sometext 0.1231313213
#> 2 moretext 0.111222
Or starting from a data frame
df <- data.frame(col = s)
unglue_unnest(df, col, "({x} :: {y})",remove = FALSE)
#> col x y
#> 1 (sometext :: 0.1231313213) sometext 0.1231313213
#> 2 (moretext :: 0.111222) moretext 0.111222
you can get the raw regex from the unglue pattern, optionally with named capture :
unglue_regex("({x} :: {y})")
#> ({x} :: {y})
#> "^\\((.*?) :: (.*?)\\)$"
unglue_regex("({x} :: {y})",named_capture = TRUE)
#> ({x} :: {y})
#> "^\\((?<x>.*?) :: (?<y>.*?)\\)$"
More info : https://github.com/moodymudskipper/unglue/blob/master/README.md