R : how to differentiate between inner and innermost brackets using regex - regex

What I need from the string ((VBD)(((JJ))(CC)((RB)(JJ)))((IN)((DT)(JJ)(NNP)(NNPS)))) is this:
"JJ", "RBJJ", "DTJJNNPNNPS", "JJCCRBJJ", "INDTJJNNPNNPS" "VBDJJCCRBJJINDTJJNNPNNPS"
that is, to find the text between innermost brackets, delete the immediately surrounding brackets so that the text can be combined and extracted. But this comprises of different levels. The uncovering of brackets can't be done all at once because the no, of brackets go out of balance:
str1<-c()
str2<-c()
library(gsubfn)
strr<-c("((VBD)(((JJ))(CC)((RB)(JJ)))((IN)((DT)(JJ)(NNP)(NNPS))))")
repeat {
str1<-unlist(strapply(strr, "((\\(([A-Z])+\\))+)"))
str2<-append(str1, str2)
strr<-gsub("(\\(\\w+\\))", "~\\1~", strr)
strr<-gsub("~\\(|\\)~", "", strr)
if (strr == "") {break}
}
strr
[1] "(VBD(JJCCRBJJINDTJJNNPNNPS"
There are brackets left blocking combining of text which makes it escape the regex. The solution to this I think is, to differentiate between innermost brackets (JJ, RB, JJ, DT, JJ, NNP, NNPS, (2, 4, 5, 7 , 8 , 9 , 10 on the fresh string)) and inner brackets. So that when all the inner most brackets are uncovered step by step and the text combined and extracted, we will reach the whole string. Is there a regular expression to do this? Or is there any other way? Please help.

This doesn't use regexp. In fact, I'm not sure that regexp are powerful enough to solve the problem and that a parser is necessary. Rather than create/define a parser in R, I leverage the existing R code parser. Doing so uses some rather potentially dangerous tricks.
The basic idea is to turn the string into parsable code which generates a tree structure using lists. Then this structure is effectively reverse pruned (keeping only the leaf node inward), and the various strings at each level are created.
Some helper packages
library("plotrix")
library("plyr")
The original string that you gave
strr<-c("((VBD)(((JJ))(CC)((RB)(JJ)))((IN)((DT)(JJ)(NNP)(NNPS))))")
Turn this string into parsable code, quoting what is inside the parentheses, and then making each set of parentheses a call to list. Commas have to be inserted between list items, but the innermost parts are always lists of length 1, so that isn't a problem. Then parse the code.
tmp <- gsub("\\(([^\\(\\)]*)\\)", '("\\1")', strr)
tmp <- gsub("\\(", "list(", tmp)
tmp <- gsub("\\)list", "),list", tmp)
tmp <- eval(parse(text=tmp))
At this point, tmp looks like
> str(tmp)
List of 3
$ :List of 1
..$ : chr "VBD"
$ :List of 3
..$ :List of 1
.. ..$ :List of 1
.. .. ..$ : chr "JJ"
..$ :List of 1
.. ..$ : chr "CC"
..$ :List of 2
.. ..$ :List of 1
.. .. ..$ : chr "RB"
.. ..$ :List of 1
.. .. ..$ : chr "JJ"
$ :List of 2
..$ :List of 1
.. ..$ : chr "IN"
..$ :List of 4
.. ..$ :List of 1
.. .. ..$ : chr "DT"
.. ..$ :List of 1
.. .. ..$ : chr "JJ"
.. ..$ :List of 1
.. .. ..$ : chr "NNP"
.. ..$ :List of 1
.. .. ..$ : chr "NNPS"
The nesting of parentheses is now nesting of lists. A few more helper functions are needed. The first collapses everything below a certain depth and throws away any node above that depth. The second is just a wrapper for paste to work one the elements of a list collectively.
atdepth <- function(l, d) {
if (d > 0 & !is.list(l)) {
return(NULL)
}
if (d == 0) {
return(unlist(l))
}
if (is.list(l)) {
llply(l, atdepth, d-1)
}
}
pastelist <- function(l) {paste(unlist(l), collapse="", sep="")}
Create a list where each element is the tree structure collapsed to a particular depth.
down <- llply(1:listDepth(tmp), atdepth, l=tmp)
Iterating backwards over this list, paste the leaf sets together. Work backwards "up" the (collapsed) trees. Doing this produces some blank strings (where there was a leaf higher up), so these are trimmed out.
out <- if (length(down) > 2) {
c(unlist(llply(length(down):3, function(i) {
unlist(do.call(llply, c(list(down[[i]]), replicate(i-3, llply), pastelist)))
})), unlist(pastelist(down[[2]])))
} else {
unlist(pastelist(down[[2]]))
}
out <- out[out != ""]
The result is what I think you asked for:
> out
[1] "JJ" "RBJJ"
[3] "DTJJNNPNNPS" "JJCCRBJJ"
[5] "INDTJJNNPNNPS" "VBDJJCCRBJJINDTJJNNPNNPS"
> dput(out)
c("JJ", "RBJJ", "DTJJNNPNNPS", "JJCCRBJJ", "INDTJJNNPNNPS", "VBDJJCCRBJJINDTJJNNPNNPS"
)
EDIT:
In response to a comment with a subsequent question: How to adapt this to process over a set of these strings.
The general approach to solving the do-it-multiple-times-for-different-inputs is to create a function which takes a single item as input and returns the associated single output. Then loop over the function with one of the apply family of functions.
Pulling together all the code from earlier into a single function:
parsestrr <- function(strr) {
atdepth <- function(l, d) {
if (d > 0 & !is.list(l)) {
return(NULL)
}
if (d == 0) {
return(unlist(l))
}
if (is.list(l)) {
llply(l, atdepth, d-1)
}
}
pastelist <- function(l) {paste(unlist(l), collapse="", sep="")}
tmp <- gsub("\\(([^\\(\\)]*)\\)", '("\\1")', strr)
tmp <- gsub("\\(", "list(", tmp)
tmp <- gsub("\\)list", "),list", tmp)
tmp <- eval(parse(text=tmp))
down <- llply(1:listDepth(tmp), atdepth, l=tmp)
out <- if (length(down) > 2) {
c(unlist(llply(length(down):3, function(i) {
unlist(do.call(llply, c(list(down[[i]]), replicate(i-3, llply), pastelist)))
})), unlist(pastelist(down[[2]])))
} else {
unlist(pastelist(down[[2]]))
}
out[out != ""]
}
Now given a vector of strings to process, say:
strrs<-c("((VBD)(((JJ))(CC)((RB)(JJ)))((IN)((DT)(JJ)(NNP)(NNPS))))",
"((VBD)(((JJ))(CC)((RB)(XX)(JJ)))((IN)(BB)((DT)(JJ)(NNP)(NNPS))))",
"((VBD)(((JJ)(QQ))(CC)((RB)(JJ)))((IN)((TQR)(JJ)(NNPS))))")
You can process all of them with
llply(strr, parsestrr)
which returns
[[1]]
[1] "JJ" "RBJJ"
[3] "DTJJNNPNNPS" "JJCCRBJJ"
[5] "INDTJJNNPNNPS" "VBDJJCCRBJJINDTJJNNPNNPS"
[[2]]
[1] "JJ" "RBXXJJ"
[3] "DTJJNNPNNPS" "JJCCRBXXJJ"
[5] "INBBDTJJNNPNNPS" "VBDJJCCRBXXJJINBBDTJJNNPNNPS"
[[3]]
[1] "JJQQ" "RBJJ"
[3] "TQRJJNNPS" "JJQQCCRBJJ"
[5] "INTQRJJNNPS" "VBDJJQQCCRBJJINTQRJJNNPS"

I'm not sure if you just want to build a tree structure of balanced text or not.
Or, why you want to strip the containing parenthesis on the inner most level.
Using your example, if it is to be done in stages, the inner most level has to be initially determined. Then parenthesis stripped off in subsequent levels in recursive passes.
This of course requires a way to do balanced text. Some regex engines can do this.
If the engine you are using doesn't support this, it would have to be done manually via text processing.
I happen to have a regex analysis program. I pumped your initial string into it and it visually formatted it via group levels. Each pass, I just stripped the inner parenth's which simulates a recursion.
Maybe this can help you to visualize what needs to be done.
## Pass 0
## ---------
(
( VBD )
(
(
( JJ )
)
( CC )
(
( RB )
( JJ )
)
)
(
( IN )
(
( DT )
( JJ )
( NNP )
( NNPS )
)
)
)
## Pass 1
## ---------
(
( VBD )
(
( JJ )
( CC )
( RB JJ )
)
(
( IN )
( DT JJ NNP NNPS )
)
)
## Pass 2
## ---------
(
( VBD )
( JJ CC RB JJ )
( IN DT JJ NNP NNPS )
)
## Pass 3
## ---------
( VBD JJ CC RB JJ IN DT JJ NNP NNPS )
## Pass 4
## ---------
VBD JJ CC RB JJ IN DT JJ NNP NNPS

You don't really need to think of matching brackets here... Sounds like you just want to recursively match the pattern [()]([^()]*)[()].
That is, "match something containing no ( ) and delimited by ( or )"

Related

R regex find ranges in strings

I have a bunch of email subject lines and I'm trying to extract whether a range of values are present. This is how I'm trying to do it but am not getting the results I'd like:
library(stringi)
df1 <- data.frame(id = 1:5, string1 = NA)
df1$string1 <- c('15% off','25% off','35% off','45% off','55% off')
df1$pctOff10_20 <- stri_match_all_regex(df1$string1, '[10-20]%')
id string1 pctOff10_20
1 1 15% off NA
2 2 25% off NA
3 3 35% off NA
4 4 45% off NA
5 5 55% off NA
I'd like something like this:
id string1 pctOff10_20
1 1 15% off 1
2 2 25% off 0
3 3 35% off 0
4 4 45% off 0
5 5 55% off 0
Here is the way to go,
df1$pctOff10_20 <- stri_count_regex(df1$string1, '^(1\\d|20)%')
Explanation:
^ the beginning of the string
( group and capture to \1:
1 '1'
\d digits (0-9)
| OR
20 '20'
) end of \1
% '%'
1) strapply in gsubfn can do that by combining a regex (pattern= argument) and a function (FUN= argument). Below we use the formula representation of the function. Alternately we could make use of betweeen from data.table (or a number of other packages). This extracts the matches to the pattern, applies the function to it and returns the result simplifying it into a vector (rather than a list):
library(gsubfn)
btwn <- function(x, a, b) as.numeric(a <= as.numeric(x) & as.numeric(x) <= b)
transform(df1, pctOff10_20 =
strapply(
X = string1,
pattern = "\\d+",
FUN = ~ btwn(x, 10, 20),
simplify = TRUE
)
)
2) A base solution using the same btwn function defined above is:
transform(df1, pctOff10_20 = btwn(gsub("\\D", "", string1), 10, 20))

Detecting number repetition in R using regex

Shouldn't this code work for repeating number detection in R?
> grep(pattern = "\\d{2}", x = 1223)
[1] 1
> grep(pattern = "\\d{3}", x = 1223)
[1] 1
If we have 988 we should get true and if 123 we should get false.
Sounds like it isn't.
> grep(pattern = "\\d{2}", x = "1223")
[1] 1
> grep(pattern = "\\d{2}", x = "13")
[1] 1
You need to use backreferences:
> grep(pattern = "(\\d)\\1", x = "1224")
[1] 1
> grep(pattern = "(\\d)\\1{1,}", x = "1224")
[1] 1
> grep(pattern = "(\\d)\\1", x = "1234")
integer(0)
EDIT: Seems like you need to figure how it works: (\\d) creates a capture group for the \\d, which can be referred to using a backreference \\1. For example, if you have numbers like x2y and you want to find those where x is the same as y, then:
> grep(pattern = "(\\d)2\\1", x = "121")
[1] 1
> grep(pattern = "(\\d)2\\1", x = "124")
integer(0)
I'd strongly recommend that you read a basic tutorial on regular expressions.
I know the question explicitly says "using regex" in the title, but here is a non-regex method that could work, depending on what you want to do.
strings <- c("1223","1233","1234","113")
# detect consecutive repeat digits, or characters
(strings.rle <- lapply(strings, function(x)rle(unlist(strsplit(x,"")))))
[[1]]
Run Length Encoding
lengths: int [1:3] 1 2 1
values : chr [1:3] "1" "2" "3"
[[2]]
Run Length Encoding
lengths: int [1:3] 1 1 2
values : chr [1:3] "1" "2" "3"
[[3]]
Run Length Encoding
lengths: int [1:4] 1 1 1 1
values : chr [1:4] "1" "2" "3" "4"
[[4]]
Run Length Encoding
lengths: int [1:2] 2 1
values : chr [1:2] "1" "3"
Now you can work with strings.rle to do what you want
# which entries have consecutive repeat digits, or characters
strings[sapply(strings.rle, function(x) any(x$lengths > 1))]
[1] "1223" "1233" "113"
or
# which digits or characters are consecutively repeated?
lapply(strings.rle, function(x) x$values[which(x$lengths > 1)])
[[1]]
[1] "2"
[[2]]
[1] "3"
[[3]]
character(0)
[[4]]
[1] "1"

How to properly manipulate a string column in a data frame in R?

I have a data.frame with a string column that contains periods e.g "a.b.c.X". I want to split out the string by periods and retain the third segment e.g. "c" in the example given. Here is what I'm doing.
> df = data.frame(v=c("a.b.a.X", "a.b.b.X", "a.b.c.X"), b=seq(1,3))
> df
v b
1 a.b.a.X 1
2 a.b.b.X 2
3 a.b.c.X 3
And what I want is
> df = data.frame(v=c("a.b.a.X", "a.b.b.X", "a.b.c.X"), b=seq(1,3))
> df
v b
1 a 1
2 b 2
3 c 3
I'm attempting to use within, but I'm getting strange results. The value in the first row in the first column is being repeated.
> get = function(x) { unlist(strsplit(x, "\\."))[3] }
> within(df, v <- get(as.character(v)))
v b
1 a 1
2 a 2
3 a 3
What is the best practice for doing this? What am I doing wrong?
Update:
Here is the solution I used from #agstudy's answer:
> df = data.frame(v=c("a.b.a.X", "a.b.b.X", "a.b.c.X"), b=seq(1,3))
> get = function(x) gsub(".*?[.].*?[.](.*?)[.].*", '\\1', x)
> within(df, v <- get(v))
v b
1 a 1
2 b 2
3 c 3
Using some regular expression you can do :
gsub(".*?[.].*?[.](.*?)[.].*", '\\1', df$v)
[1] "a" "b" "c"
Or more concise:
gsub("(.*?[.]){2}(.*?)[.].*", '\\2', v)
The problem is not with within but with your get function. It returns a single character ("a") which gets recycled when added to your data.frame. Your code should look like this:
get.third <- function(x) sapply(strsplit(x, "\\."), `[[`, 3)
within(df, v <- get.third(as.character(v)))
Here is one possible solution:
df[, "v"] <- do.call(rbind, strsplit(as.character(df[, "v"]), "\\."))[, 3]
## > df
## v b
## 1 a 1
## 2 b 2
## 3 c 3
The answer to "what am I doing wrong" is that the bit of code that you thought was extracting the third element of each split string was actually putting all the elements of all your strings in a single vector, and then returning the third element of that:
get = function(x) {
splits = strsplit(x, "\\.")
print("All the elements: ")
print(unlist(splits))
print("The third element:")
print(unlist(splits)[3])
# What you actually wanted:
third_chars = sapply(splits, function (x) x[3])
}
within(df, v2 <- get(as.character(v)))

How to expand a list with NULLs up to some length?

Given a list whose length <= N, what is the best / most efficient way to fill it up with trailing NULLs up to length (so that it has length N).
This is something which is a one-liner in any decent language, but I don't have a clue how to do it (efficiently) in a few lines in R so that it works for every corner case (zero length list etc.).
Let's keep it really simple:
tst<-1:10 #whatever, to get a vector of length 10
tst<-tst[1:15]
Try this :
> l = list("a",1:3)
> N = 5
> l[N+1]=NULL
> l
[[1]]
[1] "a"
[[2]]
[1] 1 2 3
[[3]]
NULL
[[4]]
NULL
[[5]]
NULL
>
How about this ?
> l = list("a",1:3)
> length(l)=5
> l
[[1]]
[1] "a"
[[2]]
[1] 1 2 3
[[3]]
NULL
[[4]]
NULL
[[5]]
NULL
Directly editing the list's length appears to be the fastest as far as I can tell:
tmp <- vector("list",5000)
sol1 <- function(x){
x <- x[1:10000]
}
sol2 <- function(x){
x[10001] <- NULL
}
sol3 <- function(x){
length(x) <- 10000
}
library(rbenchmark)
benchmark(sol1(tmp),sol2(tmp),sol3(tmp),replications = 5000)
test replications elapsed relative user.self sys.self user.child sys.child
1 sol1(tmp) 5000 2.045 1.394952 1.327 0.727 0 0
2 sol2(tmp) 5000 2.849 1.943383 1.804 1.075 0 0
3 sol3(tmp) 5000 1.466 1.000000 0.937 0.548 0 0
But the differences aren't huge, unless you're doing this a lot on very long lists, I suppose.
I'm sure there are shorter ways, but I would be inclined to do:
l <- as.list(1:10)
N <- 15
l <- c(l, as.list(rep(NA, N - length(l) )))
Hi: I'm not sure if you were talking about an actual list but, if you were, below will work. It works because, once you access the element of a vector ( which is a list is ) that is not there, R expands the vector to that length.
length <- 10
temp <- list("a","b")
print(temp)
temp[length] <- NULL
print(temp)

Accessing R list elements through function parameters

I have an R list which looks as follows
> str(prices)
List of 4
$ ID : int 102894616
$ delay: int 8
$ 47973 :List of 12
..$ id : int 47973
..$ index : int 2
..$ matched: num 5817
$ 47972 :List of 12
..
Clearly, I can access any element by e.g. prices$"47973"$id.
However, how would I write a function which parametrises the access to that list? For example an access function with signature:
access <- function(index1, index2) { .. }
Which can be used as follows:
> access("47973", "matched")
5817
This seems very trivial but I fail to write such a function. Thanks for any pointers.
Using '[[' instead of '$' seems to work:
prices <- list(
`47973` = list( id = 1, matched = 2))
access <- function(index1, index2) prices[[index1]][[index2]]
access("47973","matched")
As to why this works instead of:
access <- function(index1, index2) prices$index1$index2 (which I assume is what you tried?) it's because here index1 and index2 are not evaluated. That is, it searches the list for an element called index1 instead of what this object evaluates to.
You can take advantage of the fact that [[ accepts a vector, used recursively:
prices <- list(
`47973` = list( id = 1, matched = 2))
prices[[c("47973", "matched")]]
# 2