Keep string up to first occurrence of pattern in R - regex

I would like to keep the string up to the first occurrence of the following pattern: lower case letter followed by upper case, followed by lower case again.
For example
"This is My testString, how to keepUntil test"
I would like to return This is My test
This is what I have tried unsuccessfully so far:
library("magrittr")
"This is My testString, how to keepUntil test" %>% gsub("(.*[a-z])[A-Z][a-z]?.*", "\\1", .)

We can use strsplit
strsplit(str1, "(?<=[a-z])(?=[A-Z])", perl = TRUE)[[1]][1]
#[1] "This is My test"
or with sub
sub("([A-Za-z ]+[a-z])[A-Z].*", "\\1", str1)
#[1] "This is My test"
data
str1 <- "This is My testString, how to keepUntil test"

You can use a recursive function with regex capturing groups to extract always the first (leftmost) instance of the pattern you want, regardless of how many sections your text has.
regex <- "^(.*[a-z])[A-Z].*$"
text <- "This is My testString, how to keepUntil test"
library(stringr)
ExtractFirstPart <- function(Text,Regex) {
firstpart <- str_match(Text,Regex)[2]
if (is.na(firstpart)) {
return(Text)
} else {
firstpart <- ExtractFirstPart(firstpart,Regex)
return(firstpart)
}
}
Using this function, you will get:
> ExtractFirstPart(text,regex)
[1] "This is My test"

Related

Regex for extracting all words between word and character

i know basic of regex performing with R. But here i have a file like :
**[2016-04-28 14:00:06,603],,,,,SERVICE_ID=441,DEBUG,DBSEntryServlet,DBSEntryServlet: delegateToRequestManager:: SERVICE_ID=541,SERVICE_ID=9981
[2016-04-28 14:00:06,608],,,,,,DEBUG,DBSEntryServlet,10.91.39.143:60801 SERVICE_ID=00234,SERVICE_ID=11134,IMD=6767**
I wanted to extract timestamp alongwith all the SERVICE_ID in that line.
So, my expected output is:
[2016-04-28 14:00:06,603] SERVICE_ID=441 SERVICE_ID=541 SERVICE_ID=9981
[2016-04-28 14:00:06,608] SERVICE_ID=00234 SERVICE_ID=11134
The code which I tried was only extracting one SERVICE_ID.
library(qdapRegex)
a <- readLines("C:\\MY_FOLDER\\vinita\\sample.txt")
testi <- rm_between(a,"SERVICE_ID",",",extract = T)
We replace the 2 or more , with " " to get 'str2', then using regex lookarounds, we match one or more space (\\s+) that follows the ]) followed by characters (.*) till the end of the string, replace it with "" so that we can extract the [2016-04..,03] part. From the 'str2', we extract the substrings "SERVICE_ID=" followed by numbers (\\d+) into a list, paste them together and finally paste it with the 'str3'.
library(stringr)
str2 <- gsub(",{2,}", " ", str1)
str3 <- sub("(?<=\\])\\s+.*", "", str2, perl = TRUE)
paste(str3, sapply(str_extract_all(str2, "SERVICE_ID=\\d+"), paste, collapse=" "))
#[1] "[2016-04-28 14:00:06,603] SERVICE_ID=441 SERVICE_ID=541 SERVICE_ID=9981"
#[2] "[2016-04-28 14:00:06,608] SERVICE_ID=00234 SERVICE_ID=11134"
data
str1 <- c("[2016-04-28 14:00:06,603],,,,,SERVICE_ID=441,DEBUG,DBSEntryServlet,DBSEntryServlet: delegateToRequestManager:: SERVICE_ID=541,SERVICE_ID=9981",
"[2016-04-28 14:00:06,608],,,,,,DEBUG,DBSEntryServlet,10.91.39.143:60801 SERVICE_ID=00234,SERVICE_ID=11134,IMD=6767")
str1 <- c("[2016-04-28 14:00:06,603],,,,,SERVICE_ID=441,DEBUG,DBSEntryServlet,DBSEntryServlet: delegateToRequestManager:: SERVICE_ID=541,SERVICE_ID=9981",
"[2016-04-28 14:00:06,608],,,,,,DEBUG,DBSEntryServlet,10.91.39.143:60801 SERVICE_ID=00234,SERVICE_ID=11134,IMD=6767")
str2 <- gsub(",{2,}", " ", str1)
str4 <- sub("\\].*","",str2,perl = TRUE)
str5 <- sub("\\[","",str4,perl = T)
service_ids <- sapply(str_extract_all(str2,"SERVICE_ID=\\d+"), function(x){paste(x,collapse = " ")})
net <- cbind(str5,service_ids)
Output:

Combine regex 'or' with stop at first occurence

Conceptually, I want to search for (a|b) and get only the first occurrence. I know this is a lazy/non-greedy application, but can't seem to combine it properly with the or.
Moving beyond the conceptual level, which might change things a lot, a and b are actually longer patterns, but they have been tested separately and work fine. And I'm using this in strapply from package gsubfn which intrinsically finds all matches.
I suspect the answer is here in SO somewhere, but it's hard to search on such things.
Details: I'm trying to find function expressions var functionName = function(...) and function declarations function functionName(...) and extract the name of the function in javascript (parsing the lines with R). a is \\s*([[:alnum:]]*)\\s*=*\\s*function\\s*\\([^d|i] and b is \\s*function\\s*([[:alnum:]]+)\\s*\\([^d|i]. They work fine individually. A single function definition will take one form or the other, so I need to stop searching when I find one.
EDIT: In this string Here is a string of blah blah blah I'd like to find only the first 'a' using (a|b) or the first 'b' only using (b|a), plus of course whatever regex goodies I am missing.
EDIT 2: A big thanks to all who have looked at this. The details turn out to be important, so I'm going to post more info. Here are the test lines I am searching:
dput(lines)
c("var activateBrush = function() {", " function brushed() { // Handles the response to brushing",
" var followMouse = function(mX, mY) { // This draws the guides, nothing else",
".x(function(d) { return xContour(d.x); })", ".x(function(i) { return xContour(d.x); })"
)
Here are the two patterns I want to use, and how I use them individually.
fnPat1 <- "\\s*function\\s*([[:alnum:]]+)\\s*\\([^d|i]" # conveniently drops 'var'
fnNames <- unlist(strapply(pattern = fnPat1, replacement = paste0, X = lines))
fnPat2 <- "\\s*([[:alnum:]]*)\\s*=*\\s*function\\s*\\([^d|i]" # conveniently drops 'var'
fnNames <- unlist(strapply(pattern = fnPat2, replacement = paste0, X = lines))
They return, in order:
[1] "brushed" "brushed"
[1] "activateBrush" "followMouse" "activateBrush" "followMouse"
What I want to do is use both of these patterns at the same time. What I tried was
fnPat3 <- paste("((", fnPat1, ")|(", fnPat2, "))") # which is (a|b) of the orig. question
But that returns
[1] " activateBrush = function() " " function brushed() "
What I want is a vector of all the function names, namely c("brushed", "activateBrush", "followMouse") Duplicates are fine, I can call unique.
Maybe this is clearer now, maybe someone sees an entirely different approach. Thanks everyone!
To match the first a or b,
> x <- "Here is a string of blah blah blah"
> m <- regexpr("[ab]", x)
> regmatches(x, m)
[1] "a"
> x <- "Here b is a string of blah blah blah"
> m <- regexpr("[ab]", x)
> regmatches(x, m)
[1] "b"
Check the regex with sub function whether the regex matches the first a,b or not. In the below , using sub function i just replaced first a or b with ***. We use the advantage of sub function here, ie it won't do a global replacement. It only replace the first occurance of the characters which matches the given pattern or regex.
> x <- "Here is a string of blah blah blah"
> sub("[ab]", "***", x)
[1] "Here is *** string of blah blah blah"
> x <- "Here b is a string of blah blah blah"
> sub("[ab]", "***", x)
[1] "Here *** is a string of blah blah blah"
We could use gregexpr or gsub functions also.
> x <- "Here is a string of blah blah blah"
> m <- gregexpr("^[^ab]*\\K[ab]", x, perl=TRUE)
> regmatches(x, m)
[[1]]
[1] "a"
> gsub("^[^ab]*\\K[ab]", "***", x, perl=TRUE)
[1] "Here is *** string of blah blah blah"
> x <- "Here b is a string of blah blah blah"
> gsub("^[^ab]*\\K[ab]", "***", x, perl=TRUE)
[1] "Here *** is a string of blah blah blah"
Explanation:
^ Asserts that we are at the start.
[^ab]*, negated character class which matches any character but not of a or b zero or more times. We don't use [^ab]+ because there is a chance of a or b would be present at the start of the line.
\K discards the previously matched characters. ie, it removes all the characters which are matched by [^ab]* regex from printing.
[ab] Now it matches the following a or b
It seems to me this would be alot easier combining the expressions ...
strapply(lines, '(?:var|function)\\s*([[:alnum:]]+)', simplify = c)
# [1] "activateBrush" "brushed" "followMouse"
(?: ... ) is a Non-capturing group. By placing ?: inside you specify that the group is not to be captured, but to group things. Saying, group but do not capture "var" or "function" then capture the word characters that follow.
Try str_extract() from stringr package.
str_extract("b a", "a|b")
[1] "b"
str_extract("a b", "a|b")
[1] "a"
str_extract(c("a b", "b a"), "a|b")
[1] "a" "b"

How to trim and replace a string

string<-c(" this is a string ")
Is it possible to trim-off the white spaces on both the sides of the string (or just one side as required) and replace it with a desired character, such as this, in R? The number of white spaces differ on each side of the string and have to be retained on replacement.
"~~~~~~~this is a string~~"
This seems like an inefficient way of doing it, but maybe you should be looking in the direction of gregexpr and regmatches instead of gsub:
x <- " this is a string "
pattern <- "^ +?\\b|\\b? +$"
startstop <- gsub(" ", "~", regmatches(x, gregexpr(pattern, x))[[1]])
text <- paste(regmatches(x, gregexpr(pattern, x), invert=TRUE)[[1]], collapse="")
paste0(startstop[1], text, startstop[2])
# [1] "~~~~this is a string~~"
And, for fun, as a function, and a "vectorized" function:
## The function
replaceEnds <- function(string) {
pattern <- "^ +?\\b|\\b? +$"
startstop <- gsub(" ", "~", regmatches(string, gregexpr(pattern, string))[[1]])
text <- paste(regmatches(string, gregexpr(pattern, string), invert = TRUE)[[1]],
collapse = "")
paste0(startstop[1], text, startstop[2])
}
## use Vectorize here if you want to apply over a vector
vReplaceEnds <- Vectorize(replaceEnds)
Some sample data:
myStrings <- c(" Four at the start, 2 at the end ",
" three at the start, one at the end ")
vReplaceEnds(myStrings)
# Four at the start, 2 at the end three at the start, one at the end
# "~~~~Four at the start, 2 at the end~~" "~~~three at the start, one at the end~"
Use gsub:
gsub(" ", "~", " this is a string ")
[1] "~~~~this~is~a~string~~"
This function uses regular expressions to replace (i.e. sub), all occurrences of a pattern inside a string.
In your case, you have to express the pattern in a special way:
gsub("(^ *)|( *$)", "~~~", " this is a string ")
[1] "~~~this is a string~~~"
The pattern means:
(^ *): Find one or more spaces at the start of the string
( *$): Find one or more spaces at the end of the string
`|: The OR operator
Now you can use this approach to tackle your problem of replacing each space with a new character:
txt <- " this is a string "
foo <- function(x, new="~"){
lead <- gsub("(^ *).*", "\\1", x)
last <- gsub(".*?( *$)", "\\1", x)
mid <- gsub("(^ *)|( *$)", "", x)
paste0(
gsub(" ", new, lead),
mid,
gsub(" ", new, last)
)
}
> foo(" this is a string ")
[1] "~~~~this is a string~~"
> foo(" And another one ")
[1] "~And another one~~~~~~~~"
For more, see ?gsub or ?regexp.
Or using a more complex pattern matching and gsub...
gsub("\\s(?!\\b)|(?<=\\s)\\s(?=\\b)", "~", " this is a string " , perl = TRUE )
#[1] "~~~~this is a string~~"
Or with #AnandaMahto's data:
gsub("\\s(?!\\b)|(?<=\\s)\\s(?=\\b)", "~", myStrings , perl = TRUE )
#[1] "~~~~Four at the start, 2 at the end~~"
#[2] "~~~three at the start, one at the end~"
Explanation
This uses the positive and negative lookahead and look behind assertions:
\\s(?!\\b) - match a space, \\s not followed by a word boundary, (?!\\b). This would work by itself for everything except the last space before the first word, i.e. by itself we would get
"~~~~ this is a string~~". So we need another pattern...
(?<=\\s)\\s(?=\\b) - match a space, \\s that is preceded by another space, (?<=\\s) and is followed by a word boundary, (?=\\b).
And it is gsub so it tries to make the maximal number of matches that it can.

regex match within parenthesis

I'm attempting to use some regular expressions that I made for Python also work with R.
Here is what I have in Python (using the excellent re module), with my expected 3 matches:
import re
line = 'VARIABLES = "First [T]" "Second [L]" "Third [1/T]"'
re.findall('"(.*?)"', line)
# ['First [T]', 'Second [L]', 'Third [1/T]']
Now with R, here is my best attempt:
line <- 'VARIABLES = "First [T]" "Second [L]" "Third [1/T]"'
m <- gregexpr('"(.*?)"', line)
regmatches(line, m)[[1]]
# [1] "\"First [T]\"" "\"Second [L]\"" "\"Third [1/T]\""
Why did R match the whole pattern, rather than just within the parenthesis? I was expecting:
[1] "First [T]" "Second [L]" "Third [1/T]"
Furthermore, perl=TRUE didn't make any difference. Is it safe to assume that R's regex does not consider matching only the parenthesis, or is there some trick that I'm missing?
Summary of solution: thanks #flodel, it appears to work well with other patterns too, so it appears to be a good general solution. A compact form of the solution using an input string line and regular expression pattern pat is:
pat <- '"(.*?)"'
sub(pat, "\\1", regmatches(line, gregexpr(pat, line))[[1]])
Furthermore, perl=TRUE should be added to gregexpr if using PCRE features in pat.
If you print m, you'll see gregexpr(..., perl = TRUE) gives you the positions and lengths of matches for a) your full pattern including the leading and closing quotes and b) the captured (.*).
Unfortunately for you, when m is used by regmatches, it use the positions and lengths of the former.
There are two solutions I can think of.
Pass your final output through sub:
line <- 'VARIABLES = "First [T]" "Second [L]" "Third [1/T]"'
m <- gregexpr('"(.*?)"', line, perl = TRUE)
z <- regmatches(line, m)[[1]]
sub('"(.*?)"', "\\1", z)
Or use substring using the positions and lengths of the captured expressions:
start.pos <- attr(m[[1]], "capture.start")
end.pos <- start.pos + attr(m[[1]], "capture.length") - 1L
substring(line, start.pos, end.pos)
To further your understanding, see what happens if your pattern is trying to capture more than one thing. Also see that you can give names to your captures groups (what the doc refers to as Python-style named captures), here "capture1" and "capture2":
m <- gregexpr('"(?P<capture1>.*?) \\[(?P<capture2>.*?)\\]"', line, perl = TRUE)
m
start.pos <- attr(m[[1]], "capture.start")
end.pos <- start.pos + attr(m[[1]], "capture.length") - 1L
substring(line, start.pos[, "capture1"],
end.pos[, "capture1"])
# [1] "First" "Second" "Third"
substring(line, start.pos[, "capture2"],
end.pos[, "capture2"])
# [1] "T" "L" "1/T"
1) strapplyc in the gsubfn package acts in the way you were expecting:
> library(gsubfn)
> strapplyc(line, '"(.*?)"')[[1]]
[1] "First [T]" "Second [L]" "Third [1/T]"
2) Although it involves delving into m's attributes, its possible to make regmatches work by reconstructing m to refer to the captures rather than the whole match:
at <- attributes( m[[1]] )
m2 <- list( structure( c(at$capture.start), match.length = at$capture.length ) )
regmatches( line, m2 )[[1]]
3) If we knew that the strings always ended in ] and were willing to modify the regular expression then this would work:
> m3 <- gregexpr('[^"]*]', line)
> regmatches( line, m3 )[[1]]
[1] "First [T]" "Second [L]" "Third [1/T]"

R : regular expression for 'not followed by' not working

I needed to retain the words enclosed in brackets and delete the others in the following string.
(a(b(c)d)(e)f)
So what I expected would be (((c))(e)).
To delete a, b, d, f, I tried the 'not followed by' regex.
str <- "(a(b(c)d)(e)f)"
gsub("([a-z]+)(?!\\))", "", str) #(sub. anything that isn't followed by a ")" )
The message shows my regex in invalid. As I can see, the brackets in the second part of the regex "(?!\))" don't match properly. As for my editor, the first "(" matches with the immediately following ")", which is not meant to be a closure bracket (the one to its right is). I could make out just this error from my regex. Can you please tell me what actually is wrong? Is there any other way to do this?
In two steps, and using positive lookaheads:
str1 <- gsub("\\([a-z](?=\\()", "\\(", str, perl=TRUE)
str1
# [1] "(((c)d)(e)f)"
str2 <- gsub("\\)[a-z](?=\\))", "\\)", str1, perl=TRUE)
str2
# [1] "(((c))(e))"
Edit: it turns out you can even do it in one:
gsub("([\\(\\)])[a-z](?=\\1)", "\\1", str, perl=TRUE)
# [1] "(((c))(e))"
I agree with #Dason's comment:
st <- "(a(b(c)d)(e)f)"
while(grepl("\\([a-z]+\\(",st)) {
st <- sub("\\([a-z]+(\\(.+\\))[a-z]+\\)","\\1",st)
}
> st
[1] "(c)(e)"
Written on my iPad :-)