Matching entire string in R

Matching entire string in R - regex

Consider the following string:
string = "I have #1 file and #11 folders"
I would like to replace the pattern #1 with the word one, but I don't want to modify th #11. The result should be:
string = "I have one file and #11 folders"
I have tried:
string = gsub("#1", "one, string, fixed = TRUE)
but this replaces both #1 and #11. I have also tried:
string = gsub("^#1$", "one, string, fixed = TRUE)
but this doesn't replace anything since the pattern is part of a string that contains spaces.
Please note that if the initial string looked like:
string = "I have #1 file blah blah blah and #11 folders"
I would want the result to be:
string = "I have 1 file blah blah blah and #11 folders"
In other words, I literally just want to change the exact pattern #1 without touching the rest of the string. Is that possible?

I'm not sure if I understood right, but does this help -
a <- "I have #1 file and #11 folders"
b <- "I have #1file and #11 folders"
c <- "I have #1,file and #11 folders"
> gsub(x = a, pattern = "#1.*file", replacement = "one file")
[1] "I have one file and #11 folders"
> gsub(x = b, pattern = "#1.*file", replacement = "one file")
[1] "I have one file and #11 folders"
> gsub(x = c, pattern = "#1.*file", replacement = "one file")
[1] "I have one file and #11 folders"

If you use the perl=TRUE argument to tools like gsub then the perl regex engine will be used which has some options that could help.
The pattern "#1\\b" will match #1 followed by a word boundary, so it would match #1, but not #11 (since there is no boundary between the 2 1's). There are also tools for positive and negative look ahead which look for things following your pattern (like the word file maybe), but does not include them in the part to be replaced.

Use the space after #1 to your advantage:
gsub("#1 ", "one ", string, fixed = TRUE)
[1] "I have one file and #11 folders"

Related

Subdivide an expression into alternative subpattern - using gsub()

I'm trying to subdivide my metacharacter expression in my gsub() function. But it does not return anything found.
Task: I want to delete all sections of string that contain either .ST or -XST in my vector of strings.
As you can see below, using one expression works fine. But the | expression simply does not work. I'm following the metacharacter guide on https://www.stat.auckland.ac.nz/~paul/ItDT/HTML/node84.html
What can be the issue? And what caused this issue?
My data
> rownames(table.summary)[1:10]
[1] "AAK.ST" "ABB.ST" "ALFA.ST" "ALIV-SDB.ST" "AOI.ST" "ATCO-A.ST" "ATCO-B.ST" "AXFO.ST" "AXIS.ST" "AZN.ST"
> gsub(pattern = '[.](.*)$ | [-](.*)$', replacement = "", x = rownames(table.summary)[1:10])
[1] "AAK.ST" "ABB.ST" "ALFA.ST" "ALIV-SDB.ST" "AOI.ST" "ATCO-A.ST" "ATCO-B.ST" "AXFO.ST" "AXIS.ST" "AZN.ST"
> gsub(pattern = '[.](.*)$', replacement = "", x = rownames(table.summary)[1:10])
[1] "AAK" "ABB" "ALFA" "ALIV-SDB" "AOI" "ATCO-A" "ATCO-B" "AXFO" "AXIS" "AZN"
> gsub(pattern = '[-](.*)$', replacement = "", x = rownames(table.summary)[1:10])
[1] "AAK.ST" "ABB.ST" "ALFA.ST" "ALIV" "AOI.ST" "ATCO" "ATCO" "AXFO.ST" "AXIS.ST" "AZN.ST"

It seems you tested your regex with a flag like IgnorePatternWhitespace (VERBOSE, /x) that allows whitespace inside patterns for readability. You can use it with perl=T option:
d <- c("AAK.ST","ABB.ST","ALFA.ST","ALIV-SDB.ST","AOI.ST","ATCO-A.ST","ATCO-B.ST","AXFO.ST", "AXIS.ST","AZN.ST")
gsub('(?x)[.](.*)$ | [-](.*)$', '', d, perl=T)
## [1] "AAK" "ABB" "ALFA" "ALIV" "AOI" "ATCO" "ATCO" "AXFO" "AXIS" "AZN"
However, you really do not have to use that complex regex here.
If you plan to remove all substrings from ther first hyphen or dot up to the end, you may use the following regex:
[.-].*$
The character class [.-] will match the first . or - symbol and .* wil match all characters up to the end of the string ($).
See IDEONE demo:
d <- c("AAK.ST","ABB.ST","ALFA.ST","ALIV-SDB.ST","AOI.ST","ATCO-A.ST","ATCO-B.ST","AXFO.ST", "AXIS.ST","AZN.ST")
gsub("[.-].*$", "", d)
Result: [1] "AAK" "ABB" "ALFA" "ALIV" "AOI" "ATCO" "ATCO" "AXFO" "AXIS" "AZN"

This will find .ST or -XST at the end of the text and substitute it with empty characters string (effectively removing that part). Don't forget that gsub returns modified string, not modifies it in place. You won't see any change until you reassign return value back to some variable.
strings <- c("AAK.ST", "ABB.ST", "ALFA.ST", "ALIV-SDB.ST", "AOI.ST", "ATCO-A.ST", "ATCO-B.ST", "AXFO.ST", "AXIS.ST", "AZN.ST", "AAC-XST", "AAD-XSTV")
strings <- gsub('(\\.ST|-XST)$', '', strings)
Your regular expression ([.](.*)$ | [-](.*)$'), if not for unnecessary spaces, would remove everything from first dot (.) or dash (-) to end of text. This might be what you want, but not what you said you want.

Replace a random block of characters in a string in R

I have a text and I want to replace a text block in a line, like that:
"\t\t\tFGHGFJKJKJKGDSJS"
with
x= "ABCCCBBHHJJJH"
I'm interested in changing just the text block (FGHGFJKJKJKGDSJS) without modyfing the presence of other special characters. So obtaining:
"\t\t\tABCCCBBHHJJJH"
Do it exist a way to replace FGHGFJKJKJKGDSJS without clearly specify the exact combination of letters?
I found a solution in this way: txt[n° of the line] = paste0(\t,\t,\t,x)
But I would like to know whether there is a more general solution.

> library(stringr)
> mystring <- "\t\t\tFGHGFJKJKJKGDSJS"
> x <- "ABCCCBBHHJJJH"
> str_replace(mystring,"\\w+",x)
[1] "\t\t\tABCCCBBHHJJJH"
\w+mean match any character or number or underscore at least once and as many as possible. So each part not a normal char will be replace by your x variable.

> a = "\t\t\tDFGGD"
> gsub("(\t\t\t).*","\\1ABCDF",a)
[1] "\t\t\tABCDF

mystring <- "\t\t\tFGHGFJKJKJKGDSJS"
x <- "ABCCCBBHHJJJH"
sub('\\w+',x,mystring,ignore.case=T)

Combine regex 'or' with stop at first occurence

Conceptually, I want to search for (a|b) and get only the first occurrence. I know this is a lazy/non-greedy application, but can't seem to combine it properly with the or.
Moving beyond the conceptual level, which might change things a lot, a and b are actually longer patterns, but they have been tested separately and work fine. And I'm using this in strapply from package gsubfn which intrinsically finds all matches.
I suspect the answer is here in SO somewhere, but it's hard to search on such things.
Details: I'm trying to find function expressions var functionName = function(...) and function declarations function functionName(...) and extract the name of the function in javascript (parsing the lines with R). a is \\s*([[:alnum:]]*)\\s*=*\\s*function\\s*\\([^d|i] and b is \\s*function\\s*([[:alnum:]]+)\\s*\\([^d|i]. They work fine individually. A single function definition will take one form or the other, so I need to stop searching when I find one.
EDIT: In this string Here is a string of blah blah blah I'd like to find only the first 'a' using (a|b) or the first 'b' only using (b|a), plus of course whatever regex goodies I am missing.
EDIT 2: A big thanks to all who have looked at this. The details turn out to be important, so I'm going to post more info. Here are the test lines I am searching:
dput(lines)
c("var activateBrush = function() {", " function brushed() { // Handles the response to brushing",
" var followMouse = function(mX, mY) { // This draws the guides, nothing else",
".x(function(d) { return xContour(d.x); })", ".x(function(i) { return xContour(d.x); })"
)
Here are the two patterns I want to use, and how I use them individually.
fnPat1 <- "\\s*function\\s*([[:alnum:]]+)\\s*\\([^d|i]" # conveniently drops 'var'
fnNames <- unlist(strapply(pattern = fnPat1, replacement = paste0, X = lines))
fnPat2 <- "\\s*([[:alnum:]]*)\\s*=*\\s*function\\s*\\([^d|i]" # conveniently drops 'var'
fnNames <- unlist(strapply(pattern = fnPat2, replacement = paste0, X = lines))
They return, in order:
[1] "brushed" "brushed"
[1] "activateBrush" "followMouse" "activateBrush" "followMouse"
What I want to do is use both of these patterns at the same time. What I tried was
fnPat3 <- paste("((", fnPat1, ")|(", fnPat2, "))") # which is (a|b) of the orig. question
But that returns
[1] " activateBrush = function() " " function brushed() "
What I want is a vector of all the function names, namely c("brushed", "activateBrush", "followMouse") Duplicates are fine, I can call unique.
Maybe this is clearer now, maybe someone sees an entirely different approach. Thanks everyone!

To match the first a or b,
> x <- "Here is a string of blah blah blah"
> m <- regexpr("[ab]", x)
> regmatches(x, m)
[1] "a"
> x <- "Here b is a string of blah blah blah"
> m <- regexpr("[ab]", x)
> regmatches(x, m)
[1] "b"
Check the regex with sub function whether the regex matches the first a,b or not. In the below , using sub function i just replaced first a or b with ***. We use the advantage of sub function here, ie it won't do a global replacement. It only replace the first occurance of the characters which matches the given pattern or regex.
> x <- "Here is a string of blah blah blah"
> sub("[ab]", "***", x)
[1] "Here is *** string of blah blah blah"
> x <- "Here b is a string of blah blah blah"
> sub("[ab]", "***", x)
[1] "Here *** is a string of blah blah blah"
We could use gregexpr or gsub functions also.
> x <- "Here is a string of blah blah blah"
> m <- gregexpr("^[^ab]*\\K[ab]", x, perl=TRUE)
> regmatches(x, m)
[[1]]
[1] "a"
> gsub("^[^ab]*\\K[ab]", "***", x, perl=TRUE)
[1] "Here is *** string of blah blah blah"
> x <- "Here b is a string of blah blah blah"
> gsub("^[^ab]*\\K[ab]", "***", x, perl=TRUE)
[1] "Here *** is a string of blah blah blah"
Explanation:
^ Asserts that we are at the start.
[^ab]*, negated character class which matches any character but not of a or b zero or more times. We don't use [^ab]+ because there is a chance of a or b would be present at the start of the line.
\K discards the previously matched characters. ie, it removes all the characters which are matched by [^ab]* regex from printing.
[ab] Now it matches the following a or b

It seems to me this would be alot easier combining the expressions ...
strapply(lines, '(?:var|function)\\s*([[:alnum:]]+)', simplify = c)
# [1] "activateBrush" "brushed" "followMouse"
(?: ... ) is a Non-capturing group. By placing ?: inside you specify that the group is not to be captured, but to group things. Saying, group but do not capture "var" or "function" then capture the word characters that follow.

Try str_extract() from stringr package.
str_extract("b a", "a|b")
[1] "b"
str_extract("a b", "a|b")
[1] "a"
str_extract(c("a b", "b a"), "a|b")
[1] "a" "b"

Extract matched strings in C++ with regex

I have following test strings.
#5=BUILDING('xxxcdccx',#5,$,$,$,#21,$,$,.ELEMENT.,$,$,$);
#6=BUILDING('xxxcdccx',#5,$,$,$,#21,$,$,.ELEMENT.,$,$,$);
#7=BUILDING('xxxcdccx',#5,$,$,$,#21,$,$,.ELEMENT.,$,$,$);
I need to extract:
"#integer" (that always starts at the beginning of the string) from above strings and store it in a variable.
the string between "(" and ")" from above test string.
Can someone please suggest how I can achieve this in C++ with regex.
I tried following as simple example (it's a loop that processes one line at a time):
std::regex e ("\#[:d:]+");
if (std::regex_match(sLine,e)){
//store it and process it
}
output should be:
#5
and
'xxxcdccx',#5,$,$,$,#21,$,$,.ELEMENT.,$,$,$ ?? (not sure)

Description
This expression will:
capture the initial # and integer
capture the value between the parentheses
^(\#\d+).*?\(([^)]*)\)
Example
Live Demo
Sample Text
#5=BUILDING('xxxcdccx',#5,$,$,$,#21,$,$,.ELEMENT.,$,$,$);
#6=BUILDING('xxxcdccx',#5,$,$,$,#21,$,$,.ELEMENT.,$,$,$);
#7=BUILDING('xxxcdccx',#5,$,$,$,#21,$,$,.ELEMENT.,$,$,$);
Capture Groups
Group 0 gets the entire matched string
Group 1 gets the # and integer
Group 2 gets the value between the parentheses
[0][0] = #5=BUILDING('xxxcdccx',#5,$,$,$,#21,$,$,.ELEMENT.,$,$,$)
[0][1] = #5
[0][2] = 'xxxcdccx',#5,$,$,$,#21,$,$,.ELEMENT.,$,$,$
[1][0] = #6=BUILDING('xxxcdccx',#5,$,$,$,#21,$,$,.ELEMENT.,$,$,$)
[1][1] = #6
[1][2] = 'xxxcdccx',#5,$,$,$,#21,$,$,.ELEMENT.,$,$,$
[2][0] = #7=BUILDING('xxxcdccx',#5,$,$,$,#21,$,$,.ELEMENT.,$,$,$)
[2][1] = #7
[2][2] = 'xxxcdccx',#5,$,$,$,#21,$,$,.ELEMENT.,$,$,$

regex match within parenthesis

I'm attempting to use some regular expressions that I made for Python also work with R.
Here is what I have in Python (using the excellent re module), with my expected 3 matches:
import re
line = 'VARIABLES = "First [T]" "Second [L]" "Third [1/T]"'
re.findall('"(.*?)"', line)
# ['First [T]', 'Second [L]', 'Third [1/T]']
Now with R, here is my best attempt:
line <- 'VARIABLES = "First [T]" "Second [L]" "Third [1/T]"'
m <- gregexpr('"(.*?)"', line)
regmatches(line, m)[[1]]
# [1] "\"First [T]\"" "\"Second [L]\"" "\"Third [1/T]\""
Why did R match the whole pattern, rather than just within the parenthesis? I was expecting:
[1] "First [T]" "Second [L]" "Third [1/T]"
Furthermore, perl=TRUE didn't make any difference. Is it safe to assume that R's regex does not consider matching only the parenthesis, or is there some trick that I'm missing?
Summary of solution: thanks #flodel, it appears to work well with other patterns too, so it appears to be a good general solution. A compact form of the solution using an input string line and regular expression pattern pat is:
pat <- '"(.*?)"'
sub(pat, "\\1", regmatches(line, gregexpr(pat, line))[[1]])
Furthermore, perl=TRUE should be added to gregexpr if using PCRE features in pat.

If you print m, you'll see gregexpr(..., perl = TRUE) gives you the positions and lengths of matches for a) your full pattern including the leading and closing quotes and b) the captured (.*).
Unfortunately for you, when m is used by regmatches, it use the positions and lengths of the former.
There are two solutions I can think of.
Pass your final output through sub:
line <- 'VARIABLES = "First [T]" "Second [L]" "Third [1/T]"'
m <- gregexpr('"(.*?)"', line, perl = TRUE)
z <- regmatches(line, m)[[1]]
sub('"(.*?)"', "\\1", z)
Or use substring using the positions and lengths of the captured expressions:
start.pos <- attr(m[[1]], "capture.start")
end.pos <- start.pos + attr(m[[1]], "capture.length") - 1L
substring(line, start.pos, end.pos)
To further your understanding, see what happens if your pattern is trying to capture more than one thing. Also see that you can give names to your captures groups (what the doc refers to as Python-style named captures), here "capture1" and "capture2":
m <- gregexpr('"(?P<capture1>.*?) \\[(?P<capture2>.*?)\\]"', line, perl = TRUE)
m
start.pos <- attr(m[[1]], "capture.start")
end.pos <- start.pos + attr(m[[1]], "capture.length") - 1L
substring(line, start.pos[, "capture1"],
end.pos[, "capture1"])
# [1] "First" "Second" "Third"
substring(line, start.pos[, "capture2"],
end.pos[, "capture2"])
# [1] "T" "L" "1/T"

1) strapplyc in the gsubfn package acts in the way you were expecting:
> library(gsubfn)
> strapplyc(line, '"(.*?)"')[[1]]
[1] "First [T]" "Second [L]" "Third [1/T]"
2) Although it involves delving into m's attributes, its possible to make regmatches work by reconstructing m to refer to the captures rather than the whole match:
at <- attributes( m[[1]] )
m2 <- list( structure( c(at$capture.start), match.length = at$capture.length ) )
regmatches( line, m2 )[[1]]
3) If we knew that the strings always ended in ] and were willing to modify the regular expression then this would work:
> m3 <- gregexpr('[^"]*]', line)
> regmatches( line, m3 )[[1]]
[1] "First [T]" "Second [L]" "Third [1/T]"

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Matching entire string in R - regex

Use the space after #1 to your advantage: gsub("#1 ", "one ", string, fixed = TRUE) [1] "I have one file and #11 folders"

Related

Subdivide an expression into alternative subpattern - using gsub()

Replace a random block of characters in a string in R

Combine regex 'or' with stop at first occurence

Extract matched strings in C++ with regex

regex match within parenthesis

Categories

Resources