R: gsub of exact full string with fixed = T - regex

I am trying to gsub exact FULL string - I know I need to use ^ and $. The problem is that I have special characters in strings (could be [, or .) so I need to use fixed=T. This overrides the ^ and $. Any solution is appreciated.
Need to replace 1st, 2nd element in exact_orig with 1st, 2nd element from exact_change but only if full string is matched from beginning to end.
exact_orig = c("oz","32 oz")
exact_change = c("20 oz","32 ct")
gsub_FixedTrue <- function(i) {
for(k in seq_along(exact_orig)) i = gsub(exact_orig[k],exact_change[k],i,fixed=TRUE)
return(i)
}
Test cases:
print(gsub_FixedTrue("32 oz")) #gives me "32 20 oz" - wrong! Must be "32 ct"
print(gsub_FixedTrue("oz oz")) # gives me "20 oz 20 oz" - wrong! Must remain as "oz oz"
I read a somewhat similar thread, but could not make it work for full string (grep at the beginning of the string with fixed =T in R?)

If you want to exactly match full strings, i don't think you really want to use regular expressions in this case. How about just the match() function
fixedTrue<-function(x) {
m <- match(x, exact_orig)
x[!is.na(m)] <- exact_change[m[!is.na(m)]]
x
}
fixedTrue(c("32 oz","oz oz"))
# [1] "32 ct" "oz oz"

Related

Subdivide an expression into alternative subpattern - using gsub()

I'm trying to subdivide my metacharacter expression in my gsub() function. But it does not return anything found.
Task: I want to delete all sections of string that contain either .ST or -XST in my vector of strings.
As you can see below, using one expression works fine. But the | expression simply does not work. I'm following the metacharacter guide on https://www.stat.auckland.ac.nz/~paul/ItDT/HTML/node84.html
What can be the issue? And what caused this issue?
My data
> rownames(table.summary)[1:10]
[1] "AAK.ST" "ABB.ST" "ALFA.ST" "ALIV-SDB.ST" "AOI.ST" "ATCO-A.ST" "ATCO-B.ST" "AXFO.ST" "AXIS.ST" "AZN.ST"
> gsub(pattern = '[.](.*)$ | [-](.*)$', replacement = "", x = rownames(table.summary)[1:10])
[1] "AAK.ST" "ABB.ST" "ALFA.ST" "ALIV-SDB.ST" "AOI.ST" "ATCO-A.ST" "ATCO-B.ST" "AXFO.ST" "AXIS.ST" "AZN.ST"
> gsub(pattern = '[.](.*)$', replacement = "", x = rownames(table.summary)[1:10])
[1] "AAK" "ABB" "ALFA" "ALIV-SDB" "AOI" "ATCO-A" "ATCO-B" "AXFO" "AXIS" "AZN"
> gsub(pattern = '[-](.*)$', replacement = "", x = rownames(table.summary)[1:10])
[1] "AAK.ST" "ABB.ST" "ALFA.ST" "ALIV" "AOI.ST" "ATCO" "ATCO" "AXFO.ST" "AXIS.ST" "AZN.ST"
It seems you tested your regex with a flag like IgnorePatternWhitespace (VERBOSE, /x) that allows whitespace inside patterns for readability. You can use it with perl=T option:
d <- c("AAK.ST","ABB.ST","ALFA.ST","ALIV-SDB.ST","AOI.ST","ATCO-A.ST","ATCO-B.ST","AXFO.ST", "AXIS.ST","AZN.ST")
gsub('(?x)[.](.*)$ | [-](.*)$', '', d, perl=T)
## [1] "AAK" "ABB" "ALFA" "ALIV" "AOI" "ATCO" "ATCO" "AXFO" "AXIS" "AZN"
However, you really do not have to use that complex regex here.
If you plan to remove all substrings from ther first hyphen or dot up to the end, you may use the following regex:
[.-].*$
The character class [.-] will match the first . or - symbol and .* wil match all characters up to the end of the string ($).
See IDEONE demo:
d <- c("AAK.ST","ABB.ST","ALFA.ST","ALIV-SDB.ST","AOI.ST","ATCO-A.ST","ATCO-B.ST","AXFO.ST", "AXIS.ST","AZN.ST")
gsub("[.-].*$", "", d)
Result: [1] "AAK" "ABB" "ALFA" "ALIV" "AOI" "ATCO" "ATCO" "AXFO" "AXIS" "AZN"
This will find .ST or -XST at the end of the text and substitute it with empty characters string (effectively removing that part). Don't forget that gsub returns modified string, not modifies it in place. You won't see any change until you reassign return value back to some variable.
strings <- c("AAK.ST", "ABB.ST", "ALFA.ST", "ALIV-SDB.ST", "AOI.ST", "ATCO-A.ST", "ATCO-B.ST", "AXFO.ST", "AXIS.ST", "AZN.ST", "AAC-XST", "AAD-XSTV")
strings <- gsub('(\\.ST|-XST)$', '', strings)
Your regular expression ([.](.*)$ | [-](.*)$'), if not for unnecessary spaces, would remove everything from first dot (.) or dash (-) to end of text. This might be what you want, but not what you said you want.

Split string on [:punct:] except for underscore in R

I have an equation as a string where the variables in the string equation are variables in the R workspace. I would like to replace each variable with its numeric value in the R workspace. This is easy enough when the variable names don't contain punctuation.
Here is a simple example.
x <- 5
y <- 10
yy <- 15
z <- x*(y + yy)
zAsChar <- "z=x*(y+yy)"
vars <- unlist(strsplit(zAsChar, "[[:punct:]]"))
notVars <- unlist(strsplit(zAsChar, "[^[:punct:]]"))
varsValues <- sapply(vars[vars != ""], FUN=function(aaa) get(aaa))
notVarsValues <- notVars[notVars != ""]
paste(paste0(varsValues, notVarsValues), collapse="")
This yields "125=5*(10+15)", which is great.
However, I would love the option to use underscores in the variable names so that I can use "subscripts" for variable names. I am using these strings in math mode in R markdown.
So I need a [:punct:] that excludes _. I tried using [\\+\\-\\*\\/\\(\\)\\=] rather than [:punct:], but with this approach I couldn't split on the minus sign. Is there a way to preserve the _?
Instead of [:punct:] use the unicode character class \pP (shortcut for \p{P}) and its negation \PP to do that:
[^\\PP_]
(It works with perl=TRUE option)
Are you sure you need to do all this string manipulation? The substitute() function can help you out
substitute(z==x*(y+yy), list(x=x, y=y, yy=yy,z=z))
Or if you really need to start with a character value
do.call("substitute", list(parse(text=zAsChar)[[1]],list(x=x, y=y, yy=yy,z=z)))
# 125 = 5 * (10 + 15)
You can deparse() the result to turn it back into a character.

Pattern matching and replacement in R

I am not familiar at all with regular expressions, and would like to do pattern matching and replacement in R.
I would like to replace the pattern #1, #2 in the vector: original = c("#1", "#2", "#10", "#11") with each value of the vector vec = c(1,2).
The result I am looking for is the following vector: c("1", "2", "#10", "#11")
I am not sure how to do that. I tried doing:
for(i in 1:2) {
pattern = paste("#", i, sep = "")
original = gsub(pattern, vec[i], original, fixed = TRUE)
}
but I get :
#> original
#[1] "1" "2" "10" "11"
instead of: "1" "2" "#10" "#11"
I would appreciate any help I can get! Thank you!
Specify that you are matching the entire string from start (^) to end ($).
Here, I've matched exactly the conditions you are looking at in this example, but I'm guessing you'll need to extend it:
> gsub("^#([1-2])$", "\\1", original)
[1] "1" "2" "#10" "#11"
So, that's basically, "from the start, look for a hash symbol followed by either the exact number one or two. The one or two should be just one digit (that's why we don't use * or + or something) and also ends the string. Oh, and capture that one or two because we want to 'backreference' it."
Another option using gsubfn:
library(gsubfn)
gsubfn("^#([1-2])$", I, original) ## Function substituting
[1] "1" "2" "#10" "#11"
Or if you want to explicitly use the values of your vector , using vec values:
gsubfn("^#[1-2]$", as.list(setNames(vec,c("#1", "#2"))), original)
Or formula notation equivalent to function notation:
gsubfn("^#([1-2])$", ~ x, original) ## formula substituting
Here's a slightly different take that uses zero width negative lookahead assertion (what a mouthful!). This is the (?!...) which matches # at the start of a string as long as it is not followed by whatever is in .... In this case two (or equivalently, more as long as they are contiguous) digits. It replaces them with nothing.
gsub( "^#(?![0-9]{2})" , "" , original , perl = TRUE )
[1] "1" "2" "#10" "#11"

Regular expression to find and replace conditionally

I need to replace string A with string B, only when string A is a whole word (e.g. "MECH"), and I don't want to make the replacement when A is a part of a longer string (e.g. "MECHANICAL"). So far, I have a grepl() which checks if string A is a whole string, but I cannot figure out how to make the replacement. I have added an ifelse() with the idea to makes the gsub() replacement when grep() returns TRUE, otherwise not to replace. Any suggestions? Please see the code below. Thanks.
aa <- data.frame(type = c("CONSTR", "MECH CONSTRUCTION", "MECHANICAL CONSTRUCTION MECH", "MECH CONSTR", "MECHCONSTRUCTION"))
from <- c("MECH", "MECHANICAL", "CONSTR", "CONSTRUCTION")
to <- c("MECHANICAL", "MECHANICAL", "CONSTRUCTION", "CONSTRUCTION")
gsub2 <- function(pattern, replacement, x, ...) {
for(i in 1:length(pattern)){
reg <- paste0("(^", pattern[i], "$)|(^", pattern[i], " )|( ", pattern[i], "$)|( ", pattern[i], " )")
ifelse(grepl(reg, aa$type),
x <- gsub(pattern[i], replacement[i], x, ...),
aa$type)
}
x
}
aa$title3 <- gsub2(from, to, aa$type)
You can enclose the strings in the from vector in \\< and \\> to match only whole words:
x <- c("CONSTR", "MECH CONSTRUCTION", "MECHANICAL CONSTRUCTION MECH",
"MECH CONSTR", "MECHCONSTRUCTION")
from <- c("\\<MECH\\>", "\\<CONSTR\\>")
to <- c("MECHANICAL", "CONSTRUCTION")
for(i in 1:length(from)){
x <- gsub(from[i], to[i], x)
}
print(x)
# [1] "CONSTRUCTION" "MECHANICAL CONSTRUCTION"
# [3] "MECHANICAL CONSTRUCTION MECHANICAL" "MECHANICAL CONSTRUCTION"
# [5] "MECHCONSTRUCTION"
I use regex (?<=\W|^)MECH(?=\W|$) to get if inside the string contain whole word MECH like this.
Is that what you need?
Just for posterity, other than using the \< \> enclosure, a whole word can be defined as any string ending in a space or end-of-line (\s|$).
gsub("MECH(\\s|$)", "MECHANICAL\\1", aa$type)
The only problem with this approach is that you need to carry over the space or end-of-line that you used as part of the match, hence the encapsulation in parentheses and the backreference (\1).
The \< \> enclosure is superior for this particular question, since you have no special exceptions. However, if you have exceptions, it is better to use a more explicit method. The more tools in your toolbox, the better.

R: Capitalizing everything after a certain character

I would like to capitalize everything in a character vector that comes after the first _. For example the following vector:
x <- c("NYC_23df", "BOS_3_rb", "mgh_3_3_f")
Should come out like this:
"NYC_23DF" "BOS_3_RB" "mgh_3_3_F"
I have been trying to play with regular expressions, but am not able to do this. Any suggestions would be appreciated.
You were very close:
gsub("(_.*)","\\U\\1",x,perl=TRUE)
seems to work. You just needed to use _.* (underscore followed by zero or more other characters) rather than _* (zero or more underscores) ...
To take this apart a bit more:
_.* gives a regular expression pattern that matches an underscore _ followed by any number (including 0) of additional characters; . denotes "any character" and * denotes "zero or more repeats of the previous element"
surrounding this regular expression with parentheses () denotes that it is a pattern we want to store
\\1 in the replacement string says "insert the contents of the first matched pattern", i.e. whatever matched _.*
\\U, in conjunction with perl=TRUE, says "put what follows in upper case" (uppercasing _ has no effect; if we wanted to capitalize everything after (for example) a lower-case g, we would need to exclude the g from the stored pattern and include it in the replacement pattern: gsub("g(.*)","g\\U\\1",x,perl=TRUE))
For more details, search for "replacement" and "capitalizing" in ?gsub (and ?regexp for general information about regular expressions)
gsubfn in the gsubfn package is like gsub except the replacement string can be a function. Here we match _ and everything afterwards feeding the match through toupper :
library(gsubfn)
gsubfn("_.*", toupper, x)
## [1] "NYC_23DF" "BOS_3_RB" "mgh_3_3_F"
Note that this approach involves a particularly simple regular expression.
Simple example using base::strsplit
x <- c("NYC_23df", "BOS_3_rb", "mgh_3_3_f", "a")
myCap <- function(x) {
out <- sapply(x, function(y) {
temp <- unlist(strsplit(y, "_"))
out <- temp[1]
if (length(temp[-1])) {
out <- paste(temp[1], paste(toupper(temp[-1]),
collapse="_"), sep="_")
}
return(out)
})
out
}
> myCap(x)
NYC_23df BOS_3_rb mgh_3_3_f a
"NYC_23DF" "BOS_3_RB" "mgh_3_3_F" "a"
Example using the stringr package
pkg <- "stringr"
if (!require(pkg, character.only=TRUE)) {
install.packages(pkg)
require(pkg, character.only=TRUE)
}
myCap.2 <- function(x) {
out <- sapply(x, function(y) {
idx <- str_locate(y, "_")
if (!all(is.na(idx[1,]))) {
str_sub(y, idx[,1], nchar(y)) <- toupper(str_sub(y, idx[,1], nchar(y)))
}
return(y)
})
out
}
> myCap.2(x)
NYC_23df BOS_3_rb mgh_3_3_f a
"NYC_23DF" "BOS_3_RB" "mgh_3_3_F" "a"