Regular expression to find and replace conditionally - regex

I need to replace string A with string B, only when string A is a whole word (e.g. "MECH"), and I don't want to make the replacement when A is a part of a longer string (e.g. "MECHANICAL"). So far, I have a grepl() which checks if string A is a whole string, but I cannot figure out how to make the replacement. I have added an ifelse() with the idea to makes the gsub() replacement when grep() returns TRUE, otherwise not to replace. Any suggestions? Please see the code below. Thanks.
aa <- data.frame(type = c("CONSTR", "MECH CONSTRUCTION", "MECHANICAL CONSTRUCTION MECH", "MECH CONSTR", "MECHCONSTRUCTION"))
from <- c("MECH", "MECHANICAL", "CONSTR", "CONSTRUCTION")
to <- c("MECHANICAL", "MECHANICAL", "CONSTRUCTION", "CONSTRUCTION")
gsub2 <- function(pattern, replacement, x, ...) {
for(i in 1:length(pattern)){
reg <- paste0("(^", pattern[i], "$)|(^", pattern[i], " )|( ", pattern[i], "$)|( ", pattern[i], " )")
ifelse(grepl(reg, aa$type),
x <- gsub(pattern[i], replacement[i], x, ...),
aa$type)
}
x
}
aa$title3 <- gsub2(from, to, aa$type)

You can enclose the strings in the from vector in \\< and \\> to match only whole words:
x <- c("CONSTR", "MECH CONSTRUCTION", "MECHANICAL CONSTRUCTION MECH",
"MECH CONSTR", "MECHCONSTRUCTION")
from <- c("\\<MECH\\>", "\\<CONSTR\\>")
to <- c("MECHANICAL", "CONSTRUCTION")
for(i in 1:length(from)){
x <- gsub(from[i], to[i], x)
}
print(x)
# [1] "CONSTRUCTION" "MECHANICAL CONSTRUCTION"
# [3] "MECHANICAL CONSTRUCTION MECHANICAL" "MECHANICAL CONSTRUCTION"
# [5] "MECHCONSTRUCTION"

I use regex (?<=\W|^)MECH(?=\W|$) to get if inside the string contain whole word MECH like this.
Is that what you need?

Just for posterity, other than using the \< \> enclosure, a whole word can be defined as any string ending in a space or end-of-line (\s|$).
gsub("MECH(\\s|$)", "MECHANICAL\\1", aa$type)
The only problem with this approach is that you need to carry over the space or end-of-line that you used as part of the match, hence the encapsulation in parentheses and the backreference (\1).
The \< \> enclosure is superior for this particular question, since you have no special exceptions. However, if you have exceptions, it is better to use a more explicit method. The more tools in your toolbox, the better.

Related

replace every other space with new line

I have strings like this:
a <- "this string has an even number of words"
b <- "this string doesn't have an even number of words"
I want to replace every other space with a new line. So the output would look like this...
myfunc(a)
# "this string\nhas an\neven number\nof words"
myfunc(b)
# "this string\ndoesn't have\nan even\nnumber of\nwords"
I've accomplished this by doing a strsplit, paste-ing a newline on even numbered words, then paste(a, collapse=" ") them back together into one string. Is there a regular expression to use with gsub that can accomplish this?
#Jota suggested a simple and concise way:
myfunc = function(x) gsub("( \\S+) ", "\\1\n", x) # Jota's
myfunc2 = function(x) gsub("([^ ]+ [^ ]+) ", "\\1\n", x) # my idea
lapply(list(a,b), myfunc)
[[1]]
[1] "this string\nhas an\neven number\nof words"
[[2]]
[1] "this string\ndoesn't have\nan even\nnumber of\nwords"
How it works. The idea of "([^ ]+ [^ ]+) " regex is (1) "find two sequences of words/nonspaces with a space between them and a space after them" and (2) "replace the trailing space with a newline".
#Jota's "( \\S+) " is trickier -- it finds any word with a space before and after it and then replaces the trailing space with a newline. This works because the first word that is caught by this is the second word of the string; and the next word caught by it is not the third (since we have already "consumed"/looked at the space in front of the third word when handling the second word), but rather the fourth; and so on.
Oh, and some basic regex stuff.
[^xyz] means any single char except the chars x, y, and z.
\\s is a space, while \\S is anything but a space
x+ means x one or more times
(x) "captures" x, allowing for reference in the replacement, like \\1

Subdivide an expression into alternative subpattern - using gsub()

I'm trying to subdivide my metacharacter expression in my gsub() function. But it does not return anything found.
Task: I want to delete all sections of string that contain either .ST or -XST in my vector of strings.
As you can see below, using one expression works fine. But the | expression simply does not work. I'm following the metacharacter guide on https://www.stat.auckland.ac.nz/~paul/ItDT/HTML/node84.html
What can be the issue? And what caused this issue?
My data
> rownames(table.summary)[1:10]
[1] "AAK.ST" "ABB.ST" "ALFA.ST" "ALIV-SDB.ST" "AOI.ST" "ATCO-A.ST" "ATCO-B.ST" "AXFO.ST" "AXIS.ST" "AZN.ST"
> gsub(pattern = '[.](.*)$ | [-](.*)$', replacement = "", x = rownames(table.summary)[1:10])
[1] "AAK.ST" "ABB.ST" "ALFA.ST" "ALIV-SDB.ST" "AOI.ST" "ATCO-A.ST" "ATCO-B.ST" "AXFO.ST" "AXIS.ST" "AZN.ST"
> gsub(pattern = '[.](.*)$', replacement = "", x = rownames(table.summary)[1:10])
[1] "AAK" "ABB" "ALFA" "ALIV-SDB" "AOI" "ATCO-A" "ATCO-B" "AXFO" "AXIS" "AZN"
> gsub(pattern = '[-](.*)$', replacement = "", x = rownames(table.summary)[1:10])
[1] "AAK.ST" "ABB.ST" "ALFA.ST" "ALIV" "AOI.ST" "ATCO" "ATCO" "AXFO.ST" "AXIS.ST" "AZN.ST"
It seems you tested your regex with a flag like IgnorePatternWhitespace (VERBOSE, /x) that allows whitespace inside patterns for readability. You can use it with perl=T option:
d <- c("AAK.ST","ABB.ST","ALFA.ST","ALIV-SDB.ST","AOI.ST","ATCO-A.ST","ATCO-B.ST","AXFO.ST", "AXIS.ST","AZN.ST")
gsub('(?x)[.](.*)$ | [-](.*)$', '', d, perl=T)
## [1] "AAK" "ABB" "ALFA" "ALIV" "AOI" "ATCO" "ATCO" "AXFO" "AXIS" "AZN"
However, you really do not have to use that complex regex here.
If you plan to remove all substrings from ther first hyphen or dot up to the end, you may use the following regex:
[.-].*$
The character class [.-] will match the first . or - symbol and .* wil match all characters up to the end of the string ($).
See IDEONE demo:
d <- c("AAK.ST","ABB.ST","ALFA.ST","ALIV-SDB.ST","AOI.ST","ATCO-A.ST","ATCO-B.ST","AXFO.ST", "AXIS.ST","AZN.ST")
gsub("[.-].*$", "", d)
Result: [1] "AAK" "ABB" "ALFA" "ALIV" "AOI" "ATCO" "ATCO" "AXFO" "AXIS" "AZN"
This will find .ST or -XST at the end of the text and substitute it with empty characters string (effectively removing that part). Don't forget that gsub returns modified string, not modifies it in place. You won't see any change until you reassign return value back to some variable.
strings <- c("AAK.ST", "ABB.ST", "ALFA.ST", "ALIV-SDB.ST", "AOI.ST", "ATCO-A.ST", "ATCO-B.ST", "AXFO.ST", "AXIS.ST", "AZN.ST", "AAC-XST", "AAD-XSTV")
strings <- gsub('(\\.ST|-XST)$', '', strings)
Your regular expression ([.](.*)$ | [-](.*)$'), if not for unnecessary spaces, would remove everything from first dot (.) or dash (-) to end of text. This might be what you want, but not what you said you want.

Extracting hashtags AND attached string elements (IF ANY) with regular expressions AND positive lookarounds and lookbehinds in r

I'd like to create a function in r using regular expressions that extracts hashtags (and one for #'s as well) but checks to see if its a part of a string and return those parts of that string. I'm still picking up hashtags (and #'s) and so I'm assuming that I'm not picking up pure hashtag strings (#word) because this is after using a function to remove URLs, emails, hashtags, and #'s via:
clean.text <- function(x){
x <- gsub("http[^[:space:]]+"," ", x)
x <- gsub("([_+A-Za-z0-9-]+(\\.[_+A-Za-z0-9-]+)*#[A-Za-z0-9-]+(\\.[A-Za-z0-9-]+)*(\\.[A-Za-z]{2,14}))","", x)
x <- gsub("\\s#[[:alnum:]_]+"," ", x)
x <- gsub("\\s#[^[:space:]]+"," ", x)
x
}
So I'd like to know what parts of the string are attached to the hashtags (and #'s) because I'm still getting hashtags (and #'s) when use the following on my cleaned text.
findHash2 <- function(x){
m <- gregexpr("(#\\w+)", x, perl=TRUE)
w <- unlist(regmatches(x,m))
op <- paste(w,collapse=" ")
return(op)
}
findAT2 <- function(x){
m <- gregexpr("#(\\w+)", x, perl=TRUE)
w <- unlist(regmatches(x,m))
op <- paste(w,collapse=" ")
return(op)
}
Note: again, this is after I apply my clean.text function to my text. Would it be something like this?
findHash1 <- function(x){
m <- gregexpr("(?<=^)#\\w+(?=$)", x, perl=TRUE)
w <- unlist(regmatches(x, m))
return(paste(w, collapse=" "))
}
UPDATE Example
x <- "yp#MonicaSarkar: RT #saultracey: Sun kissed .....#olmpicrings at #towerbridge #london2012 # Tower Bridge http://t.co/wgIutHUl
x <-I don'nt#know #It would %%%%#be #best if#you just.idk#provided/a fewexample#character! strings# #my#&^( 160,000+posts#in #of text #my) #data is#so huge!# (some# that #should match#and some that #shouldn't) and post# the desired#output.#We'll take it from there."
As for the desired output, I guess something like:
[1] yp#MonicaSarkar: #saultracey: .....#olmpicrings
Or in the second example:
[1] don'nt#know if#you 160,000+posts#in %%%%#be fewexample#character!
Ultimately, I'd like to see what's attached to the hash tags.
I'd like to use a function or functions that would extract a hashtag (another function or set of functions for #'s) if part of a string in three scenarios and presented my attempt at the first: one that says it must be preceded and followed by one or more characters, another that matches if only followed by one or more characters and a third that matches only if preceded by one or more characters. That is: one that would match the string hashtag only if it's at the middle not if it's present at the start or at the end of a string, one that would match the string only if it's present at the start, and one that would match the string if it's present at the end.
Would three functions like I discussed need to be created for that type of procedure or could it be combined into one?

R: gsub of exact full string with fixed = T

I am trying to gsub exact FULL string - I know I need to use ^ and $. The problem is that I have special characters in strings (could be [, or .) so I need to use fixed=T. This overrides the ^ and $. Any solution is appreciated.
Need to replace 1st, 2nd element in exact_orig with 1st, 2nd element from exact_change but only if full string is matched from beginning to end.
exact_orig = c("oz","32 oz")
exact_change = c("20 oz","32 ct")
gsub_FixedTrue <- function(i) {
for(k in seq_along(exact_orig)) i = gsub(exact_orig[k],exact_change[k],i,fixed=TRUE)
return(i)
}
Test cases:
print(gsub_FixedTrue("32 oz")) #gives me "32 20 oz" - wrong! Must be "32 ct"
print(gsub_FixedTrue("oz oz")) # gives me "20 oz 20 oz" - wrong! Must remain as "oz oz"
I read a somewhat similar thread, but could not make it work for full string (grep at the beginning of the string with fixed =T in R?)
If you want to exactly match full strings, i don't think you really want to use regular expressions in this case. How about just the match() function
fixedTrue<-function(x) {
m <- match(x, exact_orig)
x[!is.na(m)] <- exact_change[m[!is.na(m)]]
x
}
fixedTrue(c("32 oz","oz oz"))
# [1] "32 ct" "oz oz"

R: Capitalizing everything after a certain character

I would like to capitalize everything in a character vector that comes after the first _. For example the following vector:
x <- c("NYC_23df", "BOS_3_rb", "mgh_3_3_f")
Should come out like this:
"NYC_23DF" "BOS_3_RB" "mgh_3_3_F"
I have been trying to play with regular expressions, but am not able to do this. Any suggestions would be appreciated.
You were very close:
gsub("(_.*)","\\U\\1",x,perl=TRUE)
seems to work. You just needed to use _.* (underscore followed by zero or more other characters) rather than _* (zero or more underscores) ...
To take this apart a bit more:
_.* gives a regular expression pattern that matches an underscore _ followed by any number (including 0) of additional characters; . denotes "any character" and * denotes "zero or more repeats of the previous element"
surrounding this regular expression with parentheses () denotes that it is a pattern we want to store
\\1 in the replacement string says "insert the contents of the first matched pattern", i.e. whatever matched _.*
\\U, in conjunction with perl=TRUE, says "put what follows in upper case" (uppercasing _ has no effect; if we wanted to capitalize everything after (for example) a lower-case g, we would need to exclude the g from the stored pattern and include it in the replacement pattern: gsub("g(.*)","g\\U\\1",x,perl=TRUE))
For more details, search for "replacement" and "capitalizing" in ?gsub (and ?regexp for general information about regular expressions)
gsubfn in the gsubfn package is like gsub except the replacement string can be a function. Here we match _ and everything afterwards feeding the match through toupper :
library(gsubfn)
gsubfn("_.*", toupper, x)
## [1] "NYC_23DF" "BOS_3_RB" "mgh_3_3_F"
Note that this approach involves a particularly simple regular expression.
Simple example using base::strsplit
x <- c("NYC_23df", "BOS_3_rb", "mgh_3_3_f", "a")
myCap <- function(x) {
out <- sapply(x, function(y) {
temp <- unlist(strsplit(y, "_"))
out <- temp[1]
if (length(temp[-1])) {
out <- paste(temp[1], paste(toupper(temp[-1]),
collapse="_"), sep="_")
}
return(out)
})
out
}
> myCap(x)
NYC_23df BOS_3_rb mgh_3_3_f a
"NYC_23DF" "BOS_3_RB" "mgh_3_3_F" "a"
Example using the stringr package
pkg <- "stringr"
if (!require(pkg, character.only=TRUE)) {
install.packages(pkg)
require(pkg, character.only=TRUE)
}
myCap.2 <- function(x) {
out <- sapply(x, function(y) {
idx <- str_locate(y, "_")
if (!all(is.na(idx[1,]))) {
str_sub(y, idx[,1], nchar(y)) <- toupper(str_sub(y, idx[,1], nchar(y)))
}
return(y)
})
out
}
> myCap.2(x)
NYC_23df BOS_3_rb mgh_3_3_f a
"NYC_23DF" "BOS_3_RB" "mgh_3_3_F" "a"