R: Capitalizing everything after a certain character - regex

I would like to capitalize everything in a character vector that comes after the first _. For example the following vector:
x <- c("NYC_23df", "BOS_3_rb", "mgh_3_3_f")
Should come out like this:
"NYC_23DF" "BOS_3_RB" "mgh_3_3_F"
I have been trying to play with regular expressions, but am not able to do this. Any suggestions would be appreciated.

You were very close:
gsub("(_.*)","\\U\\1",x,perl=TRUE)
seems to work. You just needed to use _.* (underscore followed by zero or more other characters) rather than _* (zero or more underscores) ...
To take this apart a bit more:
_.* gives a regular expression pattern that matches an underscore _ followed by any number (including 0) of additional characters; . denotes "any character" and * denotes "zero or more repeats of the previous element"
surrounding this regular expression with parentheses () denotes that it is a pattern we want to store
\\1 in the replacement string says "insert the contents of the first matched pattern", i.e. whatever matched _.*
\\U, in conjunction with perl=TRUE, says "put what follows in upper case" (uppercasing _ has no effect; if we wanted to capitalize everything after (for example) a lower-case g, we would need to exclude the g from the stored pattern and include it in the replacement pattern: gsub("g(.*)","g\\U\\1",x,perl=TRUE))
For more details, search for "replacement" and "capitalizing" in ?gsub (and ?regexp for general information about regular expressions)

gsubfn in the gsubfn package is like gsub except the replacement string can be a function. Here we match _ and everything afterwards feeding the match through toupper :
library(gsubfn)
gsubfn("_.*", toupper, x)
## [1] "NYC_23DF" "BOS_3_RB" "mgh_3_3_F"
Note that this approach involves a particularly simple regular expression.

Simple example using base::strsplit
x <- c("NYC_23df", "BOS_3_rb", "mgh_3_3_f", "a")
myCap <- function(x) {
out <- sapply(x, function(y) {
temp <- unlist(strsplit(y, "_"))
out <- temp[1]
if (length(temp[-1])) {
out <- paste(temp[1], paste(toupper(temp[-1]),
collapse="_"), sep="_")
}
return(out)
})
out
}
> myCap(x)
NYC_23df BOS_3_rb mgh_3_3_f a
"NYC_23DF" "BOS_3_RB" "mgh_3_3_F" "a"
Example using the stringr package
pkg <- "stringr"
if (!require(pkg, character.only=TRUE)) {
install.packages(pkg)
require(pkg, character.only=TRUE)
}
myCap.2 <- function(x) {
out <- sapply(x, function(y) {
idx <- str_locate(y, "_")
if (!all(is.na(idx[1,]))) {
str_sub(y, idx[,1], nchar(y)) <- toupper(str_sub(y, idx[,1], nchar(y)))
}
return(y)
})
out
}
> myCap.2(x)
NYC_23df BOS_3_rb mgh_3_3_f a
"NYC_23DF" "BOS_3_RB" "mgh_3_3_F" "a"

Related

Subdivide an expression into alternative subpattern - using gsub()

I'm trying to subdivide my metacharacter expression in my gsub() function. But it does not return anything found.
Task: I want to delete all sections of string that contain either .ST or -XST in my vector of strings.
As you can see below, using one expression works fine. But the | expression simply does not work. I'm following the metacharacter guide on https://www.stat.auckland.ac.nz/~paul/ItDT/HTML/node84.html
What can be the issue? And what caused this issue?
My data
> rownames(table.summary)[1:10]
[1] "AAK.ST" "ABB.ST" "ALFA.ST" "ALIV-SDB.ST" "AOI.ST" "ATCO-A.ST" "ATCO-B.ST" "AXFO.ST" "AXIS.ST" "AZN.ST"
> gsub(pattern = '[.](.*)$ | [-](.*)$', replacement = "", x = rownames(table.summary)[1:10])
[1] "AAK.ST" "ABB.ST" "ALFA.ST" "ALIV-SDB.ST" "AOI.ST" "ATCO-A.ST" "ATCO-B.ST" "AXFO.ST" "AXIS.ST" "AZN.ST"
> gsub(pattern = '[.](.*)$', replacement = "", x = rownames(table.summary)[1:10])
[1] "AAK" "ABB" "ALFA" "ALIV-SDB" "AOI" "ATCO-A" "ATCO-B" "AXFO" "AXIS" "AZN"
> gsub(pattern = '[-](.*)$', replacement = "", x = rownames(table.summary)[1:10])
[1] "AAK.ST" "ABB.ST" "ALFA.ST" "ALIV" "AOI.ST" "ATCO" "ATCO" "AXFO.ST" "AXIS.ST" "AZN.ST"
It seems you tested your regex with a flag like IgnorePatternWhitespace (VERBOSE, /x) that allows whitespace inside patterns for readability. You can use it with perl=T option:
d <- c("AAK.ST","ABB.ST","ALFA.ST","ALIV-SDB.ST","AOI.ST","ATCO-A.ST","ATCO-B.ST","AXFO.ST", "AXIS.ST","AZN.ST")
gsub('(?x)[.](.*)$ | [-](.*)$', '', d, perl=T)
## [1] "AAK" "ABB" "ALFA" "ALIV" "AOI" "ATCO" "ATCO" "AXFO" "AXIS" "AZN"
However, you really do not have to use that complex regex here.
If you plan to remove all substrings from ther first hyphen or dot up to the end, you may use the following regex:
[.-].*$
The character class [.-] will match the first . or - symbol and .* wil match all characters up to the end of the string ($).
See IDEONE demo:
d <- c("AAK.ST","ABB.ST","ALFA.ST","ALIV-SDB.ST","AOI.ST","ATCO-A.ST","ATCO-B.ST","AXFO.ST", "AXIS.ST","AZN.ST")
gsub("[.-].*$", "", d)
Result: [1] "AAK" "ABB" "ALFA" "ALIV" "AOI" "ATCO" "ATCO" "AXFO" "AXIS" "AZN"
This will find .ST or -XST at the end of the text and substitute it with empty characters string (effectively removing that part). Don't forget that gsub returns modified string, not modifies it in place. You won't see any change until you reassign return value back to some variable.
strings <- c("AAK.ST", "ABB.ST", "ALFA.ST", "ALIV-SDB.ST", "AOI.ST", "ATCO-A.ST", "ATCO-B.ST", "AXFO.ST", "AXIS.ST", "AZN.ST", "AAC-XST", "AAD-XSTV")
strings <- gsub('(\\.ST|-XST)$', '', strings)
Your regular expression ([.](.*)$ | [-](.*)$'), if not for unnecessary spaces, would remove everything from first dot (.) or dash (-) to end of text. This might be what you want, but not what you said you want.

Regexp to match text with optional text in parenthesis

Given the following vector of strings x
x <- c("hello", "foo_bar", "blah_blub_(bleep)", "blah_(xyz)", "xyz(_$_)")
I am looking for a regexp to extract everything before the optional parenthesis (and its content). So the final result for the above vector should be:
c("hello", "foo_bar", "blah_blub", "blah", "xyz")
I came up with the following regexp which, however, does not work (why?):
R> sub("^(.*)[_?\\(.*\\)]?$", \\1, x)
[1] "hello" "foo_bar" "blah_blub_(bleep)" "blah_(xyz)" "xyz(_$_)"
Any help is appreciated!
We can match the pattern of zero or more _ followed by ( followed by one more characters until the end of the string and replace it with ''.
sub('_*\\(.*$', '', x)
#[1] "hello" "foo_bar" "blah_blub" "blah" "xyz"

Extracting hashtags AND attached string elements (IF ANY) with regular expressions AND positive lookarounds and lookbehinds in r

I'd like to create a function in r using regular expressions that extracts hashtags (and one for #'s as well) but checks to see if its a part of a string and return those parts of that string. I'm still picking up hashtags (and #'s) and so I'm assuming that I'm not picking up pure hashtag strings (#word) because this is after using a function to remove URLs, emails, hashtags, and #'s via:
clean.text <- function(x){
x <- gsub("http[^[:space:]]+"," ", x)
x <- gsub("([_+A-Za-z0-9-]+(\\.[_+A-Za-z0-9-]+)*#[A-Za-z0-9-]+(\\.[A-Za-z0-9-]+)*(\\.[A-Za-z]{2,14}))","", x)
x <- gsub("\\s#[[:alnum:]_]+"," ", x)
x <- gsub("\\s#[^[:space:]]+"," ", x)
x
}
So I'd like to know what parts of the string are attached to the hashtags (and #'s) because I'm still getting hashtags (and #'s) when use the following on my cleaned text.
findHash2 <- function(x){
m <- gregexpr("(#\\w+)", x, perl=TRUE)
w <- unlist(regmatches(x,m))
op <- paste(w,collapse=" ")
return(op)
}
findAT2 <- function(x){
m <- gregexpr("#(\\w+)", x, perl=TRUE)
w <- unlist(regmatches(x,m))
op <- paste(w,collapse=" ")
return(op)
}
Note: again, this is after I apply my clean.text function to my text. Would it be something like this?
findHash1 <- function(x){
m <- gregexpr("(?<=^)#\\w+(?=$)", x, perl=TRUE)
w <- unlist(regmatches(x, m))
return(paste(w, collapse=" "))
}
UPDATE Example
x <- "yp#MonicaSarkar: RT #saultracey: Sun kissed .....#olmpicrings at #towerbridge #london2012 # Tower Bridge http://t.co/wgIutHUl
x <-I don'nt#know #It would %%%%#be #best if#you just.idk#provided/a fewexample#character! strings# #my#&^( 160,000+posts#in #of text #my) #data is#so huge!# (some# that #should match#and some that #shouldn't) and post# the desired#output.#We'll take it from there."
As for the desired output, I guess something like:
[1] yp#MonicaSarkar: #saultracey: .....#olmpicrings
Or in the second example:
[1] don'nt#know if#you 160,000+posts#in %%%%#be fewexample#character!
Ultimately, I'd like to see what's attached to the hash tags.
I'd like to use a function or functions that would extract a hashtag (another function or set of functions for #'s) if part of a string in three scenarios and presented my attempt at the first: one that says it must be preceded and followed by one or more characters, another that matches if only followed by one or more characters and a third that matches only if preceded by one or more characters. That is: one that would match the string hashtag only if it's at the middle not if it's present at the start or at the end of a string, one that would match the string only if it's present at the start, and one that would match the string if it's present at the end.
Would three functions like I discussed need to be created for that type of procedure or could it be combined into one?

r regex Lookbehind Lookahead issue

I try to extract passages like 44.11.36.00-1 (precisely, nn.nn.nn.nn-n, where n stands for any number from 0-9) from text in R.
I want to extract passages if they are "sticked" to non-number marks:
44.11.36.00-1 extracted from nsfghstighsl44.11.36.00-1vsdfgh is OK
44.11.36.00-1 extracted from fa0044.11.36.00-1000 is NOT
I have read that str_extract_all is not working with Lookbehind and Lookahead expressions, so I sadly came back to grep, but cannot deal with it:
> pattern1 <- "(?<![0-9]{1})[0-9]{2}\\.[0-9]{2}\\.[0-9]{2}\\.[0-9]{2}-[0-9]{1}(?![0-9]{1})"
> grep(pattern1, "dyj44.11.36.00-1aregjspotgji 44113600-1 agdtklj441136001 ", perl=TRUE, value = TRUE)
[1] "dyj44.11.36.00-1aregjspotgji 44113600-1 agdtklj441136001 "
which is not the result I expected.
I thought that:
(?<![0-9]{1}) means "match expression which is not preceeded by a number"
[0-9]{2}\\.[0-9]{2}\\.[0-9]{2}\\.[0-9]{2}-[0-9]{1} stands for the expression I seek for
(?![0-9]{1}) means "match expression which is not followed by a number"
You don't actually need lookahead or lookbehind with this approach. Just parenthesize the portion you want extracted:
library(gsubfn)
x <- c("nsfghstighsl44.11.36.00-1vsdfgh", "fa0044.11.36.00-1000") # test data
pat <- "(^|\\D)(\\d{2}[.]\\d{2}[.]\\d{2}[.]\\d{2}-\\d)(\\D|$)"
strapply(x, pat, ~ ..2, simplify = c)
## "44.11.36.00-1"
Note that ~ ..2 is short for the function function(...) ..2 which means grab the match to the second parenthesized portion in the regular expression. We could also have written function(x, y, z) y or x + y + z ~ y .
Note: The question seems to say that a non-numeric must come directly before and after the string but based on comments that have since disappeared it appears that what was really wanted was that the string be either at the beginning or just after a non-number and must either be at the end or folowed by a non-number. The answer has been so modified.
AS #Roland said in his comment, you need to use regmatches instead of grep
> s <- "nsfghstighsl44.11.36.00-1vsdfgh"
> m <- gregexpr("(?<![0-9]{1})[0-9]{2}\\.[0-9]{2}\\.[0-9]{2}\\.[0-9]{2}-[0-9]{1}(?![0-9]{1})", s, perl=TRUE)
> regmatches(s, m)
[1] "44.11.36.00-1"
A reduced one,
> x <- c('nsfghstighsl44.11.36.00-1vsdfgh', 'fa0044.11.36.00-1000')
> m <- gregexpr("(?<!\\d)\\d{2}\\.\\d{2}\\.\\d{2}\\.\\d{2}-\\d(?!\\d)", x, perl=TRUE)
> regmatches(x, m)
[1] "44.11.36.00-1"

Regular expression to find and replace conditionally

I need to replace string A with string B, only when string A is a whole word (e.g. "MECH"), and I don't want to make the replacement when A is a part of a longer string (e.g. "MECHANICAL"). So far, I have a grepl() which checks if string A is a whole string, but I cannot figure out how to make the replacement. I have added an ifelse() with the idea to makes the gsub() replacement when grep() returns TRUE, otherwise not to replace. Any suggestions? Please see the code below. Thanks.
aa <- data.frame(type = c("CONSTR", "MECH CONSTRUCTION", "MECHANICAL CONSTRUCTION MECH", "MECH CONSTR", "MECHCONSTRUCTION"))
from <- c("MECH", "MECHANICAL", "CONSTR", "CONSTRUCTION")
to <- c("MECHANICAL", "MECHANICAL", "CONSTRUCTION", "CONSTRUCTION")
gsub2 <- function(pattern, replacement, x, ...) {
for(i in 1:length(pattern)){
reg <- paste0("(^", pattern[i], "$)|(^", pattern[i], " )|( ", pattern[i], "$)|( ", pattern[i], " )")
ifelse(grepl(reg, aa$type),
x <- gsub(pattern[i], replacement[i], x, ...),
aa$type)
}
x
}
aa$title3 <- gsub2(from, to, aa$type)
You can enclose the strings in the from vector in \\< and \\> to match only whole words:
x <- c("CONSTR", "MECH CONSTRUCTION", "MECHANICAL CONSTRUCTION MECH",
"MECH CONSTR", "MECHCONSTRUCTION")
from <- c("\\<MECH\\>", "\\<CONSTR\\>")
to <- c("MECHANICAL", "CONSTRUCTION")
for(i in 1:length(from)){
x <- gsub(from[i], to[i], x)
}
print(x)
# [1] "CONSTRUCTION" "MECHANICAL CONSTRUCTION"
# [3] "MECHANICAL CONSTRUCTION MECHANICAL" "MECHANICAL CONSTRUCTION"
# [5] "MECHCONSTRUCTION"
I use regex (?<=\W|^)MECH(?=\W|$) to get if inside the string contain whole word MECH like this.
Is that what you need?
Just for posterity, other than using the \< \> enclosure, a whole word can be defined as any string ending in a space or end-of-line (\s|$).
gsub("MECH(\\s|$)", "MECHANICAL\\1", aa$type)
The only problem with this approach is that you need to carry over the space or end-of-line that you used as part of the match, hence the encapsulation in parentheses and the backreference (\1).
The \< \> enclosure is superior for this particular question, since you have no special exceptions. However, if you have exceptions, it is better to use a more explicit method. The more tools in your toolbox, the better.