r regex Lookbehind Lookahead issue - regex

I try to extract passages like 44.11.36.00-1 (precisely, nn.nn.nn.nn-n, where n stands for any number from 0-9) from text in R.
I want to extract passages if they are "sticked" to non-number marks:
44.11.36.00-1 extracted from nsfghstighsl44.11.36.00-1vsdfgh is OK
44.11.36.00-1 extracted from fa0044.11.36.00-1000 is NOT
I have read that str_extract_all is not working with Lookbehind and Lookahead expressions, so I sadly came back to grep, but cannot deal with it:
> pattern1 <- "(?<![0-9]{1})[0-9]{2}\\.[0-9]{2}\\.[0-9]{2}\\.[0-9]{2}-[0-9]{1}(?![0-9]{1})"
> grep(pattern1, "dyj44.11.36.00-1aregjspotgji 44113600-1 agdtklj441136001 ", perl=TRUE, value = TRUE)
[1] "dyj44.11.36.00-1aregjspotgji 44113600-1 agdtklj441136001 "
which is not the result I expected.
I thought that:
(?<![0-9]{1}) means "match expression which is not preceeded by a number"
[0-9]{2}\\.[0-9]{2}\\.[0-9]{2}\\.[0-9]{2}-[0-9]{1} stands for the expression I seek for
(?![0-9]{1}) means "match expression which is not followed by a number"

You don't actually need lookahead or lookbehind with this approach. Just parenthesize the portion you want extracted:
library(gsubfn)
x <- c("nsfghstighsl44.11.36.00-1vsdfgh", "fa0044.11.36.00-1000") # test data
pat <- "(^|\\D)(\\d{2}[.]\\d{2}[.]\\d{2}[.]\\d{2}-\\d)(\\D|$)"
strapply(x, pat, ~ ..2, simplify = c)
## "44.11.36.00-1"
Note that ~ ..2 is short for the function function(...) ..2 which means grab the match to the second parenthesized portion in the regular expression. We could also have written function(x, y, z) y or x + y + z ~ y .
Note: The question seems to say that a non-numeric must come directly before and after the string but based on comments that have since disappeared it appears that what was really wanted was that the string be either at the beginning or just after a non-number and must either be at the end or folowed by a non-number. The answer has been so modified.

AS #Roland said in his comment, you need to use regmatches instead of grep
> s <- "nsfghstighsl44.11.36.00-1vsdfgh"
> m <- gregexpr("(?<![0-9]{1})[0-9]{2}\\.[0-9]{2}\\.[0-9]{2}\\.[0-9]{2}-[0-9]{1}(?![0-9]{1})", s, perl=TRUE)
> regmatches(s, m)
[1] "44.11.36.00-1"
A reduced one,
> x <- c('nsfghstighsl44.11.36.00-1vsdfgh', 'fa0044.11.36.00-1000')
> m <- gregexpr("(?<!\\d)\\d{2}\\.\\d{2}\\.\\d{2}\\.\\d{2}-\\d(?!\\d)", x, perl=TRUE)
> regmatches(x, m)
[1] "44.11.36.00-1"

Related

Subdivide an expression into alternative subpattern - using gsub()

I'm trying to subdivide my metacharacter expression in my gsub() function. But it does not return anything found.
Task: I want to delete all sections of string that contain either .ST or -XST in my vector of strings.
As you can see below, using one expression works fine. But the | expression simply does not work. I'm following the metacharacter guide on https://www.stat.auckland.ac.nz/~paul/ItDT/HTML/node84.html
What can be the issue? And what caused this issue?
My data
> rownames(table.summary)[1:10]
[1] "AAK.ST" "ABB.ST" "ALFA.ST" "ALIV-SDB.ST" "AOI.ST" "ATCO-A.ST" "ATCO-B.ST" "AXFO.ST" "AXIS.ST" "AZN.ST"
> gsub(pattern = '[.](.*)$ | [-](.*)$', replacement = "", x = rownames(table.summary)[1:10])
[1] "AAK.ST" "ABB.ST" "ALFA.ST" "ALIV-SDB.ST" "AOI.ST" "ATCO-A.ST" "ATCO-B.ST" "AXFO.ST" "AXIS.ST" "AZN.ST"
> gsub(pattern = '[.](.*)$', replacement = "", x = rownames(table.summary)[1:10])
[1] "AAK" "ABB" "ALFA" "ALIV-SDB" "AOI" "ATCO-A" "ATCO-B" "AXFO" "AXIS" "AZN"
> gsub(pattern = '[-](.*)$', replacement = "", x = rownames(table.summary)[1:10])
[1] "AAK.ST" "ABB.ST" "ALFA.ST" "ALIV" "AOI.ST" "ATCO" "ATCO" "AXFO.ST" "AXIS.ST" "AZN.ST"
It seems you tested your regex with a flag like IgnorePatternWhitespace (VERBOSE, /x) that allows whitespace inside patterns for readability. You can use it with perl=T option:
d <- c("AAK.ST","ABB.ST","ALFA.ST","ALIV-SDB.ST","AOI.ST","ATCO-A.ST","ATCO-B.ST","AXFO.ST", "AXIS.ST","AZN.ST")
gsub('(?x)[.](.*)$ | [-](.*)$', '', d, perl=T)
## [1] "AAK" "ABB" "ALFA" "ALIV" "AOI" "ATCO" "ATCO" "AXFO" "AXIS" "AZN"
However, you really do not have to use that complex regex here.
If you plan to remove all substrings from ther first hyphen or dot up to the end, you may use the following regex:
[.-].*$
The character class [.-] will match the first . or - symbol and .* wil match all characters up to the end of the string ($).
See IDEONE demo:
d <- c("AAK.ST","ABB.ST","ALFA.ST","ALIV-SDB.ST","AOI.ST","ATCO-A.ST","ATCO-B.ST","AXFO.ST", "AXIS.ST","AZN.ST")
gsub("[.-].*$", "", d)
Result: [1] "AAK" "ABB" "ALFA" "ALIV" "AOI" "ATCO" "ATCO" "AXFO" "AXIS" "AZN"
This will find .ST or -XST at the end of the text and substitute it with empty characters string (effectively removing that part). Don't forget that gsub returns modified string, not modifies it in place. You won't see any change until you reassign return value back to some variable.
strings <- c("AAK.ST", "ABB.ST", "ALFA.ST", "ALIV-SDB.ST", "AOI.ST", "ATCO-A.ST", "ATCO-B.ST", "AXFO.ST", "AXIS.ST", "AZN.ST", "AAC-XST", "AAD-XSTV")
strings <- gsub('(\\.ST|-XST)$', '', strings)
Your regular expression ([.](.*)$ | [-](.*)$'), if not for unnecessary spaces, would remove everything from first dot (.) or dash (-) to end of text. This might be what you want, but not what you said you want.

Extracting hashtags AND attached string elements (IF ANY) with regular expressions AND positive lookarounds and lookbehinds in r

I'd like to create a function in r using regular expressions that extracts hashtags (and one for #'s as well) but checks to see if its a part of a string and return those parts of that string. I'm still picking up hashtags (and #'s) and so I'm assuming that I'm not picking up pure hashtag strings (#word) because this is after using a function to remove URLs, emails, hashtags, and #'s via:
clean.text <- function(x){
x <- gsub("http[^[:space:]]+"," ", x)
x <- gsub("([_+A-Za-z0-9-]+(\\.[_+A-Za-z0-9-]+)*#[A-Za-z0-9-]+(\\.[A-Za-z0-9-]+)*(\\.[A-Za-z]{2,14}))","", x)
x <- gsub("\\s#[[:alnum:]_]+"," ", x)
x <- gsub("\\s#[^[:space:]]+"," ", x)
x
}
So I'd like to know what parts of the string are attached to the hashtags (and #'s) because I'm still getting hashtags (and #'s) when use the following on my cleaned text.
findHash2 <- function(x){
m <- gregexpr("(#\\w+)", x, perl=TRUE)
w <- unlist(regmatches(x,m))
op <- paste(w,collapse=" ")
return(op)
}
findAT2 <- function(x){
m <- gregexpr("#(\\w+)", x, perl=TRUE)
w <- unlist(regmatches(x,m))
op <- paste(w,collapse=" ")
return(op)
}
Note: again, this is after I apply my clean.text function to my text. Would it be something like this?
findHash1 <- function(x){
m <- gregexpr("(?<=^)#\\w+(?=$)", x, perl=TRUE)
w <- unlist(regmatches(x, m))
return(paste(w, collapse=" "))
}
UPDATE Example
x <- "yp#MonicaSarkar: RT #saultracey: Sun kissed .....#olmpicrings at #towerbridge #london2012 # Tower Bridge http://t.co/wgIutHUl
x <-I don'nt#know #It would %%%%#be #best if#you just.idk#provided/a fewexample#character! strings# #my#&^( 160,000+posts#in #of text #my) #data is#so huge!# (some# that #should match#and some that #shouldn't) and post# the desired#output.#We'll take it from there."
As for the desired output, I guess something like:
[1] yp#MonicaSarkar: #saultracey: .....#olmpicrings
Or in the second example:
[1] don'nt#know if#you 160,000+posts#in %%%%#be fewexample#character!
Ultimately, I'd like to see what's attached to the hash tags.
I'd like to use a function or functions that would extract a hashtag (another function or set of functions for #'s) if part of a string in three scenarios and presented my attempt at the first: one that says it must be preceded and followed by one or more characters, another that matches if only followed by one or more characters and a third that matches only if preceded by one or more characters. That is: one that would match the string hashtag only if it's at the middle not if it's present at the start or at the end of a string, one that would match the string only if it's present at the start, and one that would match the string if it's present at the end.
Would three functions like I discussed need to be created for that type of procedure or could it be combined into one?

How to fill gap between two characters with regex

I have a data set like below. I would like to replace all dots between two 1's with 1's, as shown in the desired.result. Can I do this with regex in base R?
I tried:
regexpr("^1\\.1$", my.data$my.string, perl = TRUE)
Here is a solution in c#
Characters between two exact characters
Thank you for any suggestions.
my.data <- read.table(text='
my.string state
................1...............1. A
......1..........................1 A
.............1.....2.............. B
......1.................1...2..... B
....1....2........................ B
1...2............................. C
..........1....................1.. C
.1............................1... C
.................1...........1.... C
........1....2.................... C
......1........................1.. C
....1....1...2.................... D
......1....................1...... D
.................1...2............ D
', header = TRUE, na.strings = 'NA', stringsAsFactors = FALSE)
desired.result <- read.table(text='
my.string state
................11111111111111111. A
......1111111111111111111111111111 A
.............1.....2.............. B
......1111111111111111111...2..... B
....1....2........................ B
1...2............................. C
..........1111111111111111111111.. C
.111111111111111111111111111111... C
.................1111111111111.... C
........1....2.................... C
......11111111111111111111111111.. C
....111111...2.................... D
......1111111111111111111111...... D
.................1...2............ D
', header = TRUE, na.strings = 'NA', stringsAsFactors = FALSE)
Below is an option using gsub with the \G feature and lookaround assertions.
> gsub('(?:1|\\G(?<!^))\\K\\.(?=\\.*1)', '1', my.data$my.string, perl = TRUE)
# [1] "................11111111111111111." "......1111111111111111111111111111"
# [3] ".............1.....2.............." "......1111111111111111111...2....."
# [5] "....1....2........................" "1...2............................."
# [7] "..........1111111111111111111111.." ".111111111111111111111111111111..."
# [9] ".................1111111111111...." "........1....2...................."
# [11] "......11111111111111111111111111.." "....111111...2...................."
# [13] "......1111111111111111111111......" ".................1...2............"
The \G feature is an anchor that can match at one of two positions; the start of the string position or the position at the end of the last match. Since it seems you want to avoid the dots at the start of the string position we use a lookaround assertion \G(?<!^) to exclude the start of the string.
The \K escape sequence resets the starting point of the reported match and any previously consumed characters are no longer included.
You can find an overall breakdown that explains the regular expression here.
Using gsubfn, the first argument is a regular expression which matches the 1's and the characters between the 1's and captures the latter. The second argument is a function, expressed in formula notation, which uses gsub to replace each character in the captured string with 1:
library(gsubfn)
transform(my.data, my.string = gsubfn("1(.*)1", ~ gsub(".", 1, x), my.string))
If there can be multiple pairs of 1's in a string then use "1(.*?)1" as the regular expression instead.
Visualization The regular expression here is simple enough that it can be directly understood but here is a debuggex visualization anwyays:
1(.*)1
Debuggex Demo
Here is an option that uses a relatively simple regex and the standard combination of gregexpr(), regmatches(), and regmatches<-() to identify, extract, operate on, and then replace substrings matching that regex.
## Copy the character vector
x <- my.data$my.string
## Find sequences of "."s bracketed on either end by a "1"
m <- gregexpr("(?<=1)\\.+(?=1)", x, perl=TRUE)
## Standard template for operating on and replacing matched substrings
regmatches(x,m) <- sapply(regmatches(x,m), function(X) gsub(".", "1", X))
## Check that it worked
head(x)
# [1] "................11111111111111111." "......1111111111111111111111111111"
# [3] ".............1.....2.............." "......1111111111111111111...2....."
# [5] "....1....2........................" "1...2............................."

Regular expression to find and replace conditionally

I need to replace string A with string B, only when string A is a whole word (e.g. "MECH"), and I don't want to make the replacement when A is a part of a longer string (e.g. "MECHANICAL"). So far, I have a grepl() which checks if string A is a whole string, but I cannot figure out how to make the replacement. I have added an ifelse() with the idea to makes the gsub() replacement when grep() returns TRUE, otherwise not to replace. Any suggestions? Please see the code below. Thanks.
aa <- data.frame(type = c("CONSTR", "MECH CONSTRUCTION", "MECHANICAL CONSTRUCTION MECH", "MECH CONSTR", "MECHCONSTRUCTION"))
from <- c("MECH", "MECHANICAL", "CONSTR", "CONSTRUCTION")
to <- c("MECHANICAL", "MECHANICAL", "CONSTRUCTION", "CONSTRUCTION")
gsub2 <- function(pattern, replacement, x, ...) {
for(i in 1:length(pattern)){
reg <- paste0("(^", pattern[i], "$)|(^", pattern[i], " )|( ", pattern[i], "$)|( ", pattern[i], " )")
ifelse(grepl(reg, aa$type),
x <- gsub(pattern[i], replacement[i], x, ...),
aa$type)
}
x
}
aa$title3 <- gsub2(from, to, aa$type)
You can enclose the strings in the from vector in \\< and \\> to match only whole words:
x <- c("CONSTR", "MECH CONSTRUCTION", "MECHANICAL CONSTRUCTION MECH",
"MECH CONSTR", "MECHCONSTRUCTION")
from <- c("\\<MECH\\>", "\\<CONSTR\\>")
to <- c("MECHANICAL", "CONSTRUCTION")
for(i in 1:length(from)){
x <- gsub(from[i], to[i], x)
}
print(x)
# [1] "CONSTRUCTION" "MECHANICAL CONSTRUCTION"
# [3] "MECHANICAL CONSTRUCTION MECHANICAL" "MECHANICAL CONSTRUCTION"
# [5] "MECHCONSTRUCTION"
I use regex (?<=\W|^)MECH(?=\W|$) to get if inside the string contain whole word MECH like this.
Is that what you need?
Just for posterity, other than using the \< \> enclosure, a whole word can be defined as any string ending in a space or end-of-line (\s|$).
gsub("MECH(\\s|$)", "MECHANICAL\\1", aa$type)
The only problem with this approach is that you need to carry over the space or end-of-line that you used as part of the match, hence the encapsulation in parentheses and the backreference (\1).
The \< \> enclosure is superior for this particular question, since you have no special exceptions. However, if you have exceptions, it is better to use a more explicit method. The more tools in your toolbox, the better.

R: Capitalizing everything after a certain character

I would like to capitalize everything in a character vector that comes after the first _. For example the following vector:
x <- c("NYC_23df", "BOS_3_rb", "mgh_3_3_f")
Should come out like this:
"NYC_23DF" "BOS_3_RB" "mgh_3_3_F"
I have been trying to play with regular expressions, but am not able to do this. Any suggestions would be appreciated.
You were very close:
gsub("(_.*)","\\U\\1",x,perl=TRUE)
seems to work. You just needed to use _.* (underscore followed by zero or more other characters) rather than _* (zero or more underscores) ...
To take this apart a bit more:
_.* gives a regular expression pattern that matches an underscore _ followed by any number (including 0) of additional characters; . denotes "any character" and * denotes "zero or more repeats of the previous element"
surrounding this regular expression with parentheses () denotes that it is a pattern we want to store
\\1 in the replacement string says "insert the contents of the first matched pattern", i.e. whatever matched _.*
\\U, in conjunction with perl=TRUE, says "put what follows in upper case" (uppercasing _ has no effect; if we wanted to capitalize everything after (for example) a lower-case g, we would need to exclude the g from the stored pattern and include it in the replacement pattern: gsub("g(.*)","g\\U\\1",x,perl=TRUE))
For more details, search for "replacement" and "capitalizing" in ?gsub (and ?regexp for general information about regular expressions)
gsubfn in the gsubfn package is like gsub except the replacement string can be a function. Here we match _ and everything afterwards feeding the match through toupper :
library(gsubfn)
gsubfn("_.*", toupper, x)
## [1] "NYC_23DF" "BOS_3_RB" "mgh_3_3_F"
Note that this approach involves a particularly simple regular expression.
Simple example using base::strsplit
x <- c("NYC_23df", "BOS_3_rb", "mgh_3_3_f", "a")
myCap <- function(x) {
out <- sapply(x, function(y) {
temp <- unlist(strsplit(y, "_"))
out <- temp[1]
if (length(temp[-1])) {
out <- paste(temp[1], paste(toupper(temp[-1]),
collapse="_"), sep="_")
}
return(out)
})
out
}
> myCap(x)
NYC_23df BOS_3_rb mgh_3_3_f a
"NYC_23DF" "BOS_3_RB" "mgh_3_3_F" "a"
Example using the stringr package
pkg <- "stringr"
if (!require(pkg, character.only=TRUE)) {
install.packages(pkg)
require(pkg, character.only=TRUE)
}
myCap.2 <- function(x) {
out <- sapply(x, function(y) {
idx <- str_locate(y, "_")
if (!all(is.na(idx[1,]))) {
str_sub(y, idx[,1], nchar(y)) <- toupper(str_sub(y, idx[,1], nchar(y)))
}
return(y)
})
out
}
> myCap.2(x)
NYC_23df BOS_3_rb mgh_3_3_f a
"NYC_23DF" "BOS_3_RB" "mgh_3_3_F" "a"