Regular expression to keep some matches, remove others - regex

I'm having a trouble with this regular expression. Consider the following vector.
> vec <- c("new jersey", "south dakota", "virginia:chincoteague",
"washington:whidbey island", "new york:main")
Of those strings that contain a :, I would like to keep only the ones with main after :, resulting in
[1] "new jersey" "south dakota" "new york:main"
So far, I've only been able to get there with this ugly nested nightmare, which is quite obviously far from optimal.
> g1 <- grep(":", vec)
> vec[ -g1[grep("main", grep(":", vec, value = TRUE), invert = TRUE)] ]
# [1] "new jersey" "south dakota" "new york:main"
How can I write a single regular expression to keep :main but remove others containing : ?

Using | (Pick one that contains :main or that does not contains : at all):
> vec <- c("new jersey", "south dakota", "virginia:chincoteague",
+ "washington:whidbey island", "new york:main")
> grep(":main|^[^:]*$", vec)
[1] 1 2 5
> vec[grep(":main|^[^:]*$", vec)]
[1] "new jersey" "south dakota" "new york:main"

You can use this single simple regex:
^[^:]+(?::main.*)?$
See demo
Not sure about the exact R code, but something like
grepl("^[^:]+(?::main.*)?$", subject, perl=TRUE);
Explanation
The ^ anchor asserts that we are at the beginning of the string
The [^:]+ matches all chars that are not a colon
The optional non-capturing group (?::main.*)? matches a colon, main and any chars that follow
The $ anchor asserts that we are at the end of the string

Related

Regexp in R to match everything in between first and last occurene of some specified character

I'd like to match everything between the first and last underscore. I use R.
What I have until now is this:
p.subject <- c('bla_bla', 'bla', 'bla_bla_bla', 'bla_bla_bla_bla')
sub('[^_]*_(.*)_[^_]*', x = p.subject, replacement = '\\1', perl = T)
Where 'bla' is any character except an underscore...
The result I'd like would be something like this:
c(NA, NA, bla, bla_bla)
I can't figure it out! Why does the first pattern match? It shouldn't because the pattern must have 2 underscores! Or do I have to use some kind of lookahead expression?
Your help is very welcome!
You can use gsub:
vec <- gsub("(^[^_]+)_?|_?([^_]+$)", "", p.subject)
vec <- ifelse(nchar(vec) == 0 , NA, vec)
vec
[1] NA NA "bla" "bla_bla"
Data:
dput(p.subject)
c("bla_bla", "bla", "bla_bla_bla", "bla_bla_bla_bla")
Here is another option using str_extract. We use regex lookarounds to extract the pattern between the first and the last occurrence of a specified character i.e. _.
library(stringr)
str_extract(p.subject, "(?<=[^_]{1,30}_).*(?=_[^_]+)")
#[1] NA NA "bla" "bla_bla"
NOTE: We didn't use any ifelse.
data
p.subject <- c('bla_bla', 'bla', 'bla_bla_bla', 'bla_bla_bla_bla')

How to erase all non-letter characters before first letter (R vector of character strings)

I have a vector of character strings:
cities <- c("London", "001 London", "Stockholm", "002 Stockholm")
I need to erase anything in each string that precedes first letter so that I would have:
cities <- c("London", "London", "Stockholm", "Stockholm")
I've tried e.g. this
cities <- sub("^.*?[a-zA-Z]", "", cities)
but that erases the first letter too, which I don't want to happen.
Use
cities <- c("London", "001 London", "Stockholm", "002 Stockholm")
gsub("^\\P{L}*", "", cities, perl=T)
See IDEONE demo
The ^\\P{L}* regex means:
^ - Assert the beginning of the string
\\P{L}* - 0 or more characters other than a letter.
This solution is preferable if you have city names starting with Unicode letters.
Use a negated character class to match all the non-alphabetic characters which exists at the start.
cities <- sub("^[^a-zA-Z]*", "", cities)
or
Use capturing group to capture the first letter character.
cities <- sub("^.*?([a-zA-Z])", "\\1", cities)
Delete number:
gsub('\\d+','',cities)
[1] "London" " London" "Stockholm" " Stockholm"

using toOrdinal to replace numbers with ordinals

I have a sequence of addresses and I am trying to replace numbers with ordinals. Right now I have the following.
library(toOrdinal)
addlist<-c("east 1 street", "4 ave", "5 blvd", "plaza", "43 lane" )
numstringc<-gsub("\\D", "", addlist)
numstring <-as.integer(numstringc)
ordstring<-sapply(numstring[!is.na(numstring)], toOrdinal)
ordstring
[1] "1st" "4th" "5th" "43rd"
I want to eventually get a vector that says
[1] "east 1st street", "4th ave", "5th blvd", "plaza", "43rd lane"
but I can't figure out how to make that.
With \\1 you can access the part of the matched expression in paranthesis, but gsub doesn't allow functions in the replacement, so you have to use gsubfn from the package by the same name, which actually doesn't need the \\1 part:
library(gsubfn)
addlist<-c("east 1 street", "4 ave", "5 blvd", "plaza", "43 lane" )
ordstring <- gsubfn("[0-9]+", function (x) toOrdinal(as.integer(x)), addlist)
Alternatively you can use gregexpr and regmatches, to replace them:
m <- gregexpr("[0-9]+", addlist)
regmatches(addlist, m) <- sapply(as.integer(regmatches(addlist,m)), toOrdinary)

Remove trailing and leading spaces and extra internal whitespace with one gsub call

I know you can remove trailing and leading spaces with
gsub("^\\s+|\\s+$", "", x)
And you can remove internal spaces with
gsub("\\s+"," ",x)
I can combine these into one function, but I was wondering if there was a way to do it with just one use of the gsub function
trim <- function (x) {
x <- gsub("^\\s+|\\s+$|", "", x)
gsub("\\s+", " ", x)
}
testString<- " This is a test. "
trim(testString)
Here is an option:
gsub("^ +| +$|( ) +", "\\1", testString) # with Frank's input, and Agstudy's style
We use a capturing group to make sure that multiple internal spaces are replaced by a single space. Change " " to \\s if you expect non-space whitespace you want to remove.
Using a positive lookbehind :
gsub("^ *|(?<= ) | *$",'',testString,perl=TRUE)
# "This is a test."
Explanation :
## "^ *" matches any leading space
## "(?<= ) " The general form is (?<=a)b :
## matches a "b"( a space here)
## that is preceded by "a" (another space here)
## " *$" matches trailing spaces
You can just add \\s+(?=\\s) to your original regex:
gsub("^\\s+|\\s+$|\\s+(?=\\s)", "", x, perl=T)
See DEMO
You've asked for a gsub option and gotten good options. There's also rm_white_multiple from "qdapRegex":
> testString<- " This is a test. "
> library(qdapRegex)
> rm_white_multiple(testString)
[1] "This is a test."
If an answer not using gsub is acceptable then the following does it. It does not use any regular expressions:
paste(scan(textConnection(testString), what = "", quiet = TRUE), collapse = " ")
giving:
[1] "This is a test."
You can also use nested gsub. Less elegant than the previous answers tho
> gsub("\\s+"," ",gsub("^\\s+|\\s$","",testString))
[1] "This is a test."

regex match within parenthesis

I'm attempting to use some regular expressions that I made for Python also work with R.
Here is what I have in Python (using the excellent re module), with my expected 3 matches:
import re
line = 'VARIABLES = "First [T]" "Second [L]" "Third [1/T]"'
re.findall('"(.*?)"', line)
# ['First [T]', 'Second [L]', 'Third [1/T]']
Now with R, here is my best attempt:
line <- 'VARIABLES = "First [T]" "Second [L]" "Third [1/T]"'
m <- gregexpr('"(.*?)"', line)
regmatches(line, m)[[1]]
# [1] "\"First [T]\"" "\"Second [L]\"" "\"Third [1/T]\""
Why did R match the whole pattern, rather than just within the parenthesis? I was expecting:
[1] "First [T]" "Second [L]" "Third [1/T]"
Furthermore, perl=TRUE didn't make any difference. Is it safe to assume that R's regex does not consider matching only the parenthesis, or is there some trick that I'm missing?
Summary of solution: thanks #flodel, it appears to work well with other patterns too, so it appears to be a good general solution. A compact form of the solution using an input string line and regular expression pattern pat is:
pat <- '"(.*?)"'
sub(pat, "\\1", regmatches(line, gregexpr(pat, line))[[1]])
Furthermore, perl=TRUE should be added to gregexpr if using PCRE features in pat.
If you print m, you'll see gregexpr(..., perl = TRUE) gives you the positions and lengths of matches for a) your full pattern including the leading and closing quotes and b) the captured (.*).
Unfortunately for you, when m is used by regmatches, it use the positions and lengths of the former.
There are two solutions I can think of.
Pass your final output through sub:
line <- 'VARIABLES = "First [T]" "Second [L]" "Third [1/T]"'
m <- gregexpr('"(.*?)"', line, perl = TRUE)
z <- regmatches(line, m)[[1]]
sub('"(.*?)"', "\\1", z)
Or use substring using the positions and lengths of the captured expressions:
start.pos <- attr(m[[1]], "capture.start")
end.pos <- start.pos + attr(m[[1]], "capture.length") - 1L
substring(line, start.pos, end.pos)
To further your understanding, see what happens if your pattern is trying to capture more than one thing. Also see that you can give names to your captures groups (what the doc refers to as Python-style named captures), here "capture1" and "capture2":
m <- gregexpr('"(?P<capture1>.*?) \\[(?P<capture2>.*?)\\]"', line, perl = TRUE)
m
start.pos <- attr(m[[1]], "capture.start")
end.pos <- start.pos + attr(m[[1]], "capture.length") - 1L
substring(line, start.pos[, "capture1"],
end.pos[, "capture1"])
# [1] "First" "Second" "Third"
substring(line, start.pos[, "capture2"],
end.pos[, "capture2"])
# [1] "T" "L" "1/T"
1) strapplyc in the gsubfn package acts in the way you were expecting:
> library(gsubfn)
> strapplyc(line, '"(.*?)"')[[1]]
[1] "First [T]" "Second [L]" "Third [1/T]"
2) Although it involves delving into m's attributes, its possible to make regmatches work by reconstructing m to refer to the captures rather than the whole match:
at <- attributes( m[[1]] )
m2 <- list( structure( c(at$capture.start), match.length = at$capture.length ) )
regmatches( line, m2 )[[1]]
3) If we knew that the strings always ended in ] and were willing to modify the regular expression then this would work:
> m3 <- gregexpr('[^"]*]', line)
> regmatches( line, m3 )[[1]]
[1] "First [T]" "Second [L]" "Third [1/T]"