Grep in R matching getting non-digits - regex

I need to get the non-digit part of a character. I have problem with this regex in R (which according to regexpal should work):
grep("[\\D]+", "PC 17610", value = TRUE, perl = F)
It should return "PC " while it returns character(0)
Other test cases:
grep("[\\D]+", "STON/O2. 3101282 ", value = TRUE, perl = F)
# should return "STON/O2."
grep("[\\D]+", "S.C./A.4. 23567", value = TRUE, perl = F)
# should return "S.C./A.4."
grep("[\\D]+", "C.A. 31026", value = TRUE, perl = F)
# should return "C.A."
Update:
The job is to divide column "Ticket" (from the Titanic disaster database) into "TicketNumber" and "TicketSeries" columns. As for now, Ticket holds below e.g. values: "A/5 21171", "PC 17599", "STON/O2. 3101282", "113803". So the ticket number column is for the first record 21171 and ticket series column "A/5", and so on for next records.
For the record "113803", TicketNumber should be "113803" and TicketSeries NA.
Help appreciated,
Thanks!

Use sub instead, utilizing the \S regex token to match any non-whitespace characters.
x <- c('PC 17610', 'STON/O2. 3101282 ', 'S.C./A.4. 23567', 'C.A. 31026')
sub('(\\S+).*', '\\1', x)
# [1] "PC" "STON/O2." "S.C./A.4." "C.A."
EDIT
Otherwise, if you want to return NA for invalid or empty matches, I suppose you could do ...
x <- c('PC 17610', 'STON/O2. 3101282 ', 'S.C./A.4. 23567', 'C.A. 31026', '31026')
r <- regmatches(x, gregexpr('^\\S+(?=\\s+)', x, perl=T))
unlist({r[sapply(r, length)==0] <- NA; r})
# [1] "PC" "STON/O2." "S.C./A.4." "C.A." NA

You can use str_extract
library(stringr)
str_extract(x, '\\S+(?=\\s+)')
#[1] "PC" "STON/O2." "S.C./A.4." "C.A." NA
data
x <- c('PC 17610', 'STON/O2. 3101282 ', 'S.C./A.4. 23567',
'C.A. 31026', '31026')

Related

Replace element in string after first occurrence

I wish to replace all 2's in a string after the first occurrence of a 2, ideally using regex in base R. This seems like it must be a duplicate, but I cannot locate the answer.
Here is an example:
my.data <- read.table(text='
my.string
.1.222.2.2
..1..1..2.
1.1.2.2...
.222.232..
..1..1....
', header=TRUE, stringsAsFactors = FALSE)
my.data
desired.result <- read.table(text='
my.string
.1.2......
..1..1..2.
1.1.2.....
.2....3...
..1..1....
', header=TRUE, stringsAsFactors = FALSE)
desired.result
my.last.2 <- c(4, 9, 5, 2, NA)
my.last.2
Thank you for any assistance.
This appears to match your desired output:
> gsub(pattern = "(?<=2)(.*?)2",
replacement = "\\1\\.",
x = my.data$my.string,
perl = TRUE)
[1] ".1.2......" "..1..1..2." "1.1.2....." ".2....3..." "..1..1...."
This is literally a directly modification from this answer to a very similar question to make it R specific. I'll be honest, I don't quite understand this regex, so use (and up-vote) with caution.
This works, but is probably inefficient:
with(my.data, gsub("#", "2", gsub("2", ".", sub("2", "#", my.string))))
# [1] ".1.2......" "..1..1..2." "1.1.2....." ".2....3..." "..1..1...."
Approach: Use sub to only match the first occurrence and change it to # (or some other placeholder character which doesn't show up elsewhere in my.string, then use gsub to replace all remaining 2s, then gsub # back into 2.

R - Gsub return first match

I want to extract the 12 and the 0 from the test vector. Every time I try it would either give me 120 or 12:0
TestVector <- c("12:0")
gsub("\\b[:numeric:]*",replacement = "\\1", x = TestVector, fixed = F)
What can I use to extract the 12 and the 0. Can we just have one where I just extract the 12 so I can change it to extract the 0. Can we do this exclusively with gsub?
One option, which doesn't involve using explicit regular expressions, would be to use strsplit() and split the timestamp on the colon:
TestVector <- c("12:0")
parts <- unlist(strsplit(TestVector, ":")))
> parts[1]
[1] "12"
> parts[2]
[1] "0"
Try this
gsub("\\b(\\d+):(\\d+)\\b",replacement = "\\1 \\2", x = TestVector, fixed = F)
Regex Breakdown
\\b #Word boundary
(\\d+) #Find all digits before :
: #Match literally colon
(\\d+) #Find all digits after :
\\b #Word boundary
I think there is no named class as [:numeric:] in R till I know, but it has named class [[:digit:]]. You can use it as
gsub("\\b([[:digit:]]+):([[:digit:]]+)\\b",replacement = "\\1 \\2", x = TestVector)
As suggested by rawr, a much simpler and intuitive way to do it would be to just simply replace : with space
gsub(":",replacement = " ", x = TestVector, fixed = F)
This can be done using scan from base R
scan(text=TestVector, sep=":", what=numeric(), quiet=TRUE)
#[1] 12 0
or with str_extract
library(stringr)
str_extract_all(TestVector, "[^:]+")[[1]]

R regex matching for tweet pattern

I am trying to use the regex feature in R to parse some tweet text into its key words. I have the following code.
sentence = gsub("[[:punct:]]", "", sentence)
sentence = gsub("[[:cntrl:]]", "", sentence)
sentence = gsub("\\d+", "", sentence)
sentence = tolower(sentence)
However, one of my sentences has the sequence "\ud83d\udc4b". THe parsing fails for this sequence (the error is "invalid input in utf8towcs"). I would like to replace such sequences with "". I tried substituting the regex "\u+", but that did not match. What is the regex I should use to match this sequence? Thanks.
I think you want something like this,
> s <- "\ud83d\udc4b Delta"
> Encoding(s)
[1] "UTF-8"
> iconv(s, "ASCII", sub="")
[1] " Delta"
> f <- iconv(s, "ASCII", sub="")
> sentence = tolower(f)
> sentence
[1] " delta"
> sentence = RemoveNotASCII(sentence)
A function to remove not ASCII characters below.
RemoveNotASCII <- function#Remove all non ASCII characters
### remove column by columns non ASCII characters from a dataframe
(
x ##<< dataframe
){
n <- ncol(x)
z <- list()
for (j in 1:n) {
y = as.character(x[,j])
if (class(y)=="character") {
Encoding(y) <- "latin1"
y <- iconv(y, "latin1", "ASCII", sub="")
}
z[[j]] <- y
}
z = do.call("cbind.data.frame", z)
names(z) <- names(x)
return(z)
### Dataframe with non ASCII characters removed
}
The qdapRegex package has the rm_non_ascii function to handle this:
library(qdapRegex)
tolower(rm_non_ascii(s))
## [1] "delta"

Eliminating the characters that are not a date in R

I have some data frame, df with a column with dates that are in the following format:
pv$day
01/01/13 00:00:00
03/01/13 00:02:03
04/03/13 00:10:15
....
I would like to eliminate the timestamp, just leaving the date (e.g. 01/01/13 for the first row). I have tried both using sapply() to apply the strsplit() function, and tried to filter the content using a regex, but don't seem to have quite gotten it right in either case. This:
sapply(pv$day, function(x) strsplit(toString(x), ' '))
gives me the column with the correct split, but indexing with either [1] or [[1]] does not return the first element of the split.
What is the best way to go about this?
You can use sub:
vec <- c("01/01/13 00:00:00", "03/01/13 00:02:03", "04/03/13 00:10:15")
sub(" .+", "", vec)
# [1] "01/01/13" "03/01/13" "04/03/13"
A simple, flexible solution is to use strptime and strftime. Here is an example that uses your dates from the example above:
# Your dates
t <- c("01/01/13 00:00:00","03/01/13 00:02:03", "04/03/13 00:10:15")
# Convert character strings to dates
z <- strptime(t, "%d/%m/%y %H:%M:%OS")
# Convert dates to string, omitting the time
z.date <- strftime(z,"%d/%m/%y")
# Print the first date
z.date[1]
Here's a nice way to use sapply, it uses strsplit to split at the space
> d <- c("01/01/13 00:00:00", "03/01/13 00:02:03", "04/03/13 00:10:15")
> sapply(strsplit(d, " "), `[`, 1)
# [1] "01/01/13" "03/01/13" "04/03/13"
And also, you could use stringr::word if you just want a character vector.
> library(stringr)
> word(d)
# [1] "01/01/13" "03/01/13" "04/03/13"
Here is an approach using a look around assertion:
vec <- c("01/01/13 00:00:00", "03/01/13 00:02:03", "04/03/13 00:10:15")
gsub(pattern = "(?=00).*$", replacement = "", vec, perl = TRUE)
[1] "01/01/13 " "03/01/13 " "04/03/13 "
The pattern looks for anything at the end of a string that begins with double 00, and removes it.

dynamic regex in R

The below code works so long as before and after strings have no characters that are special to a regex:
before <- 'Name of your Manager (note "self" if you are the Manager)' #parentheses cause problem in regex
after <- 'CURRENT FOCUS'
pattern <- paste0(c('(?<=', before, ').*?(?=', after, ')'), collapse='')
ex <- regmatches(x, gregexpr(pattern, x, perl=TRUE))
Does R have a function to escape strings to be used in regexes?
In Perl, there is http://perldoc.perl.org/functions/quotemeta.html for doing exactly that. If the doc is correct when it says
Returns the value of EXPR with all the ASCII non-"word" characters backslashed. (That is, all ASCII characters not matching /[A-Za-z_0-9]/ will be preceded by a backslash in the returned string, regardless of any locale settings.)
then you can achieve the same by doing:
quotemeta <- function(x) gsub("([^A-Za-z_0-9])", "\\\\\\1", x)
And your pattern should be:
pattern <- paste0(c('(?<=', quotemeta(before), ').*?(?=', quotemeta(after), ')'),
collapse='')
Quick sanity check:
a <- "he'l(lo)"
grepl(a, a)
# [1] FALSE
grepl(quotemeta(a), a)
# [1] TRUE
Use \Q...\E to surround the verbatim subpatterns:
# test data
before <- "A."
after <- ".Z"
x <- c("A.xyz.Z", "ABxyzYZ")
pattern <- sprintf('(?<=\\Q%s\\E).*?(?=\\Q%s\\E)', before, after)
which gives:
> gregexpr(pattern, x, perl = TRUE) > 0
[1] TRUE FALSE
dnagirl, such a function exists and is glob2rx
a <- "he'l(lo)"
tt <- glob2rx(a)
# [1] "^he'l\\(lo)$"
before <- 'Name of your Manager (note "self" if you are the Manager)'
tt <- glob2rx(before)
# [1] "^Name of your Manager \\(note \"self\" if you are the Manager)$"
You can just remove the "^" and "$" from the strings by doing:
substr(tt, 2, nchar(tt)-1)
# [1] "he'l\\(lo)"