What is the equivalent of the LEFT plus FIND function in R? - regex

I am trying to extract the first few characters from a column in a data frame. What I need is first few character till a "," is encountered.
Data:
texts
12/5/15, 11:49 - thanks, take care
12/5/15, 11:51 - cool
What I need is
texts date
12/5/15, 11:49 - thanks, take care 12/5/15
12/10/15, 11:51 - cool 12/10/15
I tired using this, but this returned everything without the columns
df$date <- sub(", ", "", df$date, fixed = TRUE)
and
df$date <- gsub( ".,","", df$texts)
Excel equivalent
=LEFT(A1, FIND(",",A1,1)-1)

You can use sub:
sub('(^.*?),.*', '\\1', df$texts)
# [1] "12/5/15" "12/5/15"
The pattern matches
the start of the line ^ followed by any character . repeated zero to infinity times, but as few as possible *?, all captured ( ... )
followed by a comma ,
followed by any character, repeated zero to infinity times .*
which will match the whole line, and replaces it with
the captured group \\1.
Other options: substr, strsplit, stringr::str_extract.
If you're planning on using said dates, as.Date (or strptime, if you want the times too) can actually strip out what it needs:
as.Date(df$texts, '%m/%d/%y')` # or '%d/%m/%y', if that's the format
# [1] "2015-12-05" "2015-12-05"
Data:
df <- structure(list(texts = structure(1:2, .Label = c("12/5/15, 11:49 - thanks, take care",
"12/5/15, 11:51 - cool"), class = "factor")), .Names = "texts",
class = "data.frame", row.names = c(NA, -2L))

Why not just,
sub(',.*', '', df$texts)
#[1] "12/5/15" "12/5/15"

You can do
l <- strsplit (df$date, split = ",")
to split the text using the coma and then
sapply (l, "[", 1)
to keep just the first part.

Related

Matching a word after another word in R regex

I have a dataframe in R with one column (called 'city') containing a text string. My goal is to extract only one word ie the city text from the text string. The city text always follows the word 'in', eg the text might be:
'in London'
'in Manchester'
I tried to create a new column ('municipality'):
df$municipality <- gsub(".*in ?([A-Z+).*$","\\1",df$city)
This gives me the first letter following 'in', but I need the next word (ONLY the next word)
I then tried:
gsub(".*in ?([A-Z]\w+))")
which worked on a regex checker, but not in R. Can someone please help me. I know this is probably very simple but I can't crack it. Thanks in advance.
We can use str_extract
library(stringr)
str_extract(df$city, '(?<=in\\s)\\w+')
#[1] "London" "Manchester"
The following regular expression will match the second word from your city column:
^in\\s([^ ]*).*$
This matches the word in followed a single space, followed by a capture group of any non space characters, which comprises the city name.
Example:
df <- data.frame(city=c("in London town", "in Manchester city"))
df$municipality <- gsub("^in\\s([^ ]*).*$", "\\1", df$city)
> df$municipality
[1] "London" "Manchester"

Trying to use a regular expression in R to capture some data

So I have a table in R, and an example of of the string I am trying to capture is this:
C.Hale (79-83)
I want to write a regular expression to extract the (79-83).
How do I go about doing this?
We can use sub. We match one or more characters that are not a space ([^ ]+) from the beginning of the string (^) , followed by a space (\\s) and replace it with a ''.
sub('^[^ ]+\\s', '', str1)
#[1] "(79-83)"
Or another option is stri_extract_all from stringi
library(stringi)
stri_extract_all_regex(str1, '\\([^)]+\\)')[[1]]
#[1] "(79-83)"
data
str1 <- 'C.Hale (79-83)'
One possibility using the qdapRegex package I maintain:
x <- "C.Hale (79-83)"
library(qdapRegex)
rm_round(x, extract = TRUE, include.markers = TRUE)
## [[1]]
## [1] "(79-83)"

remove leading zeroes from timestamp %j%Y %H:%M

My timestamp is in the form
0992006 09:00
I need to remove the leading zeros to get this form:
992006 9:00
Here's the code I'm using now, which doesn't remove leading zeros:
prediction$TIMESTAMP <- as.character(format(prediction$TIMESTAMP, '%j%Y %H:%M'))
Simplest way is to create your own boundary that asserts either the start of the string or a space precedes.
gsub('(^| )0+', '\\1', '0992006 09:00')
# [1] "992006 9:00"
You could do the same making the replacement exempt using a trick. \K resets the starting point of the reported match and any previously consumed characters are no longer included.
gsub('(^| )\\K0+', '', '0992006 09:00', perl=T)
# [1] "992006 9:00"
Or you could use sub and match until the second set of leading zeros.
sub('^0+([0-9]+ )0+', '\\1', '0992006 09:00')
# [1] "992006 9:00"
And to cover all possibilities, if you know that you will ever have a format like 0992006 00:00, simply remove the + quantifier from zero in the regular expression so it only removes the first leading zero.
str1 <- "0992006 09:00"
gsub("(?<=^| )0+", "", str1, perl=TRUE)
#[1] "992006 9:00"
For situations like below, it could be:
str2 <- "0992006 00:00"
gsub("(?<=^| )0", "", str2, perl=TRUE)
#[1] "992006 0:00"
Explanation
Here the idea is to use look behind (?<=^| )0+ to match 0s
if it occurs either at the beginning of the string
(?<=^
or |
if it follows after a space )0+
and replace those matched 0s by "" in the second part of the gsub argument.
In the second string, the hour and minutes are all 0's. So, using the first code would result in:
gsub("(?<=^| )0+", "", str2, perl=TRUE)
#[1] "992006 :00"
Here, it is unclear what the OP would accept as a result. So, I thought, instead of removing the whole 0s before the :, it would be better if one 0 was left. So, I replaced the multiple 0+ code to just one 0 and replace that by "".
Here's another option using a lookbehind
gsub("(^0)|(?<=\\s)0", "", "0992006 09:00", perl = TRUE)
## [1] "992006 9:00"
With sub:
sub("^[0]+", "", prediction$TIMESTAMP)
[1] "992006 09:00"
You can also use stringr without a regular expression, by using the substrings.
> library(stringr)
> str_c(str_sub(word(x, 1:2), 2), collapse = " ")
# [1] "992006 9:00"
Some more Perl regexes,
> gsub("(?<!:)\\b0+", "", "0992006 09:00", perl=T)
[1] "992006 9:00"
> gsub("(?<![\\d:])0+", "", "0992006 09:00", perl=T)
[1] "992006 9:00"

Extract part of string between two different patterns

I try to use stringr package to extract part of a string, which is between two particular patterns.
For example, I have:
my.string <- "nanaqwertybaba"
left.border <- "nana"
right.border <- "baba"
and by the use of str_extract(string, pattern) function (where pattern is defined by a POSIX regular expression) I would like to receive:
"qwerty"
Solutions from Google did not work.
In base R you can use gsub. The parentheses in the pattern create numbered capturing groups. Here we select the second group in the replacement, i.e. the group between the borders. The . matches any character. The * means that there is zero or more of the preceeding element
gsub(pattern = "(.*nana)(.*)(baba.*)",
replacement = "\\2",
x = "xxxnanaRisnicebabayyy")
# "Risnice"
I do not know whether and how this is possible with functions provided by stringr but you can also use base regexpr and substring:
pattern <- paste0("(?<=", left.border, ")[a-z]+(?=", right.border, ")")
# "(?<=nana)[a-z]+(?=baba)"
rx <- regexpr(pattern, text=my.string, perl=TRUE)
# [1] 5
# attr(,"match.length")
# [1] 6
substring(my.string, rx, rx+attr(rx, "match.length")-1)
# [1] "qwerty"
I would use str_match from stringr: "str_match extracts capture groups formed by
() from the first match. It returns a character matrix with one column for the complete match and one column for each group." ref
str_match(my.string, paste(left.border, '(.+)', right.border, sep=''))[,2]
The code above creates a regular expression with paste concatenating the capture group (.+) that captures 1 or more characters, with left and right borders (no spaces between strings).
A single match is assumed. So, [,2] selects the second column from the matrix returned by str_match.
You can use the package unglue:
library(unglue)
my.string <- "nanaqwertybaba"
unglue_vec(my.string, "nana{res}baba")
#> [1] "qwerty"

Get Twitter #Username with Regex in R

How can I use regex in R to extract Twitter usernames from a string of text?
I've tried
library(stringr)
theString <- '#foobar Foobar! and #foo (#bar) but not foo#bar.com'
str_extract_all(string=theString,pattern='(?:^|(?:[^-a-zA-Z0-9_]))#([A-Za-z]+[A-Za-z0-9_]+)')
But I end up with #foobar, #foo and (#bar which contains an unwanted parenthesis.
How can I get just #foobar, #foo and #bar as output?
Here's one method that works in R:
theString <- '#foobar Foobar! and #foo (#bar) but not foo#bar.com'
theString1 <- unlist(strsplit(theString, " "))
regex <- "(^|[^#\\w])#(\\w{1,15})\\b"
idx <- grep(regex, theString1, perl = T)
theString1[idx]
[1] "#foobar" "#foo" "(#bar)"
If you want to use #Jerry's answer in R:
regex <- "#([A-Za-z]+[A-Za-z0-9_]+)(?![A-Za-z0-9_]*\\.)"
idx <- grep(regex, theString1, perl = T)
theString1[idx]
[1] "#foobar" "#foo" "(#bar)"
Both of these methods include the parenthesis that you don't want, however.
UPDATE This will get to you start-to-finish with no parentheses or any other kind of punctuation (except underscores, since they're allowed in usernames)
theString <- '#foobar Foobar! and #fo_o (#bar) but not foo#bar.com'
theString1 <- unlist(strsplit(theString, " "))
regex1 <- "(^|[^#\\w])#(\\w{1,15})\\b" # get strings with #
regex2 <- "[^[:alnum:]#_]" # remove all punctuation except _ and #
users <- gsub(regex2, "", theString1[grep(regex1, theString1, perl = T)])
users
[1] "#foobar" "#fo_o" "#bar"
#[a-zA-Z0-9_]{0,15}
Where:
# matches the character # literally (case sensitive).
[a-zA-Z0-15] match a single character present in the list
{0,15} Quantifier matches between 0 and 15 times, as many times as
possible, giving back as needed
It is working fine on selecting twitter usernames from a mixed dataset.
Try using a negative lookbehind so that characters are not consumed in your match:
(?:^|(?<![-a-zA-Z0-9_]))#([A-Za-z]+[A-Za-z0-9_]+)
^^^
EDIT: Since it seems lookbehinds don't work in R (I found somewhere here that lookbehinds worked on R, but apparently not...), try this one:
#([A-Za-z]+[A-Za-z0-9_]+)(?![A-Za-z0-9_]*\\.)
Edit: double escaped the dot
EDITv3... : Try turning on PCRE:
str_extract_all(string=theString,perl("(?:^|(?<![-a-zA-Z0-9_]))#([A-Za-z]+[A-Za-z0-9_]+)")