R - remove anything after comma from column - regex

I'd like to strip this column so that it just shows last name - if there is a comma I'd like to remove the comma and anything after it. I have data column that is a mix of just last names and last, first. The data looks as follows:
Last Name
Sample, A
Tester
Wilfred, Nancy
Day, Bobby Jean
Morris

You could use gsub() and some regex:
> x <- 'Day, Bobby Jean'
> gsub("(.*),.*", "\\1", x)
[1] "Day"

You can use gsub:
gsub(",.*", "", c("last only", "last, first"))
# [1] "last only" "last"
",.*" says: replace comma (,) and every character after that (.*), with nothing "".

str1 <- c("Sample, A", "Tester", "Wifred, Nancy", "Day, Bobby Jean", "Morris")
library(stringr)
str_extract(str1, perl('[A-Za-z]+(?=(,|\\b))'))
#[1] "Sample" "Tester" "Wifred" "Day" "Morris"
Match alphabets [A-Za-z]+ and extract those which are followed by , or word boundary.

This is will work
a <- read.delim("C:\\Desktop\\a.csv", row.names = NULL,header=TRUE,
stringsAsFactors=FALSE,sep=",")
a=as.matrix(a)
Data=str_replace_all(string=a,pattern="\\,.*$",replacement=" ")

Also try strsplit:
string <- c("Sample, A", "Tester", "Wifred, Nancy", "Day, Bobby Jean", "Morris")
sapply(strsplit(string, ","), "[", 1)
#[1] "Sample" "Tester" "Wifred" "Day" "Morris"

Related

What is the equivalent of the LEFT plus FIND function in R?

I am trying to extract the first few characters from a column in a data frame. What I need is first few character till a "," is encountered.
Data:
texts
12/5/15, 11:49 - thanks, take care
12/5/15, 11:51 - cool
What I need is
texts date
12/5/15, 11:49 - thanks, take care 12/5/15
12/10/15, 11:51 - cool 12/10/15
I tired using this, but this returned everything without the columns
df$date <- sub(", ", "", df$date, fixed = TRUE)
and
df$date <- gsub( ".,","", df$texts)
Excel equivalent
=LEFT(A1, FIND(",",A1,1)-1)
You can use sub:
sub('(^.*?),.*', '\\1', df$texts)
# [1] "12/5/15" "12/5/15"
The pattern matches
the start of the line ^ followed by any character . repeated zero to infinity times, but as few as possible *?, all captured ( ... )
followed by a comma ,
followed by any character, repeated zero to infinity times .*
which will match the whole line, and replaces it with
the captured group \\1.
Other options: substr, strsplit, stringr::str_extract.
If you're planning on using said dates, as.Date (or strptime, if you want the times too) can actually strip out what it needs:
as.Date(df$texts, '%m/%d/%y')` # or '%d/%m/%y', if that's the format
# [1] "2015-12-05" "2015-12-05"
Data:
df <- structure(list(texts = structure(1:2, .Label = c("12/5/15, 11:49 - thanks, take care",
"12/5/15, 11:51 - cool"), class = "factor")), .Names = "texts",
class = "data.frame", row.names = c(NA, -2L))
Why not just,
sub(',.*', '', df$texts)
#[1] "12/5/15" "12/5/15"
You can do
l <- strsplit (df$date, split = ",")
to split the text using the coma and then
sapply (l, "[", 1)
to keep just the first part.

Matching a word after another word in R regex

I have a dataframe in R with one column (called 'city') containing a text string. My goal is to extract only one word ie the city text from the text string. The city text always follows the word 'in', eg the text might be:
'in London'
'in Manchester'
I tried to create a new column ('municipality'):
df$municipality <- gsub(".*in ?([A-Z+).*$","\\1",df$city)
This gives me the first letter following 'in', but I need the next word (ONLY the next word)
I then tried:
gsub(".*in ?([A-Z]\w+))")
which worked on a regex checker, but not in R. Can someone please help me. I know this is probably very simple but I can't crack it. Thanks in advance.
We can use str_extract
library(stringr)
str_extract(df$city, '(?<=in\\s)\\w+')
#[1] "London" "Manchester"
The following regular expression will match the second word from your city column:
^in\\s([^ ]*).*$
This matches the word in followed a single space, followed by a capture group of any non space characters, which comprises the city name.
Example:
df <- data.frame(city=c("in London town", "in Manchester city"))
df$municipality <- gsub("^in\\s([^ ]*).*$", "\\1", df$city)
> df$municipality
[1] "London" "Manchester"

Extracting values from a string in R using regex

I'm trying to extract the first and second numbers of this string and store them in separate variables.
(User20,10.25)
I can't figure out how to get the user number and then his value.
What I have managed to do so far is this, but I don't know how to remove the rest of the string and get only the number.
gsub("\\(User", "", string)
Try
str1 <- '(User20,10.25)'
scan(text=gsub('[^0-9.-]+', ' ', str1),quiet=TRUE)
#[1] 20.00 10.25
In case the string is
str2 <- '(User20-ht,-10.25)'
scan(text=gsub('-(?=[^0-9])|[^0-9.-]+', " ", str2, perl=TRUE), quiet=TRUE)
#[1] 20.00 -10.25
Or
library(stringr)
str_extract_all(str1, '[0-9.-]+')[[1]]
#[1] "20" "10.25"
Or using stringi
library(stringi)
stri_extract_all_regex(str1, '[0-9.-]+')[[1]]
#[1] "20" "10.25"
Tyler Rinker's "qdapRegex" package has some functions that are useful for this kind of stuff.
In this case, you would most likely be interested in rm_number:
library(qdapRegex)
rm_number(x, extract = TRUE)
# [[1]]
# [1] "20" "10.25"
You can use strsplit with sub ...
> sub('\\(User|\\)', '', strsplit(x, ',')[[1]])
[1] "20" "10.25"
It would probably be easier to match the context that you want instead.
> regmatches(x, gregexpr('[0-9.]+', x))[[1]]
[1] "20" "10.25"
The following is one approach:
[^,\)\([A-Z]]

remove leading zeroes from timestamp %j%Y %H:%M

My timestamp is in the form
0992006 09:00
I need to remove the leading zeros to get this form:
992006 9:00
Here's the code I'm using now, which doesn't remove leading zeros:
prediction$TIMESTAMP <- as.character(format(prediction$TIMESTAMP, '%j%Y %H:%M'))
Simplest way is to create your own boundary that asserts either the start of the string or a space precedes.
gsub('(^| )0+', '\\1', '0992006 09:00')
# [1] "992006 9:00"
You could do the same making the replacement exempt using a trick. \K resets the starting point of the reported match and any previously consumed characters are no longer included.
gsub('(^| )\\K0+', '', '0992006 09:00', perl=T)
# [1] "992006 9:00"
Or you could use sub and match until the second set of leading zeros.
sub('^0+([0-9]+ )0+', '\\1', '0992006 09:00')
# [1] "992006 9:00"
And to cover all possibilities, if you know that you will ever have a format like 0992006 00:00, simply remove the + quantifier from zero in the regular expression so it only removes the first leading zero.
str1 <- "0992006 09:00"
gsub("(?<=^| )0+", "", str1, perl=TRUE)
#[1] "992006 9:00"
For situations like below, it could be:
str2 <- "0992006 00:00"
gsub("(?<=^| )0", "", str2, perl=TRUE)
#[1] "992006 0:00"
Explanation
Here the idea is to use look behind (?<=^| )0+ to match 0s
if it occurs either at the beginning of the string
(?<=^
or |
if it follows after a space )0+
and replace those matched 0s by "" in the second part of the gsub argument.
In the second string, the hour and minutes are all 0's. So, using the first code would result in:
gsub("(?<=^| )0+", "", str2, perl=TRUE)
#[1] "992006 :00"
Here, it is unclear what the OP would accept as a result. So, I thought, instead of removing the whole 0s before the :, it would be better if one 0 was left. So, I replaced the multiple 0+ code to just one 0 and replace that by "".
Here's another option using a lookbehind
gsub("(^0)|(?<=\\s)0", "", "0992006 09:00", perl = TRUE)
## [1] "992006 9:00"
With sub:
sub("^[0]+", "", prediction$TIMESTAMP)
[1] "992006 09:00"
You can also use stringr without a regular expression, by using the substrings.
> library(stringr)
> str_c(str_sub(word(x, 1:2), 2), collapse = " ")
# [1] "992006 9:00"
Some more Perl regexes,
> gsub("(?<!:)\\b0+", "", "0992006 09:00", perl=T)
[1] "992006 9:00"
> gsub("(?<![\\d:])0+", "", "0992006 09:00", perl=T)
[1] "992006 9:00"

How to add a "." after a String under Conditions in R

Data <- c("My name is Ernst.","I love chicken","Hello, my name is Stan!","Who?","I Love you!","Winner")
The Function should add a "." if at the end of the Sentence is none of those signs [.?!] to end the sentence.
I was trying do build a function in R with help of Regex but i had some issues to only look at the End of the String.
The below gsub function would add a dot at the end of the sentence only if the sentence is not ended with a . or ? or ! symbols.
> Data <- c("My name is Ernst.","I love chicken","Hello, my name is Stan!","Who?","I Love you!","Winner")
> gsub("^(?!.*[.?!]$)(.*)$", "\\1.", Data, perl=TRUE)
[1] "My name is Ernst." "I love chicken."
[3] "Hello, my name is Stan!" "Who?"
[5] "I Love you!" "Winner."
In regex, lookaheads are used for condition checking purposes. The negative lookahead (?!.*[.?!]$) would checks for the presence of . or ? or ! at the line end. If it's present at the last, then it skips the sentence and the replacement would never happen on that corresponding line. The replacement would occur only if there is no . or ? or ! symbols at the last.
OR
Through negative lookbehind and positive lookahead,
> Data <- c("My name is Ernst.","I love chicken","Hello, my name is Stan!","Who?","I Love you!","Winner")
> sub("(?<![!?.])(?=$)", ".", Data, perl=TRUE)
[1] "My name is Ernst." "I love chicken."
[3] "Hello, my name is Stan!" "Who?"
[5] "I Love you!" "Winner."
using stringi
library(stringi)
stri_replace_all_regex(Data, "(?<![^!?.])\\b$", ".")
#[1] "My name is Ernst." "I love chicken."
#[3] "Hello, my name is Stan!" "Who?"
#[5] "I Love you!" "Winner."
Here is another solution.
x <- c('My name is Ernst.', 'I love chicken',
'Hello, my name is Stan!', 'Who?', 'I Love you!', 'Winner')
r <- sub('[^?!.]\\K$', '.', x, perl=T)
## [1] "My name is Ernst." "I love chicken."
## [3] "Hello, my name is Stan!" "Who?"
## [5] "I Love you!" "Winner."
Here are some possible approaches:
1) If the last character is not dot, ? or ! then replace it with that character followed by dot:
sub("([^.!?])$", "\\1.", Data)
For the data in the question we get:
[1] "My name is Ernst." "I love chicken."
[3] "Hello, my name is Stan!" "Who?"
[5] "I Love you!" "Winner."
2) A gsubfn solution is even simpler. It replaces the empty () with a dot if the last character is not a dot, ! or ? .
library(gsubfn)
gsubfn("[^.!?]()$", ".", Data)
3) This one uses grepl. If dot, ! or ? is the last character then append the empty string and otherwise append dot.
paste0(Data, ifelse(grepl("[.!?]$", Data), "", "."))
4) This one does not use regular expressions at all. It picks off the last character and if its one of dot, ! or ? it appends the empty string and otherwise appends dot:
paste0(Data, ifelse(substring(Data, nchar(Data)) %in% c(".", "!", "?"), "", "."))