R gsub( ) , Regular Expression - regex

I have the following data
Names[]
[1] John Simon is a great player
[2] Chi-Twi is from china
[3] O'Konnor works hard
[4] R.F is a swimmer
I need to extract only the names from all these rows and store them. I tried doing it this way.
[1] John Simon
[2] Chi-Twi
[3] O'Konnor
[4] R.F
names = gsub("(^[A-Z|a-z|.|-|']+[ ]+[A-Z|a-z|.|-|]+)[ ]+.*", "\\1",names)
can some one help me out?

Here's a regex that will work for this sample data:
names = gsub("(^[A-Za-z]+[^A-Za-z][A-Za-z]+)", "\\1", names)
If underscores are valid characters in a first or last name, you could shorten it to:
names = gsub("(^\\w+\\W\\w+)", "\\1", names)
It simply takes one or more letters, a non-letter, and then one or more letters again.
Some things I noticed wrong in your regex:
[A-Z|a-z|.|-|']+ actually matches A-Z, |, a-z, | (again), ., |-| (that's a range), and '. You really wanted [A-Za-z.\\-']+.
In any case, that's wrong, you don't want to include dots or dashes in the first name.

Based on #nhahtdh's comment, you can use
sub("(^\\w+\\W\\w+).*", "\\1", Names)
# [1] "John Simon" "Chi-Twi" "O'Konnor" "R.F"

Related

R - Manipulate string based on pattern

This is the name of a file that I have on R:
> lst.files[1]
[1] "clt_Amon_CanESM2_rcp45_185001-230012.nc"
What I need to do is capture just the part until the 4th underscore (including), so it would be something like this:
clt_Amon_CanESM2_rcp45_
How can I get this in R?
If you know you always have (at least) four underscores, then you could do something like this:
regmatches(lst, regexec(".*_.*_.*_.*_", lst.files[1]))[[1]]
# [1] "clt_Amon_CanESM2_rcp45_"
If potentially not always four, but no underscores in the second part, you could do something like this:
regmatches(lst, regexec(".*_", lst.files[1]))[[1]]
# [1] "clt_Amon_CanESM2_rcp45_"
This doesn't require any extra package, just base R.
Using the qdap package, you can do the following.
x <- "clt_Amon_CanESM2_rcp45_185001-230012.nc"
library(qdap)
beg2char(x, "_", 4, include = TRUE)
# [1] "clt_Amon_CanESM2_rcp45_"
We can also capture the repeating patterns as a group using sub. We match one more more characters from the beginning (^) of the string that is not an underscore ([^_]+) followed by an underscore (\\_) which is repeated 4 times ({4}), capture that as a group by wrapping with parentheses followed by zero or more characters (.*). We replace it with the capture group (\\1) to get the expected output.
sub('^(([^_]+\\_){4}).*', '\\1', str1)
#[1] "clt_Amon_CanESM2_rcp45_"
data
str1 <- "clt_Amon_CanESM2_rcp45_185001-230012.nc"

Regex for known start and end characters in Perl and R-lang

I'm looking to match mentions of foo in a username. I need to be able to match text strings that start with '#' and contain the word 'foo' at any location within that username, ending by either a space or grammar.
I neeed to be able to match:
example1: #anycharacterhere_foo, anything else here
example2: #foo_anymorecharacters here
I'm looking to use the stringr library like so:
str_extract_all(x, perl("?<=#"))
What I don't understand is the match all function
Assuming that your usernames won't have special characters:
x <- "#anycharacterhere_foo, anything else here"
username <- str_extract_all(x, "\\w*(foo)\\w*")
which yields a string with your username. This will pick up additional foos in the remaining string, but you could fix that with str_extract rather than all. I am not certain if you really need all foo from the string or simply the username which in your example data is at the beginning. You could also limit that with the all match by including the #, thus:
username <- str_extract_all(x, "\\#\\w*(foo)\\w*")
You need to look for "zero or more" word characters that precede or follow:
x <- '#anycharacterhere_foo #foo_anymorecharacters here anything else here'
str_extract_all(x, '#\\w*foo\\w*')[[1]]
# [1] "#anycharacterhere_foo" "#foo_anymorecharacters"
If you don't want to include the marker:
str_extract_all(x, '(?<=#)\\w*foo\\w*')[[1]]
# [1] "anycharacterhere_foo" "foo_anymorecharacters"
You could also use rm_tag from the qdapRegex package for this:
library(qdapRegex)
rm_tag(x, extract=TRUE)[[1]]
# [1] "#anycharacterhere_foo" "#foo_anymorecharacters"

Extract subset of a string following specific text in R

I am trying to extract all of the words in the string below contained within the brackets following the word 'tokens' only if the 'tokens' occurs after 'tag(noun)'.
For example, I have the string:
m<- "phrase('The New York State Department',[det([lexmatch(['THE']),
inputmatch(['The']),tag(det),tokens([the])]),mod([lexmatch(['New York State']),
inputmatch(['New','York','State']),tag(noun),tokens([new,york,state])]),
head([lexmatch([department]),inputmatch(['Department']),tag(noun),
tokens([department])])],0/29,[])."
I want to get a list of all of the words that occur within the brackets after the word 'tokens' only when the word tokens occurs after 'tag(noun)'.
Therefore, I want my output to be a vector of the following:
[1] new, york, state, department
How do I do this? I'm assuming I have to use a regular expression, but I'm lost on how to write this in R.
Thanks!
Remove newlines and then extract the portion matched to the part between parentheses in pattern pat. Then split apart such strings by commas and simplify into a character vector:
library(gsubfn)
pat <- "tag.noun.,tokens..(.*?)\\]"
strapply(gsub("\\n", "", m), pat, ~ unlist(strsplit(x, ",")), simplify = c)
giving:
[1] "new" "york" "state" "department"
Visualization: Here is the debuggex representation of the regular expression in pat. (Note that we need to double the backslash when put within R's double quotes):
tag.noun.,tokens..(.*?)\]
Debuggex Demo
Note that .*? means match the shortetst string of any characters such that the entire pattern matches - without the ? it would try to match the longest string.
How about something like this. Here i'll use the regcatputedmatches helper function to make it easier to extract the captured matches.
m<- "phrase('The New York State Department',[det([lexmatch(['THE']),inputmatch(['The']),tag(det),tokens([the])]),mod([lexmatch(['New York State']),inputmatch(['New','York','State']),tag(noun),tokens([new,york,state])]),head([lexmatch([department]),inputmatch(['Department']),tag(noun),tokens([department])])],0/29,[])."
rx <- gregexpr("tag\\(noun\\),tokens\\(\\[([^]]+)\\]\\)", m, perl=T)
lapply(regcapturedmatches(m,rx), function(x) {
unlist(strsplit(c(x),","))
})
# [[1]]
# [1] "new" "york" "state" "department"
The regular expression is a bit messy because your desired match contains many special regular expression symbols so we need to properly escape them.
Here is a one liner if you like:
paste(unlist(regmatches(m, gregexpr("(?<=tag\\(noun\\),tokens\\(\\[)[^\\]]*", m, perl=T))), collapse=",")
[1] "new,york,state,department"
Broken down:
# Get match indices
indices <- gregexpr("(?<=tag\\(noun\\),tokens\\(\\[)[^\\]]*", m, perl=T)
# Extract the matches
matches <- regmatches(m, indices)
# unlist and paste together
paste(unlist(matches), collapse=",")
[1] "new,york,state,department"

Exception handling for regular expressions in R

I've found several related questions, but haven't found one that I solves my problem yet, please let me know if I'm missing a question that addresses this.
Essentially I want to use a regular expression to find a pattern but with an exception based on the preceding characters. For example, I have the following text object ("muffins") as a vector and I want to match the names ("Sarah","Muffins", and "Bob").:
muffins
[1] "Dear Sarah,"
[2] "I love your dog, Muffins, who is adorable and very friendly. However, I cannot say I enjoy the \"muffins\" he regularly leaves in my front yard. Please consider putting him on a leash outside and properly walking him like everyone else in the neighborhood."
[3] "Sincerely,"
[4] "Bob"
My approach was the search for capitalized words and then exclude words capitalized for grammatical reasons, such as the beginning of a sentence.
pattern = "\\b[[:upper:]]\\w+\\b"
m = gregexpr(pattern,muffins)
regmatches(muffins,m)
This pattern gets me most of the way, returning:
[[1]]
[1] "Dear" "Sarah"
[[2]]
[1] "Muffins" "However" "Please"
[[3]]
[1] "Sincerely"
[[4]]
[1] "Win"
and I can identify some of the sentence beginnings with:
pattern2 = "[.]\\s[[:upper:]]\\w+\\b"
m = gregexpr(pattern2,muffins)
regmatches(muffins,m)
but I can't seem to do both simultaneously, where I say I want pattern where pattern2 is not the case.
I've tried several combinations that I thought would work, but with little success. A few of the ones I tried:
pattern2 = "(?<![.]\\s[[:upper:]]\\w+\\b)(\\b[[:upper:]]\\w+\\b)"
pattern2 = "(^[.]\\s[[:upper:]]\\w+\\b)(\\b[[:upper:]]\\w+\\b)"
Any advice or insight would be greatly appreciated!
You maybe looking for a negative look-behind.
pattern = "(?<!\\.\\s)\\b[[:upper:]]\\w+\\b"
m = gregexpr(pattern,muffins, perl=TRUE)
regmatches(muffins,m)
# [[1]]
# [1] "Dear" "Sarah"
#
# [[2]]
# [1] "Muffins"
#
# [[3]]
# [1] "Sincerely"
#
# [[4]]
# [1] "Bob"
The look behind part (?<!\\.\\s) makes sure there's not a period and a space immediately before the match.
The below regex would match only the names Bob, Sarah and Muffins,
(?<=^)[A-Z][a-z]+(?=$)|(?<!\. )[A-Z][a-z]+(?=,[^\n])|(?<= )[A-Z][a-z]+(?=,$)
DEMO
Trying to use regular expressions to identify names becomes a problem. There is no hope of working reliably. It is very complicated to match names from arbitrary data. If extracting these names is your goal, you need to approach this in a different way instead of simply trying to match an uppercase letter followed by word characters.
Considering your vector is as you posted in your question:
x <- c('Dear Sarah,',
'I love your dog, Muffins, who is adorable and very friendly. However, I cannot say I enjoy the "muffins" he regularly leaves in my front yard. Please consider putting him on a leash outside and properly walking him like everyone else in the neighborhood.',
'Sincerely',
'Bob')
m = regmatches(x, gregexpr('(?<!\\. )[A-Z][a-z]{1,7}\\b(?! [A-Z])', x, perl=T))
Filter(length, m)
# [[1]]
# [1] "Sarah"
# [[2]]
# [1] "Muffins"
# [[3]]
# [1] "Bob"

regular expression -- greedy matching?

I am trying to extract a leading string by stripping off an optional trailing string, where the trailing strings are a subset of possible leading strings but not vice versa. Suppose the leading string is like [a-z]+ and the trailing string is like c. Thus from "abc" I want to extract "ab", and from "ab" I also want to get "ab". Something like this:
^([a-z]+)(?:c|)
The problem is that the [a-z]+ matches the entire string, using the empty option in the alternative, so the grabbed value is "abc" or "ab". (The (?: tells it not to grab the second part.) I want some way to make it take the longer option, or the first option, in the alternative, and use that to determine what matches the first part.
I have also tried putting the desired target inside both of the alternatives:
^([a-z]+)c|^([a-z]+)
I think that it should prefer to match the first one of the two possible alternatives, but I get the same results as above.
I am doing this in R, so I can use either the POSIX or the Perl regex library.
(The actual problem involves futures trading symbols. These have a root "instrument name" like [A-Z0-9]+, followed by an "expiration code" like [FGHJKMNQUVXZ][0-9]{1,2}. Given a symbol like "ZNH3", I want to strip the "H3" to get "ZN". But if I give it "ZN" I also want to get back "ZN".)
Try this:
> library(gsubfn)
> strapplyc(c("abc", "abd"), "^(\\w+?)c?$", simplify = TRUE)
[1] "ab" "abd"
and even easier:
> sub("c$", "", c("abc", "abd"))
[1] "ab" "abd"
Here's a working regular expression:
vec <- c("ZNH3", "ZN", "ZZZ33", "ABF")
sub("(\\w+)[FGHJKMNQUVXZ]\\d{1,2}", "\\1", vec)
# [1] "ZN" "ZN" "ZZ" "ABF"
A variation on the non-greedy answers using base code only.
codes <- c("ZNH3", "CLZ4")
matched <- regmatches(codes, regexec("^([A-Z0-9]+?)[FGHJKMNQUVXZ][0-9]{1,2}$", codes))
# [[1]]
# [1] "ZNH3" "ZN"
#
# [[2]]
# [1] "CLZ4" "CL"
sapply(matched, `[[`, 2) # extract just codes
# [1] "ZN" "CL"
Use a 'non-greedy' match for the first part of the regex, followed by the definitions of your 'optional allowed suffixes' anchored by the 'end-of-string'...
This regex (.+?)([FGHJKMNQUVXZ][0-9]{1,2})?$matches...
(.+?) as few characters as possible
([FGHJKMNQUVXZ][0-9]{1,2})? followed by an allowable (but optional) suffix
$ followed by the end of string
The required result is in the first captured element of the match (however that may be referenced in 'r') :-)