R list character removal - regex

lst1 is a list:
lst1 <- list(c("all the apples", "apples in the tables", "fashion prosthetics"), c("meteorological concepts", "effects of climate change", "environmental"))
I want to preserve the list structure and remove the last s from all the words. The desired answer is the list below:
> lst1
[[1]]
[1] "all the apple" "apple in the table" "nature"
[[2]]
[1] "meteorological concept" "effect of climate change"
[3] "environmental"
I tried
gsub("\\'s|s$|s[[:space:]]{0}","",lst1)
but it is not preserving the list structure.
How can it be done?

You can use gsub with lapply to loop over the list elements
lapply(lst1, gsub, pattern= "\\'s|s$|s\\b", replacement='')
#[[1]]
#[1] "all the apple" "apple in the table" "fashion prosthetic"
#[[2]]
#[1] "meteorological concept" "effect of climate change"
#[3] "environmental"

Same solution, different regex, using a non-capturing group to leave whitespace as is:
> lapply(lst1, gsub, pattern="s(?= |$)", replacement="", perl=TRUE)
[[1]]
[1] "all the apple" "apple in the table" "fashion prosthetic"
[[2]]
[1] "meteorological concept" "effect of climate change" "environmental"

Simpler regex:
lapply(lst1, function(x) gsub('s\\b', '', x))
Results in:
[[1]]
[1] "all the apple" "apple in the table" "fashion prosthetic"
[[2]]
[1] "meteorological concept" "effect of climate change"
[3] "environmental"

Related

Cleaning up dates (years, specifically) with regex

I have database with an non-validated year field. Most of the entries are 4-digit years but about 10% of the entries are "whatever." This has led me down the rabbit hole of regular expressions to little avail. Getting better results than what I have is progress, even if I don't extract 100%.
#what a mess
yearEntries <- c("79, 80, 99","07-26-08","07-26-2008","'96 ","Early 70's","93/95","late 70's","15","late 60s","Late 70's",NA,"2013","1992-1993")
#does a good job with any string containing a 4-digit year
as.numeric(sub('\\D*(\\d{4}).*', '\\1', yearEntries))
#does a good job with any string containing a 2-digit year, nought else
as.numeric(sub('\\D*(\\d{2}).*', '\\1', yearEntries))
The desired output is to grab the first readable year, so 1992-1993 would be 1992 and "the 70s" would be 1970.
How can I improve my parsing accuracy? Thanks!
EDIT: Pursuant to garyh's answer this gets me much closer:
sub("\\D*((?<!\\d)\\d{2}(?!\\-|\\d)|\\d{4}).*","\\1",yearEntries,perl=TRUE)
# [1] "79" "07-2608" "07-262008" "96" "70" "93" "70" "15" "60" "70" NA "2013" "1992"
but note that, while the dates with dashes in them work with garyh's regex101.com demo, they don't work with R, keeping the month and day values, and the first dash.
Also I realize I didn't include an example date with slashes rather dashes. Another term in the regex should handle that but again, with R, it doesn't not produce the same (correct) result that regex101.com does.
sub("\\D*((?<!\\d)\\d{2}(?!\\-|\\/|\\d)|\\d{4}).*","\\1","07/09/13",perl=TRUE)
# [1] "07/0913"
These negative lookbacks and lookaheads are very powerful but stretch my feeble brain.
Not sure what flavour of regex R uses but this seems to get all the years in the string
/((?<!\d)\d{2}(?!\-|\d)|\d{4})/g
This is matching any 4 digits or any 2 digits provided they're not followed by a dash - or digit, or preceded by another digit
see demo here
You're going to need some elbow grease and do something like:
library(lubridate)
yearEntries <- c("79, 80, 99","07-26-08","07-26-2008","'96 ","Early 70's","93/95","late 70's","15","late 60s","Late 70's",NA,"2013","1992-1993")
x <- yearEntries
x <- gsub("(late|early)", "", x, ignore.case=TRUE)
x <- gsub("[']*[s]*", "", x)
x <- gsub(",.*$", "", x)
x <- gsub(" ", "", x)
x <- ifelse(nchar(x)==9 | nchar(x)<8, gsub("[-/]+[[:digit:]]+$", "", x), x)
x <- ifelse(nchar(x)==4, gsub("^[[:digit:]]{2}", "", x), x)
y <- format(parse_date_time(x, "%m-%d-%y!"), "%y")
yearEntries <-ifelse(!is.na(y), y, x)
yearEntries
## [1] "79" "08" "08" "96" "70" "93" "70" "15" "60" "70" NA "13" "92"
We have no idea which year you need from ranged entries, but this should get you started.
I found a very simple way to get a good result (though I would not claim it is bullet proof). It grabs the last readable year, which is okay too.
yearEntries <- c("79, 80, 99","07/26/08","07-26-2008","'96 ","Early 70's","93/95","15",NA,"2013","1992-1993","ongoing")
# assume last two digits present in any string represent a 2-digit year
a<-sub(".*(\\d{2}).*$","\\1",yearEntries)
# [1] "99" "08" "08" "96" "70" "95" "15" "ongoing" NA "13" "93"
# change to numeric, strip NAs and add 2000
b<-na.omit(as.numeric(a))+2000
# [1] 2099 2008 2008 2096 2070 2095 2015 2013 2093
# assume any greater than present is last century
b[b>2015]<-b[b>2015]-100
# [1] 1999 2008 2008 1996 1970 1995 2015 2013 1993
...and Bob's your uncle!
#garyth's regex work well actually if you use the regmatches/grexprcombo to extract the pattern instead of sub:
regmatches(yearEntries,
gregexpr("(?<!\\d)\\d{2}(?!-|\\/|\\d)|\\d{4}",yearEntries,perl=TRUE))
[[1]]
[1] "79" "80" "99"
[[2]]
[1] "08"
[[3]]
[1] "2008"
[[4]]
[1] "96"
[[5]]
[1] "70"
[[6]]
[1] "95"
[[7]]
[1] "70"
[[8]]
[1] "15"
[[9]]
[1] "60"
[[10]]
[1] "70"
[[11]]
character(0)
[[12]]
[1] "2013"
[[13]]
[1] "1992" "1993"
To only keep the first matching pattern:
sapply(regmatches(yearEntries,gregexpr("(?<!\\d)\\d{2}(?!-|\\/|\\d)|\\d{4}",yearEntries,perl=TRUE)),`[`,1)
[1] "79" "08" "2008" "96" "70" "95" "70" "15" "60" "70" NA "2013" "1992"
regmatches("07/09/13",gregexpr("(?<!\\d)\\d{2}(?!-|\\/|\\d)|\\d{4}","07/09/13",perl=TRUE))
[[1]]
[1] "13"

regex capture repeated phrases

I can capture repeated words pretty easily using:
"(?i)\\b(\\w+)(((\\.{3}\\s*|,\\s+)*|\\s+)\\1)+\\b" but this regex does not seem to extend to mutipe words (and why should it in its current state). How could I find repeated phrases using regex?
Here I extract repeated terms (regardless of case) but the same regex doesn't word to extract a repeated phrase:
library(qdapRegex)
rm_default("this is a big Big deal", pattern = "(?i)\\b(\\w+)(((\\.{3}\\s*|,\\s+)*|\\s+)\\1)+\\b", extract=TRUE)
rm_default("this is a big is a Big deal", pattern = "(?i)\\b(\\w+)(((\\.{3}\\s*|,\\s+)*|\\s+)\\1)+\\b", extract=TRUE)
I would hope for a regex that would return:
"is a big is a Big"
for:
x <- "this is a big is a Big deal"
To cover corner cases here's a larger desired test and output...
"this is a big is a Big deal",
"I want want to see",
"I want, want to see",
"I want...want to see see how",
"this is a big is a Big deal for those of, those of you who are.",
"I like it. It is cool",
)
[[1]]
[1] "is a big is a Big"
[[2]]
[1] "want want"
[[3]]
[1] "want, want"
[[4]]
[1] "want...want" "see see"
[[5]]
[1] "is a big is a Big" "those of, those of"
[[6]]
[1] NA
My current regex only gets me to:
rm_default(y, pattern = "(?i)\\b(\\w+)(((\\.{3}\\s*|,\\s+)*|\\s+)\\1)+\\b", extract=TRUE)
## [[1]]
## [1] NA
##
## [[2]]
## [1] "want want"
##
## [[3]]
## [1] "want, want"
##
## [[4]]
## [1] "want...want" "see see"
##
## [[5]]
## [1] NA
I think this does what you want (note we only allow a single space, ..., or , as separators, but you should be able to tweak that easily):
pattern <- "(?i)\\b(\\w.*)((?:\\s|\\.{3}|,)+\\1)+\\b"
rm_default(x, pattern = pattern, extract=TRUE)
Produces:
[[1]]
[1] "is a big is a Big"
[[2]]
[1] "want want"
[[3]]
[1] "want, want"
[[4]]
[1] "want...want" "see see"
[[5]]
[1] "is a big is a Big" "those of, those of"
Try this:
> regmatches(x, gregexpr("(?i)\\b(\\S.*\\S)[ ,.]*\\b(\\1)", x, perl = TRUE))
[[1]]
[1] "is a big is a Big"
[[2]]
[1] "want want"
[[3]]
[1] "want, want"
[[4]]
[1] "want...want" "see see"
[[5]]
[1] "is a big is a Big" "those of, those of"
Here is a visualization (except there is an error in the visualization - the \S parts should be within the group.
(?i)\b(\S.*\S)[ ,.]*\b(\1)
Debuggex Demo
You might want to replace [ ,.] with [ [:punct:]]. I did not do that since debuggex does not support POSIX character groups.

splitting a string in which upper case follows lower case in stringr

I have a string vector that looks like this and I'd like to split it up:
str <- c("Fruit LoopsJalapeno Sandwich", "Red Bagel", "Basil LeafBarbeque SauceFried Beef")
str_split(str, '[a-z][A-Z]', n = 3)
[[1]]
[1] "Fruit Loop" "alapeno Sandwich"
[[2]]
[1] "Red Bagel"
[[3]]
[1] "Basil Lea" "arbeque Sauc" "ried Beef"
But I need to keep those letters at the end and start of the words.
Here's 2 approaches in base (you can generalize to stringr if you want).
This one subs out this place with a placeholder and then splits on that.
strsplit(gsub("([a-z])([A-Z])", "\\1SPLITHERE\\2", str), "SPLITHERE")
## [[1]]
## [1] "Fruit Loops" "Jalapeno Sandwich"
##
## [[2]]
## [1] "Red Bagel"
##
## [[3]]
## [1] "Basil Leaf" "Barbeque Sauce" "Fried Beef"
This method uses lookaheads and lookbehinds:
strsplit(str, "(?<=[a-z])(?=[A-Z])", perl=TRUE)
## [[1]]
## [1] "Fruit Loops" "Jalapeno Sandwich"
##
## [[2]]
## [1] "Red Bagel"
##
## [[3]]
## [1] "Basil Leaf" "Barbeque Sauce" "Fried Beef"
EDIT Generalized to stringr so you can grab 3 pieces if you want
stringr::str_split(gsub("([a-z])([A-Z])", "\\1SPLITHERE\\2", str), "SPLITHERE", 3)
You could also match instead of splitting based off your string.
unlist(regmatches(str, gregexpr('[A-Z][a-z]+ [A-Z][a-z]+', str)))
# [1] "Fruit Loops" "Jalapeno Sandwich" "Red Bagel"
# [4] "Basil Leaf" "Barbeque Sauce" "Fried Beef"

How to separate POS from words

Need to create an text sparce matrix (DTM) for classification. To prepare the text, first I need to eliminate (separate) the POS-tags the text. My guess was to do it like below. I'm new to R and don't now how to negate a REGEX (see below NOT!).
text <- c("wenn/KOUS ausläuft/VVFIN ./$.", "Kommt/VVFIN vor/PTKVZ ;/$.", "-RRB-/TRUNC Durch/APPR und/KON", "man/PIS zügig/ADJD ./$.", "empfehlung/NN !!!/NE")
My guess how it could work:
(POSs <- regmatches(text, gregexpr('[[:punct:]]*/[[:alpha:][:punct:]]*', text)))
[[1]]
[1] "/KOUS" "/VVFIN" "./$."
[[2]]
[1] "/VVFIN" "/PTKVZ" ";/$."
[[3]]
[1] "-/TRUNC" "/APPR" "/KON"
[[4]]
[1] "/PIS" "/ADJD" "./$."
[[5]]
[1] "/NN" "!!!/NE"
But don't konw how to negate the expression like:
# VVV
(texts <- regmatches(text, NOT!(gregexpr('[[:punct:]]*/[[:alpha:][:punct:]]*', text))))
[[1]]
[1] "wenn" "ausläuft"
[[2]]
[1] "Kommt" "vor"
[[3]]
[1] "Durch" "und"
[[4]]
[1] "man" "zügig"
[[5]]
[1] "empfehlung"
One possibility is to eliminate the tags by, searching for POS-tags and replacing them with '' (i.e. empty text):
text <- c("wenn/KOUS ausläuft/VVFIN ./$.", "Kommt/VVFIN vor/PTKVZ ;/$.", "-RRB-/TRUNC Durch/APPR und/KON", "man/PIS zügig/ADJD ./$.", "empfehlung/NN !!!/NE")
(textlist <- strsplit(paste(gsub('[[:punct:]]*/[[:alpha:][:punct:]]*','', text), sep=' '), " "))
[[1]]
[1] "wenn" "ausläuft"
[[2]]
[1] "Kommt" "vor"
[[3]]
[1] "-RRB" "Durch" "und"
[[4]]
[1] "man" "zügig"
[[5]]
[1] "empfehlung"
With the friendly help of rawr

Extract phone number regex

How can I extract phone numbers from a text file?
x <- c(" Mr. Bean bought 2 tickets 2-613-213-4567 or 5555555555 call either one",
"43 Butter Rd, Brossard QC K0A 3P0 – 613 213 4567",
"Please contact Mr. Bean (613)2134567",
"1.575.555.5555 is his #1 number",
"7164347566"
)
This is a question that's been answered for other languages (see php abd general regex) but doesn't seem to have been tackled on SO for R.
I have searched and found what appears to be possible regexes to find phone numbers (In addition to the regexes from other languages above): http://regexlib.com/Search.aspx?k=phone but have not been able to use gsub within R with these to extract all of these numbers in the example.
Ideally, we'd get something like:
[[1]]
[1] "2-613-213-4567" "5555555555"
[[2]]
[1] "613 213 4567"
[[3]]
[1] "(613)2134567"
[[4]]
[1] "1.575.555.5555"
[[5]]
[1] "7164347566"
This is the best I've been able to do- you have a pretty wide range of formats, including some with spaces, so the regex is pretty general. It just says "look for a string of at least 5 characters made up entirely of digits, periods, brackets, hyphens or spaces":
library(stringr)
str_extract_all(x, "(^| )[0-9.() -]{5,}( |$)")
Output:
[[1]]
[1] " 2-613-213-4567 " " 5555555555 "
[[2]]
[1] " 613 213 4567"
[[3]]
[1] " (613)2134567"
[[4]]
[1] "1.575.555.5555 "
[[5]]
[1] "7164347566"
The leading/trailing spaces could probably be fixed with some additional complexity, or you could just fix it in post.
Update: a bit of searching lead me to this answer, which I slightly modified to allow periods. A bit stricter in terms of requiring a valid (US?) phone number, but seems to cover all your examples:
str_extract_all(x, "\\(?\\d{3}\\)?[.-]? *\\d{3}[.-]? *[.-]?\\d{4}")
Output:
[[1]]
[1] "613-213-4567" "5555555555"
[[2]]
[1] "613 213 4567"
[[3]]
[1] "(613)2134567"
[[4]]
[1] "575.555.5555"
[[5]]
[1] "7164347566"
The monstrosity found here also works once you take out the ^ and $ at either end. Use only if you really need it:
huge_regex = "(?:(?:\\+?1\\s*(?:[.-]\\s*)?)?(?:\\(\\s*([2-9]1[02-9]|[2-9][02-8]1|[2-9][02-8][02-9])\\s*\\)|([2-9]1[02-9]|[2-9][02-8]1|[2-9][02-8][02-9]))\\s*(?:[.-]\\s*)?)?([2-9]1[02-9]|[2-9][02-9]1|[2-9][02-9]{2})\\s*(?:[.-]\\s*)?([0-9]{4})(?:\\s*(?:#|x\\.?|ext\\.?|extension)\\s*(\\d+))?"
The qdapRegex now has the rm_phone specifically designed for this task:
x <- c(" Mr. Bean bought 2 tickets 2-613-213-4567 or 5555555555 call either one",
"43 Butter Rd, Brossard QC K0A 3P0 – 613 213 4567",
"Please contact Mr. Bean (613)2134567",
"1.575.555.5555 is his #1 number",
"7164347566"
)
library(qdapRegex)
ex_phone(x)
## [[1]]
## [1] "613-213-4567" "5555555555"
##
## [[2]]
## [1] "613 213 4567"
##
## [[3]]
## [1] "(613)2134567"
##
## [[4]]
## [1] "1.575.555.5555"
##
## [[5]]
## [1] "7164347566"
You would need a complex regex to cover all rules for matching phone numbers, but to cover your examples.
> library(stringi)
> unlist(stri_extract_all_regex(x, '(\\d[.-])?\\(?\\d{3}\\)?[-. ]?\\d{3}[-. ]?\\d{4}\\b'))
# [1] "2-613-213-4567" "5555555555" "613 213 4567" "(613)2134567"
# [5] "1.575.555.5555" "7164347566"